When Data Becomes A Liability
By Anthony · 6 min read · January 11, 2021
87% of the American population can be uniquely identified by a combination of just their ZIP code, gender, and date of birth. 1 It doesn’t take much data to accurately identify a person.
Since 2020, more people than ever before perform their daily activities online. This includes working, shopping, playing, and even communicating with friends and family, all from the same device.
These daily interactions generate vast amounts of data. For example, Google alone processes more than 40,000 searches per second (3.5 billion searches per day) 2. And in the last two years alone, an astonishing 90% of the world’s data has been created.
To put that into perspective, consider how much data is generated every single day by a handful of services 3:
- More than 500 million tweets are sent.
- 294 billion emails are sent.
- 4 petabytes of data are created on Facebook.
- 4 terabytes of data are created from each connected car.
- 65 billion messages are sent on WhatsApp.
That’s a lot of data, and you may be wondering: why would it matter if someone knows that I took an Uber to a restaurant last Friday, or that I like watching comedy movies on Saturday nights? Your data is more valuable than you think, but first, let's consider why data is being collected at all.
Why collect data?
Let's consider why data is collected in the first place. One example is that it can help website owners better understand how to improve the experience for their visitors. The people running a website need to be able to measure some broad metrics such as unique visitors, time spent on the website, how fast did the website load in each country, and so on. Without these metrics, they'd be essentially flying without navigation instruments.
Consider this: there’s now billions of videos on YouTube, and more than 300 hours of video are uploaded every minute. Finding good content that is relevant to you within that ocean of content, is like looking for a needle in a haystack. That's why companies have been building machine learning models that could better predict which content to show to you given billions of possible options. These models are also responsible for flagging inappropriate content, and automatically categorizing it to deliver a better experience for everyone.
But what is relevant for me, might not be relevant for you, and algorithms like this one learn from billions of example data points on the interactions performed by visitors across websites and apps. That is, broadly speaking, how most ranking and recommender systems work: you collect vast amounts of data to train a model that can make predictions of the given task, often with remarkable accuracy (and biases 4). There are many different approaches being developed, and some even learn on your device without requiring to transfer your data to a cloud service 5. But that's for another post.
While many use cases do lead to improved experiences for the user, there have been cases in which the data has been re-purposed for manipulating and influencing people 6. With great power comes great responsibility, and it's unfortunate that there's been so many stories of power-abuse from those that have collected so much data.
So where do we draw the line? Data collection is necessary for businesses to build, and improve the products that people use every day. As Peter Drucker once said: “If you can’t measure it, you can’t improve it.” Data analytics is an essential part of the Internet ecosystem, and it’s important to increase awareness of the responsibilities that come along it.
More is not always better
Over the past decades, companies have been hoarding on vast amounts of data, but at some point we have to ask ourselves: is having more data always useful? As with most things in life, there's tradeoffs to every choice. It greatly depends on the problem at hand, for which purpose was the data collected, for how long will you keep it, and plenty other factors. The line is simply too blurry, and depends a lot on the context.
Consider this: let's assume Netflix collects a history of the movies you’ve recently watched, and whether you finished watching them or not. This data can help their recommendation algorithm understand what kind of content to suggest to you next. Do you like romantic comedies, but dislike horror movies? The algorithm will learn this over time, and improve the experience for you.
That may seem like a fair use of your data, but can your viewing history be exploited to influence you on broader topics? Consider a technique used in filmmaking called juxtaposition 7, placing two things close together with a contrasting effect. This can be used to alter your perception of the relationship between the two elements placed next to each other. For example, by placing a video of a highly controversial topic, which you might dislike, next to an unrelated video about any other topic, you could create associations without them being explicitly there. Our brains tend to look for patterns, even where there are none. And over time, these implicit associations can build up, having profound effects on us, and influencing our behaviour.
As the world becomes aware of the responsibility that the entities collecting this data have in their hands, privacy regulations such as the General Data Protection Regulation (GDPR) are establishing a framework to give individuals control over their personal data, and to simplify the regulatory environment for businesses to make use of this data. GDPR compliance is an important topic for many companies, many of which have gotten huge fines already 8. A lot has been changing in recent years, and these regulations could have significant effects on how companies innovate and grow. That’s why a solid, principled foundation is key to building a long-term, sustainable ecosystem.
You’re not exempt from privacy regulations
The responsibility to properly handle personal data applies to everyone who collects it, and not only to big enterprises. Whether it’s your family doctor in handwritten forms, or the local mom and pop restaurant making food deliveries, no one is exempt from the responsibility entrusted with handling personal data.
That’s why every piece of data you control is a liability. Collecting data just for the sake of it can do more harm than good, and you should consider whether you truly need it or not. Even after you collect it, you must ensure that it is protected with high security standards, as you automatically become the data controller under various privacy regulations.
So even on a small scale, whether it’s your personal blog, or small online shop, you have the responsibility to handle your visitors data with care, as they are trusting you with their data, and are becoming increasingly aware of this, which is a good thing.
Privacy by default
To be able to innovate, and improve the products we use every day, companies need to be able to measure, and understand user behaviour. And to do so, it’s a necessary evil to collect, and aggregate data so that business decisions can be informed on reality, and not on opinions.
Businesses should take steps towards privacy by design, which is crucial to minimizing risks, and essential to building long-term trust. Those numbers on reports, and dashboards represent people, and they deserve better.
In particular for businesses that play a major role in society, utmost care must be taken in order to avoid having models biased by the data that was fed into them, or unfairly balanced systems that have life-changing implications on your users. The topic of data ethics is broad, and deserves a more elaborate article than this paragraph, so I won't elaborate on it for now.
Today there are more and more ways to collect all kinds of data, including private, personally identifiable data. More data is not always better, and to reduce your liability, embrace the principle of data minimization.
You should only collect as much data as you truly require in order to provide your service. Collecting and processing personal data is a sensible topic, but if you recognize this and handle it properly, it can benefit everyone, including those who entrusted you with it.
Disclaimer: I'm not a lawyer, and this is not legal advice. Any advice on this website is general in nature and not to be taken as professional advice.
1: L. Sweeney Simple Demographics Often Identify People Uniquely. Carnegie Mellon University, Data Privacy Working Paper 3. Pittsburgh 2000.
2: Google Search Statistics. Internet Live Stats.
3: How much data is generated each day? World Economic Forum.
4: Biases in Machine Learning. Towards Data Science.
5: Federated Learning: Collaborative Machine Learning without Centralized Training Data. Google AI Blog.
6: Facebook–Cambridge Analytica data scandal. Wikipedia.
7: Juxtaposition and Montage. Hollywood Lexicon.
8: GDPR Enforcement Tracker. CMS Law Tax.