Before the introduction of the EU General Data Protection Regulation and other modern privacy regulations, there was growing evidence that histories of human mobility containing detailed location data are vulnerable to simple reidentification attacks. This line of research may have eventually led to the GDPR specifically singling out location data that is pseudonymized (i.e., does not include obvious identifiers, such as a name or phone number) as not anonymous. However, given how useful this data can be, academics and industry practitioners have been asking whether there is a simple fix to privacy in these datasets. Namely, if the dataset was big enough, would individual records become anonymous by being “lost in the crowd”?

These questions become more and more pressing as evidence grows for the availability and utility of human mobility data. For example, we interact daily with many digital services when using our phone, paying with our credit card or using public transport with a smart card. Throughout these interactions, our location data is collected broadly and at scale. Vodafone alone holds the location trajectories of close to a third of the U.K.’s population. This mobility data is a very rich source of insights in many areas, such as improving urban planningstudying poverty at scale, and monitoring and containing the spread of pandemics, such as COVID-19.

As many with experience will attest, to use private data, first we need to make sure the privacy of the users within is preserved. For a long time, the industry standard has been to “anonymize” the data — modify it in such a way that no individual’s data can be identified. The data is then released. Techniques proposed by researchers include pseudonymization (removing direct identifiers, such as name or phone number) and generalization (decreasing the data’s accuracy, such as rounding timestamps to the hour).

However, anonymizing location data is notoriously challenging. Fundamentally, there is no desirable balance between user privacy and the utility of the resulting data for general purpose use.

Indeed, a vast body of research has shown this data is highly reidentifiable. Previously, researchers showed that knowing four random points of someone’s trajectory points, such as when and where you take your morning coffee, was enough to uniquely identify that person 95% of the time in a dataset of 1.5 million people. Other studies have replicated similarly high unicity numbers with location data obtained from vehiclessmart cards in public transportcredit card transactions and mobile phone metadata in a number of countries.

But what happens when the dataset is much bigger, like that of Vodafone UK? Do trajectories get “lost in the crowd” and become effectively anonymous?

This is the question we address in our recent article with the answer being no. We show that people remain unique in population-scale datasets, and thus that dataset size is not sufficient protection for individual privacy. We propose a simple statistical model that (1) can reproduce reidentification rates in real datasets, and (2) shows the reidentification risk is still very high even for a population of 20 million people (93% of people unique with three points). 

Dataset size is no protection against simple reidentification attacks. But all is not lost. On the one hand, the community researching privacy-enhancing technologies is extremely active with promising results.

On the other hand, regulators, evidenced by more principled acts, such as the GDPR, are working together with these researchers to draft guidelines for data protection. This kind of collaboration is required to develop privacy-preserving methods to unlock the utility of high-dimensional personal data. We seem to be on the right track, but we’re not there yet. 

Photo by NASA on Unsplash