Anonymization is hard.
Just like cryptography, most people are not qualified to build their own. Unlike cryptography, the research is far earlier stage, and the pre-built code is virtually unavailable. That hasn’t stopped people from claiming certain datasets (like this) are anonymized and (sadly) having them re-identified. Those datasets are generally deidentified rather than anonymized — the names and obvious identifiers are stripped out, but the rest of the data is left untouched. Deidentification doesn’t tend to successfully anonymize data because there are so many sources of data in the world that still have identifying information in them; figure out where some identified dataset and the deidentified data align, and you’ve re-identified the dataset. If the dataset was anonymized, it would have been transformed such that re-identification was impossible, no matter what other information the attacker has to hand.
But fear not! Good anonymization does exist!
This is an important point to recognize because anonymization is a deeply important technique in making the world work with greater privacy. Differential privacy holds great promise and is a critical building block behind techniques like federated learning. Note that differential privacy is not a panacea; there are complex discussions lurking there, including appropriate “privacy budget.”
There are also simpler techniques that are also useful. For example, if I run a popular web service, I need to have enough server power to run that service. My users would strongly prefer that I predict how much I’ll need and provision that amount in advance rather than periodically running out. In order to do that prediction, I need to know how much traffic I had at points in the past and how much in the present. Storing that I had approximately 55 million pages loaded on a particular day with a peak of 5,000 requests per second gives me enough information to project load. The crowds are large, the rounding substantial, and — for some traffic measurement techniques — the measurement may not be precise. We have removed the user data and rendered the number anonymized.
Deidentification doesn’t tend to successfully anonymize data because there are so many sources of data in the world that still have identifying information in them.
Deidentification is not anonymization (in virtually all cases), but it’s still useful as a data minimization technique. Anonymization is not always an option: If I buy software from an app store, I would be exceedingly displeased if the app store anonymized those records so I couldn’t run the software any more! Anonymized pictures of my kids would defeat the point. But deidentification is practical for certain types of processing. When training a machine-learning model to recommend new apps, the training data doesn’t need to include who I am, just what I have.
There’s a counterintuitive pitfall to avoid in deidentification: Overdoing it can cause other privacy problems. When someone asks you to delete their data, you need to delete it completely, not just from the primary data store but from all these secondary data stores, like caches, analytics datasets, and ML training datasets. If you’ve deidentified that data in a meaningful way, then it’s awfully hard to figure out what part of the dataset to delete. Because the data is not anonymized, you are still responsible for doing this deletion. The most effective way to do this is to delete the entire dataset periodically. Some datasets only need to exist for a limited time, like slices of server logs used for analytics. Some datasets can be periodically recreated from the primary data store, like ML training datasets.
In some cases, deidentification is useful in long-term datasets. For example, we live in a world where each user may have multiple mobile devices. In some cases, a server needs to keep track of where (which user, which device) certain pieces of data came from, for example, so that data can be automatically deleted when that particular device is no longer in use. That means the server needs some kind of device identifier. Looking more closely, however, the server only needs a few bits of information to differentiate between the relatively small number of devices owned by one user. Thus, the server doesn’t need that whole device identifier: They can hash the device identifier and throw away most of the bits of that resulting hash. All that the server really needs, in this case, is the ability to differentiate between devices from the same user and retain enough bits of the hash to do that; there’s no need for more.
Photo by Llanydd Lloyd on Unsplash