We previously wrote this 101-level guide to deidentification, hoping to make it easier to understand how deidentification works in practice. This article is meant to be a 201-level follow-up, focused on what deidentification is, what it isn’t and how organizations should think about deidentifying their data in practice.
So let’s dive right in.
What are direct and indirect identifiers?
Identifiers are personal attributes that can be used to help identify an individual. Identifiers that are unique to a single individual, such as Social Security numbers, passport numbers and taxpayer identification numbers, are known as “direct identifiers.” The remaining kinds of identifiers are known as “indirect identifiers” and generally consist of personal attributes that are not unique to a specific individual on their own. Examples of indirect identifiers include height, race, hair color and more. Indirect identifiers can often be used in combination to single out an individual’s records. Carnegie Mellon University's Latanya Sweeney, for example, famously showed that 87% of the U.S. population could be uniquely identified using only three indirect identifiers: gender, five-digit ZIP code and birthdate.
What’s pseudonymization?
Pseudonymization can be thought of as the masking of direct identifiers. As we explained in our 101-level guide, there are a variety of different masking techniques that can be used to hide direct identifiers (and some are stronger than others). While pseudonymization can reduce re-identification risk — which is why recent data protection laws, like the EU General Data Protection Regulation and the California Consumer Privacy Act, incentivize pseudonymization — it does not by itself meet the level of protections required for true anonymization.
So what’s anonymization?
Anonymization is the process through which personal data are transformed into non-personal data. From a technical point of view, a big part of this process involves altering the data to make it difficult to match any records with the individuals they represent.
Legal standards for what counts as anonymization vary. Some laws, like the U.S. Health Insurance Portability and Accountability Act, set this threshold as to when a statistical expert attests that a “very small” risk of re-identification exists through something known as the expert determination standard — highlighting how subjective the legal concept of anonymization is in practice.
Regulators in other jurisdictions refer to remote re-identification risks, such as the U.K. Information Commission's Office, or to “robust[ness] against identification performed by the most likely and reasonable means the data controller or any third party may employ,” in the case of the former EU Article 29 Working Party. In all cases, however, there’s always some level of risk assessment involved and always some consequent level of uncertainty. This means that knowing whether anonymization has been achieved is rarely a black-and-white proposition.
What’s a risk-based approach to anonymization?
If we’re realistic about anonymization — and realism is the job of all lawyers, in our view! — the best we can hope for is getting the risks of re-identification low enough to be reasonable or functionally anonymized. Here, the concept of “functional anonymization” means that the data is sufficiently anonymized to pose little risk given the broader controls imposed on that data. This risk-based approach finds its roots in statistical disclosure methods and research, considering “the whole of the data situation,” to quote the U.K. anonymization framework. This type of risk-based approach is grounded in statistical methods — with a healthy dose of realism — and tends to be favored by regulators in multiple jurisdictions, from the U.S. Federal Trade Commission to the U.K. ICO and more.
Can mathematical techniques help with anonymization, like k-anonymization and differential privacy?
Yes! But the key caveat is this: not on their own.
There are a host of statistical techniques that can help preserve privacy and lead to functional anonymization when combined with additional controls. For this reason, these techniques are often referred to as privacy-enhancing technologies. But these tools also require oversight, analysis and a host of context controls to meaningfully protect data, meaning that it is not enough to run fancy math against a dataset to anonymize it (as nice as that would be). These types of techniques include:
- K-anonymization, which is a data generalization technique that ensures indirect identifiers match a specific number of other records, making it difficult to identify individuals within a dataset (the total number of matching records is referred to as “k,” and hence the name). For example, in data that’s been k-anonymized, if k is set to 10 and where indirect identifiers include race and age, we would only see at least 10 records for each combination of race and age. The higher we set k, the harder it will be to use indirect identifiers to find the record of any specific individual.
- Differential privacy, which is a family of mathematical techniques that formally limit the amount of private information that can be inferred about each data subject. There are two main flavors of differential privacy, offering slightly different privacy guarantees: “global,” which offers data subjects deniability of participation, and “local,” which offers deniability of record content. Despite being slightly different, both operate by introducing randomization into computations on data to prevent an attacker from reasoning about its subjects with certainty. Ultimately, these techniques afford data subjects deniability while still allowing analysts to learn from the data.
While PETs like k-anonymization and differential privacy can offer mathematical guarantees for individual datasets, it’s important to note these guarantees are based on assumptions about the availability of other data that can change over time. The availability of new data, for example, can create new indirect identifiers, exposing formerly k-anonymized data to attacks.
Differential privacy faces similar issues if an attacker is allowed access to other differentially private outputs over the same input. In practice, this means that context is always a critical factor in applying PETs to data. You can read more about PETs in this UN report on privacy-preserving techniques.
So if I want to functionally anonymize data, what should I do?
When performing functional anonymization, we recommend, for starters, combining controls on data and on context. Data controls include the types of operations we’ve discussed in the 101-level guides and in this post, like masking and differential privacy, and which we might simply refer to as “data transformation techniques.”
But equally important are controls on context, which include items like access controls, auditing, query monitoring, data sharing agreements, purpose restrictions and more. Context can be thought of as the broader environment in which the data actually sits — the more controls placed on the data, the lower the re-identification risk will be. Within this framework, it’s also helpful to think about all the different types of attacks and disclosures you’re trying to avoid and assess how likely each scenario is given your controls.
As with our 101-level guide, we’d like for this post to be interactive, so if you think we’re missing an important area or simply have feedback for us, please comment below or reach out to governance@immuta.com.
Photo by Charlie Egan on Unsplash