TOTAL: {[ getCartTotalCost() | currencyFilter ]} Update cart for total shopping_basket Checkout

Privacy Tech | Deidentification 201: A lawyer’s guide to pseudonymization and anonymization Related reading: Deidentification 101: A lawyer’s guide to masking, encryption and everything in between

rss_feed

""

GDPR-Ready_300x250-Ad

We previously wrote this 101-level guide to deidentification, hoping to make it easier to understand how deidentification works in practice. This article is meant to be a 201-level follow-up, focused on what deidentification is, what it isn’t and how organizations should think about deidentifying their data in practice. 

So let’s dive right in.

What are direct and indirect identifiers?

Identifiers are personal attributes that can be used to help identify an individual. Identifiers that are unique to a single individual, such as Social Security numbers, passport numbers and taxpayer identification numbers, are known as “direct identifiers.” The remaining kinds of identifiers are known as “indirect identifiers” and generally consist of personal attributes that are not unique to a specific individual on their own. Examples of indirect identifiers include height, race, hair color and more. Indirect identifiers can often be used in combination to single out an individual’s records. Carnegie Mellon University's Latanya Sweeney, for example, famously showed that 87% of the U.S. population could be uniquely identified using only three indirect identifiers: gender, five-digit ZIP code and birthdate. 

What’s pseudonymization? 

Pseudonymization can be thought of as the masking of direct identifiers. As we explained in our 101-level guide, there are a variety of different masking techniques that can be used to hide direct identifiers (and some are stronger than others). While pseudonymization can reduce re-identification risk — which is why recent data protection laws, like the EU General Data Protection Regulation and the California Consumer Privacy Act, incentivize pseudonymization — it does not by itself meet the level of protections required for true anonymization. 

So what’s anonymization?

Anonymization is the process through which personal data are transformed into non-personal data. From a technical point of view, a big part of this process involves altering the data to make it difficult to match any records with the individuals they represent.

Legal standards for what counts as anonymization vary. Some laws, like the U.S. Health Insurance Portability and Accountability Act, set this threshold as to when a statistical expert attests that a “very small” risk of re-identification exists through something known as the expert determination standard highlighting how subjective the legal concept of anonymization is in practice.

Regulators in other jurisdictions refer to remote re-identification risks, such as the U.K. Information Commission's Office, or to “robust[ness] against identification performed by the most likely and reasonable means the data controller or any third party may employ,” in the case of the former EU Article 29 Working Party. In all cases, however, there’s always some level of risk assessment involved and always some consequent level of uncertainty. This means that knowing whether anonymization has been achieved is rarely a black-and-white proposition.

What’s a risk-based approach to anonymization?

If we’re realistic about anonymization — and realism is the job of all lawyers, in our view! — the best we can hope for is getting the risks of re-identification low enough to be reasonable or functionally anonymized. Here, the concept of “functional anonymization” means that the data is sufficiently anonymized to pose little risk given the broader controls imposed on that data. This risk-based approach finds its roots in statistical disclosure methods and research, considering “the whole of the data situation,” to quote the U.K. anonymization framework. This type of risk-based approach is grounded in statistical methods — with a healthy dose of realism — and tends to be favored by regulators in multiple jurisdictions, from the U.S. Federal Trade Commission to the U.K. ICO and more.

Can mathematical techniques help with anonymization, like k-anonymization and differential privacy?

Yes! But the key caveat is this: not on their own.

There are a host of statistical techniques that can help preserve privacy and lead to functional anonymization when combined with additional controls. For this reason, these techniques are often referred to as privacy-enhancing technologies. But these tools also require oversight, analysis and a host of context controls to meaningfully protect data, meaning that it is not enough to run fancy math against a dataset to anonymize it (as nice as that would be). These types of techniques include:

  • K-anonymization, which is a data generalization technique that ensures indirect identifiers match a specific number of other records, making it difficult to identify individuals within a dataset (the total number of matching records is referred to as “k,” and hence the name). For example, in data that’s been k-anonymized, if k is set to 10 and where indirect identifiers include race and age, we would only see at least 10 records for each combination of race and age. The higher we set k, the harder it will be to use indirect identifiers to find the record of any specific individual.
  • Differential privacy, which is a family of mathematical techniques that formally limit the amount of private information that can be inferred about each data subject. There are two main flavors of differential privacy, offering slightly different privacy guarantees: “global,” which offers data subjects deniability of participation, and “local,” which offers deniability of record content. Despite being slightly different, both operate by introducing randomization into computations on data to prevent an attacker from reasoning about its subjects with certainty. Ultimately, these techniques afford data subjects deniability while still allowing analysts to learn from the data.

While PETs like k-anonymization and differential privacy can offer mathematical guarantees for individual datasets, it’s important to note these guarantees are based on assumptions about the availability of other data that can change over time. The availability of new data, for example, can create new indirect identifiers, exposing formerly k-anonymized data to attacks.

Differential privacy faces similar issues if an attacker is allowed access to other differentially private outputs over the same input. In practice, this means that context is always a critical factor in applying PETs to data. You can read more about PETs in this UN report on privacy-preserving techniques.

So if I want to functionally anonymize data, what should I do?

When performing functional anonymization, we recommend, for starters, combining controls on data and on context. Data controls include the types of operations we’ve discussed in the 101-level guides and in this post, like masking and differential privacy, and which we might simply refer to as “data transformation techniques.”

But equally important are controls on context, which include items like access controls, auditing, query monitoring, data sharing agreements, purpose restrictions and more. Context can be thought of as the broader environment in which the data actually sits — the more controls placed on the data, the lower the re-identification risk will be. Within this framework, it’s also helpful to think about all the different types of attacks and disclosures you’re trying to avoid and assess how likely each scenario is given your controls.

As with our 101-level guide, we’d like for this post to be interactive, so if you think we’re missing an important area or simply have feedback for us, please comment below or reach out to governance@immuta.com.

Photo by Charlie Egan on Unsplash


Approved
CIPM, CIPP/A, CIPP/C, CIPP/E, CIPP/G, CIPP/US, CIPT
Credits: 1

Submit for CPEs

5 Comments

If you want to comment on this post, you need to login.

  • comment Jussi Leppälä • May 29, 2020
    Is it somewhat misleading to compare privacy risks coming from learning new data in the two cases you discuss: equivalence class-based methods like k-anonymization and differential privacy?  In the case of k-anonymization, the loss of privacy can be sudden and complete for an individual depending on the nature of the additional information.  This is especially true if we define k-anonymity over quasi-identifiers and accept unique “confidential” attributes which are always expected to remain secret.  A new differentially private data release only introduces a measurable privacy loss within the parameters of the new release. The privacy guarantees of the original release do not change.
  • comment Jussi Leppälä • May 29, 2020
    Thank you for the article.  I think it clarifies the concepts well and I agree that the broader context is extremely important.
  • comment Alfred Rossi • May 30, 2020
    Hi Jussi,
    
    The goal is not to compare the relative strengths of k-anonymization and differential-privacy, but only to observe that context controls remain necessary in either case. In the case of differential-privacy, that the outputs are safeguarded. I now elaborate on this issue:
    
    It's true that an epsilon-differentially private release only reveals at most epsilon bits of private information about any data subject. We say that such a release comes with a privacy-loss of epsilon, and you are right that this a fixed property of the release. The problem comes when there are multiple releases, as the net privacy-loss is the sum of the privacy-losses of the individual releases. (This follows essentially immediately from the definition of differential-privacy – see, e.g., Theorem 3.16 in The Algorithmic Foundations of Differential Privacy)
    
    It's natural to wonder how much additional certainty each release confers to an adversary hoping to infer private information. The answer is quite a lot. I am happy to outline in more detail but, roughly, one can model the adversary as a Bayesian learning process using the release to update its beliefs. One can show that the reduction in uncertainty is a much as 2^epsilon.
    
    Intuitively, this makes sense: if epsilon=1 then the adversary can potentially get up to 1 bit of new information about some data subject, cutting its uncertainty in its guess in half.  It follows immediately that if an adversary has access to, say, k different epsilon-differentially-private releases, then they could be more certain by a factor of up to 2^k over one with just access to just a single release. That means that an attacker can be over a factor of 1000 more certain in regards to their beliefs about individuals with access to just 10 releases. Such is the concern and why we remark that context controls should be implemented, even in the case of differential-privacy.
  • comment Alfred Rossi • May 30, 2020
    For scale, suppose that with one release the adversary can infer someone's location along the US east cost with an uncertainty the length of the whole coast line. After 10 releases, the worst-case bound on uncertainty has been reduced to a stretch of coast short enough to be walked by an individual in less than an hour.
  • comment Jussi Leppälä • Jun 1, 2020
    Hi Alfred, thank you so much for the clarification and wonderful illustrations.  I fully agree that even in the case of differential privacy, releasing more information will result in an additional loss of privacy.  Your illustrations show clearly that the erosion of privacy can be surprisingly quick.  Sometimes, differentially private releases are allowed for several stakeholders through the same interface assuming that they do not cooperate.  If the assumption does not hold true, the privacy implications will be significant.  I was quick to make the original comment because I have been fascinated by the conceptual clarity of differential privacy compared to some other anonymization methods.  That clarity includes the effects of additional future information.