TOTAL: {[ getCartTotalCost() | currencyFilter ]} Update cart for total shopping_basket Checkout

Privacy Tech | Deidentification versus anonymization Related reading: Tech talk: Vulnerability versus incident

rss_feed

""

""

Editor's Note:

This is the third in a series of Privacy Tech posts focused on privacy engineering and UX design from Humu Chief Privacy Officer Lea Kissner.

Anonymization is hard.

Just like cryptography, most people are not qualified to build their own. Unlike cryptography, the research is far earlier stage, and the pre-built code is virtually unavailable. That hasn’t stopped people from claiming certain datasets (like this) are anonymized and (sadly) having them re-identified. Those datasets are generally deidentified rather than anonymized — the names and obvious identifiers are stripped out, but the rest of the data is left untouched. Deidentification doesn’t tend to successfully anonymize data because there are so many sources of data in the world that still have identifying information in them; figure out where some identified dataset and the deidentified data align, and you’ve re-identified the dataset. If the dataset was anonymized, it would have been transformed such that re-identification was impossible, no matter what other information the attacker has to hand.

But fear not! Good anonymization does exist!

This is an important point to recognize because anonymization is a deeply important technique in making the world work with greater privacy. Differential privacy holds great promise and is a critical building block behind techniques like federated learning. Note that differential privacy is not a panacea; there are complex discussions lurking there, including appropriate “privacy budget.”

There are also simpler techniques that are also useful. For example, if I run a popular web service, I need to have enough server power to run that service. My users would strongly prefer that I predict how much I’ll need and provision that amount in advance rather than periodically running out. In order to do that prediction, I need to know how much traffic I had at points in the past and how much in the present. Storing that I had approximately 55 million pages loaded on a particular day with a peak of 5,000 requests per second gives me enough information to project load. The crowds are large, the rounding substantial, and — for some traffic measurement techniques — the measurement may not be precise. We have removed the user data and rendered the number anonymized.

Deidentification doesn’t tend to successfully anonymize data because there are so many sources of data in the world that still have identifying information in them. 

Deidentification is not anonymization (in virtually all cases), but it’s still useful as a data minimization technique. Anonymization is not always an option: If I buy software from an app store, I would be exceedingly displeased if the app store anonymized those records so I couldn’t run the software any more! Anonymized pictures of my kids would defeat the point. But deidentification is practical for certain types of processing. When training a machine-learning model to recommend new apps, the training data doesn’t need to include who I am, just what I have.

There’s a counterintuitive pitfall to avoid in deidentification: Overdoing it can cause other privacy problems. When someone asks you to delete their data, you need to delete it completely, not just from the primary data store but from all these secondary data stores, like caches, analytics datasets, and ML training datasets. If you’ve deidentified that data in a meaningful way, then it’s awfully hard to figure out what part of the dataset to delete. Because the data is not anonymized, you are still responsible for doing this deletion. The most effective way to do this is to delete the entire dataset periodically. Some datasets only need to exist for a limited time, like slices of server logs used for analytics. Some datasets can be periodically recreated from the primary data store, like ML training datasets.

In some cases, deidentification is useful in long-term datasets. For example, we live in a world where each user may have multiple mobile devices. In some cases, a server needs to keep track of where (which user, which device) certain pieces of data came from, for example, so that data can be automatically deleted when that particular device is no longer in use. That means the server needs some kind of device identifier. Looking more closely, however, the server only needs a few bits of information to differentiate between the relatively small number of devices owned by one user. Thus, the server doesn’t need that whole device identifier: They can hash the device identifier and throw away most of the bits of that resulting hash. All that the server really needs, in this case, is the ability to differentiate between devices from the same user and retain enough bits of the hash to do that; there’s no need for more.

Photo by Llanydd Lloyd on Unsplash

9 Comments

If you want to comment on this post, you need to login.

  • comment Adedapo Odeyemi • Jun 18, 2019
    I thought the risk-based de-identification methodology proposed by Khaled El Emam in his book, Guide to the De-Identification of Personal Health Information, produces a desired level of data anonymization. Also, as Khaled once said, anonymization is a term used by European jurisdictions while de-identification is the preferred term under HIPAA? Lastly, whether we are referring to de-identification / anonymization, I think what is at issue is whether we are able to mitigate a number of disclosure risks: 1. identity disclosure; 2. attribute disclosure; and 3. inferential disclosure. Currently, I am not sure any technique exists to mitigate 2 & 3 above.
  • comment Mohammad Mousavi Ghamsari • Jun 19, 2019
    Thank you! The difference has become especially relevant because of the GDPR. When personal data is de-identified the GDPR is still applicable. If personal data is anonymized GDPR is NOT applicable. In GDPR de-identification is mentioned as a security measure.
  • comment Stefan Keller • Jun 19, 2019
    I found this article to be a bit confusing. 
    Khaled El Emam and his works are certainly one good source of information, as is  GDPR's Art 4 definition of "pseudonymization" (which also mentions TOM to prevent re-identification), ISO 25237 and the Article 29 Working Party's Opinion 05/2014 on “Anonymisation Techniques" ( at http://collections.internetmemory.org/haeu/20171122154227/http://ec.europa.eu/justice/data-protection/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf )
    
    To my understanding, the arguments in the latter, even though from 2014, is still what e.g. the CNIL would use. - I would use GDPR's Art 4 definition of "pseudonymization", to argue that anonymization (as the extreme of pseudonymization) can be conditional to the presence of TOM to prevcnt re-identification. You might want to use C‑582/14 Breyer v Bundesrepublik Deutschland in this context.
  • comment Andy Ketch • Jun 19, 2019
    Lea brings-up a great point that anonymization is hard to do well.  Practitioners, tools, techniques, and threats vary greatly, which Lea identifies as it is possible to be successful with anonymization.  Let us keep in mind there is no such thing as zero-risk of re-identification when data is shared.  Getting to a safe level of statistical risk is the goal of de-identification and this requires measurement.  
    
    As Adedapo and Stefan commented earlier, a risk-based approach has proven successful to de-identify data.  Based on context of use and threat, measuring the data risk and then applying the right level of de-identification using the right techniques to de-identify the data to be lower in risk than the risk threshold, data can be made safe.  A well defined governance model for the use of data is an important basis of any data de-identification effort.  There are applications for anonymization, strong-pseudonymization, and pseudonymization.  
    
    Knowing what to do and how to do it per context of use is where data de-identification experience and technique show their value to keep data safe while still being as appropriately usable as possible according to individual trust and regulatory expectations.
  • comment Lea Kissner • Jun 19, 2019
    Happily, the state of research on anonymization and de-identification has advanced quite a bit since Khaled El Emam wrote that book and the CNIL put out that report; both attacks and defense are far stronger. (I haven't read Khaled El Emam's book in particular.) I expect this to remain a strong area of research for at least 5-10 years and things are going to change quickly.
    
    We can't bend math to the law. What we can do is build laws, regulation, and policy both with our current understanding of these matters and the understanding that these will evolve quickly. Describing desired aims rather than particular techniques works better for this.
    
    Adedapo: if I understand correctly what you mean by attribute and inferential disclosure, there are techniques to prevent those sorts of attacks, though they may not be practical in every scenario.
  • comment Steven Malloy • Jun 20, 2019
    Thank you Lea. A very interesting article. I look forward to reading more from you.
  • comment Adedapo Odeyemi • Jun 23, 2019
    Hello Lea, thanks for your reply. Can you please direct me to some literature that cover attribute and inferential disclosure controls?
  • comment Lea Kissner • Jun 24, 2019
    Adedapo: want to drop me a line and we'll discuss interactively? I'd like to get a solid definition of what you're looking for here and how comfortable you are with pointers to CS theory literature rather than throwing a bunch of math over the wall at you. :)
  • comment Adedapo Odeyemi • Jun 26, 2019
    Sure, Lea. I have sent you a message via twitter.