28 Jan. 2020

Is my dataset anonymized? A primer

When you are collecting datasets online or through an application programming interface or a biobank, some providers may consider those datasets as anonymized. However, the concept of anonymization is debated and covers only data for which it is no longer possible to re-identify the person behind the data; it is an irreversible process. For instance, the mere use of a hash function will not automatically transform personal data into anonymous data (because a hash function will often yield to the same output). Nowadays, it is more frequent to find pseudonymized data, which implies the direct identifiers are replaced by a coded information through a key or a correspondence table. This process is reversible.

There is only one question to which data scientists want a clear answer: Can I consider my dataset as anonymized? Frequently, lawyers and people with a legal background are troubled by this question, and they prefer not to answer clearly because the answer would require to analyze the technical means that the provider of the personal data put in place. And yet, the business needs an answer to plan the work. Unfortunately, there is not effectively a clear legal basis that would provide for a checklist of the required measures.

GDPR

The EU General Data Protection Regulation provides in Recital 26 that “to determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly. To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments.”

According to the GDPR, two conditions must be hence analyzed: First, is the natural person is identifiable? Second, if yes, are there objective factors that prevent reasonably the re-identification of the natural person?

Identifiability

To answer the first question, the "Opinion 05/2014 on Anonymisation Techniques," adopted April 10, 2014, by the former Article 29 Data Protection Working Party (now, the European Data Protection Board) is still useful. In fact, if you can isolate data that belongs to an individual (singling out), if you are able to link at least two records or databases concerning the same person (linkability), or if you can deduce the value of an attribute that concerns a natural person (inference), you probably have personal data in the legal sense. As a reminder, personal data is a very broad legal concept, which may include different types of data.

Objective factors

Regarding the second criteria, the costs, time frame and technological developments must be considered. For instance, if the re-identification implies to make enormous efforts in terms of costs, time and technology, then the re-identification is unlikely. Concerning the technological developments, different techniques already exist, such as the homomorphic encryption, deterministic encryption, zero-knowledge proof or other techniques of randomization (noise addition, permutation of values, differential privacy) and generalization (aggregation and k-anonymity, l-diversity). For instance, Google promotes its own private-join compute open-source tool. Recently, different scientists proposed a new system called “MedCo,” which secures the distribution of clinical or genomic data. This system uses collective homomorphic encryption and obfuscation techniques and would be certainly useful for hospitals particularly. However, the processing of genetic personal data or the mastering of technical advances over time can hinder any anonymization.

Scope of the analysis

As highlighted in the European Parliamentary Research Service study about blockchain and the GDPR, it is unclear whether the analysis focuses on an objective or a subjective approach. In other words, should the analysis be carried out between the supplier and recipient only (subjective approach), or should third parties that could obtain access to the personal data also be considered (objective approach)? Pursuant to the Breyer case, it is not required that all information about a natural person must be in the hands of one single entity. In fact, if it requires disproportionate efforts, technically speaking, or if it is not prohibited by the law, the risk of re-identification is deemed insignificant. In the Breyer case, although German law does not allow the internet service provider to transmit directly the additional information to the data controller, legal channels exist so the data controller is able to contact the competent authority to obtain from the ISP that information. In a nutshell, it is necessary to consider the objective approach and the possibility for a third party to help re-identify or that may legally be required to do so. Concretely, it requires consideration of the legal and technical means available to the data controller to re-identify a person, taking into account its possibility to contact (lawfully) a third party.

Notice the answer to the question posted at the beginning, "Can I consider my dataset as anonymized?" is not easy. The business cannot simply take the word of the provider for it, and technical means may evolve over time. The business must analyze first whether the data may identify a natural person, and if yes, it must analyze how the data has been technically protected and if legal means exist to ask for additional information to the provider or another third party. This process requires time and energy, and sometimes it is easier to forget about the distinction between anonymization and pseudonymization.

In some circumstances, the business would do better to consider how to comply with the law rather than trying to circumvent it. It would be even easier if the lawmaker removed the notion of “anonymized personal data” and submits all personal data, regardless of its form, to the law. However, this initiative would probably increase the administrative burden for the business and certain authorities (in particular, ethics commissions or institutional review boards).

Photo by Markus Spiske on Unsplash

This article is eligible for Continuing Professional Education credits. Please self-submit according to CPE policy guidelines.

Submit for CPEs

Interested in writing for us? Visit our Contributor Guidelines Page

Is my dataset anonymized? A primer

Related stories

GDPR

Identifiability

Objective factors

Scope of the analysis