A trans-Atlantic comparison of a real struggle: Anonymized, deidentified or aggregated?

Published23 May 2023

Contributors:

Katharina Koerner

Principal Researcher, Technology

Nonpersonal data is at the forefront of modern data analytics. In general, anonymized or deidentified personal data is not in the scope of privacy and data protection regulations, offering freedom from the restrictions of privacy compliance while allowing for greater data utilization. However, despite the growing demand for techniques to anonymize or deidentify personal data, this issue remains a topic of intense discussion at the intersection of privacy law and engineering.

One part of the problem is legal definitions of anonymization and deidentification are not clear and uniform around the world. If you work in a globally operating company, you may ask yourself whether deidentification and anonymization are one and the same, or whether deidentified data equals pseudonymization. And, where does data aggregation fit in?

This lack of uniformity is perpetuated by the absence of concrete legal guidelines and case law to determine which technologies and techniques are state of the art and which are outdated. Under which exact circumstances can personal data be considered anonymized or deidentified? Which privacy technologies and methods suffice?

The appropriate risk assessment to claim sufficient anonymization or deidentification can differ between legal regimes globally and, very often, clear guidelines are missing even within one jurisdiction.

GDPR: Defining anonymized and pseudonymized personal data

According to the EU General Data Protection Regulation, personal data refers to any information related to an identified or identifiable individual who can be directly or indirectly identified, e.g., by utilizing an identifier like a name, identification number, location data, an online identifier or other specific factors related to their physical, physiological, genetic, mental, economic, cultural or social identity.

In regard to anonymized personal data, Recital 26 of the GDPR gives further directions, noting its principles of data protection do not "apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable."

Next to the concept of anonymized data, the GDPR describes pseudonymized data as personal data that can only be attributed to a specific data subject with the use of additional information that is kept separately, and has technical and organizational measures taken to prevent its attribution to an identified or identifiable natural person. Pseudonymization serves as a security measure to "reduce the risks to the data subjects concerned" and helps with data protection by design by supporting unlinkability and data minimization.

In determining whether data is anonymized or pseudonymized, Recital 26 of the GDPR takes a risk-based approach in which all means of reidentification that could reasonably be used by the data owner or a third party to identify an individual have to be taken into account. This "reasonability test" considers objective aspects, e.g., time and technical means, and contextual elements on a case-by-case basis, e.g., nature and volume of data. In general, circumstances such as reidentification attacks, technological progress and disclosure of nonpersonal information need to be considered to determine the reasonable likelihood of identifying a natural person.

Absolute and relative approach under the GDPR

In practice, these regulations led to an array of open questions and legal uncertainty, as the question of where to draw the line between anonymized data and personal data differs depending on whether an absolute or relative approach is taken.

The absolute approach indicates that, to consider data anonymous, no remaining risk for reidentification is acceptable. In addition, the identifiability must be assessed not only from the perspective of the controller, but also any potential third parties, including malicious actors. This approach even goes as far as to data cannot be considered anonymized if the original data is still available. This position was taken by European Data Protection Board predecessor the Article 29 Working Party in its Opinion 05/2014 on Anonymisation Techniques, which is still referred to by the EDBP today.

The relative approach assumes a residual risk of reidentification will always remain. "Means likely reasonably to be used" to identify the data subject do not include the identification of the data subject prohibited by law or practically impossible, requiring a disproportionate effort in terms of time, cost and manpower. This view has been taken by the Court of Justice of the European Union in the case commonly known as "Breyer," shortly before the GDPR entered into force.

Both approaches coexist today, but the emphasized approach differs between member states, the EDPB guidelines and international agreements of the EU. For example, when the EDPB answered the European Commission's questions on the consistent application of the GDPR for health research, it referred to both the opinion of the Article 29 Working Party and the Breyer case.

Latest developments in the EU

A new judgment, issued 26 April by the European General Court, has the potential to spark the conversation around the appropriate risk assessment for anonymization under GDPR anew. In this case personal data was rendered pseudonymous by allocating an alphanumeric code to individual comments submitted by people into a form. The alphanumeric code consisted of a 33-digit globally unique identifier randomly generated at the time responses were received. The dispute concerned whether the data stayed pseudonymous when passed on to a third party without the key, or if it could be treated as anonymous for the third party.

The EGC ruled the same data could be categorized as either personal or nonpersonal, depending on the factual and legal circumstances involved in a given scenario, as well as each party's capacity to identify the subject of the data. To assess whether reidentification is reasonably possible, this test of feasibility and effort shall be carried out from the perspective of the recipient of the information, considering whether reidentification is legally and factually possible by the recipient.

This decision is expected to be challenged on appeal to the European Court of Justice. If upheld, it could cause a major shift in the context of international transfers of personal data. New cryptographic privacy-enhancing techniques, such as homomorphic encryption and multiparty computation, may be called out as anonymization techniques, allowing for computing on encrypted data or achieving common outputs of various parties' input data, without the data itself being accessible to the processor that stores or processes it. This could represent a paradigm shift in the context of navigating the increasingly complex and uncertain world of international data transfers.

On the other hand, in the light of the vast variety of existing reidentification attacks, many will query the lack of a reidentification risk assessment by the EGC, which focused on the legal means for reidentification only.

New UK bill defining pseudonymous, but not anonymized/deidentified data

While under the current U.K. data protection regime, where the definitions for anonymized and pseudonymized data are similar to the GDPR, the U.K. introduced a new draft Data Protection and Digital Information (No. 2) Bill on 8 March. This draft legislation does not refer explicitly to anonymized or deidentified information, but instead extends and specifies the definition of personal data.

According to the new U.K. draft, "information relating to an identifiable living individual" implies the living individual is identifiable directly or indirectly by the controller, processor or another person at the time of the processing, or where the controller or processor can assume another person is likely to identify the individual because of the processing or at the time of the processing.

In contrast to the GDPR, the new U.K. definition explicitly includes "another person" apart from the controller or processor. This aligns with the so-called "motivated intruder test", explored in a draft Information Commissioner's Office guide on anonymization from October 2021, as part of assessing the risk of identifiability. This test considers whether an intruder with reasonable competence, access to resources and investigative techniques could successfully identify individuals based on information claimed to be anonymous. The guide suggests not limiting the assessment to the capabilities of an ordinary person but also considering determined individuals with specific reasons to identify individuals, such as investigative journalists, estranged partners, stalkers or industrial spies.

It is further remarkable that in its draft guidelines, the ICO explicitly addresses the question of whether the process of anonymizing personal data itself counts as processing, and answers it with a clear yes.

The U.S. FTC crack down on false claims of "anonymized data"

In the absence of a comprehensive federal privacy legislation in the U.S., the Federal Trade Commission is the primary agency responsible for enforcing consumer protection laws in the country.

In this capacity, last year, the FTC published a blog post stating business claims of data being "anonymous" or "anonymized" are often deceptive. The FTC emphasizes that research has shown "anonymized" data can often be reidentified. False claims about anonymization can be a deceptive trade practice and violate the FTC Act when untrue, warns the FTC.

While the FTC does not explicitly define anonymized data, those explanations indicate it considers data anonymized when it cannot be reidentified. This terminology seems to correspond with the one used by the GDPR, which is particularly interesting as U.S. federal and state laws generally do not use the term "anonymization."

HIPAA: Defining deidentified health data

With the U.S. Health Insurance Portability and Accountability Act Privacy Rule, a seemingly clear definition of nonpersonal health information was introduced in 1996. HIPAA protected health information is not classified as personal health information and is no longer protected by HIPAA’s Privacy Rule if it has been deidentified according to one of two methods.

Either an expert with relevant knowledge and experience must assess and document the risk of the deidentified health information, either alone or in conjunction with other reasonably accessible data, being used to identify an individual by an anticipated recipient is very small. Or the information must be altered in accordance with the "Safe Harbor" method, which involves the removal of 18 direct identifiers in combination with the HIPAA-covered entity not having actual knowledge that the remaining information could be identifying.

Whether HIPAA's definition of deidentified data meets the definition of anonymized data used under GDPR is subject to debate. On multiple occasions research has shown the current HIPAA Safe Harbor cannot reliably anonymize data and is not sufficient to protect data against reidentification. Therefore, there are good reasons to consider common HIPAA deidentification approaches as pseudonymizing personal health data under the GDPR.

U.S. state privacy law: Deidentified information

The nomenclature gets even more tricky when it comes to U.S. state privacy laws. California, o Colorado, Connecticut, Indiana, Iowa, Virginia, Utah and Texas laws all explicitly exclude deidentified and aggregated consumer information from their broad definitions of personal information.

The wording in all of these laws is comparable, with the California Consumer Privacy Act as amended by California Privacy Rights Act being the most concrete, defining deidentified data as "information that cannot reasonably identify, relate to, describe, be capable of being associated with, or be linked, directly or indirectly, to a particular consumer, with additional safeguards to be put in place by the business that possesses the information."

These additional safeguards include reasonable measures businesses must take to ensure the information cannot be associated with a consumer or household. Second, businesses must publicly commit to maintain and use the information in deidentified form and not to attempt to reidentify the information, except for testing whether its deidentification processes satisfies the legal requirements. Third, businesses need to contractually oblige any recipients of the information to comply with the very same safeguards.

Comparing these definitions with those in the GDPR, one can argue U.S. state privacy laws are in fact one step ahead, by requiring businesses to make provisions against reidentification.

What complicates the matter though, is the introduction of aggregate consumer information as a new concept and term in these state laws.

is "information that relates to a group or category of consumers, from which individual consumer identities have been removed, that is not linked or reasonably linkable to any consumer or household, including via a device." The definition further states explicitly this "does not mean one or more individual consumer records that have been deidentified."

This wording and the lack of clear guidance for its interpretation seems problematic in several ways.

First, the term "aggregation" poses a problem due to its broad interpretation. It can refer to collating data, counting people in a group, adding up numerical data or calculating statistics of data about people.

Second, a significant problem with aggregated data can be caused by unique and unusual information appearing within the data itself. In 2000, when computer scientist Latanya Sweeney conducted experiments with the 1990 U.S. Census summary data, she found 87% of the U.S. population reported characteristics that likely made them unique based only on their ZIP code, date of birth and assigned sex.

Third, the Database Reconstruction Theorem from 2003 states that too many statistics published too accurately from a confidential database exposes the entire database with near certainty. Requesting too much information from aggregated data can lead to successful database reconstruction attacks.

This is true even for the most advanced anonymization or deidentification techniques like differential privacy, which is widely recognized as the leading statistical data privacy definition by the academic community. The number of queries a user can issue cannot be unlimited. Instead, a so-called "privacy budget" set limits on the number of queries to maintain the privacy guarantee. In differential privacy, this privacy budget can be calculated and determined, distinguishing differential privacy as a unique method that provides meaningful guarantees about privacy even when releasing multiple analysis results from the same data.

Because of the many instances and opportunities for data leakage in aggregated personal information, it is difficult for U.S. state privacy law not to require additional safeguards for consumer information after the data has been aggregated.

International standardization efforts and outlook

In light of those different definitions and the ambiguity about appropriate legal assessments of reidentification risks, the new Privacy-enhancing data deidentification framework ISO/IEC 27559:2022 could be just the groundbreaking reference global privacy professionals need.

The new framework, developed over a five-year period, aimed to establish best practices for the reuse and sharing of personal data. It builds upon the ISO/IEC 20889:2018 standard on privacy-enhancing data deidentification terminology and classification of techniques. It encompasses various dimensions of data protection and includes context, data and identifiability assessments. Different scenarios are outlined within the framework, considering how deidentified data is made available to users in different environments, whether they are internal or external to the custodian organization. Each scenario introduces different risks that can be identified and mitigated using the implementation framework. Governance of the deidentification process and resulting data is also emphasized to ensure ongoing risk monitoring and management. Once adopted by national standards bodies, the ISO/IEC 27559 standard can serve as a basis for compliance assessments, with auditors evaluating the controls required to manage risk.

Another forward-thinking initiative comes from the National Institute of Standards and Technology. In its ongoing effort to keep guidelines for government agencies on deidentification up to date, another call for public comment on NIST Special Publication 800-188 De-Identifying Government Data Sets was recently closed.

Additionally, the NIST Privacy Engineering Program set up a Collaborative Research Cycle, inviting participants to explore deidentification technologies by submitting deidentified instances of rich demographic data using any technique. The goal of this initiative is to advance research, innovation and understanding of data deidentification techniques. The first release of the acceleration bundle is planned for 19 May with additional releases scheduled for the summer.

The state of the art is evolving — don't miss out!

When it comes to protecting sensitive data through anonymization or deidentification, reducing risks to the greatest extent possible requires a proactive approach. In today's rapidly evolving digital landscape, relying on state-of-the-art privacy and security measures is paramount, while depending on legacy data protection measures can leave organizations vulnerable to emerging threats.

In a thorough assessment of the threat landscape, potential vulnerabilities and the likelihood of reidentification attacks, as well as factors such as the nature of the data and the available contextual information should be considered and documented.

Legacy data protection measures, although once effective, may now fall short in safeguarding anonymized personal information against reidentification attempts. Challenging questions around personal data feeding machine-learning development will only perpetuate potential issues. To maintain a robust defense against ever-evolving risks, organizations must adapt and embrace advanced privacy and security practices for anonymization and deidentification.

This is in the best interest of the data subjects concerned, and it strengthens organizations' privacy postures in general, helping to create trust of stakeholders, demonstrating a commitment to responsible data use and contributing to a more secure environment for the reuse and sharing of personal data.

This content is eligible for Continuing Professional Education credits. Please self-submit according to CPE policy guidelines.

Submit for CPEs

Contributors:

Katharina Koerner

Principal Researcher, Technology

Tags:

AI and machine learning

Contributors:

Contributors:

Related Stories

EDPB high level meeting in Dublin further deepens consistency and enforcement cooperation

A view from Brussels: The suspension of disbelief

Notes from the Asia-Pacific Region: Organizations navigate New Zealand's AI governance gaps

Singapore launches AI training data guidelines, expands PETs resources