Inherently identifiable: Is it possible to anonymize health and genetic data?

Nearly 25 million people have taken an at-home DNA testing kit and shared that data with one of four ancestry and health databases. With this proliferation of genetic testing and biometric data collection, there should be an increased scrutiny of the practices used to deidentify this data. Biometric data, namely genetic information and health records, is innately identifiable. This article looks at whether biometric data can ever truly be anonymized, the methods of deidentification and best practices, and the current state of biometric data under the EU General Data Protection Regulation.

Defining biometric data

Biometric data is defined in many different ways, and the definition largely depends on the jurisdiction. The GDPR defines biometric data in Article 4(14) and lays out ground rules in Article 9 for processing of special categories of personal data, and overall, increases protections for special kinds of data, including biometric data. Article 9 states in part, “processing of personal data revealing racial or ethnic origin … genetic data, biometric data for the purpose of uniquely identifying a natural person … shall be prohibited.” Article 9(4) allows EU member states to create their own stricter regulations.

One interesting exclusion of data, including biometric data, from protection in the GDPR occurs when a data subject dies. Recital 27 Section 1 of the GDPR states, "this regulation does not apply to the personal data of deceased persons." As with Article 9, Recital 27 Section 2 allows member states to create their own stricter rules. In Slovakia, when the data subject is deceased, consent may be given and withdrawn by a "close person." It is unclear which method is best, but as technology advances, the definition of biometric data should advance, as well.

Deidentified biometric data

Since its inception, biometric data has been used for many different purposes and in a multitude of different fields. However, the method used to anonymize biometric data is largely the same. Typically, a sample is stripped of most of its personally identifying information, such as dates, locations and demographics. The data controller then classifies the biometric data as deidentified or anonymized (i.e., impossible to link the sample back to the subject). As computing power becomes stronger, merely stripping a biometric data set of its markers is not enough to ensure it is not tracked back to the data subject. Researchers are able to use complex algorithms, including artificial intelligence, to reorganize or reidentify the biometric data. With an organized set combined with other information from social media or other internet sources, they are able to narrow down with scary precision what data likely belongs to certain individuals.

There is a growing skepticism in the field of data protection and privacy law that biometric data can never truly be deidentified or anonymized. This supposition rests on the fact that biometric data is inherently identifiable. DNA, fingerprints, facial features and even ears are considered unique to the individual data subject and regardless of whether the labels are stripped away, they still contain identifying information. In recent years, engineers developed a way to decode DNA (even with a small amount of DNA strands) and make a person's unique DNA highly and inherently identifiable. This question has the most impact on research, as studies often rely on deidentifying biometric data to not violate the Health Insurance Portability and Accountability Act when they release the study results. The failure to separate the data subject from the results of the study creates huge privacy risks. “If we move into a society where we’re required to use biometrics to identify ourselves, and that information is compromised, anyone can impersonate us,” Electronic Frontier Foundation Senior Attorney Jennifer Lynch said.

Studies regarding deidentified biometric and health data

With the comprehensive and widespread use of the internet for distribution of information, biometric data is uniquely vulnerable. In April 2018, researchers Fida Dankar, Andrey Ptitsyn and Samar Dankar published an article in Human Genomics journal outlining the challenges and concerns of "the development of large-scale deidentified biomedical databases." Of note, they outlined the features that make genomic data so hard to deidentify and also why it is so valuable to researchers. First, it provides information, not just about the data subject, but also their entire familial lineage. The information it provides contains genetic conditions and predispositions. Next, genomic data is "highly distinguishable." With only a sequence of 30 DNA SNPs strands, a data subject can be identified. Finally, the scariest part is genes are extremely constant, rarely changing. Due to the highly distinguishable and stable nature of DNA, it should be considered inherently identifiable and subject to stricter deidentification criteria than merely removing a few labels.

As technology and artificial intelligence advances, the ability to manipulate and organize large deidentified data lakes also increases. In a 2018 Journal of the American Medical Association article, Chinese researchers Liangyuan Na, Cong Yang and Chi-Cheng Lo were able to accurately match 95% of adults to their data in a deidentified user dataset in an attempt to show approved deidentification methods did not afford an adequate level of protection. The researchers wrote that the current practices for deidentification of data might be insufficient to ensure privacy.

Practical consequences

With technological advancements moving at a rate much faster than legislation, the best approach to deidentifying biometric data remains unclear. Comprehensive statutes, like the GDPR, are too new to determine whether they can provide a successful long-term framework. However, the sectoral approach is met with much criticism because of increased compliance costs and small government agencies tasked with privacy enforcement. Ultimately, the current legislative approach to regulating the deidentification and anonymization of biometric data needs improvement. EU member states need to devote more resources to policing this issue. Further, clear up-to-date guidelines that multinational companies can follow would promote better information security practices.

Editor's Note: This article was corrected to reflect that the GDPR defines biometric data in Article 4(14).

Photo by Louis Reed on Unsplash

Inherently identifiable: Is it possible to anonymize health and genetic data?

Related stories

Defining biometric data

Deidentified biometric data

Studies regarding deidentified biometric and health data

Practical consequences