28 July 2015

On Why We Need To Map the De-Identification Middle Ground

At a workshop earlier this month, the Future of Privacy Forum and EY attempted to drill down into some challenges that privacy pros face in day-to-day practice. By day’s end, it became clear that disputes in terminology continue to cloud how practitioners can pursue effective de-identification strategies. De-identification remains a concept without a clear definition: Practitioners focus on the percentage of records in a data set that could be re-identified, while technologists and advocates often assert that if a single record is re-identifiable, de-identification is a misnomer.

Predictably, any discussion about de-identification begins by examining how policymakers inside and outside of industry define personally identifiable information and what they consider to be truly anonymous. As Peter Swire argued at the workshop, a line must be drawn somewhere if we are to have this sort of privacy regime. In order to escape privacy regimes, organizations frequently generalize or broaden their understanding of what is anonymous information in public privacy policies, forcing a push-and-pull with regulators. The Federal Trade Commission’s Maneesha Mithal explained that a variety of factors can inform whether the agency might view information as personal or not. She highlighted the sensitivity of the information or the persistence of any identifier as potential factors to consider, suggesting that “in this day and age, to say a device identifier is not personal information, I think that would be difficult.”

Yet while law and policy tend to classify data as residing on one side of that line or another, practice tends to support the notion that data instead resides on a spectrum of identifiability.

In 2011, Profs. Dan Solove and Paul Schwartz proposed a categorization scheme, termed “PII 2.0,” with identified, identifiable and non-identifiable information. Yet, after establishing this sliding scale, little has been done to define the middle ground where de-identification lives and breathes. Policymakers remain reluctant to recognize this spectrum, or to recognize the need for some intermediate category or categories of data that can be subject to some but not all privacy restrictions. At the Future of Privacy Forum, we are presenting a paper at the Amsterdam Privacy Conference that seeks to clarify terminology and flesh out this middle ground.

Terminology challenges aside, de-identification is also challenged by the rise of big data. Big data, or more accurately, the increase in sharing and analysis of different data sources, has disrupted how the concept of indirect identifiers function in the spectrum’s middle ground. More precise data and more data over time change how practitioners should evaluate whether a given piece of information could become an indirect identifier. (Technologists can be quick to suggest there is no mathematical difference between a direct and indirect identifier, further complicating the situation.)

Plus, there is growing consensus that industry must be much more careful with terminology. For example, privacy policies describing persistent identifiers that are linked to other data and shared in some capacity as “anonymous” are clearly problematic. Mithal suggested that the FTC’s enforcement actions might support bifurcating a discussion about identifiability into two pieces: First, data pointers to explicitly personal information as in the MySpace case; and second, situations such as the Netflix re-identification example where “a bucket of clues” and otherwise publicly available information can be used to re-identify individuals.

So when should data be considered identifiable?

Workshop participants repeatedly invoked HIPAA as providing lessons and guidelines for determining the identifiability of a given dataset. Practitioners might consider whether a data element was replicable, distinguishable compared to the relevant population or knowable by adversaries. However, properly evaluating a dataset in light of these principles will require a larger, better-trained body of professionals that understand what is going into a dataset. Decades of de-identification debates have still lead to a small population of experts who understand the technical aspects of data masking and perturbation, and privacy pros may be in the best position to help organizations understand the need for this kind of talent.

Meanwhile, the role controls play remains paramount. Significantly, privacy pros often lack clarity over what this precisely means. As a result, de-identification critics take the position that only technical solutions provide real protection. However, internal organizational controls can be effective tools in practice.

Some of the specific tactics industry representatives stressed at the workshop included basics like written policies and procedures. This would include simple physical restrictions on data storage and access as well as important strategies such as credentialing vendors, downstream data users and even data sources, engaging in audits and assessments up and down the data supply chain.

Virtually every participant acknowledged that technical de-identification could never be a silver bullet, but that privacy pros ought to advocate for multiple levels of administrative controls and additional technical protections on top of basic de-identification. Selecting an appropriate de-identification strategy may begin with a basic analysis of the organization’s purpose in using data, and this determination may necessarily be sector-specific.

However, the workshop revealed that much of the thinking that goes into these determinations is not even sector-specific, but organization-specific, and the body of knowledge that other privacy pros can pull from to read up on de-identification is limited. Different business models have found themselves trapped by different understandings of what is both appropriate and best practice.

What is apparent is that the entire profession needs more guidance, and that will only come when organizations begin sharing their practices and procedures for how they view the spectrum of identifiable data. Plus, regulators could always provide more obvious hints as to how they understand anonymity and de-identification.

Our workshop brought a number of smart legal and technical minds into one room, and the constant refrain was for a more frank exchange of what everyone considered to be de-identified. Many organizations, particularly those with sophisticated marketing apparatuses, have developed detailed procedures and controls for different data states, but much of this thinking is locked behind closed doors and inaccessible to others, even as every industry becomes increasingly more data-driven.

Certainly, the age of big data has made the term “anonymous” utterly verboten, but treating more and more information as personally identifiable is no solution. It will have a number of stark consequences, for consumers and for industry, and so the only solution is to begin mapping the de-identification middle ground. There’s much work to be done. We invite privacy pros to engage with us further in this important conversation and to reach out for a further summary of the workshop’s proceedings.

photo credit: Another UCD concept map via photopin (license)

On Why We Need To Map the De-Identification Middle Ground

Related stories