25 April 2016

PII, cookies and de-ID: Shades of gray

EU laws, as well as HIPAA and COPPA in the U.S., have labeled these identifiers personal in many cases. Yet, in most privacy policies, it remains widespread practice to describe these kinds of data points as “non-personal” or “anonymous.” The New York Times website, for example, labels in its privacy policy “non-personal” various categories of information, including device IDs, cookies, log files, reading history, and even location information.

While speaking recently to Network Advertising Initiative members, Jessica Rich, the FTC’s Consumer Protection Bureau chief, focused attention to these practices while describing the FTC’s position on persistent identifiers and privacy. In a follow-up blog post, Rich noted, “We regard data as ‘personally identifiable,’ and thus warranting privacy protections, when it can be reasonably linked to a particular person, computer, or device. In many cases, persistent identifiers such as device identifiers, MAC addresses, static IP addresses, or cookies meet this test.”

This is, of course, what the FTC has said for years. Rich pointed to both the agency’s 2009 staff report on online behavioral advertising and the 2012 Privacy Report. It’s also what Rich and other senior staff have conveyed to privacy professionals at IAPP conferences and events like the FPF Practical De-Identification Workshop.

Technologists, regulators and consumer advocates have long taken a strict view of de-identification, referring to any information that could possibly be linked to an identity as personal. In Europe, regulators avoid the term de-identification altogether, employing instead a view of anonymization that leaves little room for nuance. European regulation seems to imply that any risk of re-identification, however remote and by whichever third party, brings data under the full remit of data protection law.

[quote]Despite a broad consensus around the need for and value of de-identification, the debate as to whether and when data can be said to be truly de-identified appears interminable.[/quote]

Despite a broad consensus around the need for and value of de-identification, the debate as to whether and when data can be said to be truly de-identified appears interminable. Although academics, regulators, and other stakeholders have sought for years to establish common standards for de-identification, they have so far failed to adopt even a common terminology.

It’s this very gap between common terms and shared understandings that the Future of Privacy Forum has taken on in a new paper,Shades of Gray: Seeing the Full Spectrum of Practical Data De-Identification(forthcoming in the Santa Clara Law Review) with an accompanying Visual Guide to Practical Data De-Identification. In these works, we propose parameters for calibrating legal rules to data depending on multiple gradations of identifiability, while also assessing other factors such as an organization’s safeguards and controls. We suggest that rather than treating data as a black or white dichotomy, policymakers should view the states of data as various shades of gray. We also provide guidance on where to place important legal and technical boundaries between different categories of identifiability based on the presence of direct identifiers (e.g., name, social security number) or indirect identifiers (e.g., date of birth, gender) as well as technical, organizational and legal controls preventing employees, researchers or other third parties from re-identifying individuals.

We also specifically address where on the spectrum lie data items like cookies and unique device identifiers (but also license plates or medical record numbers). We propose three “degrees of identifiability,” recognizing that there is a qualitative distinction between data that is explicitly personal (such as a name or social security number) and data that is merely potentially identifiable (such as a cookie, in many cases). We also recognize a distinction between potentially identifiable data with little or no controls and situations where responsible organizations have implemented reasonable technical and administrative safeguards such that data becomes not readily identifiable (e.g., hashing a MAC address and making public commitments to not re-identify individuals).

To be sure, because these categories of data still contain direct (although obscured) identifiers and often are linked to a variety of indirect identifiers, no credible expert would ever certify them as perfectly “de-identified” or “anonymous.” And yet, even if not foolproof, these intermediate shades of data help companies unlock value while mitigating privacy risks. Our paper continues, examining data at various stages of identifiability along a spectrum ranging from explicitly personal data to very highly aggregated, anonymous data, with careful attention to intermediate steps, including pseudonymous data (contains indirect but not direct identifiers) and de-identified data (contains neither direct nor indirect identifiers).

Rich struck a similar balance in her comments, stating, “If you’re collecting persistent identifiers, be careful about making blanket statements to people assuring them that you don’t collect any personal information or that the data you collect is anonymous. And as you assess the risks to the data you collect, consider all your data, not just the data associated with a person’s name or email address. Certainly, all forms of personal information don’t need the same level of protection, but you’ll want to provide protections that are appropriate to the risks.”

[quote]We think this is the right approach, recognizing the nuance of different levels of identifiability and the value in calibrating rules to the level of privacy risks.[/quote]

We think this is the right approach, recognizing the nuance of different levels of identifiability and the value in calibrating rules to the level of privacy risks. Indeed, fiery rhetoric aside, even European regulators in several member states have adopted a more pragmatic approach, accepting softer obligations for pseudonymous data.

Over the past decade, with the de-identification debate raging, organizations around the world necessarily continued to rely on a wide range of technical, administrative and legal measures to reduce the identifiability of personal data to enable critical uses and valuable research while providing protection to individuals’ identity and privacy.

To be sure, in certain cases businesses’ characterization of data as de-identified, despite the presence of persistent identifiers, which are broadly disseminated and readily reversed through use of look-up directories, is too narrow. At the same time, an overly restrictive approach by regulators, which does not recognize the risk mitigation afforded by different levels of identifiability, misses the point. Regulation should provide incentives, even if not a safe harbor, for various shades of de-identification. While no risk is more protective than low risk, low risk should still be preferred to a baseline of fully identified data.

A regulatory approach recognizing the complete spectrum of data categories will create incentives for organizations to avoid explicit identification and to deploy elaborate safeguards and controls, while at the same time allowing them to maintain the utility of data sets.

PII, cookies and de-ID: Shades of gray

Related stories