A close-up on deidentified data under CCPA

Published27 Aug. 2019

Contributors:

Jacob Rubinstein

Founder

Pithy Privacy

The California Consumer Privacy Act has made plenty of waves since its announcement in April 2018. The EU General Data Protection Regulation near-look-alike is the first of its kind in the U.S. and presents many complications for global businesses with California residents as their consumer. The CCPA will demand revision to many data-handling practices, chiefly in the data subject access right space, but will also feature expansion of the definition of personal information, depending on your organization's prior approach to legal frameworks, such as the Gramm-Leach-Bliley Act.

Of course, one of the reasons competent privacy counsel is in demand in 2019 is to advise on big data monetization within the guardrails of the law. To effectively analyze data, data scientists cannot have their hands tied with privacy regulations. Deidentification of data, a process first approached from a U.S.-legal perspective under the Health Insurance Portability and Accountability Act, can facilitate data utility in environments like the cloud or within a nascent technology, such as machine learning.

Under the CCPA, deidentified means information that cannot reasonably identify a particular consumer if the organization, implemented: technical safeguards and business processes that prohibit re-identification and processes to prevent inadvertent release of the de-identified information. Further, no one in the organization may attempt to re-identify the information.

Effective deidentification work is really three pronged. You must:

Use a deidentification method: A process whereby the form of the data is depersonalized. For example, instead of a name being stored as John, it is stored as J*@n (in this instance, masking).
Assess for the likelihood of reidentification: As Professor Latanya Sweeney’s research at Carnegie Mellon in 2000 proved, 87% of U.S. citizens can be reidentified on the basis of their gender, ZIP code and age. Removing data that directly point to the individual, such as a driver’s license ID number, is a part of the battle, but consider the effects of deidentifying someone by saying "a female executive" at a company that has 30 employees. Reidentification is rather simplified.
Implement controls: Ensure the data is only shared with parties who have a purpose in receiving the information, such that appropriate segregation of duties exist and that data which, in its raw form, is highly sensitive must be treated with enhanced confidentiality.

Deidentification as a process can be quite complicated to execute with precision to ensure the privacy risk is completely eradicated. The old complications present in this process expand under the CCPA, as the dimensions of what is protected has expanded. Most privacy programs will require modification to accommodate the imposed requirements pertaining to California residents, such as protecting data inferences.

Beyond the PI elements listed, companies need to also pay closer attention to employee data, and one of the chief challenges is that publicly available information is no longer a full-blown exemption. The only data exempted is government offered.

The combination of the increased capacity of artificial intelligence to repurpose and study data against the CCPA’s rigor will deeply complicate the deidentification process. Here's an example of a deidentification case without AI and CCPA restrictions versus a case study impacted by both.

Name: John Smith
ZIP code: 11029
Career: Electrical engineer
Nationality: English

Key when using the old mechanism was just to ensure the individual could not be discovered based on data provided. A robust deidentification approach may look like the below:

Name: Jo&^ 1m8t?
ZIP code: 1102A
Career: Engineer
Nationality: European

Now, the combination of those data elements are unlikely to tell you who the data subject is. However, contrast those elements using AI and the CCPA umbrella. PI is defined as information "that identifies, relates to, describes, is capable of being associated with, or could reasonably be linked." One must assume that the drafters view PI as being more expansive than identification, relating to or describing must address something not covered by identifiability. By adding terms like “associated with” or “relates to,” it is clear that personal information extends beyond data that can identify an individual. For instance, mentioning that a celebrity has the largest brownstone on 52^nd Avenue, may not be able to identify a neighbor but it does associate those who live on that block.

Further, take an AI aim where the goal is predictive analytics. Say you want to prognosticate the individuals who would invest in a vehicle with updated iPhone XR technology capabilities. So consider some likely data elements here:

Goods and services purchased by the individual.
Age.
ZIP code.
Gender.
Name.
Email address.
Geolocation.
Device ID.
Employment history.

You get the picture. Now, each one of these elements will likely be converted from its raw form so as to make the information less identifiable. For example, if there were access to broader information on the individual, instead of showing each phone accessory they purchased, it may generalize the information to state phone accessories.

Nevertheless, when the data is repurposed, each time AI or a human studies the data the multitude of combinations will allow the individual to be reidentified and much of the information with it. If one set runs the following de-identified combinations ...

ZIP code: 110**
Age: 24–34
Employment history: CPA at big 4 accounting firm (as opposed to auditor at Deloitte)
Purchase history: Boat, swingset and hybrid vehicle

... there may be a select few individuals with precisely those values and the right forms of technology that may isolate some of that data; hence, one individual can be reidentified. With the amount of personal information specifically named under the CCPA (for example, olfactory data) a host of new privacy risks have emerged.

Solutions

Implement access controls: Deidentification is not a risk-free activity. Converting the data still means access to potentially raw data and even if already deidentified there is still an inherent privacy risks. Access controls are huge here.
Ensure notice transparency: Privacy and cookie notices require a close look in the event your company is using this kind of technology. Ensure that notices are as transparent as possible, and explain data is being repurposed. Privacy today is an effective marketing tool, we know that. Still, give consumers enough of a lens so regulatory risk is as squelched as possible.
Assess protected-class risks: CCPA loops in all California-based protected classes, including discrimination against those opting out of varied forms of data handling. Assess whether product offerings on account of deidentified data run that risk. For example, not approving someone’s loan application status because they're homeless without strong documented rationale.
Do your data mapping: Deidentified information can be personal. Work with your service providers so they have this data mapped. Of course, for internal purposes, have your files ready for access or deletion. Data mapping has to be of utmost importance internally and develop a good working relationship with data science so those algorithms are super transparent.
Check for Brazilian law overlap: Many organizations are scrambling yet again, even though they had a program overhaul for the GDPR. That's because the Brazil's new federal privacy law may possibly be interpreted as having a limited benefit from deidentified data. Take a close look here to avoid duplicity.

Organizations are already bent out of shape over the changes in the privacy sphere. Let’s keep our internal and external clients satisfied by thinking ahead.

Photo by Kevin Ku on Unsplash