The California Consumer Privacy Act is notorious for the haste with which it was drafted. Many provisions of the statute require clarification, and the attorney general’s office is holding a series of public forums before issuing clarifying regulations.

Among the concepts not well defined by the CCPA are deidentification, pseudonymization, and aggregation. It's helpful to take a look at some of the challenges the CCPA creates with its imprecise language regarding these topics and point out of the limited benefits the CCPA offers a business for each type of data treatment technique.

What does the CCPA say about these concepts?

The CCPA defines deidentification, aggregation, and pseudonymization in § 1798.140.

“Deidentified” (§ 1798.140(h)) "[M]eans information that cannot reasonably identify, relate to, describe, be capable of being associated with, or be linked, directly or indirectly to a particular consumer, provided that a business that uses deidentified information [has implemented technical and process safeguards that prohibit reidentification of the consumer, has implemented processes to prevent inadvertent release of deidentified information, and makes no attempt to reidentify the information]."

“Aggregate consumer information” (§ 1798.140((a)) "[M]eans information that relates to a group or category of consumers, from which individual consumer identities have been removed, that is not linked or reasonably linkable to any consumer or household, including via a device. 'Aggregate consumer information' does not mean one or more individual consumer records that have been deidentified."

“Pseudonymize” or “Pseudonymization” (§ 1798.140(r)) "[M]eans the processing of personal information in a manner that renders the personal information no longer attributable to a specific consumer without the use of additional information, provided that the additional information is kept separately and is subject to technical and organizational measures to ensure that the personal information is not attributed to an identified or identifiable consumer."

The term “deidentification” is found in:

  • The definition of aggregate consumer information (§ 1798.140(a)).
  • The definition of publicly available information (§ 1798.140(o)(2)).
  • The definition of research (§ 1798.140(s)(2)).
  • Subsection 1798.145(a)(5), where the statute states that nothing in the CCPA should restrict a business’s ability to “collect, use, retain, sell, or disclose consumer information that is deidentified or in the aggregate.”

The term “aggregate” or the phrase “aggregate consumer information” is found in:

  • The definition of publicly available information (§ 1798.104(o)(2)).
  • The definition of research (§ 1798.140(s)(2)).
  • Subsection 1798.145(a)(5), where the statute states that nothing in the CCPA should restrict a business’s ability to “collect, use, retain, sell, or disclose consumer information that is deidentified or in the aggregate.”

The terms “pseudonymize,” “pseudonymization,” and “pseudonym” are found in:

  • The definition of research (§ 1798.140(s)(2)).
  • Subsection 1798.140(x), where the definition of unique identifier includes a “unique pseudonym” that “can be used to recognize a consumer.” Pseudonym here is used as a noun, whereas in the definition of research and its own definition section, pseudonymize/ation is an adjective that describe a process, so the term’s appearance in § 1798.140(x) is likely intended to be a different concept from that referred to in the definition subsection (§ 1798.140(r)) or research subsection.

Additionally, “personal information” excludes “publicly available information” (§ 1798.140(o)(2)). And, “‘publicly available’ means information that is lawfully made available from federal, state, or local government records.” If the data is used for a purpose incompatible with the purpose for which the government made it available and maintains it, then it is no longer “publicly available.”

Does the CCPA provide any advantages for deidentifying, pseudonymizing, or aggregating information?

Yes, but the practical advantages are minimal.

The statute regards deidentification and aggregation as distinct and separate concepts: “‘Aggregate consumer information’ does not mean . . . consumer records that have been deidentified” (§ 1798.140(a)). It also discusses pseudonymized data, deidentified data, and aggregated data using distinguishing language: “Subsequently pseudonymized and deidentified, or deidentified and in the aggregate” (§ 1798.140(s)(2)).

What is left unclarified is whether any of these three methods for treating data ease the business obligations found in the CCPA. The short answer is yes; but, the intent of that easing is doubtful and the practical advantages it offers are minimal.

The advantages of deidentification and aggregation are connected but limited.

In its description of publicly available information, the law states, in a strictly textual reading, that deidentification and aggregation surprisingly disadvantage a business. Information that is publicly available is explicitly not personal information, but the definition of publicly available is narrow and only includes information made available by a government entity—information made public by a private organization is not “publicly available.” The law becomes convoluted when it then states that “‘[p]ublicly available’ does not include consumer information that is deidentified or aggregate consumer information.” So, publicly available information is not personal information—which removes the information from the obligations associated with personal information—but if it is deidentified or in the aggregate it is no longer publicly available, and, presumably, personal information again.

Clearly the California legislature does not want to discourage the government from deidentifying or aggregating the information it makes publicly available. An explanation for this strange phrasing is to interpret the provision as an elaboration intended to clarify that a private organization cannot avoid the obligations imposed on personal information simply by releasing deidentified or aggregate consumer information to the public. The availability of information to the public does not bear on its status as “publicly available”—rather, its source (a government entity) and its purpose determine that status.

The advantages of pseudonymization apply only in a research context.

The effect of pseudonymization appears to be entirely restricted to research. “Research” is defined as “scientific, systematic study . . . conducted in the public interest in the area of public health” (§ 1798.140(s)) — a narrow definition. Requisite for research is the combination of pseudonymization and deidentification, or the combination of deidentification and aggregation of personal information (§ 1798.140(s)(2)). Research can be a “business purpose” (§ 1798.140(d)(6)), for which a business must comply with the requirements associated with a consumer request for disclosure under § 1798.115. This creates obligations for a business, not exemptions. The only exemption for research in the CCPA is an exemption from the obligation for a business to comply with a consumer’s request for deletion if the personal information is “necessary” for research (§ 1798.105(d)(6)).

In short: If a business conducts research (narrowly defined), it must either (1) pseudonymize and deidentify the personal information, or (2) deidentify and aggregate the personal information. If that requirement is satisfied, then the business need not comply with a consumer’s request to delete the consumer’s personal information (it still must respond to the request, however).

The CCPA imprecisely uses foundational concepts for deidentification, pseudonymization, and aggregation.

The CCPA offers little advantage for deidentifying, pseudonymizing, or aggregating personal information, and it also fails to gracefully connect each concept to the others. All three concepts are related techniques applied to reduce the risk of reidentification, and thus the threat to personal privacy.

Pseudonymization is a subcategory of deidentification, but it is also distinct. And, both rely on the fundamental concepts of direct and indirect identifiers. Deidentification, according to NIST and ISO, is a “general term for any process of removing the association between a set of identifying data and the data subject” — or, in GDPR terms, at a minimum removing direct identifiers. Pseudonymization is a “particular type of anonymization that both removes the association with a data subject and adds an association between a particular set of characteristics relating to the data subject and one or more pseudonyms” — in other words, the data is indirectly identifiable. Whether using the CCPA’s terminology — “linked, directly or indirectly,” “ linked or reasonably linkable,” or “attributable” — NIST’s and ISO’s definitions, or the direct and indirect identifiability from the GDPR, the fundamental concept is the same: both deidentification and pseudonymization rely on methods to treat directly identifiable and indirectly identifiable information in a certain way to reduce the risk of reidentification.

Aggregation is a specific method of deidentification that alters quasi-identifiers — akin to indirect identifiers — in order to reduce the risk of reidentification.

Deidentification, pseudonymization, and aggregation are each a technique for reducing the identifiability of individuals in a data set. More specifically, they look beyond direct identifiers and address whether data can be linked indirectly to an individual.

Conclusion

The CCPA treats each of these techniques as distinct and attempts to assign advantages specific to each one, but it fails to address why one technique grants advantages in a narrow slice of the CCPA and another technique grants advantages in a different narrow slice of the law. For example, the ability to deny a consumer’s request for deletion is reserved for research, which requires that information be pseudonymized, and the relationship between publicly available information and deidentification and aggregation may or may not have a bearing on whether that information is personal information.

Do deidentification and aggregation grant benefits for research? Can publicly available information be pseudonymized? Are the terms interchangeable? If they are not interchangeable, where and why does one apply, but not the others? Is there a practical difference between “linked, directly or indirectly,” “linked or reasonably linkable,” and “attributable?” If so, what is that difference? These are the questions left for the California Attorney General and the state legislature to answer before January 1, 2020.

Photo credit: Carnevale a venezia 2011 via photopin (license).