Artificial intelligence has existed for several decades, but it is only in recent times that large language models, which are powered by generative pre-trained transformers, have been developed into user-friendly formats. These advancements have seized the attention of the broader population and increasingly fuel generative AI applications that have a growing impact on people's daily lives.

LLMs require re-identification risk mitigation due to dependencies on data sharing

Multiparty AI projects and LLMs require effective risk mitigation to avoid unauthorized re-identification of individuals. This is because these AI and LLM projects are predicated on data sharing among numerous parties.

With this ready access to data, individuals can be more easily re-identified via the Mosaic Effect, which occurs when multiple data sets are combined to identify individuals within the data sets, even if they were considered anonymous within each individual data set. It is essential to address this pressing concern before the momentum in AI/LLM adoption becomes akin to a "runaway train."

There is still an opportunity to preemptively confront the risks associated with unauthorized re-identification via the Mosaic Effect before the integration and utilization of data sets intensifies to the point of "no return."

Guardrails are not enough

While encryption, access controls, masking and tokenization act as "guardrails" to curtail specific privacy vulnerabilities, they fall short of precluding unauthorized re-identification in instances of inter-party data sharing. The diversity of datasets used in LLMs and multiparty AI projects opens many avenues for drawing correlations between seemingly innocuous information to reveal sensitive information, even in the presence of these guardrails.

To illustrate, access controls and encryption serve to safeguard data against unauthorized access, but they fail to restrict entities with authorization from exploiting the data, including using the data to reveal identity via the Mosaic Effect. Masking and tokenization protect data in isolation; nonetheless, the emergence of the Mosaic Effect results from integrating and correlating diverse, ostensibly "anonymized" datasets, unveiling identities and patterns that are imperceptible within any single data set.

Consequently, these guardrail methodologies are inadequate to address the risks stemming from the amalgamation of data from varied origins.

Recent developments highlighting re-identification risk

Various regulatory bodies and governmental institutions emphasize the need to more effectively manage re-identification risks and evolve regulatory frameworks to require concrete, enforceable measures to complement foundational principles and ethical postulations. This underscores an increasing recognition and urgency of managing privacy risks associated with AI and data sharing. The escalating importance of managing unauthorized re-identification is highlighted by recent news developments. 

  • In February 2023, commissioners on the U.S. Federal Trade Commission stated that while the U.S. Health Insurance Portability and Accountability Act includes protocols for deidentification (i.e., modifying or removing protected health information to protect identities within a dataset), the aggregation of deidentified data sets creates a risk of unauthorized re-identification. Compliance with HIPAA de-identification standards is not necessarily enough to prevent unauthorized re-identification via the Mosaic Effect, given advancements in data analysis and escalating data availability.
  • In March 2023, the U.S. White House Office of Science and Technology Policy adopted a "National Strategy to Advance Privacy-Preserving Data Sharing and Analytics." This strategy underscores the critical need to bolster the presence of robust protective measures to guard against unauthorized re-identification, especially through the Mosaic Effect, which can occur when various data sets are combined. The goal is to mitigate the "risk of harm to individuals and society" that can emanate from practices in data sharing and analytics.
  • In September 2023, the U.S. Senate adopted a bi-partisan framework for risk-based regulation of AI, including warnings by a noted legal expert that accentuate that legislative frameworks around AI should extend beyond just foundational principles and ensure the incorporation of substantial technical controls to counter unauthorized identification risks like the Mosaic Effect. It was highlighted that simply championing the cause of AI transparency, addressing bias and supporting ethical principles are inadequate. There exists an urgent need for concrete and enforceable protective measures to impede the integration of anonymized data sets that could culminate in unintended identifications.
  • Looking ahead, it is anticipated that in early 2024 the EU will approve the EU Artificial Intelligence Act requiring quantifiable and auditable technical risk mitigation safeguards, expected by many to establish the global standards for the regulation of AI — much as the EU General Data Protection Regulation has been for the regulation of data protection. For AI regulation to be truly impactful, it is important to mandate quantifiable and auditable safeguards to ensure that individuals' identities are meticulously protected against any unauthorized re-identification attempts. The emphasis on technical risk mitigation safeguards in the act underscores the importance of moving beyond theoretical frameworks to implementing practical, enforceable solutions to safeguard individual privacy in the evolving landscape of AI.

A performant privacy approach to mitigating AI/LLM re-identification risk

The word "performant," when applied to computing technology, means "characterized by a high or excellent level of performance or efficiency," or "working in an effective way." The concept of "performant privacy," as outlined by International Data Corporation, is crucial in implementing robust safeguards that embed high-utility, tailored privacy controls directly into the data. This ensures the data maintains its integral speed and accuracy characteristics, similar to processing unprotected data while protecting against unauthorized use, disclosure or re-identification.

This concept is pivotal in preventing the unauthorized identification of individuals via the Mosaic Effect. It enables harmonizing data protection and utility, facilitating secure and seamless data flows and mitigating compliance risks. This not only augments the value of the data but also ensures that it is accessible and usable for expedited decision-making processes and optimizing business value without compromising the fundamental privacy rights of individuals.

To achieve true performant privacy, managing AI/LLM-associated re-identification risks within the broader context of the modern data ecosystem, extending beyond individual datasets is imperative. Implementing privacy controls that transcend traditional boundaries is essential to balance data utility and individual privacy.

Integrating performant privacy measures that are designed to mitigate the risks associated with the Mosaic Effect can help to reconcile data utility and privacy so the economic potential of AI/LLMs can be achieved while simultaneously respecting the fundamental rights of individuals.

Editor's Note:

Gibson Dunn’s Jane Horvath, Cognizant’s Diptesh Singh and Anonos’ Gary LaFever will lead a breakout session on performant risk managment for LLMs at the IAPP Privacy. Security. Risk. conference, 6 Oct. at 14:00 PT in the San Diego Ballroom A, North Tower.