As conversation around building artificial intelligence models intensifies, new laws are being introduced that aim to empower individuals by granting them more control over their information.  

Large language models relying on vast datasets — typically sourced through web scraping — have put the role of data brokers and their practices under a new spotlight.  

Laws like the forthcoming California Delete Act promise to curb questionable data practices by expanding rights to request deletion of personal data from brokers. Building on foundations like the EU General Data Protection Regulation and California Consumer Privacy Act, they seem poised to shift power back to individuals in an increasingly opaque data economy. 

However, the actual mechanics reveal some troubling deficiencies. Specifically, the broad exclusion of publicly available data found in the definition of personal data creates a paradox where information, which may be commonly considered "private" by societal standards, remains easily accessible. The policy implications for this gap span from criminal justice to personal security.  

As society advances in addressing challenges — such as countering bias in AI models expunging past minor convictions to aid re-entry and lower recidivism rates and redacting information to safeguard officials' privacy for security reasons — the existing gaps in the legal framework could render these efforts effectively moot. Without addressing this issue, the continuous collection and distribution of such potentially harmful public data are likely to persist indefinitely. 

A public carve out 

Plenty of ink has been spilled over the various approaches to personal data across the globe. Yet, the stark contrast between the U.S. and the EU is nowhere more obvious than in the definition itself. The GDPR provides a concise, yet broad, definition of personal data in Article 4(1), encompassing both identified and identifiable information, whether directly or indirectly obtained.  

It states, "'Personal data' means any information relating to an identified or identifiable natural person ('data subject'); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person."  

What is truly impressive about this definition has as much to do with what isn't included as what is. The GDPR's expansive definition of "personal data" notably lacks any exclusion of publicly available information. Yet when we examine many U.S. privacy laws, we find exactly that.

Oregon's Consumer Privacy Act states, "'Personal data' means data, derived data or any unique identifier that is linked to or is reasonably linkable to a consumer or to a device that identifies, is linked to or is reasonably linkable to one or more consumers in a household." It continues, "'Personal data' does not include deidentified data or data that: (A) Is lawfully available through federal, state or local government records or through widely distributed media; or (B) A controller reasonably has understood to have been lawfully made available to the public by a consumer." 

Similarly, California's Delete Act applies such exclusions for "publicly available information" in its core definition of "personal information."  

These few words, therefore, carry significant implications. It means activities like web scraping for public records — ranging from criminal histories, social media profiles to home addresses — are not covered by privacy protections, leaving most individuals vulnerable.  

While individuals might expect rights over things like criminal records, addresses or social media profiles, they may be disturbed to learn these protections do not necessarily apply thanks to the publicly available exclusion. 

Public implication 

The ramifications of such exclusions are particularly striking when examining their effects on public policies aimed at data redaction or expungement. As efforts increase to clear past minor convictions to facilitate re-entry and lower recidivism rates, this information, if it was public before, can still be captured and shared indefinitely by data brokers who are not governed by expungement laws.  

Therefore, despite legal frameworks — like Oregon's ORS 137.255, which may annul convictions, and the U.S. Fair Credit Reporting Act, which restricts data misuse under specific conditions — individuals continue to struggle to assert these rights against private entities. Therein lies the privacy paradox — laws permit individuals to revoke public access in one sphere, while failing to address continued availability through unregulated third-party channels. 

Possible counterarguments exist around precedent and social goods. Some may contend that past convictions necessarily involve enduring consequences, so no guarantee of removal is warranted. Or that aggregating public data serves business or security interests that outweigh privacy risks for individuals.  

However, these claims reflect outdated assumptions. Mounting research on recidivism and re-entry argues against permanent stigmatization. Moreover, unchecked data collection poses increasing societal risks, including biased AI models based on faulty web scraping, profiling and information manipulation. 

Essentially, there is a legislative absence where once public data removed from official records continues circulating beyond the individual's control. This data privacy gap contradicts evolving social values and public policies aiming to restrict access to outdated or dangerous information. Without a reexamination of the public exception, such policies are becoming functionally pointless. 

Public solution  

Fundamentally, we must reassess what obligations private entities owe individuals regarding information originating in public records. As laws evolve to restrict access, the accompanying rationale should also constrain downstream usage rights — particularly for entities holding data purely for commercial gain. 

Yet current exclusion frameworks provide no pathway for this, instead allowing a pre-existing notion of "public" data to persist unchanged. Particularly concerning is the assumption that public availability remains static over time. But we increasingly understand the need for more dynamic conceptions in areas like criminal records. As legal remedies allow for expungement post-sentencing, it is logical that related privacy rights should be formed. 

In essence, publicly available data exclusions require more nuanced formulation to align with progressive privacy policy and social values. Where law and technology permit official records to be updated or restricted over time, former publicity should not grant uncontrolled perpetual use. Exclusions perhaps should be explicitly temporary in nature. 

Bridging gaps 

The urgency to re-evaluate and update the criteria for exclusions of publicly available data is evident, aiming for alignment with the dynamic realm of data privacy and protection. Notably, many scholars have pointed out the inherent biases in current models.  

As LLMs persist in gathering and analyzing data beyond the reach of existing privacy regulations, we risk embedding the adverse outcomes of flawed policies into the core of our AI systems. This occurs while existing privacy paradigms simultaneously reject contemporary policies designed to remediate the failures of the past. 

Moving toward a model that treats these exclusions as provisional, with privacy rights reinstated once the basis for exclusion no longer applies, could offer a more balanced and fair approach to personal data privacy.  

As societal perspectives on transparency, accountability and forgiveness continue advancing in coming years, we must bridge policy gaps that would otherwise allow outdated assumptions to undermine emerging progress in nascent fields like AI. Individual rights and AI's positive impact on society need not be mutually exclusive — but achieving both requires conscious examination of existing legal deficiencies. 

Graham Reynolds is senior counsel at Gordon Rees Scully Mansukhani.