8 Nov. 2023

Training AI on personal data scraped from the web

A cautionary tale is unfolding at the intersections of global privacy, data protection law, web scraping and artificial intelligence. Companies that deploy generative AI tools are facing a "barrage of lawsuits" for allegedly using "enormous volumes of data across the internet" to train their programs.

For example, the class action lawsuit PM v. OpenAI LP, filed in San Francisco federal court in late June, claimed OpenAI uses "stolen private information, including personally identifiable information, from hundreds of millions of internet users, including children of all ages, without their informed knowledge or consent." The list of legal violations alleged in the suit invoke the Electronic Communications Privacy Act, the Computer Fraud and Abuse Act, the California Invasion of Privacy Act (which prohibits the recording of phone calls, unless everyone in the conversation consents), the Illinois Biometric Information Privacy Act, California's Unfair Competition Law and other state consumer protection laws. The lawsuit's claims include, but are not limited to, invasion of privacy, intrusion upon seclusion, unjust enrichment and receipt of stolen property.

One of the many issues raised by this lawsuit relates to the lawfulness of web scraping. As Google recently disclosed in an update to its privacy policies, many AI products and services are trained on personal information scraped from the web. Probing the legality of web scraping through an investigation of relevant statutes, case law and global enforcement actions can therefore clarify some of the unfolding dynamics of AI regulation.

Web scraping and the U.S. Computer Fraud and Abuse Act

In the U.S., most legal discussions around the use of data scraped from the web invoke the Computer Fraud and Abuse Act, enacted in 1986. As the country's first dedicated computer crime statute, the CFAA came into effect at a time when fewer than 30 million computers were in use throughout the U.S. Its main goal was to address the nascent crime of hacking. In essence, the statute prohibits intentionally accessing a computer "without authorization" or in a way that "exceeds authorized access." It has been amended and expanded numerous times, including by the U.S. Patriot Act in 2001 and, most recently, through passage of the Identity Theft Enforcement and Restitution Act in 2008.

Over the years, the CFAA received a fair share of criticism from legal scholars and civil society organizations. They primarily decried its failure to define what "without authorization" means, as well as its "harsh penalty schemes," including the ease with which felony liability is triggered. Notably, the law does not distinguish low-level crimes from more serious ones, and is redundant with other statutes on wire fraud. Over the years, several high-profile cases have involved the CFAA, including U.S. v. Aaron Swartz, which involved the late activist's efforts to download millions of scholarly articles from the digital library JSTOR.

Van Buren: A precedent applicable to web scraping

In 2021, the U.S. Supreme Court ruling in Van Buren v. United States essentially reined in the CFAA, interpreting it from a narrower perspective after decades of mission creep. At the center of the case was Nathan Van Buren, a police officer who was given access to a government database for work purposes but used it for personal reasons. In Van Buren, the court developed what has been referred to as a gates-up-or-down inquiry. For someone to violate the CFAA, "a person needs to bypass a gate that is down that the person isn't supposed to bypass." In Van Buren's case, the Supreme Court ruled he did not violate the CFAA because he was provided access to the database. In other words, the gate was open to him.

Although not directly related to web scraping, the Van Buren case is notable for narrowing the scope of the CFAA. When reading the CFAA as a trespass statute, however, Van Buren left unanswered questions about distinguishing between a gate that triggers liability and a speedbump (i.e., a provider-imposed restriction) that does not. Professor Orin Kerr wrote about the nuance of this distinction in his article Norms of Computer Trespass. He argues for a set of rules under the rubric of authentication, which defines access as "unauthorized when the computer owner requires authentication to access the computer and the access is not by the authenticated user or his agent."

LinkedIn vs. hiQ Labs: Web scraping's pyrrhic victory

The closest thing to a directly applicable legal standard for web scraping involves the legal saga that unfolded in hiQ Labs v. Linkedin Corp.. The now-defunct data analytics firm hiQ, pronounced "high-cue," was engaged in scraping information from public facing user profiles hosted by LinkedIn. In May 2017, LinkedIn sent a cease-and-desist letter to hiQ, alleging its practices violated the CFAA, the Digital Millennium Copyright Act, California Penal Code 502(c) and California Penal Code 602(k).

Later that year, the U.S District Court for the Northern District of California granted a preliminary injunction in favor of hiQ, which had raised questions about the CFAA's applicability. The Court ordered LinkedIn to "remove any existing technical barriers to hiQ's access to LinkedIn members' public profiles, and to refrain from erecting any legal or technical barriers that block hiQ's access to those profiles," among other actions. The Ninth Circuit Court of Appeals upheld this order in a 2019 appeal, also known as hiQ II. The Supreme Court then granted LinkedIn's petition for writ of certiorari, vacated the judgment and remanded the case for further consideration in light of Van Buren v. U.S., which was decided in the interim in June 2021.

Then, in April 2022, the Ninth Circuit reaffirmed its original decision. This decision in hiQ II relied explicitly on the Supreme Court's reasoning in Van Buren. As the court explained, "the concept of 'without authorization' does not apply to public websites." In other words, it concluded LinkedIn and its users had assumed the risk that a third-party might view the public-facing user profile containing personal information such as name, email address, education and employment history. Furthermore, the court reasoned:

"… giving companies like LinkedIn free rein to decide, on any basis, who can collect and use data—data that the companies do not own, that they otherwise make publicly available to viewers, and that the companies themselves collect and use—risks the possible creation of information monopolies that would disserve the public interest."

The saga concluded in December 2022 when the two parties reached a confidential settlement to resolve all their remaining disputes after the Ninth Circuit ruled that hiQ breached LinkedIn's User Agreement.

Some hailed the outcome of the LinkedIn-hiQ saga as a victory for web scraping and for the researchers, journalists and companies that engage in it. Yet, importantly, given that the two parties ultimately reached a private settlement, the case did not establish a binding legal precedent with respect to web scraping. Moreover, the cost of this supposed victory was significant. Although the injunction against LinkedIn initially succeeded in court, it preceded hiQ's diminishment, as the company could no longer obtain funding from investors, retain employees, renew client contracts or solicit new business. As early as 2018, hiQ wound down its operations for good.

Web scraping and U.S. state privacy laws

Legal scholars and practitioners have raised questions about what, if any, restrictions U.S. state privacy laws impose on the practice of web scraping. To begin with, most exclude publicly available information from their definitions of personal information. For example, the CPRA defines publicly available information as:

" … information that a business has a reasonable basis to believe is lawfully made available to the public by the consumer or from widely distributed media, or by the consumer; or information made available by a person to whom the consumer has disclosed the information if the consumer has not restricted the information to a specific audience."

Thus, as Husch Blackwell's David Stauss, CIPP/E, CIPP/US, CIPT, FIP, PLS, and Stacy Weber, CIPP/E, CIPP/US, CIPT, explained in the firm's "Byte Back" resource, "personal information that a consumer makes publicly available on social media platforms could fit within the exception."

If the personal data scraped from the web does not qualify for this exclusion, however, the other requirements governing the use of personal information within U.S. state privacy laws may apply to its collection and processing. Section 1798.100(b) of the CCPA, for example, requires any qualifying business collecting covered consumer information to notify the consumers "at or before the point of collection" about which categories of information it plans to collect and for what purposes. Because the definition of collect under the CCPA includes "obtaining" or "gathering" consumer PII "by any means," the conduct of scraping a website seems to fall within that description. However, within the CCPA Final Regulations, approved by the California Office of Administrative Law in March, Section 7012(h) further clarifies that:

"A business that neither collects nor controls the collection of personal information directly from the consumer does not need to provide Notice at Collection to the consumer if it neither sells nor shares the consumer's personal information."

Thus, according to a blog by Nate Garhart, special counsel at Farella, Braun, and Martel, no notice needs to be provided for:

A data scraper that does not sell the scraped personal information.
A data scraper that uses the scraped information for their own purposes, even for marketing to identified customers.
A data scraper that collects data, deidentifies it, and then sells the deidentified collection of data.

On the other hand, according to Garhart, a scraper selling collections of scraped data that include personal information would be subject to the requirement to provide notice at collection. This may apply to AI products trained on personal data scraped from the web. Yet, important questions remain about whether and to what extent the protections of U.S. state privacy laws would apply to data that is scraped from the web and used to train AI.

Web scraping and the GDPR

While complicated questions remain within the U.S. legal system, across the pond, web scraping interacts quite differently with the EU General Data Protection Regulation. For AI technologies driven by personal data or special categories of personal data that are scraped from the internet, GDPR provisions governing the collection and processing of personal data for data controllers likely apply. Heightened obligations may also apply to the collection and processing of one of these special categories of personal data.

What differentiates the GDPR from U.S. privacy law is its default prohibition on collecting and processing personal data unless the controller has a lawful basis. The GDPR outlines six lawful bases that can justify data collection and processing: consent, contract, legal obligation, vital interest, public task and legitimate interest. Under most interpretations of the GDPR, these Article 6 requirements apply whether the information is obtained from a publicly accessible source or collected directly from the data subject.

Explicit GDPR guidance for web scrapers came in 2020, when France's data protection authority, the Commission nationale de l'informatique et des libertés, reminded companies that they must obtain individuals' "freely given, specific, informed and unambiguous consent" to reuse contact details that are published in online public spaces. Similarly, ABB Senior Counsel Piotr Foitzik, CIPP/A, CIPP/C, CIPP/E, CIPP/G, CIPP/US, CIPM, CIPT, FIP, wrote in an article, "there is no doubt, that when the personal data comes from publicly available sources, the data subjects must be notified in line with Article 14."

Given the nature of web scraping, however, consent, transparency and the right to object are difficult principles to operationalize. Other legal bases hold little promise for web scraping as well. Notably, in the Italian Supervisory Authority's March 2022 decision to fine Clearview AI 20 million euros for scraping the web for biometric data, the regulator rejected the company's legitimate interest claim as a lawful basis for its data processing. A similar joint investigation by the U.K. Information Commissioner's Office and the Office of the Australian Information Commissioner against the company included complaints around transparency, purpose limitation and storage limitation. In May 2022, the ICO issued a USD9 million fine against Clearview AI and ordered it to stop obtaining and to delete the data it already had on UK citizens. A U.K. tribunal overturned the fine in October 2023, saying the GDPR is not applicable to foreign law enforcement activities.

Nonetheless, legal hurdles still exist for web scraping in the EU. Given such challenges, AI models that rely on web-scraped data may be in a difficult legal position with respect to global data protection laws. Indeed, they have already drawn the ire of numerous global privacy and data protection authorities.

Warning from global privacy regulators

The practice of web scraping is certainly on the radar of global privacy regulators. In late August, 12 international data protection and privacy authorities released a joint statement around web scraping to instill greater accountability on social media companies that collect personal information and make it publicly available. Of course, many of the privacy concerns around web scraping originate with bad actors, or those who use web scraping to perpetrate cyberattacks, create fraudulent loan or credit card applications, gather political intelligence, and deliver bulk unsolicited marketing messages. Data protection regulators have seen an increased number of reports of these kinds of incidents and are cautioning all companies to be on heightened alert.

Yet, the regulators' statement stressed that even publicly available personal data is still subject to the protection of data privacy laws. In other words, they are worried web scraping authorized by social media companies may still run counter to individuals' privacy expectations. As an Electronic Privacy Information Center article put it, "When we make information available for the public to view on social media or the web, we do not expect or intend that others will take that information and do with it as they please."

As IAPP Managing Director, D.C., Cobun Zweifel-Keegan, CIPP/US, CIPM, pointed out in his weekly dispatch, though, "Exceptions and exclusions for public data make it difficult to bring enforcement actions over those who make later use of this data." This difficulty partly explains why the strategy of global regulators has been to urge social media companies to take more technical, procedural and legal actions that prevent their websites from being scraped in the first place. These measures, many of which major platforms like Meta already devote substantial resources toward implementing, include things such as rate limiting the number of visits per hour or day by one account, taking more steps to detect bots by using CAPTCHAs, blocking IP addresses where data scraping activity is identified, and sending cease-and-desist letters to suspected and confirmed web scrapers. Moreover, regulators advise that these controls be used in proportion to the sensitivity of the information at stake.

Importantly, companies' use of personal data scraped from the web, including for the training of AI technologies, can undermine consumers' trust and thereby have detrimental consequences for the digital economy. Indeed, one of the findings from the IAPP Privacy and Consumer Trust Report concerned a set of behaviors referred to as privacy self-defense. These include deciding against an online purchase, deleting a smartphone app or avoiding a particular website due to privacy concerns. When consumers lose trust in how their data is being collected and used, they are more likely to engage in these self-defensive behaviors to protect their privacy.

Practical takeaways

While global privacy law and AI governance remain in constant motion and the legal status of web scraping continues to develop, there are several actionable takeaways for companies involved in data scraping to consider.

For those engaged in web scraping activities, some practical considerations include:

Reviewing the scraped website's terms of use and/or user agreement.
Minimizing or even avoiding the collection of PII.
Being prepared to halt scraping activity if subject to a cease-and-desist letter.

For those whose data is being scraped, practical considerations include:

Creating internal knowledge systems to build organizational awareness of likelihood and impact of risk.
Updating terms of service policies to ensure unauthorized scraping is explicitly prohibited.
Removing or limiting public access to data based on its sensitivity.

Last but certainly not least, all organizations should stay updated on applicable global privacy laws and guidance from global data protection authorities on web scraping.

Unwelcome guests?

In a Colorado Technology Law Journal article, one author compared web scrapers to "unwelcome guests of a private dinner party held at a public restaurant." As the privacy expectations of consumers and regulators evolve, web scrapers may find that fewer parties are willing to let them listen in.

Approaches to web scraping vary considerably across countries and jurisdictions, making navigation of the legal landscape applicable to AI technologies that scrape personal data from the web a considerable challenge. Ultimately, companies developing and deploying AI models that rely on web-scraped data should consider both the benefits and the risks of this kind of information collection and data use by employing a risk-based approach. There are undoubtedly benefits brought about by technologies that scrape large quantities of data and turn those into innovative products and services, but these technologies also introduce new privacy risks for which systems of internal and external accountability must be in place.