24 Feb. 2023

A view from DC: Can scraping challenges help limit the idea of public data?

As long as there is a public internet, people will scrape it. The free and open accessibility of most of the internet has led to many technological breakthroughs, from search engines to powerful generative artificial intelligence models like ChatGPT. Web scraping — the automated extraction of data across websites — is a common and often harmless activity that fuels innovation and the flow of knowledge.

At the same time, scraping can lead to a large array of supposed harms, from coopting intellectual property to straining computing resources to invading individual privacy. Bots can make use of scraped information in a manner that is unlike anything a human user could do, decontextualizing and amplifying it or incorporating it into machine-learning models. For example, Clearview AI scraped more than 10 billion faces from photos across the internet to create its faceprint database. Technologies like this have the potential to threaten the anonymity and contextual integrity of online spaces. Other companies have misused scraped personal information to generate spam and other unwanted marketing.

Social media platforms and other companies with a large public web presence have a variety of incentives to push back on scraping — for their own benefit and that of their users. Knowing this, the law firm Venable recently launched a project called the Mitigating Unauthorized Scraping Alliance. The firm hosted a series of panels in Washington, D.C., this month discussing some of the pitfalls of unauthorized scraping and the various legal paths to address potential harms. The main takeaway was the crucial need to find a balance between allowing legitimate and beneficial uses of web scraping while preventing harmful and unethical ones.

Radarfirst Ad - Don't let the first warning come from a regulator: Audit-proof your AI

The most prominent example of a legal tool that has failed to respond to potentially harmful scraping is the Computer Fraud and Abuse Act, which was designed to combat hacking. The high-profile case of LinkedIn v. HiQ, after a lengthy court process, resulted in a narrowing of the potential use of CFAA. After the U.S. Supreme Court clarified the only relevant test was whether systems were accessed “without authorization,” the U.S. Ninth Circuit Court of Appeals ruled this does not apply to “public websites.”

For the same reason, despite the myriad possibilities of privacy harms flowing from malicious scraping, data privacy law in its current state cannot provide a solution. Every comprehensive privacy law includes an exception for publicly available information. There are many reasons why this is a reasonable exception. But the concept of publicness seems increasingly fragile in a world of ubiquitous computing power and connected databases.

Do you remember phone books? Phone books were actual physical books that would show up at your doorstep every year with a list of every address and phone number of your closest 10,000 neighbors. This was information made public by the phone company. But it was geographically limited, and only the most well-resourced and motivated individuals, private eyes and such, would bypass this contextual limitation. Phone books, on the internet, would be entirely different — immediately accessible to the globe and able to be cross-referenced with hundreds of other sources of info. That distinction is probably why you don’t see this service flourishing anymore.

Although the public exception is no doubt here to stay, it is worth considering how it can be refined over time to clarify the level of publicity that counts. Is a website de facto public? The Ninth Circuit says the answer is yes, even if the site includes a friendly robot.txt file asking nicely not to be scraped. What about if the site makes you prove you aren’t a robot? Or forces you to log in? What if it is a chatroom or Telegram group that anyone can join? Does it matter if the group has five members or a million members? As soon as a scraping bot is allowed to access such a space, is it public?

In his book "Privacy at the Margins," professor Scott Skinner-Thompson documents how similar reasoning has led to the erosion of privacy causes of action under tort law, particularly the common law tort of public disclosure. In many instances, any prior exposure of information in “public,” no matter how minimal the audience, can negate someone’s public exposure claim. Skinner-Thompson includes the compelling example of a gay man who was outed by a pastor of his church, but lost his case because he had previously visited gay bars. The fact that some people knew his secret was enough for it to be public.

Uniformity around our understanding of what counts as public information will be hard to achieve given the variegated structures that host information online and the siloed regulatory environments in which this question will be litigated. But the sooner we can reach consensus, the sooner we can collectively guard against privacy harms.

Here's what else I’m thinking about:

The FTC’s case against Kochava is moving forward. After a hearing on a motion to dismiss in the U.S. District Court for Idaho, the judge has instructed the Federal Trade Commission to consolidate the countersuits and add more “flesh on the bones” to the allegations, including with more specific allegations of harm, so that both parties can fully engage on the facts. For more on the case, see my column from six months ago.
Is last year’s federal privacy bill impacting state proposals? It is too early to say for sure. Politico reports on some American Data Privacy and Protection Act copycat bills, though neither of these have yet been assigned to a committee. Meanwhile, other state bills are advancing apace.
Chris Inglis retired from his post as the first national cyber director. The office is now headed by acting Director Kemba Eneas Walden. Inglis appeared on a recent episode of The Lawfare Podcast to discuss his legacy, including the new National Cyber Strategy.
Every collection is a potential violation. An important case at the Illinois Supreme Court found that each swipe of a fingerprint or other request for biometric information can lead to an independent claim of a Biometric Information Privacy Act violation. This means significantly more potential liability for these claims than many expected.
Gentleman’s rules for reading each other’s mail. Kenneth Propp’s compelling headline is just the start of a comprehensive analysis of the background of the Organization for Economic Co-operation and Development’s work to establish principles for government access.

Under scrutiny

Microsoft’s rapid deployment of the AI functionality it acquired from OpenAI into its Bing search engine is critiqued in a New York Times op-ed by Reid Blackman. An earlier piece in the Times by technology columnist Kevin Roose about a “conversation” he had with the system is worth reading alongside this well-reasoned critique by Mike Solana.

Upcoming happenings

March 1 at 8:30 a.m. EST, the Innovation, Data, and Commerce Subcommittee of the House Energy and Commerce Committee hosts a hearing on Promoting U.S. Innovation and Individual Liberty through a National Standard for Data Privacy (Rayburn).
March 1 at 8:30 a.m. EST, Politico hosts Privacy: Who's Winning? (hybrid).

Please send feedback, updates and scrapings to cobun@iapp.org.

This article is eligible for Continuing Professional Education credits. Please self-submit according to CPE policy guidelines.

Submit for CPEs

Interested in writing for us? Visit our Contributor Guidelines Page

A view from DC: Can scraping challenges help limit the idea of public data?

Related stories

Under scrutiny

Upcoming happenings