27 Aug. 2025

ANALYSIS

Fair use or free ride? The fight over AI training and US copyright law

Recent developments across all three branches of the federal government have reaffirmed the centrality of copyright as a major artificial intelligence policy issue. On 16 July, Sen. Josh Hawley, R-Mo., led a Senate Committee Hearing on the Judiciary hearing titled "Too Big to Prosecute?: Examining the AI Industry's Mass Ingestion of Copyrighted Works for AI Training." Meanwhile, major lawsuits are underway between creators and AI developers, alleging this mass ingestion violates copyright protections and the White House on 23 July released its AI Action Plan, which sidestepped direct copyright concerns but was accompanied by a presidential address that seemed to signal support for developing AI without burdensome legal hurdles.

Yet as lawmakers and litigants stake out positions, a familiar question is taking new shape: how should the law balance creators' rights with the drive to innovate? Recent developments suggest that fair use in the context of AI may depend as much on the sourcing of copyrighted materials as on their use.

Too big to prosecute? 

Hawley framed the July hearing as an opportunity to critique AI developers' use of copyrighted works without licensing or compensation. The senator's opening remarks did not hide his opinion on the matter, equating the use of copyrighted materials for foundation model training to “the largest intellectual property theft in American history." Hawley even called out individual companies, including Meta and Anthropic, accusing them of downloading copyrighted books and articles from illegal "shadow libraries."

Syrenis webinar, The ROI of Consent: Generate 40% More Revenue Through Smarter Personalization

Sen. Dick Durbin, D-Ill., joined the criticism of foundation model developers; he even called out Anthropic's CEO for his decision to "steal (copyrighted works) to avoid legal, practice, and business slog, whatever that means." Durbin's colorful turn of phrase echoed the judge in one of the most high-profile large language model copyright cases — who in turn was quoting Anthropic's co-founder.

Though both senators explicitly bought into the prevailing U.S. policy narrative of supporting the innovation economy, they made clear in their remarks that AI innovation should not come at the expense of creators and their works. Durbin's opening remarks highlighted how "AI innovation and protection of intellectual property rights are not mutually exclusive. That is why it is troubling to hear stories about the steps Big Tech companies are taking to train their AI models on copyrighted materials without compensating the creators (of) those works."

In the week following the hearing, Hawley introduced the AI Accountability and Personal Data Protection Act, co-sponsored by Sen. Richard Blumenthal, D-Conn.

Hawley's bill would uniquely merge copyright and privacy protections, creating a private right of action for the misuse of covered data. The new tort would allow individuals to sue a developer if it uses the individual's personal data, inferences or copyrighted works to train LLMs without their express, informed consent. The bill specifies that consent must be willfully given and cannot be obtained as a condition for a product or service unless reasonably necessary to operate it. The burden to obtain consent is on whoever appropriates, uses, collects, processes, sells or otherwise exploits covered data. This includes training generative AI systems.

In the copyright context, creators would be given the power to sue AI developers — and perhaps those who use appropriative AI systems — for not getting express, informed consent before using their work. Instead of joining large class actions, each creator could pursue individual litigation to seek redress of their copyright claims.

Hawley will likely face an uphill climb for this bill. In an era where deregulation and competition dominate the narrative, Hawley may be hard pressed to find support for a private right of action that could subject AI developers to large amounts of private litigation.

Still, this example shows that some Republicans in Congress remain open to exploring various AI guardrails. Meanwhile, it remains up to the courts to determine how existing copyright law applies to generative AI.

Action in the courts: The fair use defense 

As litigation unfolds between creators and AI developers over alleged copyright infringement, the ever-popular fair use defense is at the core of the debate. The fair use doctrine allows for the secondary use of copyrighted works in certain circumstances. Courts apply this defense through a case-by-case analysis using four statutory factors.

LLM developers assert that training these systems constitutes a fair use, despite requiring the collection and storage of millions of copyrighted works. This argument came to a peak recently in Bartz v. Anthropic, a prominent case in the U.S. District Court for the Northern District of California.

Judge William Alsup addressed fair use at the summary judgment stage — a point in the case where the judge can peek ahead at the legal issues to determine whether they clearly favor Anthropic, even assuming the plaintiff authors' arguments to be true. Alsup determined that training Anthropic's Claude model on the copyrighted works probably qualifies as fair use, but the storage of the same books in a central library only qualifies as fair use if copies of those books were legally obtained. The judge's analysis provides a good snapshot of the four statutory fair use factors as applied in the LLM context.

Alsup splits his analysis of each factor into two separate uses at issue in the case: the initial creation of a centralized digital library housing all the books Anthropic collected — the company's stated goal was to collect "all the books in the world" — and the subsequent training of the Claude LLM on a portion of these works. Anthropic argued that only the LLM training should count as a "use" under copyright, but at this initial stage the court considers both uses, giving the authors the benefit of the doubt.

Further, when analyzing each factor for the central library use, Alsup reviewed two different sets of collected works. The library included thousands of copyrighted works acquired by purchasing and scanning physical copies; thousands more were downloaded from online repositories "assembled from unauthorized copies of copyrighted books — that is, pirated."

Purpose and character of the use

In determining fair use, courts first examine how the copyrighted works were used and whether the use generates a new purpose or meaning to the material. This is usually referred to as the transformative nature of the use. Whether the new use is for commercial purposes also frequently plays into this factor.

For the central library, the court found the transformation of the legally possessed copyrighted works from print to digital format was transformative and weighs towards fair use. Anthropic stored the digitized copies for purposes of searchability and organization. While the conversion doesn’t add new meaning or expression, the court found the purpose of internal organization to be sufficiently transformative to favor fair use.

The pirated copies do not receive the same treatment. Despite the identical practice of copying and retaining full works, the court held that using pirated copies is not likely to count as transformative. Copying something that was already pirated does not become fair use just because it might be used later in a fair use context. The retention of pirated works is not transformative.

For the training step of developing an AI model, the court found the use is likely to be considered transformative. Alsup compared model training to teaching a child to read. Consumers are not expected to pay each time they read a book, and similarly, Anthropic should not be charged for each LLM output that reflects trained knowledge. The model was designed to "turn a hard corner and create something different," and this commercial but transformative purpose weighed in favor of fair use.

Nature of the copyrighted work

This factor generally focuses on the type of copyrighted material at issue and the degree of protection it receives. Creative, highly expressive works get strong protection while mainly factual works get less protection.

In Bartz, this factor weighs against fair use for all the uses. Alsup, in line with precedent, stated the main function of this factor is to assess the relationship between the nature of the works at issue and the secondary use. Because the works at issue were chosen for their expressive value and artistic writing style in order to make LLMs better at the same, this factor weighs against fair use across the board.

Amount and substantiality of the portion used

This factor considers whether the amount copied was reasonable in relation to the purpose, as articulated by the U.S. Supreme Court in Campbell v. Acuff-Rose Music. Courts have traditionally viewed copying entire works as weighing against fair use. However, in Authors Guild v. Google, the U.S. Court of Appeal for the Second Circuit clarified that what ultimately matters is not the amount copied per se, but what is made available to the public in the secondary use.

Alsup concluded that the full retention of the legally acquired works favored fair use as the purpose was to create a training library. Conversely, he concludes that the retention of pirated works weighs against fair use because the retention of pirated works is inherently infringement.

As for the training of specific LLMs, this factor also favored fair use. Anthropic argued that large-scale copying was necessary to train an effective LLM, and the court accepted that the scale of data used was appropriate given the technical demands of model development. Importantly, the court noted there were no allegations that the model produced infringing outputs. The necessity supported fair use.

The effect of the use upon the market

This factor examines whether the secondary use undermines the market for the original work or its licensing potential. Even a transformative use can weigh against fair use if it causes substantial market harm, as the U.S. Supreme Court emphasized in Andy Warhol Foundation for the Visual Arts v. Goldsmith.

The storage of legally acquired works and converting them from print to digital copies is neutral for the fourth factor. The plaintiffs alleged there is a displacement in the market because Anthropic can now easily distribute the digital copies. But the court noted there was no evidence Anthropic had distributed these digital copies or intended to. Without more evidence, market harm was speculative.

In contrast, retaining pirated works weighed against fair use as copying unauthorized material undermines the market and deprives authors of compensation. The court assessed the harm in the aggregate, warning that widespread reliance on pirated sources would erode market protections.

The effect of the use on the market weighs for fair use for training LLMs. While Claude might generate summaries or derivative insights, these outputs did not constitute realistic market substitutes that harmed the authors' economic interests in their original works. So long as this remains the market reality, this factor favors fair use for LLM training.

What is next for Anthropic and other developers?

Given the possibility of Anthropic's fair use defense failing with regard to its collection of pirated materials, the court denied the motion for summary judgment and later denied Anthropic's bid to appeal this decision before trial. When coupled with the judge’s decision a few days later to approve a class certification, the case has major ramifications for the future of LLM training.

Thousands of authors can now join the litigation against Anthropic, with a trial date already set for 1 December. Meanwhile, other similar cases are proliferating, pitting creators including authors, designers and musicians against foundation model developers such as OpenAI, Anthropic and Meta.

The wider lessons for AI governance professionals from this case are mixed.

Here the court has reaffirmed that training LLMs on copyrighted works can qualify as fair use because the process is "spectacularly transformative" — analyzing statistical patterns to generate new text rather than reproducing originals. Format-shifting lawfully acquired print books into digital form for internal use, including for training purposes, is also considered permissible.

Yet, the storage of pirated works in a central library did not satisfy fair use, creating some nuance about the reality of training LLMs. While transformative use strongly favors fair use, the source of the copyrighted works is crucial. If there is a legal way to acquire copyrighted works, then the developer must pursue that route instead of sourcing from pirated libraries.

Operationally, AI governance professionals must balance these factors carefully: copying entire works may be permissible if justified by model performance needs, but the burden of proof remains. Courts favor uses that don't substitute or replicate original works; developers should continue implementing safeguards to prevent infringing outputs.

The decision signals that transformative use and lawful acquisition are key pillars of defensible AI training practices. Retention policies and data provenance remain high-risk areas for litigation and regulatory scrutiny.

What about the executive branch?

On 9 May, the U.S. Copyright Office released a pre-published report on generative AI training covering issues including fair use. Though only advisory, the opinion of the office can be an important factor in analyzing practices under copyright law. The report presented reasoning that largely agreed with the Bartz opinion particularly in the first three factors. However, the report mostly focuses on training AI and does not concentrate on the issue of collecting and storing large numbers of works in a central library, which is determinative in the Bartz analysis.

Notably, Alsup and the Copyright Office disagreed on the fourth factor, particularly over the market protection under the Copyright Act. While both agreed there is potential for a market of AI training licensing, they disagreed over whether it should be considered in the fair use analysis. The court believed the right to license is not protected under the Copyright Act and creators are not free to control downstream markets. The Copyright Office took a wider stance, seeing the inability to sell licenses for one's work as a negative market impact, weighing against fair use.

Although this is the current official guidance from the U.S. government, the Trump administration has cast some doubts on whether it will remain the official position. Immediately after the publication of the draft report, Trump fired the director of the Copyright Office. And though the administration's AI Action Plan does not weigh in on the copyright issue, in his accompanying speech, Trump said, "You can’t be expected to have a successful AI program when every single article, book or anything else that you’ve read or studied, you’re supposed to pay for."

There is a world of distance between this opinion and Hawley's private right of action for unauthorized use of copyrighted materials for training LLMs. The intensity of the debate may help to explain why the AI Action Plan remained silent on copyright issues. Unless Congress acts, it remains up to the courts to determine how copyright applies to this revolutionary technology.

The intersection of AI development and copyright law is continuing to prove fraught. As AI models become more sophisticated, so do the legal and ethical questions surrounding the use of copyrighted materials for training. Courts are grappling with the boundaries of the fair use doctrine; legislators are proposing stricter consent requirements; and the executive branch is signaling deregulatory priorities. Even as the tension between AI innovation and creators' intellectual property rights remains unresolved, innovation is undoubtedly the top priority in Washington today.

Dayton Fiddler is a student at the University of Colorado Law School and served as a summer legal policy intern at the IAPP.

Cobun Zweifel-Keegan, CIPP/US, CIPM, is the managing director, Washington, D.C., for the IAPP.

This article is eligible for Continuing Professional Education credits. Please self-submit according to CPE policy guidelines.

Submit for CPEs

Interested in writing for us? Visit our Contributor Guidelines Page

Fair use or free ride? The fight over AI training and US copyright law

Related stories

Too big to prosecute?

Action in the courts: The fair use defense

Purpose and character of the use

Nature of the copyrighted work

Amount and substantiality of the portion used

The effect of the use upon the market

What is next for Anthropic and other developers?

What about the executive branch?

Too big to prosecute? 

Action in the courts: The fair use defense