Editor's note: The IAPP is policy neutral. We publish contributed opinion and analysis pieces to enable our members to hear a broad spectrum of views in our domains.
Artificial intelligence is increasingly the subject of costly litigation around the world, particularly in the realm of copyright and intellectual property. Copyright and AI — specifically generative AI — is usually framed as a battle between model developers seeking to innovate versus creatives and other rightsholders who are afraid of having their work exploited at their expense. Jurisdictions are reacting to the rapid rise of AI in this field by issuing guidance, publishing judicial opinions, and conducting public consultations to explore potential solutions, frequently highlighting the importance of increased transparency and traceability. But, in the realm of intellectual property, how can transparency be achieved without skewing the existing balance of competing interests between rightsholders and AI developers?
The U.S. National Institute of Standards and Technology defines transparency as "a property of openness and accountability throughout the supply chain." In the context of AI, NIST treats transparency as one of several properties that characterize a trustworthy, responsible system. These principles, which also include explainability, interpretability and accountability, work together to build trust and ensure fairness and lawfulness. In the context of copyright, transparency can promote a "fair and equitable sharing of benefits" for rightsholders, who want to be credited and compensated for their work, and developers, who rely on quality data for training purposes. But, as NIST's AI Risk Management Framework notes, the level of transparency and information availability will vary depending on the stage of the AI lifecycle in question.
At a high level, generative AI functions by learning from mass amounts of training data, assigning certain variables different weights to account for different outputs, and churning out a resulting output based on statistical probabilities. Thus, there are three main stages of AI model development: input, model development and output.
Breaking it down: What does transparency really mean?
At the input stage, transparency mostly concerns the data used by developers to train their models and whether they lawfully obtained it. Higher quality training data results in higher quality, more accurate outputs. So, developers often prefer using higher-quality content, even if it may be subject to copyright protections, over publicly available social media posts, for example. Therefore, copyright infringement issues primarily arise at the input phase because of data scraping practices that collect information from books, audio files, movies and other rich forms of text and media to train AI models. A lack of transparency into the training process can cause issues for rightsholders attempting to obtain sufficient evidence to pursue legal action or proper remuneration.
When rightsholders pursue copyright litigation, they bear the burden of proving their claim with evidence of infringement. However, developers have disproportionately more information about the training data they use, requiring costly litigation and discovery proceedings to even begin to understand whether an infringement occurred. This predictably disincentivizes litigation and works in favor of tech companies and developers, some of whom may be violating copyright protections.
In the U.S., cases in the Northern District of California have addressed whether training models on protected works is lawful under the fair use exception to Copyright Law of the United States. Fair use is a broad exception for copyright infringement and involves a judicial analysis of four factors: the purpose and character of use, the nature of the copyrighted work, the amount and substantiality of the portion of the work used in relation to the work as a whole and the effect of the use upon the potential market value of the work.
While the court in Bartz v. Anthropic recognized that using protected works, namely books, to train a model could be considered a transformative fair use, a different justice on the same court indicated that this type of use is likely unlawful in most cases. In the latter case, Kadrey v. Meta, the court too ultimately held in favor of the developers on a motion for summary judgment over evidentiary issues in the plaintiffs' arguments.
In Bartz, the court conducted the four-part fair use analysis on the books used to train Anthropic's Claude. The author-plaintiffs in that case did not allege any copyright infringement via the AI model's output — only in the model's input and training — which ultimately weighed in favor of Anthropic's fair use of the books as training data.
The court notably distinguished between works that were purchased or otherwise lawfully obtained and those that were pirated. While using lawfully obtained works can constitute fair use, using illicitly acquired works is generally an infringement of copyright law. In analyzing the third and fourth factors, the court again considered the lack of alleged infringement via the AI model's output, recognized that the use of billions of words to train a large language model was reasonably necessary, and determined the use would not likely have a direct effect on the market for the original works.
Much of the court's analysis in the Kadrey case focused on the fourth factor: the potential effect of the use on the market value for the works. Like Bartz, the plaintiffs alleged copyright infringement for Meta's use of copyright-protected books to train its LLM, Llama. The court granted summary judgment in favor of Meta based on the types of market harm alleged by the plaintiffs. Notably, however, U.S. District Court for the Northern District of California Justice Vince Chhabria's opinion seemed to indicate that were the plaintiffs to raise the claim and evidence that the use of their works for training purposes would lead to a market flooded with similar, AI-generated works, Meta's use of copyrighted materials to train its AI model may not have been deemed lawful.
In the federal district court for the Southern District of New York, ongoing litigation between The New York Times and OpenAI has also challenged the use of protected works for AI model training as a violation of copyright. At the time of this publication, the court has not yet issued a final ruling on whether OpenAI's use qualifies as fair use.
This trend is also emerging elsewhere around the world. A case in the U.K., Getty Images v. Stability AI, originally contained a claim for copyright infringement from Stability AI's use of protected images to train an image generation tool. That claim was ultimately dropped largely because of evidentiary issues, leaving a secondary infringement claim remaining. These issues could be alleviated by developers utilizing increased transparency during the training process, leading to more available evidence and thus making it easier to enforce copyright law across the board.
The EU Parliament released a report highlighting that tracing whether protected works have been used as an input is not merely a binary, yes-or-no question. Instead, it is a statistical problem to determine the degree of influence a protected work had in producing a generative model. Here, increased transparency and traceability can help rightsholders identify when infringement has occurred and, subsequently, the extent to which they are entitled to remuneration. However, this raises separate issues as to how remuneration would be determined.
Model development
Generative AI models are developed in a complex, iterative manner that is often described as a black box. At this level, infringement is discussed less frequently, largely due to the lack of transparency into the development process beyond proprietary walls. However, transparency in this phase can alleviate infringement issues during other stages. Often, transparency in model development looks more like explainability and interpretability — as recognized by NIST — by understanding how and why a model made a given decision.
At the development stage, disclosure can aid in understanding a model and be valuable for evaluating accuracy, as well as addressing ethical concerns like fairness and bias. The EU AI Act requires documentation and the disclosure of various elements of a model's development process; these include the model's architecture and number of parameters, the modality and format of inputs and outputs, training methodologies, key design choices and information about the computational resources and energy used for training. Additional information is required for high-risk systems and those that are being provided to third parties for downstream development.
The EU released a non-binding Code of Practice to supplement transparency and copyright provisions within the AI Act and provide sample documentation for developers of general-purpose AI. Since its publication, several prominent tech companies like Google, Microsoft, OpenAI and IBM have committed to follow these best practices.
Further, as described above, the European Parliament's report, "Technical Aspects of Generative AI in the Context of Copyright," addresses some ways that developers can increase the traceability of their models. Besides increased research on the reverse traceability of a model, the report recommends adopting open-source frameworks for auditing purposes, implementing tools that can measure the likelihood of a specific datapoint influencing an output, standardizing dataset documentation and conducting independent testing. Without transparency and insight into the model development itself, it's difficult to paint a picture of how rightsholders' works are being used and whether their rights have been violated.
In the case of Getty v. Stability AI, the remaining claim reviewed by the High Court of Justice concerned whether Stability AI is liable for secondary infringement via the open-source nature of a particular model and its accompanying weights allegedly constituting an "infringing copy." Notably, the court accepted evidence that AI models do not actually store any protected works. As such, the court concluded that model weights are not infringing copies themselves and ultimately dismissed the remaining claims in favor of Stability AI.
Generated output
Once an output is generated, two main copyright issues arise: whether the newly generated work is copyrightable and whether it’s generation itself violates copyright law. Additional issues may exist concerning deepfakes and images that appropriate an individual's likeness. Denmark, for example, has proposed legislation that would give individuals copyrights over their own image and likeness to combat deepfakes. In the U.S., the TAKE IT DOWN Act applies in the more nuanced context of non-consensual pornographic content generated by AI.
In many jurisdictions, only work that has been authored by a human is eligible for copyright protection. This has been reaffirmed in the U.S. in a U.S. Court of Appeals for the District of Columbia Circuit decision, Thaler v. Perlmutter, where the court held that a work generated by a machine or AI model was not eligible for copyright protection under the Copyright Act. The U.K., however, is one of few jurisdictions that provides copyright protections to computer-generated works under the Copyright, Designs and Patents Act, though some protections vary from that of human-generated works. It remains unclear in either jurisdiction where the boundary lies between human- and computer-generated content when both humans and AI contribute in the creation process and, consequently, how such generated works would be treated under the law.
Some of the same risks and issues concerning the inputs and source training data arise at the output stage. When a work is a derivative of its inputs, it is important to know how much influence a given protected work has on the resulting output. Not only does this affect remuneration, but it is important for rightsholders to ascertain whether an infringement occurs when work is seemingly substantially similar to their own.
In this context, technical solutions to track attribution and influence can again be useful. Research teams at the Massachusetts Institute of Technology and other academic institutions around the world have been piloting approaches to address this issue, exemplifying that transparency in this manner is achievable. For example, one team established a tool known as the "Data Provenance Explorer" to help AI developers search datasets for licensing restrictions, enable dataset creators to track data and properly allocate credit, and empower researchers and policymakers who are seeking to better understand this field.
Without transparency into the production of outputted material, it's difficult for rightsholders to seek recourse or renumeration or otherwise obtain evidence for litigation, such as in Getty Images and The New York Times v. OpenAI. As indicated, the effect of an AI-generated output on the market for a copyright-protected input can influence a court's fair use analysis.
Methods for increasing transparency
One way for developers and tech companies to be transparent about their practices is to disclose the data used to train their models. For example, under Article 53 of the EU AI Act, developers of general-purpose AI models must disclose information about the data used for training, testing and validation, including the type of data.
Separately, many companies engage in licensing agreements to consent to sharing their work with developers for training purposes in exchange for compensation. Notably, The New York Times has agreed to share its publications with Amazon for training purposes; The Washington Post has a similar agreement with OpenAI. In the U.S., Sens. Josh Hawley, R-Mo., and Richard Blumenthal, D-Conn., proposed a bill that pushes for further transparency and requires these licensing agreements to include information about what shared data third parties may access. The bill, if passed, would also provide enhanced remedies and protections against the theft of protected works for training purposes.
Some jurisdictions, such as the EU, allow rightsholders to opt-out of sharing their work with developers using a technical measure known as robots.txt. However, there are limitations to this as rightsholders cannot opt-out of their work being used for research, improving accessibility, or appropriate uses under the text and data mining exception of the Directive on Copyright in the Digital Single Market. Some private companies are also providing technical solutions to make this capability more widely available or extend it to other jurisdictions without this mechanism in place. The European Parliament has recognized the need for increased research into technical solutions that can address traceability gaps to retrospectively track the influence of inputs within a model, particularly in the name of fair renumeration.
At the output stage, labeling mechanisms can indicate whether a work was created by a human or AI model; the metadata of generated media can give insight into where, when and by whom a piece was created. Some coalitions of organizations across various industries have formed to share content credentials about the work they create and adopt transparent practices that align with principles like privacy, security and authenticity. Research has shown that public perception generally favors the disclosure of AI use in creating content like news, indicating that transparency is also a favorable business practice.
Conclusion
Each stage of generative AI development poses different copyright and intellectual property challenges — issues that can be aided by increased transparency. At each stage, transparent practices look different and involve various solutions including legal, technical and business processes. Ultimately, increased transparency can pave a path forward to a better balance between innovation for developers and fairness for rightsholders and other creatives.
Stephanie Forbes is a former summer privacy fellow for the IAPP.


