Editor's note: The IAPP is policy neutral. We publish contributed opinion and analysis pieces to enable our members to hear a broad spectrum of views in our domains.
A key challenge in training generative artificial intelligence models is ensuring compliance with copyright laws. Large language models, like ChatGPT or Google DeepMind, require vast amounts of text, images and other data for training to generate high-quality responses. It's no secret that these datasets are often compiled through web scraping, using publicly available content.
The EU AI Act reinforces the need for copyright compliance, particularly concerning LLMs. Recital 105 highlights that developing and training general-purpose AI models requires access to extensive amounts of text, images, videos and other data. The act acknowledges that "text and data mining techniques may be used extensively in this context for the retrieval and analysis of such content, which may be protected by copyright and related rights. Any use of copyright protected content requires the authorisation of the rightsholder concerned unless relevant copyright exceptions and limitations apply."
The act defines general-purpose AI models as those trained on large datasets that exhibit significant generality, performing a wide range of distinct tasks. Examples could include ChatGPT or Google's PaLM — handling code generation, translation and joke explanation — or Anthropic's Claude — capable of content creation, vision analysis and answering complex questions.
Although the AI Act refers to general-purpose AI providers only, other AI developers are not off the hook. Provisions of the directive on "copyright and related rights in the Digital Single Market" still apply to anyone attempting to use a copyrighted work. It's worth noting that the directive was actually the first legislative attempt to address copyright issues arising from AI training through web scraping.
DSM Directive provisions on AI training and copyright
The DSM directive introduced a text and data mining exception to copyright protection. While text and data mining covers a wide range of computational analysis, including search engine indexing, it also extends to data scraping for AI training. The directive, however, was enacted in 2019 — before the rise of generative AI tools — so legislators may not have fully anticipated the impact of LLMs on online copyrighted works.
Generally, web scraping of copyrighted content for AI training is permitted under the DSM directive, provided rightsholders have not explicitly opted out. Rightsholders can reserve their rights using machine-readable means, namely, technical protocols that web crawlers — bots used to scrape data — can recognize and respect. Recital 18 mentions that machine-readable reservations can include metadata or website terms and conditions — though, in practice, most crawlers do not process website terms and conditions. If a rightsholder has expressly reserved their rights, general-purpose AI providers must obtain authorization before using the content for training.
AI Act requirements for copyright compliance in general-purpose AI models
Article 53 of the AI Act imposes two key obligations on general-purpose AI providers.
First, to implement a policy in compliance with EU copyright law, particularly identifying and complying with the reservation of rights in the DSM directive.
The second requirement is to draw up and make publicly available a sufficiently detailed summary about the content used for training. This transparency measure, hopefully, will allow creators to verify whether their works have been used in training and whether opt-out requests have been honored.
The third draft of the General-Purpose AI Code of Practice: Copyright section
The AI Act does not specify what a copyright compliance policy should entail but encourages general-purpose AI providers to develop industry best practices — referred to as codes of conduct. On 11 March, a group of independent experts, facilitated by the AI Office and involving nearly 1,000 stakeholders, EU member state representatives, and international observers, submitted the third draft of the General-Purpose AI Code of Practice.
The code's copyright section outlines five measures to ensure compliance with copyright protection under the AI Act. Of particular interest is signatories' commitment to "identify and comply with rights reservations when crawling the World Wide Web."
Ensuring crawlers respect machine-readable opt-outs
The draft Code of Practice states signatories should employ the crawlers that read and follow instructions expressed in accord with the Robot Exclusion Protocol.
Robot.txt is a file used by websites to control how web crawlers — including search engine bots — access and index content on the site. It provides instructions on which parts of a website should not be crawled. Currently, it is the most common technical protocol used to reserve the creators' rights. However, it's important to remember that robots.txt only provides guidance to compliant bots. It does not block the access to copyrighted works, but informs the crawler if the copyright has been reserved or not.
In these circumstances, the commitment of the code's signatories to employ the crawler that will follow the guidance is an important step. Sadly, the copyrighted content can still be scraped by bots that simply ignore the reservation flag.
Nevertheless, it should be noted that despite robot.txt being the most respected protocol there are several others in use, and lack of a unified standard for the reservation of rights does not make the life of general-purpose AI providers easier.
In his paper "Considerations for opt-out compliance policies by AI model developers," Paul Keller provided insights on the existing types of protocols, which could be divided into two main categories — location-based and unit-based protocols.
Location-based protocols — like robots.txt, ai.txt, the TDM Reservation Protocol, meta tags or http headers — are applied by domain owners to all content on the website, and unfortunately may block search engines indexing as well.
Unit-based protocols allow tagging of a specific work by metadata tags informing the crawler of the creator's wish to opt-out of the AI training. For instance, an image is tagged with metadata that includes details about the content's origin and any usage restrictions — like "not for AI training." In contrast to the location-based signals, the metadata tags can be attached to a specific work, which provides the individual creator with more control.
To tackle the above challenge, the code contains an additional commitment to make best efforts to identify protocols that have either resulted from a cross-industry standard-setting process — aiming to achieve a unified protocol for rights reservation — or are "state-of-the-art and widely adopted by rightsholders." This means that less common or newly introduced opt-out mechanisms may not necessarily be followed unless they become industry standards. Although the DSM Directive does not limit the machine-readable means that can be used to express the opt-out, the proposal to follow "state-of-the art" protocols may contribute to a faster standardization process.
The code encourages signatories to support standardization efforts and engage in discussions to develop appropriate machine-readable standards for expressing reservation rights. This commitment will be a large step forward, supporting the long-awaited efforts to design and implement a unified protocol for reserving copyrights under the DSM directive.
Risks of unified protocol for reserving the rights
Although a unified opt-out protocol may be a dream come true for large AI providers it is not without potential risks. If general-purpose AI providers only follow widely adopted protocols, others — often very good solutions — may fall off the market. This may also lead to a limited choice for the authors who may prefer to use a different option to protect their works.
The AI Act's copyright requirements also have an extraterritorial effect. As such, the obligation to have a copyright compliance policy in place will apply to any general-purpose AI provider putting its product on the EU market, regardless of where the training took place. These providers may also be expected to follow the unified protocol, once agreed upon.
Magdalena Serafin, AIGP, CIPP/US, is a privacy leader helping organizations implement and maintain global privacy frameworks.