Artificial intelligence has uprooted traditional copyright protections, testing the limits of the current legal framework. This upheaval stems from the vast amount of information used to train AI models, especially generative AI models, which often includes data protected by copyright law.
In the EU, these issues have been discussed in the context of the frameworks of the Copyright and Information Society Directive and the Copyright in the Digital Single Market Directive. EU institutions have taken steps toward trying to explain or anticipate how EU legislation should be applied. A study by the EU Intellectual Property Office focused on the development of generative AI and its implications for copyright law. In November 2025, it expanded its work by launching the Copyright Knowledge Centre.
Similarly, the European Parliament released a report focused on providing recommendations, such as giving developers a proactive responsibility to document the origins of their datasets. The report also recommends establishing a balanced regulatory ecosystem that allows for responsible innovation and creating remuneration schemes to protect rightsholders against literal copying and in recognition of the influence that protected works may have on outputs.
In parallel, litigation over generative AI has appeared in EU member state courts and on the docket of the Court of Justice of the European Union. Decisions from the lower courts have outlined how to develop an AI model in a manner that complies with copyright law. At the same time, the EU AI Act incorporates transparency obligations for model developers, including requirements to disclose information on the origin of model training data. The intent of such requirements is to empower rightsholders to exercise their rights.
Broadly speaking, the intersection of copyright and AI in the EU raises several difficult questions. These range from alleged copyright infringements across the AI life cycle, to the potential role of licensing in safeguarding certain individual rights, and to compliance with new transparency obligations that may spur a new wave of litigation or further licensing of protected works.
The CDSM's TDM exception, AI Act transparency obligations
The debate on these issues has focused on text and data mining exceptions. This article explores emerging EU legal guidance on AI model development, data sourcing, and copyright issues. Along with TDM exceptions, transparency of the origin of data has been a topic of interest for EU legislators, resulting in specific reporting requirements in the AI Act.
Introduced in the CDSM Directive enacted in 2019, the TDM exception explicitly allows the unauthorized reproduction of protected works — contrary to the general rule under which such action would be copyright infringement — if focused on cultural heritage or if done with a scientific purpose by a not-for-profit organization. Furthermore, the directive generally aims to find a balance between the interests of authors and other rightsholders and technological innovations that heavily rely on text and data mining technologies.
The TDM exception permits the use of "any automated analytical technique aimed at analyzing text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations." The TDM exception also clearly distinguishes between copying data for scientific research conducted by a research organization or cultural heritage institution and copying data for all other purposes.
The TDM exception for research purposes applies when a reproduction is made by a research or cultural heritage institution for scientific purposes using materials to which they have lawful access. Any reproductions created under this exception must also be subject to appropriate security methods.
Under the CDSM directive, research organizations are defined as entities whose purpose is to conduct scientific research on a not-for-profit basis, reinvest all profits on scientific research or to pursue a public interest goal as set by a member state. However, Article 4 of the directive explains the TDM exception without a research purpose applies to anyone if the works copied are lawfully accessed and the rightsholder has not opted out from the mining.
Litigation on this matter has been limited but significant. In Germany, the Hamburg District Court in LAION v. Kneschke explained the TDM exception for scientific research justified the collection of publicly available images and textual descriptions, highlighting that opt-out provisions of the non-scientific TDM must be respected as well. Similarly, the Munich Regional Court in GEMA v. OpenAI found that, in principle, large language models are covered by the TDM exception because the reproductions made during data training do not, by themselves, harm an author’s rights.
Additionally, Article 53(1)(d) of the EU AI Act requires that certain elements of a model's development be disclosed and documented in a public, clear and detailed document. One of the central categories developers must document and report is the source of the training data used to develop the model. Recital 107 underscores the importance of transparency on the origin of the data used throughout the AI life cycle, noting that making this information publicly available enables rightsholders to protect their copyrights and exercise their legitimate interests whenever possible.
Furthermore, once the EU AI Act took effect, the European Commission issued an Explanatory Notice and Template for the Public Summary of Training Content for general-purpose AI models that outlines the content used to train the model. Providers should disclose any publicly available databases, private databases, user data, synthetic data and a "comprehensive narrative description" of the data collected through online scraping.
This last item must also include information about the AI crawler used, as well as a list of the most relevant domains scraped based on the size of the content. The existence of the transparency requirement — and the accompanying detailed disclosure list — is likely to open the door to further litigation as rightsholders will have more tools to justify their claims instead of relying on experiments to identify the presence, use and influence of protected works in the AI life cycle.
AI life cycle stages under the TDM exception and the AI Act’s transparency rules
In the LAION case, the court provides a clear roadmap to distinguish the stages of the AI life cycle. The first stage requires assembling a data set that may be used for training the model. The second stage is the actual training of the AI neural network with the data set constructed and in the third, the use of the AI model to create new content.
Each stage presents its own issues and interacts in its own unique way with the TDM exception and the transparency requirement. These steps in the life cycle could also be framed as the design, development and deployment stage.
Design stage
In the design stage, model developers set out to gather the data that will be used to train the model. During this process, different types of data are collected, including information available in the public domain that is not protected by copyright law, data sets given to the developer through third-party licenses or data scraped from the web, usually without consent, including personal information and protected works.
This last method raises issues with copyright laws as the scraped data is covered by copyright protection. Because of its higher quality, this type of data is generally effective in model training, but its use may infringe the rights held by the rightsholder to reproduce protected works.
The design stage is the most litigated of the AI life cycle and has become a central topic in debates over copyright limitations as the data collected usually copies content from an original source.
However, the source of the collected data is not always clear, even when publicly available content is used. As such, many works protected under copyright law are copied into datasets without the rightsholder’s knowledge. In those scenarios, the AI Act's transparency requirement to make a list of the content used for training publicly available will give rightsholders the notice they need to exercise their rights.
However, in the context of the TDM exception for scientific purposes, the CDSM may still take precedence since consent is not necessary at the time of data collection, especially where any reservations against copying or compiling the data were not made clear.
The plaintiff in the LAION case, a photographer, sued the nonprofit organization alleging it infringed his copyright by reproducing his photo without authorization to create a dataset for the purpose of training AI models. In its defense, LAION explained the datasets were aggregated through the reproduction of publicly available images and descriptions, and the dataset it built was publicly available and free.
The court concluded the TDM exception for scientific research applied because LAION is a nonprofit organization and its practices of mining and analyzing the data can be classified as a public benefit since the resulting dataset is available free of charge.
The plaintiff argued web scraping is inherently a for-profit or commercially driven activity due to the potential for reproduction of the work because it focuses on the extraction of intellectual content of the works. The court disagreed and explained this perception does not consider the different and independent stages of the AI training cycle. The court further stated that so long as the author, rightsholder or authorized individual unambiguously voiced their reservations against the use of text and data mining technologies in a machine-readable manner, as provided under the TDM exception for other purposes, the objection must be respected.
The LAION decision explains how web scraping is considered protected under the TDM exception if the scraper is a nonprofit organization or if the rightsholder or an authorized representative has opted not to expressly object to the scraping. It also opens the door to further litigation against for-profit businesses that engage in web scraping to build datasets for AI models.
Development stage
In comparison to the design stage, it is particularly difficult to determine when an alleged copyright infringement occurs in the development stage due to the lack of transparency from the AI model developers.
The adoption of transparency requirements prevents developers from hiding behind the "black box" training the model, as there will be documentation explaining the origin of the data. This documentation may open the door for authors and other rightsholders to regain control of their rights, especially in cases of literal copying and storage within the model. It may encourage further exercise of legal recourse to preserve copyrights, such as removal of works from datasets, seeking monetary compensation or relying on the TDM exception reservations of the CDSM Directive's Article 4.
When developing the model, the datasets and information within are supposed to be analyzed in a way where patterns, parameters and relationships in the data can be identified to produce outputs in response to user prompts. A key purpose of training models on a specific set of data is to prevent the literal reproduction of the original works used. When the model fails to produce the expected analysis and copies the content by storing it, memorization happens, resulting in possible copyright infringement.
In the GEMA case, the court explained the model's memorization of the protected work was copyright infringement not covered by either version of the TDM exception. GEMA, a collecting society and performance rights organization in Germany, argued OpenAI's ChatGPT memorized the lyrics of songs written by at least nine authors. The defendant argued its model did not store the data used for its training, so no memorization could occur.
The court disagreed with OpenAI, explaining that while there is a level of consensus that large language models are covered by the TDM exception, that does not apply when memorization happens. The court explained the phenomenon of memorization in data training interferes with the rights of the rightsholders, as the LLM copies and stores the data used for training after the fact. This results in protected works being literally or significantly reproduced in outputs once prompted by users. The court stated that if developers seek to rely on the TDM exception, they must ensure the covered data used for training is not permanently stored as it harms the right to reproduction.
The GEMA decision allows for two conclusions. First, TDM exceptions only apply in the model-development stage if no memorization of the protected work occurs since memorizations can result in an unauthorized reproduction within the model and deprive the rightsholder of their rights. Second, the prevention of memorization is a burden on developers as they must make significant efforts to implement technical safeguards to prevent memorization.
Deployment stage
Once the model is deployed, two more copyright-related issues arise. The first relates to the copyrightability of the output generated from users' prompts. This issue has been met with caution by the European courts. For example, the Prague Municipal Court in Case No. 10C 13/2003 refused to recognize the copyright of an image generated through an AI model by the claimant because he could not prove he created the image, meaning he was not the author of the image and, therefore, it could not be protected.
On the other hand, memorization that occurs in the realm of the outputs is a more nuanced issue. In those cases, copyright infringement may occur as the output generated results in a literal copy of material that has been used for training purposes. This is addressed in the GEMA decision when the court reiterated that in cases where literal reproduction occurs in the model, specific economic rights of the rightsholder are infringed upon. This infringement occurs even when the output is the result of user prompts, as the rightsholder loses the protection copyright law grants them.
While the court in the GEMA case did not focus on the TDM exception at this stage, it remains unclear how the exception would apply, given that the data had already been scraped and analyzed by the model by that point in the life cycle.
Further questions and looking forward
The LAION and the GEMA decisions serve as relevant starting points for understanding the intersection of copyright law with the AI life cycle in the EU. However, both decisions are from lower member state courts that are still awaiting final decisions from appellate courts and potentially the CJEU. Meanwhile, other cases are being litigated in the lower courts of member states, and one case is currently waiting for a decision from the CJEU.
The CJEU will decide Like Company v. Google Ireland, which could be the most relevant case on the application of the TDM exception to the AI life cycle. The case, originally filed in the Budapest District Court and later referred to the CJEU, was brought by the Like Company, a Hungarian press publisher with several online news portals that claimed Google infringed its copyright when its Gemini AI assistant reproduced and summarized the contents of its websites.
The plaintiff claimed that some of the summaries were either identical or almost identical to the original news content on its website. The Budapest court raised several questions to the CJEU, including whether Gemini’s outputs infringe the copyright holder’s rights of communication to the public and reproduction, and whether the TDM exception for non-scientific purposes applies to reproductions made during the model-training stage.
In Germany, GEMA v. Suno AI is before the Munich Regional Court. GEMA is suing the AI audio company based on the processing of recordings by artists it represents. The plaintiff argues that Suno’s audio outputs reproduce harmonies, melodies and rhythms that are similar to songs in GEMA’s repertoire and alleges the AI audio company has trained its AI model for commercial purposes using copyrighted works without authorization.
In France, SNE v. Meta is before the Paris Judicial Court. The National Union of Publishers, or SNE, and other authors associations are suing Meta based on the inclusion of protected works on datasets used for training its model with a for-profit goal resulting in what they consider copyright infringement and economic free riding.
These three cases are particularly relevant because the defendants are for-profit entities that may help clarify applicability of the TDM exception outside the research context and help resolve the debate of what should be the appropriate opt-out language for the rightsholder or authorized individual.
Furthermore, the GEMA v. OpenAI decision imposes new responsibilities on developers to ensure their AI systems do not memorize protected works. These new obligations reinforce the need for accountability by developers and may prompt additional discussion about the scope and number of appropriate measures needed to protect the rights of rightsholders when relying on the TDM exceptions — particularly if the decision is upheld and future cases follow a similar approach.
Conclusion
Issues related to the EU TDM exceptions occur at different stages of the AI life cycle, although its applicability and coverage are more likely to be assessed in latter stages of the cycle. While some questions remain unanswered, these few judicial decisions have the potential to serve as general principles on how the exceptions should be applied at each stage of AI development in the EU. Moreover, the implementation of new transparency requirements will clarify how AI models gather and use AI-copyrighted materials, which could empower rightsholders to protect their legal interests.
David Botero is a Westin Fellow at the IAPP.

