Who Owns AI Training Data? The Copyright Lawsuits That Will Define the Future of AI

The Billion-Dollar Question Behind Every AI Model

Every large language model and image generator in commercial operation today was trained on data created by humans: articles, books, photographs, artwork, code, music, and video. The companies that built these models, OpenAI, Google, Meta, Anthropic, Stability AI, Midjourney, ingested this material at a scale unprecedented in the history of intellectual property. OpenAI’s GPT-4 training dataset is estimated to include over a trillion tokens drawn from books, websites, academic papers, and proprietary content. Stability AI’s Stable Diffusion model was trained on subsets of the LAION-5B dataset, which contains 5.85 billion image-text pairs scraped from the internet, many of them copyrighted photographs and artworks.

The fundamental legal question is simple to state and enormously consequential to answer: does training an AI model on copyrighted material without permission constitute copyright infringement, or is it a lawful use, whether through fair use (in the US), fair dealing (in the UK), or text and data mining exceptions (in the EU)? The answer will determine whether AI companies owe billions in licensing fees, whether their existing models face injunctive relief, and whether the entire business model of generative AI requires restructuring.

The financial stakes are not abstract. OpenAI generated approximately $6 billion in revenue in 2024 and its annualized revenue surpassed $20 billion in 2025, representing a tripling year-over-year. Stability AI, despite persistent financial difficulties, raised over $200 million in venture funding across multiple rounds. Bloomberg Intelligence projects the broader generative AI market to reach $1.3 trillion by 2032, growing at a compound annual rate of roughly 42%. If courts rule that training constitutes infringement, retroactive licensing obligations could consume a significant portion of this revenue, or force fundamental changes in how models are built.

The Active Cases: A Legal Map

The most closely watched case is The New York Times Company v. Microsoft Corporation and OpenAI Inc., filed in the Southern District of New York in December 2023. The Times alleges that OpenAI and Microsoft systematically copied millions of its articles to train GPT models, that the models can reproduce Times content verbatim, and that ChatGPT and Bing Chat directly compete with the Times for readers, diverting advertising and subscription revenue. The complaint includes exhibits showing ChatGPT reproducing nearly word-for-word passages from Times articles, including investigative pieces that cost hundreds of thousands of dollars to produce.

OpenAI’s defense rests on fair use, the four-factor balancing test codified in 17 U.S.C. Section 107. The company argues that training is “transformative” because it creates a fundamentally different product (a general-purpose AI assistant, not a news archive), that only a small portion of any individual work is reproduced in model outputs, and that AI tools are complementary to rather than substitutive of traditional news consumption. OpenAI has also argued that the Times engaged in adversarial prompting to elicit verbatim reproduction, behavior it says is unrepresentative of normal use. In March 2025, Judge Sidney Stein narrowed the scope of the lawsuit but allowed the core copyright infringement claims to proceed. In January 2026, the court affirmed an order compelling OpenAI to produce a sample of 20 million ChatGPT conversation logs for discovery, a significant win for the Times that could reveal patterns about whether AI outputs substitute for original works. Summary judgment briefing is scheduled to conclude by April 2026, with trial still prospective.

Getty Images v. Stability AI, filed in both the UK High Court and the US District Court for the District of Delaware, presents the visual arts dimension. Getty alleges that Stability AI scraped over 12 million Getty-owned images, complete with visible Getty watermarks, to train Stable Diffusion. The evidence is vivid: Stable Diffusion outputs sometimes include garbled versions of the Getty watermark, strongly suggesting that the training data included watermarked images. The UK case reached a landmark outcome in November 2025, when the High Court largely rejected Getty’s copyright claims, holding that AI model weights are not a “copy” of the training images in the sense required by the Copyright, Designs and Patents Act. Getty had abandoned its primary copyright and database right infringement claims after accepting there was no evidence that training took place in the UK. The court did find limited trademark infringements where outputs reproduced garbled Getty watermarks. Getty has been granted permission to appeal the decision. The separate US case in Delaware, where Getty is seeking damages that have been increased to as much as $1.7 billion, remains ongoing.

Additional cases form a growing constellation. Authors including Sarah Silverman, Michael Chabon, and Paul Tremblay have sued OpenAI and Meta alleging that their books were used in training without authorization, likely sourced from pirate libraries like Library Genesis and Z-Library. Most non-copyright claims in those cases have been dismissed, but the core copyright infringement allegations survive. Music publishers have sued AI music generation companies. Visual artists filed a class action against Stability AI, Midjourney, and DeviantArt. In total, over 50 lawsuits related to AI training data were pending in US courts as of late 2025, with additional proceedings in the UK, EU, and Japan. No US court has yet ruled on fair use in the AI training context; the first substantive decisions are expected in mid-to-late 2026.

The Legal Arguments: Fair Use, Text Mining, and the Global Divide

The fair use analysis in US courts will turn on the four statutory factors, with the “transformative use” and “market effect” factors likely proving decisive. The Supreme Court’s May 2023 decision in Andy Warhol Foundation v. Goldsmith (598 U.S. 508) narrowed the transformative use doctrine, ruling 7-2 that simply adding new expression is insufficient if the new work serves substantially the same purpose as the original. AI companies must argue that an AI model is a fundamentally different type of product from the works it was trained on, not merely a new way to access the same information.

The market effect factor is equally contentious. Publishers argue that AI-generated content directly displaces demand for the original works: why subscribe to the New York Times if ChatGPT can provide the same information? AI companies counter that their models generate new types of value and that many users would not have accessed the original content at all. Empirical evidence is accumulating and points strongly toward substitution. Data published in the Reuters Institute’s Journalism and Technology Trends and Predictions 2026 report, based on Chartbeat analytics, found that Google search traffic to publishers declined globally by approximately one-third in the year to November 2025, with AI-driven search features playing a significant role. Some publishers have reported traffic declines of 20% to 30% or more as AI chatbots and answer engines provide direct responses to user queries.

Outside the US, the legal landscape diverges sharply. The EU’s Copyright Directive (2019/790) provides a text and data mining (TDM) exception under Article 4, which allows mining of lawfully accessible works for any purpose, unless the rights holder has “expressly reserved” their rights in a machine-readable way. This opt-out mechanism means that European publishers and artists who failed to implement robots.txt exclusions or metadata reservations may have inadvertently waived their objection. The EU AI Act, which entered into force on August 1, 2024, adds a transparency obligation: providers of general-purpose AI models must publish sufficiently detailed summaries of the content used for training. The obligations for GPAI providers became applicable on August 2, 2025, and the European Commission’s AI Office published its mandatory training data summary template on July 24, 2025, requiring structured disclosure of data types, sources, and collection methods.

Japan’s approach is the most permissive. Article 30-4 of Japan’s Copyright Act allows the use of copyrighted works for data analysis, including AI training, without the rights holder’s permission, as long as the purpose is not to “enjoy” the expression itself. This has made Japan an attractive jurisdiction for AI training operations. However, the Agency of Cultural Affairs published its “Checklist & Guidance on AI and Copyright” on July 31, 2025, clarifying that the exception does not apply when the use unfairly harms the interests of the copyright owner, including cases where training data is used to produce outputs that compete with or substitute for the original works.

Business Model Implications: From Scraping to Licensing

Regardless of how the courts ultimately rule, the AI industry is already shifting toward a licensing model. OpenAI has signed content licensing deals with the Associated Press (July 2023), Axel Springer (December 2023), Le Monde and Prisa Media (March 2024), the Financial Times (April 2024), and News Corp (May 2024, reportedly worth over $250 million over five years). Google has established licensing arrangements with Reddit ($60 million annually, announced February 2024) and multiple publishers. These deals suggest that the major AI companies are hedging against adverse court rulings by building licensed training data portfolios.

The licensing trend creates its own set of market dynamics. Large publishers with recognizable brands and extensive archives command premium prices. Small publishers, independent journalists, and individual artists lack the bargaining power to negotiate meaningful licensing terms. The result may be a two-tier system: major content owners extract licensing revenue from AI companies, while the vast majority of creators whose work was used in training receive nothing.

For Algeria’s context, the copyright implications intersect with the country’s intellectual property framework. Algeria is a signatory to the Berne Convention and has domestic copyright law (Ordinance No. 03-05 of July 19, 2003) that protects literary and artistic works. However, Algeria has no equivalent of the US fair use doctrine or the EU TDM exception. If an AI company trained its model on Algerian news articles, academic papers, or creative works, whether through Common Crawl datasets or direct scraping, Algerian rights holders would theoretically have infringement claims. The practical enforcement of such claims against companies with no Algerian presence or assets is another matter entirely.

🧭 Decision Radar (Algeria Lens)

Dimension	Assessment
Relevance for Algeria	Moderate to high. Algerian content creators, publishers, and academic institutions are affected as both consumers of AI tools and potential rights holders whose works may have been used in training.
Infrastructure Ready?	Not applicable in the traditional sense. The relevant infrastructure is legal: Algeria’s copyright enforcement mechanisms and judicial expertise in digital IP are underdeveloped.
Skills Available?	Limited. Algeria has intellectual property lawyers but few with expertise in AI-specific copyright issues. Academic and judicial training is needed.
Action Timeline	12-24 months for key US rulings; EU framework already operational with GPAI transparency obligations in effect since August 2025; Algeria should monitor and prepare domestic policy responses.
Key Stakeholders	Ministry of Culture, ONDA (Office National des Droits d’Auteur), Algerian publishers and media companies, academic institutions, AI startups, international AI companies serving the Algerian market.
Decision Type	Policy monitoring and preparedness. Algeria should track international rulings, assess the exposure of Algerian content in training datasets, and consider whether its copyright framework needs AI-specific provisions.

Quick Take: The AI copyright question will see its first substantive US fair use rulings in mid-to-late 2026, while the UK’s Getty v. Stability AI decision has already set early precedent. The emerging licensing model favors large content owners and creates a new revenue stream for publishers willing to negotiate. Algeria’s creators and institutions should begin documenting and asserting their rights now, before the market structure solidifies without them.

The Billion-Dollar Question Behind Every AI Model

The Active Cases: A Legal Map

The Legal Arguments: Fair Use, Text Mining, and the Global Divide

Business Model Implications: From Scraping to Licensing

🧭 Decision Radar (Algeria Lens)

Sources & Further Reading

Leave a Comment Cancel reply

Most recent

Digital Economy

After Jumia’s Exit: Who Will Win Algeria’s E-Commerce Market?

Policy & Regulation

Digital Accessibility Laws: How WCAG Mandates and the EU Accessibility Act Are Reshaping the Web

AI & Automation

AI at the Border: How Algeria’s Customs and Port Systems Are Going Digital

Skills & Careers

The Algerian Developer Stack: What Languages, Frameworks, and Tools Algerian Developers Actually Use in 2026