A new investigation has raised serious questions about where OpenAI gets the data used to train its powerful AI models like GPT-4o. The research, conducted by the AI Disclosures Project, found strong evidence that OpenAI’s latest model recognizes content from paywalled O’Reilly Media programming books – suggesting these copyrighted materials may have been used without permission in its training.
The Big Questions About AI Training Data
At the heart of this debate is a simple but crucial question: Where do AI companies get the massive amounts of data needed to train their models? While some data comes from publicly available internet sources, there are growing concerns that copyrighted books, paywalled articles, and other restricted materials are being used without proper authorization.
The AI Disclosures Project – led by tech publishing pioneer Tim O’Reilly and economist Ilan Strauss – conducted an experiment using 34 copyrighted O’Reilly Media books (known for their technical guides on programming and software development). Their goal? To see if OpenAI’s models had been trained on this material.
How the Study Worked
- Researchers used a legally obtained dataset of O’Reilly books (both publicly available and paywalled versions)
- They tested whether OpenAI’s models could distinguish between real O’Reilly book content and AI-paraphrased versions
- The method, called DE-COP, helps detect if an AI model has “memorized” specific copyrighted text
Key Findings: GPT-4o Shows Strong Recognition of Paywalled Books
The results were striking:
GPT-4o (OpenAI’s newest model) showed an 82% recognition rate for paywalled O’Reilly content – meaning it clearly “remembered” this copyrighted material.
In contrast, GPT-3.5 Turbo (an older model) barely recognized the books, scoring just above 50% – essentially random guessing.
GPT-4o Mini, a smaller version of the model, showed no significant recognition of the books at all.
This suggests that OpenAI may have included copyrighted O’Reilly books in GPT-4o’s training data, while earlier models like GPT-3.5 did not use them (or used them much less).
Where Did OpenAI Get These Books?
The study notes that all 34 tested books were available on LibGen (Library Genesis), a controversial shadow library often used to bypass paywalls. While OpenAI hasn’t confirmed using LibGen, the findings raise concerns about whether AI companies are sourcing training data from legally questionable sources.
Why This Matters: The Growing Battle Over AI and Copyright
This isn’t just about one set of programming books – it’s part of a much larger debate:
1. Are AI Companies Exploiting Creators?
- Authors, journalists, and artists worry their work is being used without payment or consent
- If AI models train on paywalled books, should publishers be compensated?
- Some argue this could reduce incentives for professional content creation
2. Legal Gray Areas
- Current copyright laws weren’t written with AI in mind
- Courts are still deciding whether AI training counts as “fair use”
- The EU AI Act is pushing for more transparency, but enforcement remains unclear
3. A Growing Market for Licensed Data
Some companies are trying to do things the right way:
- Defined.ai and others now offer licensed training data with proper permissions
- Media giants like The New York Times and Reuters are striking deals with AI firms
- Will voluntary licensing become the norm, or will regulation force AI companies to pay up?
OpenAI’s Response (Or Lack Thereof)
So far, OpenAI hasn’t directly addressed these findings. The company has previously stated that it uses a mix of publicly available data, licensed content, and synthetic data for training. However, it has never provided a full list of its training sources, citing competitive reasons.
Critics argue that without true transparency, it’s impossible to know whether AI models are built ethically – or if they’re profiting from others’ unpaid work.
What Happens Next? Legal Battles and Policy Changes
This study adds fuel to an already fiery debate:
Lawsuits Piling Up
- The New York Times is suing OpenAI for copyright infringement
- Authors like George R.R. Martin and John Grisham have filed similar cases
- Courts may soon decide whether AI training violates copyright law
Push for Regulation
- The EU AI Act will soon require disclosure of training data sources
- The U.S. is considering similar rules, but progress is slower
- Will governments force AI companies to reveal what’s in their training data?
Alternative Solutions
- Some propose a royalty system, where AI firms pay content creators per use
- Others suggest opt-in systems, where creators must consent before their work is used
- Could blockchain or watermarking help track AI training sources?
The Bigger Picture: Can AI and Creators Coexist?
This study highlights a fundamental tension in AI development:
- AI needs vast amounts of data to improve and stay competitive
- Creators deserve compensation when their work is used commercially
Finding a balance won’t be easy. If AI companies ignore copyright concerns, they risk legal battles and public backlash. But if regulation becomes too strict, could it stifle innovation?
One thing is clear: The way AI models are trained today will shape the internet’s future. If professional content creators can’t make a living because AI absorbs their work without payment, we might see a decline in high-quality books, journalism, and art.
What Can Be Done?
- More transparency from AI companies about training data
- Better licensing systems to compensate creators
- Clearer laws on AI and copyright
Final Thoughts: A Turning Point for AI Ethics
The AI Disclosures Project’s findings don’t just implicate OpenAI – they point to a systemic issue in how the AI industry operates. As these models grow more powerful, the debate over who gets paid, who gets credit, and what counts as fair use will only intensify.
Will AI companies voluntarily clean up their practices? Or will it take lawsuits and regulations to force change? The next few years will be critical in determining whether AI develops as a tool that benefits everyone – or one that exploits the very creators who make its existence possible.