A new investigation has raised serious questions about where OpenAI gets the data used to train its powerful AI models like GPT-4o. The research, conducted by the AI Disclosures Project, found strong evidence that OpenAI’s latest model recognizes content from paywalled O’Reilly Media programming books – suggesting these copyrighted materials may have been used without permission in its training.

The Big Questions About AI Training Data

At the heart of this debate is a simple but crucial question: Where do AI companies get the massive amounts of data needed to train their models? While some data comes from publicly available internet sources, there are growing concerns that copyrighted books, paywalled articles, and other restricted materials are being used without proper authorization.

The AI Disclosures Project – led by tech publishing pioneer Tim O’Reilly and economist Ilan Strauss – conducted an experiment using 34 copyrighted O’Reilly Media books (known for their technical guides on programming and software development). Their goal? To see if OpenAI’s models had been trained on this material.

How the Study Worked

Researchers used a legally obtained dataset of O’Reilly books (both publicly available and paywalled versions)
They tested whether OpenAI’s models could distinguish between real O’Reilly book content and AI-paraphrased versions
The method, called DE-COP, helps detect if an AI model has “memorized” specific copyrighted text

Key Findings: GPT-4o Shows Strong Recognition of Paywalled Books

The results were striking:

GPT-4o (OpenAI’s newest model) showed an 82% recognition rate for paywalled O’Reilly content – meaning it clearly “remembered” this copyrighted material.

In contrast, GPT-3.5 Turbo (an older model) barely recognized the books, scoring just above 50% – essentially random guessing.

GPT-4o Mini, a smaller version of the model, showed no significant recognition of the books at all.

This suggests that OpenAI may have included copyrighted O’Reilly books in GPT-4o’s training data, while earlier models like GPT-3.5 did not use them (or used them much less).

Where Did OpenAI Get These Books?

The study notes that all 34 tested books were available on LibGen (Library Genesis), a controversial shadow library often used to bypass paywalls. While OpenAI hasn’t confirmed using LibGen, the findings raise concerns about whether AI companies are sourcing training data from legally questionable sources.

Why This Matters: The Growing Battle Over AI and Copyright

This isn’t just about one set of programming books – it’s part of a much larger debate:

1. Are AI Companies Exploiting Creators?

Authors, journalists, and artists worry their work is being used without payment or consent
If AI models train on paywalled books, should publishers be compensated?
Some argue this could reduce incentives for professional content creation

2. Legal Gray Areas

Current copyright laws weren’t written with AI in mind
Courts are still deciding whether AI training counts as “fair use”
The EU AI Act is pushing for more transparency, but enforcement remains unclear

3. A Growing Market for Licensed Data

Some companies are trying to do things the right way:

Defined.ai and others now offer licensed training data with proper permissions
Media giants like The New York Times and Reuters are striking deals with AI firms
Will voluntary licensing become the norm, or will regulation force AI companies to pay up?

OpenAI’s Response (Or Lack Thereof)

So far, OpenAI hasn’t directly addressed these findings. The company has previously stated that it uses a mix of publicly available data, licensed content, and synthetic data for training. However, it has never provided a full list of its training sources, citing competitive reasons.

Critics argue that without true transparency, it’s impossible to know whether AI models are built ethically – or if they’re profiting from others’ unpaid work.

What Happens Next? Legal Battles and Policy Changes

This study adds fuel to an already fiery debate:

Lawsuits Piling Up

The New York Times is suing OpenAI for copyright infringement
Authors like George R.R. Martin and John Grisham have filed similar cases
Courts may soon decide whether AI training violates copyright law

Push for Regulation

The EU AI Act will soon require disclosure of training data sources
The U.S. is considering similar rules, but progress is slower
Will governments force AI companies to reveal what’s in their training data?

Alternative Solutions

Some propose a royalty system, where AI firms pay content creators per use
Others suggest opt-in systems, where creators must consent before their work is used
Could blockchain or watermarking help track AI training sources?

The Bigger Picture: Can AI and Creators Coexist?

This study highlights a fundamental tension in AI development:

AI needs vast amounts of data to improve and stay competitive
Creators deserve compensation when their work is used commercially

Finding a balance won’t be easy. If AI companies ignore copyright concerns, they risk legal battles and public backlash. But if regulation becomes too strict, could it stifle innovation?

One thing is clear: The way AI models are trained today will shape the internet’s future. If professional content creators can’t make a living because AI absorbs their work without payment, we might see a decline in high-quality books, journalism, and art.

What Can Be Done?

More transparency from AI companies about training data
Better licensing systems to compensate creators
Clearer laws on AI and copyright

Final Thoughts: A Turning Point for AI Ethics

The AI Disclosures Project’s findings don’t just implicate OpenAI – they point to a systemic issue in how the AI industry operates. As these models grow more powerful, the debate over who gets paid, who gets credit, and what counts as fair use will only intensify.

Will AI companies voluntarily clean up their practices? Or will it take lawsuits and regulations to force change? The next few years will be critical in determining whether AI develops as a tool that benefits everyone – or one that exploits the very creators who make its existence possible.

MOST POPULAR

AI SERVICES

OTHER SERVICES

Contact us

Marie Elsner

Account Executive

MOST POPULAR

AI SERVICES

OTHER SERVICES

Contact us

Marie Elsner

Account Executive

Study Alleges OpenAI Uses Copyrighted Data to Train AI

Table of Contents

The Big Questions About AI Training Data

Key Findings: GPT-4o Shows Strong Recognition of Paywalled Books

Why This Matters: The Growing Battle Over AI and Copyright

OpenAI’s Response (Or Lack Thereof)

What Happens Next? Legal Battles and Policy Changes

The Bigger Picture: Can AI and Creators Coexist?

Final Thoughts: A Turning Point for AI Ethics

Table of Contents

Arrange your free initial consultation now

Details

Share

Book Your free AI Consultation Today

Similar Posts

Lightweight LLMs in Single GPU: How Enterprises Are Unlocking Generative AI Without Massive Infrastructure

WorldGen: Meta’s Generative AI Transforming 3D Worlds into Interactive Realms

The Hidden Risks of AI-Powered Web Search for Businesses