OpenAI’s models ‘memorized’ copyrighted content, new study suggests

A new study seems to add credibility to the allegations that Openai has trained at least some AI models on copyright content.
Openai is involved in the lawsuits filed by authors, programmers and other rights holders who accuse the company of using their works-books, codebases, etc.-to develop their models without permission. Openai has called for a fair defense for fair use, but prosecutors in these cases argue that there is no bus in the law of copyright in the United States to train data.
The study, which was co -authored by researchers at the University of Washington, Copenhagen University and Stanford, proposes a new way to determine the training data “saved” by models behind the application programming interface, such as Openai’s.
Models are prediction engines. Train on a lot of data, learn patterns – so they can create articles, pictures and more. Most outputs are not literal copies of training data, but because of the way they “learned”, some of them are inevitable. Photo models were found to renew screenshots of the films that were trained on them, while language models were observed by suicide news articles effectively.
The study method depends on the words that the participating authors call “high”-that is, the words that are not common in the context of a larger set of work. For example, the word “radar” in the sentence will be considered “Jack and I am still with the tinnar radar” high because it is less statistics of words such as “the engine” or “radio” to appear before the “tinnitus”.
The authors participating in many Openai, including the GPT-4 and GPT-3.5, have investigated the removal of high words from scraps from fictional books and New York Times Pieces and the presence of models that try to “guess” that have been hidden. If the models are able to properly guess, they are likely to keep the excerpt during training, the participating authors conclude.
According to the results of the tests, GPT-4 showed signs of memorizing parts of common imagination books, including books in a collection of data containing e-books protected by copyright called Bookmia. The results also indicated that the model preserves parts of the New York Times articles, albeit at a relatively fewer rate.
Abhlasha Ravechander, a doctorate student at Washington University and a participant author of the study, told Techcrunch that the results that shed light on the “controversial data” models may have been trained.
“In order to have great linguistic models worthy of confidence, we need to have models that we can investigate, review and examine them scientifically,” said Ravenicander. “Our work aims to provide a tool to investigate large language models, but there is a real need for the transparency of larger data in the entire ecosystem.”
Openai has long called on fashion restrictions to develop models using copyright data. Although the company has some content licensing deals in place and provides subscription cancellation mechanisms that allow copyright owners to the brand over their content who prefer not to use the company for training purposes, it has pressed many governments to write down the rules of “fair use” about artificial intelligence training methods.
2025-04-04 18:42:00