Study claims OpenAI trains AI models on copyrighted data

A new study from the project of disclosure of artificial intelligence raised questions about Openai’s data to train its large language models (LLMS). The research indicates that the GPT-4O model of Openai shows “strong recognition” of the data engraved of copyright from the O’Railly Media Books books.
The project to disclose artificial intelligence, led by technician Tim Aureli and economic Ilan Strauss, aims to address the potential harmful societal effects to market artificial intelligence by calling for improving transparency and technology. The project worksheet highlights the lack of disclosure in artificial intelligence, which directs similarities with the criteria of financial disclosure and its role in strengthening strong securities markets.
The study used a law obtained law that includes 34 books of copyright O’Reilly to investigate if LLMS from OpenAI was trained in copyright -protected data without approval. The researchers applied the method of attacking the DE-COP membership to determine whether the models can differentiate between the O’Reilly texts that were composed by the human being and the re-formulated LLM versions.
The main results of the report include:
- GPT-4O shows a “strong recognition” of the content of the O’Railly Bookd book, with an 82 % Auroc. On the other hand, the previous model of Openai, GPT-3.5 Turbo, does not appear as the level of recognition (Auroc Proper Upped 50 %)
- GPT-4O offers a stronger recognition of the non-general O’Reillly Owing content compared to the informative samples (82 % compared to 64 % of Aurocs respectively)
- GPT-3.5 Turbo shows a larger relative recognition of the O’Reilly Book samples that can be accessed to the audience from other non-public offers (64 % compared to 54 % of Auroc)
- GPT-4O MINI, a smaller model, did not appear any knowledge of the content of public or non-public media when tested (Auroc is about 50 %)
Researchers suggest that access violations have occurred through the LibGen database, where all O’Railly books have been found there. They also acknowledge that the latest LLMS has an improved ability to distinguish between the language that is composed and generated by machine guns, which does not reduce the method of the method of classifying data.
The study highlights the possibility of “temporal bias” in the results, due to language changes over time. For this, the researchers tested two models (GPT-4O and GPT-4O Mini) trained data from the same period.
The report indicates that although the evidence is for Openai and O’Reillly Media, it is likely to reflect a systematic problem about the use of copyright data. It argues that the use of uncomplicated training data can lead to a decrease in the quality and diversity of Internet content, with revenue flows decreased to create professional content.
The artificial intelligence disclosure project emphasizes the need for stronger accountability in the training of artificial intelligence companies. They suggest that the provisions of responsibility that stimulates the transparency of improved companies in revealing the source of the data may be an important step towards facilitating commercial markets to train data license and bonuses.
The requirements for detection of AI in the European Union can help operate the positive detection standards course if determined and applied properly. Ensure that IP holders know when their work is used in models training, training is seen as a decisive step towards creating artificial intelligence markets for content creator data.
Despite the evidence that artificial intelligence companies may obtain illegal data for typical training, the market stands out as developers pushing the artificial intelligence model for content through licensing deals. Companies such as Defined.AI facilitate the purchase of training data, obtaining approval from data providers and stripping personal information.
The report concludes that, using 34 media books of O’Railly, the study provides experimental evidence that Openai is likely to train GPT-4O on non-governmental copyright data.
(Photo by Sergey Tukakov)
See also: Anthropor provides an insight into “artificial intelligence biology”
Do you want to learn more about artificial intelligence and large data from industry leaders? Check AI and Big Data Expo, which is held in Amsterdam, California, and London. The comprehensive event was identified with other leading events including the smart automation conference, Blockx, the digital transformation week, and the Cyber Security & Cloud.
Explore the upcoming web events and seminars with which Techforge works here.
2025-04-02 09:04:00