Breaking News

OpenAI and Microsoft are teaming up with Harvard’s libraries to train AI models on 600-year-old books

All he said on the Internet was just the beginning of teaching artificial intelligence about humanity. Technology companies are now taking advantage of the oldest knowledge warehouse: library chimneys.

Almost a million books were published early in the fifteenth century – and in 254 languages ​​- are part of the Harvard University group that is released to artificial intelligence researchers on Thursday. It will also come close to the old newspapers and government documents kept by the Boston Public Library.

Crashing the opening on trailers that date back to a data reward for technology companies that fight lawsuits from the norms, visual artists and other creative works without agreeing to training AI Chatbots.

“It is a wise decision to start the public domain data because this is less controversial at the present time than the content that is still under copyright.”

Davis said the libraries also contain “large quantities of interesting cultural, historical and linguistic data” that have been missing since the past few decades of online comments that I have mostly learned from AI Chatbots. Fears of data depletion also led to the transformation of artificial intelligence into “artificial” data, which Chatbots itself and of less quality.

With the support of Microsoft and Chatgpt Maker Openai, the Harvard’s institutional data initiative works with libraries and museums around the world on how to make its historical groups ready for the prosecution in a way that also benefits the societies that serve it.

“We are trying to transfer some power from this current moment of Amnesty International to these institutions,” said Aristana Scortas, who runs research at the Innovation Laboratory at Harvard law Faculty. “Library trustees have always been a documentary and purchase of information.”

The newly released institutional data collection contains Harvard, more than 394 million pages, which are settled in paper. One of the previous works is 1400s – handwritten Korean painter ideas about cultivation of flowers and trees. The largest focus of business is from the nineteenth century, on topics such as literature, philosophy, law and agriculture, all of which were accurately preserved and organized by generations of librarians.

It is considered a blessing for artificial intelligence developers trying to improve the accuracy and reliability of their systems.

“Many data used to train artificial intelligence did not come from original sources,” said Greg Libert, CEO of Data Initiative, Greg Lieber, who is also the chief technician at the Harvard Center in Berkelin Klein for the Internet and society. He said that this collection of books “along the way to the physical version that was wiped by the institutions that have already collected these elements.”

Before Chatgpt sparked a commercial insanity for Amnesty International, most researchers did not think much about the origin of the text clips that they withdrew from Wikipedia, from social media forums such as Radit and sometimes from the deep warehouses of pirated books. They just needed a lot of what computer scientists call symbols – units of data, each of which can represent part of the word.

The new AI training group at Harvard University includes an estimated 242 billion symbols, which is difficult for humans to understand, but it is still just a decrease in what is fed in the most advanced artificial intelligence systems. For example, the parent company Facebook Meta said that the latest version of the Great AI language style was trained on more than 30 trillion symbol withdrew from the text, pictures and videos.

Meta is also fighting a lawsuit from comedian Sarah Silverman and other published authors who are accused of stealing their books from “Shadow Libraries” from pirated works.

Now, with some reservations, real libraries stand.

Openai, who is also fighting a series of copyright cases, donated $ 50 million this year to a group of research institutions including the Bodleian Library at Oxford University, which is ranging rare texts and uses artificial intelligence to help copy it.

Jessica Chapel, head of its digital services and online, said when the company continued for the first time with the Boston Public Library, one of the largest in the United States, the library explained that any retaliatory information will be for everyone.

“Openai had this attention in huge quantities of training data. We have an interest in huge quantities of digital things. So this is just a condition in which things correspond to,” Chaplary said.

Digital is expensive. It was a strenuous work, for example, for the Boston Library to conduct a survey and manufacture of dozens of French newspapers in New England, which was widely read in the late nineteenth and early twentieth century by Canadian immigrant communities from Quebec. Now that this text has been used as training data, it helps on the stroke projects that librarians want to do anyway.

The Harvard Group has already been numbered from 2006 for another technology giant, Google, in its controversial project to create a search library that can include more than 20 million books.

Google spent years overing the legal challenges from the authors to the book Book Library, which included many newer and copyright works. It was finally settled in 2016 when the United States Supreme Court allowed to stand in the lower court, which rejected the claim of copyright.

Now, for the first time, Google worked with Harvard to recover the general range sizes of Google books and clarify the way to release them to artificial intelligence developers. Publishing rights usually continue in the United States for 95 years, and longer for sound recordings.

The new effort applauded on Thursday by the same authors’ group, which filed a law against Google on its book project and recently brought artificial intelligence companies to the court.

“There are many of these titles except in the main librarians, and the creation and use of the data collection will provide this expanded access to these folders and knowledge within,” said Mary Rasinberger, CEO of the authors company, in a statement on Thursday. “Most importantly, the creation of a large legal training data set will weaken the creation of new models of artificial intelligence.”

How useful is this for the next generation of artificial intelligence tools, as data is shared on Thursday on the Huging Face platform, which hosts data collections and artificial intelligence models open source that anyone can download.

The collection of books is more linguistic than the sources of model intelligence data. Less than half of the folders in English, although European languages ​​are still dominating, especially German, French, Italian, Spanish and Latin.

Libert said that the collection of masculine books in the nineteenth century could also be “very important” for the efforts of the technology industry to build artificial intelligence agents who can plan and cause the cause, as well as humans.

“In one of the universities, you have a lot of science of teaching about what the mind means,” Liebere said. “You have a lot of scientific information on how to operate operations and how to run analyzes.”

At the same time, there is also a lot of old data, from scientific and medical theories that were exposed to racist and colonial novels.

“To help them make enlightened decisions and use AI” when it deals with this big data collection, there are some difficult problems about content and harmful language. “.

——— – –

The Associated Press and Openai has a license and technology agreement that allows access to Openai to part of AP.

This story was originally shown on Fortune.com

2025-06-12 19:10:00

Related Articles

Back to top button