A major AI training data set contains millions of examples of personal data

0 2 minutes read

A major AI training data set contains millions of examples.jpg

“Anything you put on the Internet. [be] Perhaps it was scraped. “

The researchers found thousands of Identity documents counterparts – including photos of credit cards, driver license, passports, and birth certificates – as well as more than 800 verified applications (including CVs and coverage messages), which were confirmed through LinkedIn and other web inspections as associated with real people. (In many cases, researchers had no time to verify the authenticity of the documents or were unable to do so due to issues such as clarity of the image.)

A number of CVs have revealed sensitive information including disability, background examination results, birth dates, birthplaces for those who are children, and race. When the CV was linked to people with online rollers, researchers also found communication information, government identifiers, social, demographic information, confrontation, home addresses and communication information for other people (such as references).

Examples of documents related to the identity in the small data set are shown on a common credit card, social security and driving license. For each sample, the URL type website type is shown above, the image in the middle, and a comment at the rates below. All personal information has been replaced, and the text has been reformulated to avoid direct quotes. The images were revised to show the presence of faces without identifying individuals.

With permission from researchers

When it was released in 2023, DataCOSP CommonPool, with its 12.8 billion data sample, was the largest current data collection of pictures of the image available to the public, which is often used to train text models to the image. While its coordinators said that commenpool was intended for academic research, its license is also not prohibited from commercial use.

Communpool has been created as a follow-up to the Laion-5B data collection, which was used to train models including stable proliferation and Midjourney. It depends on the same source of data: the web scrape, which is done by the common non -profit crawl between 2014 and 2022.

Although commercial models often do not reveal the data groups that are trained on them, the common data sources of Datacomb Commonpool and Laion-5B mean that data groups are similar, and that the same specified information that can be identified appears in Laion-5B, as well as in other models listed on common data. Participants did not respond to e -mail questions.

And since Datacomb Commonpool has been downloaded more than two million times over the past two years, it is possible that “there there is [are]Rachel Hong, a physician at Washington University Sciences, and the main author of the newspaper, says many doctoral models in computer science at Washington University and the head of the newspaper.

Good intentions are not enough

“You can assume that any large -scale online data always contains content that should not be there,” says Aiba Berhan, the cognitive scientist and technology ethics that lead the Trainte Lab in Dublin in Dublin.

Don’t miss more hot News like this! AI/" target="_blank" rel="noopener">Click here to discover the latest in AI news!

2025-07-18 13:08:00

0 2 minutes read