Meta’s Investment in AI Data Labeling Explained

0 6 minutes read

Earlier this summer, Meta made a bet of $ 14.3 billion on a company that most people have not heard before: Scale AI. The deal, which gave Meta 49 percent stake, sent Meta-including Openai and Google-mix to get out of their contracts with Scale AI for fear that Meta might give an insightful look at how to train them in their AI models.

Scale AI is a pioneer in a sign of data for artificial intelligence models. It is an industry, in essence, do what you say on tin. A basic example of the thumb and thumbs that you liked to see if you have used Chatgpt before. One calls a response as positive. The other, negative.

But with the growth of artificial intelligence models, whether in the size and popularity of the model, this simple task apparently grew to become a monster looking for each training institution or controlling the model.

“The vast majority of the account is used on pre -training data that has poor quality,” says Sarah Hker, Vice president of Research at COHERE LABS. “We need to reduce this, to improve it, by applying high -quality golden dust data.”

What is the data description?

In the past, computer scientists relied on the intuitive “garbage in, garbage outside.” It indicates that bad inputs always lead to bad outputs.

However, as Holker suggests, training modern artificial intelligence models challenges that intuitive. Large language models are trained on exposed raw text data from the public internet, many of which have low quality (Reddit posts tend to exceed the number of academic papers).

Cleaning and sorting training data is a theory, but with modern models training on data home, it is not practical in practice due to the huge size of the relevant data. This is a problem, because the popular artificial intelligence data training groups include racist, sexual and criminal data. Training data can also include more exact problems, such as satirical advice or intentionally misleading advice. Simply: A lot of garbage finds its way to training data.

So steps to put recipes to clean chaos. Instead of trying to clean all the problematic elements of training data, human experts manually make notes on the output of the artificial intelligence model after training the model. This weakens the model, which reduces unwanted responses and change the behavior of the model.

Sajjad ABDOLI, an Amnesty International in Perle, explains this process of creating “golden standards” to control artificial intelligence models. What exactly this standard contains depends on the purpose of the model. “We walk to our customers through this procedure, and we create quality assessment standards,” Abdoli says.

Looking in a typical Chatbot. Most companies want to create usebot useful, accurate and bright, so databases make notes with these goals in mind. Human data veement reads the responses created by the form on a set of test claims. The response, which appears to be answering brief and accurate information, is positive. A zigzag response will be classified as an insult as negative.

Not all models of artificial intelligence are supposed to be Chatbots, or focus on the text. As an interview point, Abdouli described Pearl’s work that helps a customer working on a form to name photos. Pearl Human experts contract to accurately naming objects in thousands of images, creating a standard that can be used to improve the model. “We have found a huge gap between what human experts mentioned in the picture, and what the machine learning model can recognize.”

Why did Meta invest billions of dollars in the scope of artificial intelligence

Data description is essential to controlling any Amnesty International model, but this alone does not explain the reason for a deadly willingness to invest more than $ 14 billion of artificial intelligence. To understand this, we need to understand the latest manufacturer of artificial intelligence: Agency AI.

SAM Altman, CEO of Openai, believes that Amnesty International will allow one person to build one billion dollars (or more). To make this dream true, although artificial intelligence companies need to invent AI Agentic models capable of complex multi -step workflow that may span days, and even weeks, and include the use of many software tools.

It turns out that a data description is a major component in the AICEC AI recipe.

“Take a world where you have many agents who interact with each other,” said Jason Liang, the first vice president of the artificial intelligence data classification company. “A person will have to attend and review. Will the agent call the correct tool? Has the next agent required correctly?”

In fact, the problem is more complicated than it appears at the beginning, as the evaluation of both specific procedures and the comprehensive plan requires artificial intelligence agent. For example, many agents may call for sequences, each for reasons that appear justified. “In fact, it was possible that the first factor had just contacted the fourth place and skipped the two in the middle,” says Liang.

Agency AI also requires models that can solve problems in high risk fields where the results of the worker can have the consequences of life or death. Pearl Abdouli referred to medical use as a pioneer. The artificial intelligence doctor can be able to carefully diagnose, even if it is in one field or in limited conditions, a great value. But the creation of such an agent, if possible, will push the data description industry to its limits.

“If you are collecting medical notes, data from the CT scans, or such data, then you need the source of doctors [to label and annotate the data]. “It is very expensive, however, for these types of activities, the accuracy and quality of data is the most important thing,” Abdoli says.

The effect of artificial data on artificial intelligence training

However, if artificial intelligence models require human experts to put signs of data to judge and improve the models, then where does this need to be over? Will we have teams of doctors who prescribe data in offices instead of doing actual medical work?

This is where artificial data enters.

Instead of relying entirely on human experts, data classification companies often use artificial intelligence models to create training data for other models of artificial intelligence – allow machines to teach machinery mainly. Modern data recipes are often a mixture of manual human comments and automated artificial intelligence teachers designed to enhance the behavior of the desired model.

“You have a teacher, and your teacher, in this case, is just another deep nerve network, which comes out for example,” says Hooker in Kwiver. “Then the student model is trained in this example.” It indicates that the key is the use of a high -quality teacher, and the use of “teachers” of various artificial intelligence rather than relying on one model. This avoids the problem of the collapse of the model, as the directing quality of the artificial intelligence model that has been trained in the data that was largely created.

Deepsek R1, the model of the Chinese company with the same name that made surfing in January for cheap training, is an extreme example of how artificial data works in practice. It has achieved the same thinking of the best models of Openai, Anthropic and Google without traditional human reactions. Instead, Deepseek R1 has been trained in “cold start” data that consists of a few thousand examples chosen on man about thinking about the series of thinking. Next, use Deepseek rewards based on the rules to enhance the behavior of thinking in the model.

However, Liang’s Supernotate warned that artificial data is not a silver bullet. While the artificial intelligence industry is often keen to automate whenever possible, attempts to use models for complex tasks ever can reveal the edge cases that humans hunt only. “When we started seeing companies that put models in production, all of them come to perception, the holy Muller, I need to introduce humans into this mix.”

This is exactly the reason why data classification companies such as Scale AI, Perle and Supernotate (among dozens of others) have lights. The best way to adjust the AIIC models to address complex or specialized cases – whether through human comments, artificial data, some technologies or new technologies that have not been discovered – have not yet been discovered. Meta’s $ 14 billion betting indicates that the answer will not be cheap.

From your site articles