From 100,000 to Under 500 Labels: How Google AI Cuts LLM Training Data by Orders of Magnitude

0 2 minutes read

1754815514 From 100000 to Under 500 Labels How Google AI Cuts.png

Google Research revealed A pioneering way to adjust the large language models (LLMS) that reduces the amount of training data required by up to 10,000xWith the preservation or even improvement of the quality of the model. This approach focuses on active learning and focusing efforts to put signs on experts on the most beneficial examples – “border cases” where they peak typical uncertainty.

Traditional bottle

LLMS usually requires tasks that require a profound understanding of the context and cultural-such as advertising content or moderation-huge high-quality data groups. Most of the data is benign, which means that for the disclosure of a policy violation, only a small part of the examples is important, which increases the cost and complexity of the data stimulation. Standard methods are also struggling to keep pace with when policies or problematic patterns turn, which requires expensive re -training.

Active learning penetration from Google

How to work:

LLM-SA-SCOOT: LLM is used to wipe a wide range (hundreds of billions of examples) and select less confirmed cases.
Signs of experts targeted: Instead of describing thousands of random examples, human experts only suspend these border elements.
Repetition: This process is repeated, with each group of new “problematic” examples that are informed of the latest points of confusion in the model.
Rapid rapprochement: Models are set in multiple rounds, and repetition continues until the output of the model is closely in line with the rule of experts-measured by Cohen Kaba, which compares the agreement between the interconnected comments to the beyond chance.

Photo source: https://research.google/blog/achieVing-10000x-training-data-reducction-With-high-fidelity-labs/

impact:

Data need to decrease: In experiments with Gemini Nano-1 and Nano-2 models, compatibility with human experts has reached equal or better use 250-450 examples are well chosen Instead of ~ 100,000 random collective stickers – a decrease from three to four orders.
The quality of the form: For the most complex tasks and larger models, performance improvements have reached 55-65 % from the baseline, indicating a more reliable alignment with politicians.
Efficiency of posters: For reliable gains using small data collections, high sticker quality was constantly necessary (Cohen Kapa> 0.8).

Why do it matter

This approach turns the traditional model. Instead of drowning models in vast gatherings of noisy and frequent data, they benefit from LLMS’s ability to determine mysterious cases and field experience for human jurisprudence where their inputs are the most valuable. Deep benefits:

Cost reduction: Examples are largely less naming, significantly reduce the lower support and capital expenditures.
Faster updates: The ability to re -train models on a handful of examples makes adaptation with new abuse patterns, changes in politics or rapid and feasible field transformations.
Community Impact: The enhanced ability of contextual and cultural understanding increases the safety and reliability of automated systems that deal with sensitive content.

In summary

Google’s new methodology allows you to put LLM polishing on complex and advanced tasks with only hundreds (not hundreds of thousands) of high-resolution targeted stickers-in developing the stronger, more flexible and cost-effective models.

Michal Susttter is a data science specialist with a master’s degree in Data Science from the University of Badova. With a solid foundation in statistical analysis, automatic learning, and data engineering, Michal is superior to converting complex data groups into implementable visions.