AI

AI That Teaches Itself: Tsinghua University’s ‘Absolute Zero’ Trains LLMs With Zero External Data

LLMS has shown developments in thinking capabilities through reinforcement learning with verified bonuses (RLVR), which rely on results -based comments instead of imitating medieval thinking steps. Current RLVR works face critical expansion challenges because they depend heavily on manually coordinated groups of questions and answers for training. With the progress of thinking models, the creation of large -scale data collections becomes increasingly unnecessary, similar to the specified bottlenecks in LLM training. Moreover, the exclusive dependence on the tasks of man may restrict the ability of artificial intelligence systems to learn and independent development, especially as it develops beyond human intellectual capabilities.

Researchers have explored different ways to enhance the llm thinking capabilities. The star pioneer in self -feeding using experts and taking samples of responses that were identified to improve thinking in COT. The O1 model has spread this concept widely, achieved newer results, and the R1 later became the first open weight model that corresponds to or exceeded O1 performance by entering the “zero” setting where the RL is applied directly to the basic LLM. Moreover, self -gameplay models from Schmidhbir’s early dual -cabinet settings have evolved into more sophisticated applications such as Alphago and Alphazero. Modern methods such as rotation, language models that are good at the self, SPC and SPAG are applied to play self -playing models for alignment and thinking.

Researchers from the University of Tsinghua, the Beijing Institute for General AI and Pennsylvania State University have suggested a model of RLVR called Absolute Zero to enable one model of generating and independently solving tasks to the maximum progress in learning without relying on any external data. Under this method, the researchers presented the absolute mind (AZR), which looks forward to the training curricula and its ability to think through a symbol outlets that verify the tasks of thinking about the proposed programming instructions and achieved answers, providing a unified source of the bonus that can be verified to direct open learning after consolidating it. AZR can be effectively implemented via various model standards and remains compatible with different models, indicating wide application capacity.

LLMS provides an ideal framework for AZR implementation in multi -task learning contexts. During every repetition of online operation in the equation, the goal of absolute zero preparation, AZR suggests new thinking tasks based on the type of self -created task and examples created, with an explicit claim to generate various tasks and then attempts to solve them, and receive notes for their typical responses. AZR is used as a symbol port as a checked flexible interface, which allows automatic construction and implement thinking tasks in code. Finally, the AZR algorithm includes insulating configuration, task proposal inputs and temporary store management, building good tasks, verifying the validity of the solution, and an estimated calculation of benefit by enhancing tasks ++.

ABSOLUTE Zero Dieorer-7B 7B has performed a performance in the latest model in the total 7B and average coding codes, bypassing the best previous models by 1.8 absolute percentage points although they are completely out of distribution for both mathematics and symbol thinking. Excellence in the human -taught models with the human data are outperformed by 0.3 absolute percentage points with no access to these same data. The scaling analysis reveals that AZR achieves greater gains on the largest models, with 7B and 14B models to improve 200 training steps while 3B plateaus. External distribution performance gains increase with the size of the form: +5.7, +10.2, +13.2 for 3B, 7B and 14B, respectively.

In conclusion, the researchers presented the absolute zero model to process data restrictions in the current RLVR frameworks. Under this method, researchers, AZR, who train models on proposing and solving thinking tasks related to a country based on a symbol. However, there are restrictions on safety management in self -improvement systems. The team noticed several COTs, heading to the Safety of Llama-3.1-8B, called “UH-OH”. The results indicate that although the absolute zero model reduces the needs of human intervention in regulating tasks, continuous oversight remains necessary to address continuous safety concerns, highlighting the critical trend for future research.


verify Paper, model on the face embrace and the Jaythb page. Also, do not forget to follow us twitter.

Here is a brief overview of what we build in Marktechpost:


Sajjad Ansari is in the last year of the first university stage of Iit khargpur. As enthusiastic about technology, it turns into the practical applications of Amnesty International with a focus on understanding the impact of artificial intelligence techniques and their effects in the real world. It aims to clarify the concepts of complex artificial intelligence in a clear and accessible way.

Don’t miss more hot News like this! Click here to discover the latest in AI news!

2025-05-09 22:56:00

Related Articles

Back to top button