A new, challenging AGI test stumps most AI models

3 2 minutes read

The Arc Prize Foundation, a non -profit organization that has been established by prominent AI researcher Francois Chollet, announced in a blog on Monday that it had created a new and full test to measure the general intelligence of Amnesty International models.

To date, the new test, called ARC-AGI-2, assembled most of the models.

Thinking models such as Openai’s O1-PRO and Deepseek R1 between 1 % and 1.3 % on ARC-AGI-2, according to the leaders of the ARC Award. Strong non-glamorous models including GPT-4.5, Claude 3.7 Sonnet and Gemini 2.0 Flash about 1 %.

The ARC-EAGI tests consist of a puzzle-like problems where artificial intelligence must determine the visual patterns of a group of squares of different colors, and to create the correct “answer” network. Problems are designed to force artificial intelligence to adapt to new problems that they have not seen before.

The Arc Award Foundation had more than 400 people taking ARC-AGI-2 to create a human foundation. On average, “paintings” of these people got 60 % of the test questions properly – much better than any of the models.

Question of ARC-Agi-2 (Credit: ARC Prize).

In a post on X, Chollet Arc-Agi-2 claimed a better scale for the actual intelligence of the artificial intelligence model from the first repetition of the test, ARC-AGI-1. The ARC Arc Foundation’s tests aim to evaluate whether the artificial intelligence system can obtain new skills outside the data efficiently.

Chollet said that unlike ARC-AGI-1, the new test prevents artificial intelligence models from relying on “brute force”-the broad computing power-to find solutions. Chollet has previously admitted that this was a major defect in ARC-AGI-1.

To address the first test defects, the Arc-AGI-2 introduces a new scale: efficiency. It also requires models to explain patterns while flying instead of relying on memorization.

“Intelligence is defined not only by the ability to solve problems or achieve high degrees,” wrote Greg Craradat, co -founder of the Arc Bize Foundation. “The efficiency of these capabilities is obtained and published is a decisive and specific component. The main question that it asks not only,” Can Amnesty International be obtained [the] Skill to solve the task? But also, “In what efficiency or cost?”

ARC-AGI-1 was not defeated for almost five years until December 2024, when Openai has released the advanced thinking model, O3, which outperformed all other artificial intelligence models and human performance identical. However, as we noticed at the time, the performance of the O3 performance on the ARC-AGI-1 came at a huge price.

The version of the O3 O3-O3 model (low)-was one of the first to reach new heights on the ARC-AGI-1, with 75.7 % in the test, by 4 % on the ARC-AGI-2 using a $ 200 computing capacity per task.

Comparing the performance of the Ai Frontier model on ARC-AGI-1 and ARC-AGI-2 (credit: ARC Prize).

The arrival of ARC-AGI-2 also calls many in the technology industry to new unanimous standards to measure the progress of artificial intelligence. Thomas Wolf, co -founder of Huging Face, TECHCRUNCH recently told the artificial intelligence industry lacking adequate tests to measure the main features of the so -called artificial general intelligence, including creativity.

Besides the new standard, the ARC Arc Foundation announced the new ARC ARC 2025 competition, which is a challenge for developers to reach 85 % of the accuracy in the ARC-AGI-2 test while only $ 0.42 spends per task.

2025-03-25 00:29:00

3 2 minutes read