AI

This AI Paper Introduces the Kolmogorov-Test: A Compression-as-Intelligence Benchmark for Evaluating Code-Generating Language Models

Compression is the cornerstone of arithmetic intelligence, deeply rooted in the theory of complexity of Colmojorov, which determines the minimum program to reproduce a specific sequence. Unlike the traditional pressure methods that are looking for repetition and repetition, the Kolmogorov framework explains pressure as a problem to discover organized patterns through programming. While the theory promises optimal pressure, its inactivity represents a great obstacle. However, the appearance of large language models capable of generating the code opens an interesting opportunity to test the extent of modern systems’ ability to converge this theoretical text by thinking through the code rather than matching patterns.

A basic problem arises from the current tools in the pressure of the data sequence using a brief and executive code. Models often repeat inputs instead of creating programs that reproduce them, indicating a gap in understanding the real style. This becomes particularly evident when dealing with the sequence of sound, text or DNA in the real world, where complex logical structures must be detected to achieve effective pressure. The main challenge is to ensure the repetition of the sequence model and uses a set of minimum and rational instructions. Moreover, although artificial training data is useful for controlled evaluation, they often fail to support the strong generalization of natural data, which is necessary for practical applications.

There are many pressure tools, ranging from traditional algorithms such as GZIP to newer nervous pressure systems. GZIP is still a strong basic line, especially for long or frequent sequences, due to its effective coding of statistical regularity. Recently, language modeling methods have been combined with arithmetic coding, using the possibilities of predicting input data pressure. However, these methods usually require access to the full weight weights at the time of deciphering time, which limits their efficiency and application. Code generation models such as GPT-4 and Llama are also evaluated in scratch settings to create Python programs that multiply the input sequence. However, they often produce a long inaccurate symbol with limited success, especially when facing invisible or complex serials.

Meta AI and Tel Aviv University Kolmogorov-Test (KT), a criterion for assessing the ability to think about the code generation language models. The test evaluates the ability of the model to create the shortest program that comes out of a specific input sequence. Unlike typical standards, KT emphasizes the logical composition and generating the program on predictive text modeling. Series include natural data (Librispeech), text (Wikipedia enwik9) and DNA (GRCH38), as well as the artificial serials created through a specially designed language language (DSL). This DSL supports construction of organized serials by forming operations such as domain creation, sequence coordination, merge, and filtering.

The researchers have developed an automatic framework to generate millions of artificial sequence pairs using this DSL. Then these programs train and evaluate the models, including the pre -trained large and trained specifically like Seqcoder. To measure performance, the team used standards such as accuracy – whether it is the sequence -created program – accuracy – the correct program may be compared to GZIP pressure. The test guarantees the pressure of varying length sequences, with the average 76 -by -sequences and the real sequences in 128.

The results showed that even the strongest models are struggling. GPT-4 achieved 69.5 % resolution on high-quality sound, but decreased to 36.4 % for 8-bit sound and 50.3 % for DNA data. Llama-3.1-405B performance was worse, with a resolution of up to 3.9 % for sound and only 24.8 % for DNA. In artificial data, Seqcoder-8B has reached 92.5 % of the accuracy with a precise degree of 0.56, outperforming traditional tools such as GZIP. However, its accuracy remained on the real world data near scratch. This contradiction shows the difficulty in transferring success from artificial standards to more diverse and loud series, while highlighting the restrictions of current training systems and pushing the need for new strategies.

In general, this research clearly determines the complexity of pressure by generating the code. The KT standard provides a strict and diverse test for thinking and identifying the structure, showing the bright gap between artificial learning environments and applications in the real world. The methodology and testing provided a high tape of future models aimed at uniting thinking with pressure, but a great innovation is still needed to face this challenge.


Payment The paper. All the credit for this research goes to researchers in this project. Also, do not hesitate to follow us twitter And do not forget to join 85k+ ml subreddit.


Niegel, a trainee consultant at Marktechpost. It follows an integrated double degree in materials at the Indian Institute of Technology, Khargpur. Nichil is a fan of artificial intelligence/ml that always looks for applications in areas such as biomedics and biomedical sciences. With a strong background in material science, it explores new progress and creates opportunities to contribute.

2025-03-27 05:27:00

Related Articles

Back to top button