Evaluating frontier AI R&D capabilities of language model agents against human experts
View the PDF file from the paper entitled: Rating the potential of FRONIRER AI R&D for the factors of the language model against human experts, by Hjalmar Wijk and 22 other authors
PDF HTML (experimental) view
a summary:Frontier Ai highlights the automation and development of artificial intelligence (R&D) by artificial intelligence agents as an important ability to expect. However, there are few assessments of AI R&D capabilities, and there is no very realistic and have a direct comparison with human performance. We offer Re-Bences (Research Engineering Standard, V1), which consists of 7 difficult and open research engineering environments and data of 71 8-hour attempts by 61 distinguished human experts. We confirm that our experts are making progress in the environments granted 8 hours, with 82 % of expert attempts to make a non -zero degree and 24 % matching or exceeding our strong reference solutions. We have compared humans with many general border models through the best of that through various time budgets and agent designs, and you find that the best artificial intelligence agents achieve a 4x degree higher than human experts when both are given a total time budget of two hours per environment. However, humans are currently offering better returns to increase the budgets of time, by a slightly exceeding the degree of artificial intelligence agent considering the 8 -hour budget, and achieving 2x the degree of artificial intelligence agent when both are given 32 total hours (via different attempts). Quantitatively, we find that modern artificial intelligence agents have great experience in many ML topics – for example, Kernel’s fastest custom agent has written from any of our human experts – and they can generate and test solutions faster ten times than humans, at a much lower cost. We open the source of evaluation environments, human expert data, analysis code and agent paths to facilitate future research.
The application date
From: Nayef Barrick [view email]
[v1]
Friday, 22 November 2024 18:30:46 UTC (18,094 KB)
[v2]
Tuesday, 27 May 2025 03:32:23 UTC (19,789 KB)
Don’t miss more hot News like this! Click here to discover the latest in AI news!
2025-05-28 04:00:00



