A Chinese firm has just launched a constantly changing set of AI benchmarks

The development of the standard in Hongshan began in 2022, after the successful Chatgpt success, as an internal tool to evaluate the models that deserve to be investigated. Since then, the partner Gong Yuan, the team has steadily expanded the regime, and brought external and professional researchers to help improve it. As the project grows more advanced, they decided to launch it to the public.
Xbench is close to the problem with two different systems. One of them is similar to traditional measurement: an academic test that measures the model’s readiness on various topics. The other is similar to the technical interview tour, and the evaluation of the amount of economic value in the real world that the model may offer.
Xbench methods for raw intelligence currently include two components: XBench-Scienceqa and XBENCH-Deepresearch. Scienceqa is not a fundamental exit from current STEM criteria at the graduate levels like GPQA and Supergpqa. It includes questions that extend fields of biochemistry to tropical mechanics, which were formulated by graduate and achieved students. Records are recorded not only the correct answer, but also the series of thinking they lead to.
On the contrary, Deepresearch focuses the model’s ability to navigate the Chinese network. Ten experts have created 100 questions in music, history, financing and literature-questions not only Google but require an important research to answer. It is preferred to score on a wide range of sources, realistic consistency, and the model’s willingness to confess when there is not enough data. A question in the group that was announced is “How many Chinese cities are in the three northwestern provinces bordered by a foreign country?” (It’s 12, and only 33 % of the tested models you got properly, if you are wondering.)
On the company’s website, the researchers said they want to add more dimensions to the test – for example, aspects such as the extent of creativity of the model in solving problems, the extent of cooperation when working with other models, and its reliability.
The team is committed to updating the test questions once every quarter and maintaining a half -data set, especially half of the two sectors.
To assess preparation in the real world, the team worked with experts to develop tasks designed on the actual workflow, initially in employment and marketing. For example, one of the tasks requires a model for the source of five candidates for battery qualifying engineers and justify each choice. Another asks for this to match advertisers with the appropriate creators in the video from a group of more than 800 effects.
The site also raises the next categories, including financing, accounting and design. Question groups for these categories were not open yet.
Chatgpt-O3 ranks first in both the current professional categories. For recruitment and research in confusion and Claude 3.5 Sonata ranked second and third respectively. For marketing, Claude, Groke, and Jimini all good performance.
“It is really difficult for standards to include difficult things,” said Zhehan Cheng, the main researcher in a new standard called LiveCodebeench Pro and a student at New York University. “But Xbench represents a promising start.”
Don’t miss more hot News like this! Click here to discover the latest in AI news!
2025-06-23 15:46:00