An Advanced Live Benchmark for LLM Agents in Future Prediction

0 2 minutes read

[Submitted on 16 Aug 2025 (v1), last revised 5 Sep 2025 (this version, v3)]

Authors:Zhiyuan Zeng, Jiashuo Liu, Siyuan Chen, Tianci He, Yali Liao, Yixiao Tian, Jinpeng Wang, Zaiyuan Wang, Yang Yang, Lingyue Yin, Mingren, Zhenwei Zhu, Tianle Cai, Zhui Chen, Jiacheng Guo, Liang Hu, Jianpeng Jiao, Xiangsheng Li, Jingkai Liu, Shuang Ni, Zhoufute Wen, GE Zhang, Kaiyuan Zhang, XIN ZHO , Mengdi Wang, Wenhao Huw

View PDF file from the paper entitled Futurex: A direct standard applicant for LLM agents in future prediction, by Zhiyuan Zeng and 30 other books

PDF view

a summary:The future prediction is a complex task for LLM agents, as it requires a high level of analytical thinking, collecting information, contextual understanding, and making decisions in light of uncertainty. The agents should not only collect and explain huge quantities of dynamic information, but also merge the sources of various data, the weight of uncertainty, and adaptation of predictions based on emerging trends, just as human experts do in areas such as politics, economics and financing. Despite its importance, there is no broad standard for evaluating agents about future prediction, which is largely due to the challenges in dealing with actual time updates and recovering accurate answers in time. To treat this, we offer $ \ Textbf {Futurex} $, which is a dynamic and vital evaluation standard specifically designed for LLM agents who perform future prediction tasks. Futurex is the largest and most diversified in the direct criterion for predicting future, supporting daily updates and eliminating data pollution through an automatic pipeline to collect questions and collect the answer. We evaluate 25 LLM/Agent models, including those with logical capabilities, research capabilities, and external tools such as deep open source research agent and deep resource research models. This comprehensive evaluation assesses thinking and adaptive performance of factors in dynamic environments. In addition, we provide in -depth analyzes of the methods of failure of agents and the risk of performance in the tasks directed towards the future, including weakness on fake web pages and time validity. Our goal is to create a dynamic evaluation standard that drives the development of LLM agents who are able to perform at the level of professional human analysts in complex thinking and predictive thinking.