AI

Stanford Researchers Introduced MedAgentBench: A Real-World Benchmark for Healthcare AI Agents

A team of Stanford University researchers released Medagentbench,, A new indicator suite designed to evaluate the factors of the large language model (LLM) in health care contexts. Unlike the previous questions data collections, Medagentbench provides Virtual electronic healthy environment (EHR) Where artificial intelligence systems must interact, planning and carrying out multi -step clinical tasks. This represents a major shift from the fixed thinking test to evaluating the acting abilities in Direct medical workflow on tools.

https://ai.nejm.org/doi/full/10.1056/aidbp2500144

Why do we need an agent in health care agent?

The last LLMS has moved beyond the fixed -based hard -based reactions Working behaviorDivide high -level instructions, application programming facades, patient data integration, and automation of complex operations. In medicine, this development can help treat Lack of employees, the burden of documents, and administrative competence.

While the agent’s general criteria are found (for example, Agentbench, Agentboard, Tau-Bench), Health care lacks a unified standard Which embodies the complexity of medical data, FHIR, and longitudinal records of the patient. Medagentbench fills this gap by providing a repetitive and clinical assessment framework.

What does Medagentbench contain?

How is the tasks organized?

Medagentbench consists of 300 tasks across 10 categoriesWritten by licensed doctors. These tasks include patient information retrieval, laboratory results, documentation, testing, referrals, and medicine management. The tasks are on average 2-3 steps and functioning of the mirror that has been faced in caring for internal and external patients.

What is the patient’s data that supports the standard?

The criterion requires 100 realistic profiles for the patient Extract from the Starford data warehouse, which includes more 700,000 records Including laboratories, vitality, diagnosis, procedures, and medicine orders. Data selecting and declining privacy has been canceled while maintaining clinical validity.

How is the environment built?

the environment FHIR is compatibleSupport both (obtain) and modify (POST) for EHR data. Artificial intelligence systems can simulate real clinical reactions such as documenting vital lives or placing medicine orders. This design makes the standard to translate directly to live in EHR systems.

How are the models evaluated?

  • metric: SR success rate (SR), measured with strict Passing@1 To reflect safety requirements in the real world.
  • Models tested: 12th LLMS including GPT-4O, Claude 3.5 Sonnet, Gemini 2.0, Deepseek-V3, QWEN2.5, and Llama 3.3.
  • Orchestrator agent: Preparing a basic coincidence with nine FHIR functions, limited to Eight reaction rounds for each task.

What models have been performed better?

  • Claude 3.5 Sonnet v2: The best in general with 69.67 % successEspecially strong in retrieval tasks (85.33 %).
  • GPT-4O: 64.0 % success, show balanced retrieval and work performance.
  • Deepseek-V3: 62.67 % success, leads between open weight models.
  • noteMost of the models excelled in Inquiry tasks But he struggled with Work -based tasks It requires a safe and multi -steering implementation.
https://ai.nejm.org/doi/full/10.1056/aidbp2500144

What are the mistakes made by the models?

Satan appeared from dominant failure:

  1. Commitment instructions – API calls are incorrect or incorrect format.
  2. The mismatch of the output Providing complete sentences when the numerical values ​​were required.

These errors highlight the gaps in Accuracy and reliabilityBoth are crucial in clinical publishing.

summary

Medagentbench creates the first large -scale standard for evaluating LLM factors in real EHR settings, linking 300 tasks composed of doctors with an environment compatible with FHIR and 100 patient files. The results show strong but limited reliability – Claude 3.5 Sonnet V2 leading to 69.67 % – which highlights the gap between the success of the query and the implementation of the safe procedure. Although it is restricted to the data of the same institution and the focus on the EHR, the Medagentbench provides an open and repetitive framework to lead the next generation of reliable health care factors


verify paper and Technical Blog. Do not hesitate to check our GitHub page for lessons, symbols and notebooks. Also, do not hesitate to follow us twitter And do not forget to join 100K+ ML Subreddit And subscribe to Our newsletter.


Michal Susttter is a data science specialist with a master’s degree in Data Science from the University of Badova. With a solid foundation in statistical analysis, automatic learning, and data engineering, Michal is superior to converting complex data groups into implementable visions.

Don’t miss more hot News like this! Click here to discover the latest in AI news!

2025-09-16 07:24:00

Related Articles

Back to top button