AI

“Dr. Google” had its issues. Can ChatGPT Health do better?

Some doctors see LLMs as a boon to medical literacy. The average patient may have difficulty navigating the vast landscape of online medical information—in particular, distinguishing between high-quality sources and polished but factually questionable sites—but LLM students can do that task for them, at least in theory. Treating patients who have Googled their symptoms “requires a lot of attacking the patient’s anxiety [and] “Reducing misinformation,” says Mark Suchi, an associate professor at Harvard Medical School and a practicing radiologist. But now, he says, “you see patients with college education, high school education, asking questions on the level of something an early medical student might ask.”

The release of ChatGPT Health, and Anthropic’s subsequent announcement of new health integrations for Cloud, suggest that AI giants are increasingly willing to acknowledge and encourage health-related uses of their models. Such uses certainly come with risks, given MBAs’ well-documented tendencies to agree with users and fabricate information rather than admit ignorance.

But these risks must also be weighed against the potential benefits. There’s an analogy here to self-driving vehicles: When policymakers consider whether to allow Waymo in their city, the key measure is not whether its cars are involved in accidents at all, but whether they cause less damage than the status quo of relying on human drivers. If Dr.ChatGPT is an improvement over Dr. Google — and early evidence suggests it may be so — will likely reduce the enormous burden of medical misinformation and unnecessary health anxiety that the Internet has created.

However, determining how effective chatbots like ChatGPT or Claude are with respect to consumer health is difficult. “An open chatbot is very difficult to evaluate,” says Danielle Peterman, clinical lead for data science and artificial intelligence at Brigham Mass General Healthcare System. Large language models score well on medical licensing tests, but those tests use multiple-choice questions that don’t reflect how people use chatbots to search for medical information.

Sirisha Rambhatla, an assistant professor of management science and engineering at the University of Waterloo, tried to fill this gap by evaluating how GPT-4o responded to licensing exam questions when it couldn’t access a list of possible answers. Medical experts who evaluated the answers scored only half as being completely correct. But multiple-choice exam questions are designed to be difficult enough that the answer choices don’t completely obscure them, and they’re still a very distant approximation of the kind of things a user might type into ChatGPT.

A different study, which tested GPT-4o on more realistic prompts made by human volunteers, found that it answered medical questions correctly in about 85% of cases. When I spoke with Amulya Yadav, an associate professor at Penn State who runs the Responsible AI for Social Liberation Lab and led the study, he made it clear that he’s not personally a fan of patient-facing medical MBAs. But he openly admits that technically they seem up to the task, saying that human doctors misdiagnose their patients 10% to 15% of the time. “If I look at it abstractly, it seems to me that the world is going to change, whether I like it or not,” he says.

For people searching for medical information online, an MBA seems to be a better option than Google, Yadav says. Socci, the radiologist, also concluded that MBA could be a better alternative to web searches when he compared GPT-4 responses to questions about common chronic medical conditions with the information provided in Google’s information card, the information box that sometimes appears on the right side of search results.

Since the Yadav and Sochi studies came online, in the first half of 2025, OpenAI has released multiple new versions of GPT, and it is reasonable to expect that GPT-5.2 will perform better than its predecessors. But the studies have important limitations: they focus on direct, factual questions, and only examine short interactions between users and chatbots or web search tools. Some of the weaknesses of LLMs – most notably their sycophancy and tendency to hallucinate – may be more likely to rear their heads in broader conversations and with people dealing with more complex problems. Riva Lederman, a professor at the University of Melbourne who studies technology and health, suggests that patients who don’t like the diagnosis or treatment recommendations they receive from a doctor may seek a second opinion from an LLM – and if the LLM is fawning, it may encourage them to reject their doctor’s advice.

Some studies have found that LLM will hallucinate and show flattery in response to health-related prompts. For example, one study showed that GPT-4 and GPT-4o would gladly accept and act on incorrect drug information included in a user’s question. In another case, GPT-4o frequently fabricated definitions for fake syndromes and laboratory tests mentioned in a user message. Given the abundance of medically questionable diagnoses and treatments spread across the Internet, these patterns of LLM behavior could contribute to the spread of medical misinformation, especially if people perceive the LLM to be trustworthy.

OpenAI reported that the GPT-5 model series is significantly less flatter and prone to hallucinations than its predecessors, so the results of these studies may not apply to ChatGPT Health. The company also evaluated the model that supports ChatGPT Health in its responses to health-specific questions, using the publicly available HeathBench benchmark. HealthBench rewards models that express uncertainty when appropriate, recommends users seek medical attention when necessary, and refrains from causing unnecessary stress to users by telling them their condition is more serious than it actually is. It’s reasonable to assume that the ChatGPT Health base model exhibited those behaviors in testing, although Peterman points out that some of the prompts in HealthBench were generated by LLMs, not users, which may limit how well the benchmark translates to the real world.

Don’t miss more hot News like this! Click here to discover the latest in AI news!

2026-01-22 17:38:00

Related Articles

Back to top button