AI Fails at Primary Patient Diagnosis More Than 80% of the Time, Study Finds

AI language models fail to produce an appropriate early diagnosis more than 80% of the time, suggesting they are not yet safe for unsupervised clinical use, according to a new study.

Generative artificial intelligence (AI) still lacks the reasoning processes needed for safe clinical use, a new study has found.

AI chatbots have improved their diagnostic accuracy when presented with comprehensive clinical information, but still failed to produce an appropriate differential diagnosis more than 80% of the time, according to researchers at Mass General Brigham, a Boston-based non-profit hospital and research network and one of the largest health systems in the United States.

The results of the study, published in the open-access JAMA Network Open medical journal, found that large language models’ (LLMs) fall short of the reasoning required for clinical use.

“Despite continued improvements, off-the-shelf large language models are not ready for unsupervised clinical-grade deployment,” said Marc Succi, co-author of the study.

He added that AI cannot yet replicate differential diagnosis, which is central to clinical reasoning, and which he considers the “art of medicine”.

Differential diagnosis is the first step for healthcare professionals to identify a condition, separating it from others with similar symptoms.

How the models were tested

The research team analysed the functioning of 21 LLMs, including the latest available versions of Claude, DeepSeek, Gemini, GPT and Grok.

They evaluated the LLMs on 29 standardised clinical vignettes using a newly developed tool called PrIME-LLM.

The tool assesses a model’s ability across different stages of clinical reasoning: conducting an initial diagnosis, ordering appropriate tests, arriving at a final diagnosis, and planning treatment.

To simulate how clinical cases unfold, the researchers gradually fed the models information, beginning with basics such as a patient’s age, sex and symptoms, before adding physical examination findings and laboratory results.

A differential diagnosis is critical in a real-world clinical setting to advance to the next step. However, in the study, the models were given additional information so that they could proceed to the next stage even if they failed at the differential diagnosis step.

The researchers found that the language models achieved high accuracy on final diagnoses but performed poorly in generating differential diagnoses and navigating uncertainty.

Study author Arya Rao noted that by evaluating LLMs in a stepwise fashion, research moves past treating them like test-takers and puts them in a doctor’s position.

“These models are great at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case, when there isn’t much information,” she added.

The researchers found that all of the models failed to produce an appropriate differential diagnosis more than 80% of the time.

On final diagnosis, success rates ranged from around 60% to over 90% depending on the model.

Most of the LLMs showed improved accuracy when provided with laboratory results and imaging in addition to text.

The results identified a top-performing cluster that included Grok 4, GPT-5, GPT-4.5, Claude 4.5 Opus, Gemini 3.0 Flash and Gemini 3.0 Pro.

Medical professionals are still key

However, the authors noted that despite version-based improvements and advantages in reasoning-optimised models, off-the-shelf LLMs have not yet achieved the level of intelligence required for safe deployment and remain limited in demonstrating advanced clinical reasoning.

“Our results reinforce that large language models in healthcare continue to require a ‘human in the loop’ and very close oversight,” Succi noted.

Susana Manso García, a member of the Artificial Intelligence and Digital Health working group of the Spanish Society of Family and Community Medicine, who was not involved in the study, said the findings carry a clear message for the public.

“The study itself insists they [language models] should not be used to make clinical decisions without supervision. Therefore, whilst artificial intelligence represents a promising tool, human clinical judgement remains indispensable,” she said.

“The recommendation for the public is to use these technologies with caution and, when faced with any health concern, always consult a healthcare professional.”

How the models were tested

Medical professionals are still key

Gates News

About

Services

Popular Insights

Contact Us

AI language models fail to produce an appropriate early diagnosis more than 80% of the time, suggesting they are not yet safe for unsupervised clinical use, according to a new study.

How the models were tested

Medical professionals are still key

Jon Stewart Satirizes Lookalike Appearance in Trump AI Image

Novo Nordisk Partners with OpenAI as AI Drug Discovery Hopes Mount

You may also like

Leave a Comment Cancel Reply

Gates News

About

Services

Popular Insights

Contact Us