Microsoft Claims AI Diagnostic Tool Outperforms Human Doctors

Thu Jul 03 2025
icon-facebook icon-twitter icon-whatsapp

Key points

  • AI “orchestrator” uses five virtual agents debating diagnoses
  • Solved 85.5pc complex cases versus 20pc by doctors
  • Technology could integrate with Microsoft’s chatbot and Bing
  • Not peer-reviewed or clinically approved yet

ISLAMABAD: Microsoft has developed an artificial intelligence-powered medical tool that it claims is four times more successful than human doctors at diagnosing complex conditions, unveiling research it believes could accelerate treatment.

The “Microsoft AI Diagnostic Orchestrator” is the first project from an AI health unit established last year by Mustafa Suleyman, who recruited staff from DeepMind, the research lab he co-founded, now owned by Google, according to Financial Times.

In an interview with the media, the chief executive of Microsoft AI described the trial as a step towards “medical superintelligence” that could help alleviate staffing shortages and reduce long waiting times in overburdened healthcare systems.

The new system operates via an “orchestrator” that assembles virtual panels of five AI agents acting as “doctors”, each with a specific role such as generating hypotheses or selecting diagnostic tests, which interact and “debate” to decide on the best course of action.

Analysing case studies

To assess its capabilities, “MAI-DxO” analysed 304 case studies from the New England Journal of Medicine (NEJM) detailing how some of the most challenging cases were diagnosed by doctors.

This allowed researchers to evaluate whether the programme could correctly identify diagnoses and explain its reasoning through a novel “chain of debate” technique, prompting AI models to provide a step-by-step account of their problem-solving process.

Microsoft utilised leading large language models from OpenAI, Meta, Anthropic, Google, xAI and DeepSeek. The orchestrator improved the performance of all these models but was most effective with OpenAI’s o3 reasoning model, which correctly solved 85.5 per cent of the NEJM cases.

Experienced human doctors

By comparison, experienced human doctors achieved about 20 per cent accuracy, although they were not permitted access to textbooks or consultations with colleagues during the trial, which might have improved their results.

A version of this technology could soon be integrated into Microsoft’s Copilot AI chatbot and Bing search engine, which collectively handle 50 million health-related queries daily.

Suleyman stated that Microsoft is approaching “AI models that are not just slightly better, but dramatically superior to human performance: faster, cheaper, and four times more accurate”.

“That is going to be truly transformative,” he added.

Unravelling biological mysteries

Suleyman’s initiative follows DeepMind’s pioneering AI healthcare breakthroughs. Google’s lab chief, Demis Hassabis, jointly won a Nobel Prize in Chemistry last year for employing AI to unravel biological mysteries of proteins essential to life.

Microsoft has invested nearly $14 billion in OpenAI and holds exclusive rights to utilise and market its technology. However, tensions persist between Microsoft and OpenAI, as the latter seeks to transition to a for-profit structure, causing disputes over the future terms of their partnership.

While OpenAI’s model delivered the best results, Suleyman emphasised that Microsoft remains “agnostic” about which of the four “world-class models” MAI-DxO employs.

“We have long believed these will become commodities… it’s the combined orchestrator that sets us apart,” he explained.

New entry point

Dominic King, former head of DeepMind’s health unit who joined Microsoft last year, said the programme had “performed better than anything we’ve ever seen” and presents an “opportunity to serve as a new entry point into healthcare”.

The AI models were also instructed to consider costs, significantly reducing the number of diagnostic tests needed to reach a correct diagnosis during trials, saving hundreds of thousands of dollars in some instances.

However, King cautioned that the technology remains in early stages, has yet to undergo peer review, and is not ready for clinical deployment.

“This is a landmark study,” said Eric Topol, cardiologist and founder of the Scripps Research Translational Institute. “Although not tested in real-world medical practice, it is the first to demonstrate generative AI’s potential for efficiency in medicine — in both accuracy and cost savings.” 

icon-facebook icon-twitter icon-whatsapp