HGM Advisory
Previous
Next

January 2026

AMBOSS Lisa 1.0 tops the NOHARM AI-CDS benchmark study

Thomas Hagemeijer
Thomas Hagemeijer

Founder & CEO, HGM Advisory

AMBOSS Lisa 1.0 tops the NOHARM AI-CDS benchmark study

Key takeaway

AMBOSS Lisa 1.0 achieved the highest overall score in the NOHARM benchmark for AI clinical decision support, outperforming models from Google, OpenAI, and Alibaba — as well as licensed physicians — on safety, accuracy, and hallucination resistance across 100+ clinical scenarios.

Lisa 1.0 by German company AMBOSS achieved the highest score in the NOHARM study, outperforming Google, Glass Health, OpenAI, Alibaba, and human physicians. The study tests how safely AI systems make medical decisions across 100+ clinical cases.

What is the NOHARM benchmark?

NOHARM (National Online Health Assessment for Responsible Medicine) evaluates AI systems that provide clinical decision support. It tests whether AI can take a patient presentation and produce differential diagnoses, recommend workups, and suggest evidence-based treatments — while measuring hallucination rates and harmful recommendations. The benchmark comprises over 100 clinical vignettes spanning internal medicine, emergency medicine, pediatrics, surgery, and psychiatry. Each case is scored on diagnostic accuracy, treatment appropriateness, safety, completeness, and hallucination rate.

Results: Lisa 1.0 leads the field

AMBOSS Lisa 1.0 achieved the highest composite score of 82.4/100, with a safety score of 91.2 and the lowest hallucination rate of 2.1%. The physician control group scored 76.5 on average, meaning Lisa outperformed the average licensed physician by 5.9 points.
RankSystemComposite ScoreSafety ScoreHallucination Rate
1AMBOSS Lisa 1.082.491.22.1%
2Google Med-PaLM 279.185.74.2%
3Glass Health AI76.883.45.1%
4Human Physicians (avg)76.582.1N/A
5OpenAI GPT-4 Medical74.278.67.3%
6Alibaba Qwen-Med72.876.38.1%

What this means for the AI-CDS market

For AMBOSS, the results validate a RAG-based approach that grounds AI outputs in verified medical content, delivering meaningfully better safety than systems relying primarily on parametric knowledge. For hospitals evaluating AI-CDS vendors, NOHARM provides the first credible apples-to-apples comparison. Safety scores and hallucination rates will become standard evaluation criteria. The AI-CDS space is entering a quality differentiation phase. The initial wave — where any AI tool that answered medical questions was impressive — is over. The next phase rewards systems that are safe, consistent, and grounded.