January 2026
AMBOSS Lisa 1.0 tops the NOHARM AI-CDS benchmark study
Thomas Hagemeijer
Founder & CEO, HGM Advisory

Key takeaway
AMBOSS Lisa 1.0 achieved the highest overall score in the NOHARM benchmark for AI clinical decision support, outperforming models from Google, OpenAI, and Alibaba — as well as licensed physicians — on safety, accuracy, and hallucination resistance across 100+ clinical scenarios.
Lisa 1.0 by German company AMBOSS achieved the highest score in the NOHARM study, outperforming Google, Glass Health, OpenAI, Alibaba, and human physicians. The study tests how safely AI systems make medical decisions across 100+ clinical cases.
What is the NOHARM benchmark?
NOHARM (National Online Health Assessment for Responsible Medicine) evaluates AI systems that provide clinical decision support. It tests whether AI can take a patient presentation and produce differential diagnoses, recommend workups, and suggest evidence-based treatments — while measuring hallucination rates and harmful recommendations.
The benchmark comprises over 100 clinical vignettes spanning internal medicine, emergency medicine, pediatrics, surgery, and psychiatry. Each case is scored on diagnostic accuracy, treatment appropriateness, safety, completeness, and hallucination rate.
Results: Lisa 1.0 leads the field
AMBOSS Lisa 1.0 achieved the highest composite score of 82.4/100, with a safety score of 91.2 and the lowest hallucination rate of 2.1%.
The physician control group scored 76.5 on average, meaning Lisa outperformed the average licensed physician by 5.9 points.
| Rank | System | Composite Score | Safety Score | Hallucination Rate |
|---|---|---|---|---|
| 1 | AMBOSS Lisa 1.0 | 82.4 | 91.2 | 2.1% |
| 2 | Google Med-PaLM 2 | 79.1 | 85.7 | 4.2% |
| 3 | Glass Health AI | 76.8 | 83.4 | 5.1% |
| 4 | Human Physicians (avg) | 76.5 | 82.1 | N/A |
| 5 | OpenAI GPT-4 Medical | 74.2 | 78.6 | 7.3% |
| 6 | Alibaba Qwen-Med | 72.8 | 76.3 | 8.1% |
What this means for the AI-CDS market
For AMBOSS, the results validate a RAG-based approach that grounds AI outputs in verified medical content, delivering meaningfully better safety than systems relying primarily on parametric knowledge.
For hospitals evaluating AI-CDS vendors, NOHARM provides the first credible apples-to-apples comparison. Safety scores and hallucination rates will become standard evaluation criteria.
The AI-CDS space is entering a quality differentiation phase. The initial wave — where any AI tool that answered medical questions was impressive — is over. The next phase rewards systems that are safe, consistent, and grounded.