MedQA Performance Evaluation
Comparative analysis of large language models on medical question answering tasks
Performance
# | Model | Accuracy - 95% CI | |||
---|---|---|---|---|---|
1 | OpenAI o3-mini | 95.19% | 94.76% - 95.59% | $1.1 | $4.4 |
2 | Llama 4 Maverick | 92.55% | 92.05% - 93.05% | $0.2 | $0.6 |
3 | DeepSeek-R1 | 91.91% | 91.36% - 92.42% | $0.55 | $2.19 |
4 | Llama 3.3 70B Instruct | 90.88% | 90.31% - 91.42% | $0.12 | $0.3 |
5 | Llama 4 Scout | 89.71% | 89.16% - 90.26% | $0.09 | $0.48 |
6 | Claude 3.7 Sonnet | 87.88% | 87.23% - 88.5% | $3 | $15 |
7 | Gemini Flash 2.0 | 82.7% | 81.95% - 83.42% | $0.1 | $0.4 |
8 | Gemini 2.0 Flash Lite | 77.18% | 76.36% - 77.99% | $0.075 | $0.3 |
9 | Mistral Large 2411 | 74.23% | 73.37% - 75.07% | $2 | $6 |
10 | OpenAI GPT-4o-mini | 73.99% | 73.13% - 74.84% | $0.15 | $0.6 |
11 | Mistral Small 3.1 24B | 68.77% | 67.86% - 69.67% | $0.1 | $0.3 |
12 | Qwen2.5 32B Instruct | 68.77% | 67.86% - 69.66% | $0.79 | $0.79 |
13 | Gemma 3 27b | 67.44% | 66.53% - 68.35% | $0.1 | $0.2 |
14 | Llama 3.2 3B Instruct | 67.29% | 66.37% - 68.2% | $0.015 | $0.025 |
15 | Gemini Flash 1.5 8B | 59.59% | 58.63% - 60.54% | $0.0375 | $0.15 |
Last updated: April 6, 2025
Benchmark Methodology
How we evaluate language models on medical tasks
Evaluation Framework
Our benchmarking methodology follows a rigorous protocol designed to assess LLMs on their medical knowledge, reasoning, and clinical relevance.
Key Principles
- Standardized Prompting: We use identical prompts across all models to ensure fair and consistent comparisons.
- Default Model Configurations: All models are run with their default configuration.
- Objective Evaluation: We measure performance against standardized evaluation datasets to provide quantifiable and reproducible results.
- Transparency and Collaboration: Our methods are open-source to encourage collaboration and community contributions.
Standard Prompt Template
You are a medical assistant. Please answer the following multiple choice question.
Question: {question}
Options:
{options}
## Output Format:
Please provide your answer in JSON format that contains an "answer" field.
Example response format:
{"answer": "X. exact option text here"}
Important:
- Please ensure that your answer is in valid JSON format.
Example Prompt
Question extracted from the MedMCQA dataset
Sample Question Prompt
You are a medical assistant. Please answer the following multiple choice question.
Question:
Which one of the following specimen is not refrigerated prior to inoculation?
Options:
A. CSF
B. Plus
C. Urine
D. Sputum
## Output Format:
Please provide your answer in JSON format that contains an "answer" field.
Example response format:
{"answer": "X. exact option text here"}
Important:
- Please ensure that your answer is in valid JSON format.
Model Response Example
{"answer": "A. CSF"}
Evaluation Details
- Correct Answer: A. CSF
- Subject: Microbiology
Benchmark Datasets
Information about our evaluation datasets
MedQA [Paper] [GitHub]
Multiple choice question answering based on the United States Medical License Exams (USMLE).
- Size: 12,723 questions (English version)
- Benchmark Set: We evaluate the models against the 10,178 questions in the train split of the data
- Format: Multiple choice questions
- Citation: Jin, Di, et al. "What disease does this patient have? a large-scale open domain question answering dataset from medical exams." Applied Sciences 11.14 (2021): 6421.
MedMCQA [Paper] [GitHub]
A large-scale, Multiple-Choice Question Answering (MCQA) dataset designed to address real-world medical entrance exam questions.
- Size: 194,000+ questions
- Benchmark Set: We evaluate the models against the 120,765 single-select questions in the train split of the data
- Format: Multiple choice questions
- Coverage: Twenty-one medical subjects including anatomy, physiology, biochemistry, pathology, and pharmacology
- Citation: Pal, A., Umapathi, L.K. and Sankarasubbu, M., 2022, April. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In *Conference on health, inference, and learning* (pp. 248-260). PMLR.