MedQA Performance Evaluation

Comparative analysis of large language models on medical question answering tasks

Performance

#	Model		Accuracy - 95% CI
1	OpenAI o3-mini	95.19%	94.76% - 95.59%	$1.1	$4.4
2	Llama 4 Maverick	92.55%	92.05% - 93.05%	$0.2	$0.6
3	DeepSeek-R1	91.91%	91.36% - 92.42%	$0.55	$2.19
4	Llama 3.3 70B Instruct	90.88%	90.31% - 91.42%	$0.12	$0.3
5	Llama 4 Scout	89.71%	89.16% - 90.26%	$0.09	$0.48
6	Claude 3.7 Sonnet	87.88%	87.23% - 88.5%	$3	$15
7	Gemini Flash 2.0	82.7%	81.95% - 83.42%	$0.1	$0.4
8	Gemini 2.0 Flash Lite	77.18%	76.36% - 77.99%	$0.075	$0.3
9	Mistral Large 2411	74.23%	73.37% - 75.07%	$2	$6
10	OpenAI GPT-4o-mini	73.99%	73.13% - 74.84%	$0.15	$0.6
11	Mistral Small 3.1 24B	68.77%	67.86% - 69.67%	$0.1	$0.3
12	Qwen2.5 32B Instruct	68.77%	67.86% - 69.66%	$0.79	$0.79
13	Gemma 3 27b	67.44%	66.53% - 68.35%	$0.1	$0.2
14	Llama 3.2 3B Instruct	67.29%	66.37% - 68.2%	$0.015	$0.025
15	Gemini Flash 1.5 8B	59.59%	58.63% - 60.54%	$0.0375	$0.15

#	Model		Accuracy - 95% CI
1	Llama 4 Maverick	94.36%	94.21% - 94.51%	$0.2	$0.6
2	Llama 3.3 70B Instruct	91.83%	91.67% - 91.98%	$0.12	$0.3
3	Llama 4 Scout	90.03%	89.85% - 90.21%	$0.09	$0.48
4	DeepSeek-R1	87.96%	87.77% - 88.14%	$0.55	$2.19
5	Claude 3.7 Sonnet	86.15%	85.96% - 86.35%	$3	$15
6	OpenAI o3-mini	85.64%	85.01% - 86.25%	$1.1	$4.4
7	Gemini Flash 2.0	83.28%	83.07% - 83.49%	$0.1	$0.4
8	Gemini 2.0 Flash Lite	79.63%	79.4% - 79.85%	$0.075	$0.3
9	Mistral Large 2411	76.01%	75.75% - 76.27%	$2	$6
10	OpenAI GPT-4o-mini	75.22%	74.97% - 75.46%	$0.15	$0.6
11	Qwen2.5 32B Instruct	73.11%	72.86% - 73.36%	$0.79	$0.79
12	Llama 3.2 3B Instruct	72.05%	71.8% - 72.31%	$0.015	$0.025
13	Mistral Small 3.1 24B	71.75%	71.49% - 72.0%	$0.1	$0.3
14	Gemma 3 27B	70.14%	69.88% - 70.4%	$0.1	$0.2
15	Gemini Flash 1.5 8B	64.56%	64.29% - 64.83%	$0.0375	$0.15

Last updated: April 6, 2025

Benchmark Methodology

How we evaluate language models on medical tasks

Evaluation Framework

Our benchmarking methodology follows a rigorous protocol designed to assess LLMs on their medical knowledge, reasoning, and clinical relevance.

Key Principles

Standardized Prompting: We use identical prompts across all models to ensure fair and consistent comparisons.
Default Model Configurations: All models are run with their default configuration.
Objective Evaluation: We measure performance against standardized evaluation datasets to provide quantifiable and reproducible results.
Transparency and Collaboration: Our methods are open-source to encourage collaboration and community contributions.

Standard Prompt Template

You are a medical assistant. Please answer the following multiple choice question.

Question: {question}

Options:
{options}

## Output Format:
Please provide your answer in JSON format that contains an "answer" field.

Example response format:
{"answer": "X. exact option text here"}

Important:
- Please ensure that your answer is in valid JSON format.

Example Prompt

Question extracted from the MedMCQA dataset

Sample Question Prompt

You are a medical assistant. Please answer the following multiple choice question.

Question:
Which one of the following specimen is not refrigerated prior to inoculation?

Options:
A. CSF
B. Plus
C. Urine
D. Sputum

## Output Format:
Please provide your answer in JSON format that contains an "answer" field.

Example response format:
{"answer": "X. exact option text here"}

Important:
- Please ensure that your answer is in valid JSON format.

Model Response Example

{"answer": "A. CSF"}

Evaluation Details

Correct Answer: A. CSF
Subject: Microbiology

Benchmark Datasets

Information about our evaluation datasets

MedQA [Paper] [GitHub]

Multiple choice question answering based on the United States Medical License Exams (USMLE).

Size: 12,723 questions (English version)
Benchmark Set: We evaluate the models against the 10,178 questions in the train split of the data
Format: Multiple choice questions
Citation: Jin, Di, et al. "What disease does this patient have? a large-scale open domain question answering dataset from medical exams." Applied Sciences 11.14 (2021): 6421.

MedMCQA [Paper] [GitHub]

A large-scale, Multiple-Choice Question Answering (MCQA) dataset designed to address real-world medical entrance exam questions.

Size: 194,000+ questions
Benchmark Set: We evaluate the models against the 120,765 single-select questions in the train split of the data
Format: Multiple choice questions
Coverage: Twenty-one medical subjects including anatomy, physiology, biochemistry, pathology, and pharmacology
Citation: Pal, A., Umapathi, L.K. and Sankarasubbu, M., 2022, April. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In *Conference on health, inference, and learning* (pp. 248-260). PMLR.