MedQA Performance Evaluation

Comparative analysis of large language models on medical question answering tasks
Performance
# Model Accuracy - 95% CI
1
OpenAI
OpenAI o3-mini
95.19% 94.76% - 95.59% $1.1 $4.4
2
Meta
Llama 4 Maverick
92.55% 92.05% - 93.05% $0.2 $0.6
3
DeepSeek
DeepSeek-R1
91.91% 91.36% - 92.42% $0.55 $2.19
4
Meta
Llama 3.3 70B Instruct
90.88% 90.31% - 91.42% $0.12 $0.3
5
Meta
Llama 4 Scout
89.71% 89.16% - 90.26% $0.09 $0.48
6
Anthropic
Claude 3.7 Sonnet
87.88% 87.23% - 88.5% $3 $15
7
Google
Gemini Flash 2.0
82.7% 81.95% - 83.42% $0.1 $0.4
8
Google
Gemini 2.0 Flash Lite
77.18% 76.36% - 77.99% $0.075 $0.3
9
Mistral AI
Mistral Large 2411
74.23% 73.37% - 75.07% $2 $6
10
OpenAI
OpenAI GPT-4o-mini
73.99% 73.13% - 74.84% $0.15 $0.6
11
Mistral AI
Mistral Small 3.1 24B
68.77% 67.86% - 69.67% $0.1 $0.3
12
Qwen
Qwen2.5 32B Instruct
68.77% 67.86% - 69.66% $0.79 $0.79
13
Google
Gemma 3 27b
67.44% 66.53% - 68.35% $0.1 $0.2
14
Meta
Llama 3.2 3B Instruct
67.29% 66.37% - 68.2% $0.015 $0.025
15
Google
Gemini Flash 1.5 8B
59.59% 58.63% - 60.54% $0.0375 $0.15
Last updated: April 6, 2025

Benchmark Methodology

How we evaluate language models on medical tasks

Evaluation Framework

Our benchmarking methodology follows a rigorous protocol designed to assess LLMs on their medical knowledge, reasoning, and clinical relevance.

Key Principles

  • Standardized Prompting: We use identical prompts across all models to ensure fair and consistent comparisons.
  • Default Model Configurations: All models are run with their default configuration.
  • Objective Evaluation: We measure performance against standardized evaluation datasets to provide quantifiable and reproducible results.
  • Transparency and Collaboration: Our methods are open-source to encourage collaboration and community contributions.

Standard Prompt Template

You are a medical assistant. Please answer the following multiple choice question.

Question: {question}

Options:
{options}

## Output Format:
Please provide your answer in JSON format that contains an "answer" field.

Example response format:
{"answer": "X. exact option text here"}

Important:
- Please ensure that your answer is in valid JSON format.

Example Prompt

Question extracted from the MedMCQA dataset

Sample Question Prompt

You are a medical assistant. Please answer the following multiple choice question.

Question:
Which one of the following specimen is not refrigerated prior to inoculation?

Options:
A. CSF
B. Plus
C. Urine
D. Sputum

## Output Format:
Please provide your answer in JSON format that contains an "answer" field.

Example response format:
{"answer": "X. exact option text here"}

Important:
- Please ensure that your answer is in valid JSON format.

Model Response Example

{"answer": "A. CSF"}

Evaluation Details

  • Correct Answer: A. CSF
  • Subject: Microbiology

Benchmark Datasets

Information about our evaluation datasets

MedQA [Paper] [GitHub]

Multiple choice question answering based on the United States Medical License Exams (USMLE).

  • Size: 12,723 questions (English version)
  • Benchmark Set: We evaluate the models against the 10,178 questions in the train split of the data
  • Format: Multiple choice questions
  • Citation: Jin, Di, et al. "What disease does this patient have? a large-scale open domain question answering dataset from medical exams." Applied Sciences 11.14 (2021): 6421.

MedMCQA [Paper] [GitHub]

A large-scale, Multiple-Choice Question Answering (MCQA) dataset designed to address real-world medical entrance exam questions.

  • Size: 194,000+ questions
  • Benchmark Set: We evaluate the models against the 120,765 single-select questions in the train split of the data
  • Format: Multiple choice questions
  • Coverage: Twenty-one medical subjects including anatomy, physiology, biochemistry, pathology, and pharmacology
  • Citation: Pal, A., Umapathi, L.K. and Sankarasubbu, M., 2022, April. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In *Conference on health, inference, and learning* (pp. 248-260). PMLR.