Knowledge
MMLU-Pro: 2026 AI Leaderboard
MMLU's harder successor: 10 answer choices and more reasoning.
What it tests
MMLU-Pro is a successor to MMLU that expands each question to 10 answer choices (up from 4) and rewrites prompts to require multi-step reasoning rather than pure recall.
How it is scored
Same accuracy metric as MMLU but on the harder reformulated question bank. Frontier models score roughly 10-20 points lower here than on base MMLU.
Why it matters
Worth watching because base MMLU has saturated. MMLU-Pro is less saturated and still has headroom, making it a better discriminator for top-tier models in 2026.
Leaderboard (5 models)
Sorted by MMLU-Proscore. Tier column shows the tool's overall AIToolTier rank, which blends this benchmark with pricing, features, and real-world usability.
| # | Model | Tier | MMLU-Pro score | Variant | Overall |
|---|---|---|---|---|---|
| 1 | DeepSeek DeepSeek V4-Pro (launched 2026-04-24; scores below are the V3.2 baseline pending third-party V4 verification, which typically lands 3-7 days post-launch) | A | 85% | MMLU-Pro | 8.0/10 |
| 2 | Qwen (Alibaba) Qwen3.5-397B MoE | A | 83.5% | MMLU-Pro | 8.8/10 |
| 3 | GLM / Z.ai (Zhipu AI) GLM-5.1 (744B MoE / 40B active) | A | 81.2% | MMLU-Pro | 8.0/10 |
| 4 | Llama 4 (Meta) Llama 4 Maverick (17B/400B MoE) | B | 80.5% | MMLU-Pro | 7.9/10 |
| 5 | Nemotron (Nvidia) Nemotron 3 Ultra (253B) | B | 79.8% | MMLU-Pro | 7.8/10 |
About MMLU-Pro
- Creator
- TIGER-Lab, 2024
- Unit
- % (max 100)
- Official source
- https://arxiv.org/abs/2406.01574