HumanEval: 2026 AI Leaderboard
164 Python programming problems: does the generated code pass unit tests?
What it tests
HumanEval is 164 handwritten Python programming problems with hidden unit tests. A model sees the function signature plus docstring and must generate a body that passes every test.
How it is scored
pass@1 -- the percentage of problems solved on the first attempt. Frontier models in 2026 sit in the 94-99% range, so this benchmark is effectively saturated for top-tier LLMs.
Why it matters
Still useful as a floor-check: any serious coding model should clear 90% here. For real-world discrimination, SWE-bench Verified and LiveCodeBench are the benchmarks that still separate the field.
Leaderboard (14 models)
Sorted by HumanEvalscore. Tier column shows the tool's overall AIToolTier rank, which blends this benchmark with pricing, features, and real-world usability.
| # | Model | Tier | HumanEval score | Variant | Overall |
|---|---|---|---|---|---|
| 1 | Codex (OpenAI) GPT-5.2-Codex (launched 2026-04-23 -- SOTA on SWE-Bench Pro and Terminal-Bench 2.0; first-party scores below pending detailed third-party verification) | A | 95% | HumanEval | 8.3/10 |
| 2 | ChatGPT GPT-5.5 (launched 2026-04-23; scores below are the GPT-5.4 baseline -- GPT-5.5 launch benchmarks per OpenAI are logged in Known Issues, pending third-party verification) | A | 95% | HumanEval | 8.8/10 |
| 3 | Claude (Anthropic) Claude Opus 4.7 (4.6 baseline scores shown; 4.7 announced 13% coding lift, 3x production task completion) | A | 94% | HumanEval | 8.5/10 |
| 4 | Gemini (Google) Gemini 3.1 Ultra | A | 93.5% | HumanEval | 8.3/10 |
| 5 | Qwen (Alibaba) Qwen3.5-397B MoE | A | 92.5% | HumanEval | 8.8/10 |
| 6 | Mistral AI Mistral Medium 3.5 (vendor-published; third-party verification pending) | B | 92% | HumanEval | 7.5/10 |
| 7 | DeepSeek DeepSeek V4-Pro (launched 2026-04-24; scores below are the V3.2 baseline pending third-party V4 verification, which typically lands 3-7 days post-launch) | A | 91.5% | HumanEval | 8.0/10 |
| 8 | Muse Spark (Meta) Muse Spark | A | 91% | HumanEval | 8.8/10 |
| 9 | Grok Grok 4.20 | B | 90% | HumanEval | 7.5/10 |
| 10 | Nemotron (Nvidia) Nemotron 3 Ultra (253B) | B | 89.6% | HumanEval | 7.8/10 |
| 11 | GLM / Z.ai (Zhipu AI) GLM-5.1 (744B MoE / 40B active) | A | 89.1% | HumanEval | 8.0/10 |
| 12 | Llama 4 (Meta) Llama 4 Maverick (17B/400B MoE) | B | 88% | HumanEval | 7.9/10 |
| 13 | Gemma 4 (Google) Gemma 4 31B | A | 85% | HumanEval | 8.3/10 |
| 14 | Falcon (TII) Falcon 3 10B | B | 73.8% | HumanEval | 7.1/10 |
About HumanEval
- Creator
- OpenAI, 2021
- Unit
- % (max 100)
- Official source
- https://arxiv.org/abs/2107.03374