Coding

HumanEval: 2026 AI Leaderboard

164 Python programming problems: does the generated code pass unit tests?

What it tests

HumanEval is 164 handwritten Python programming problems with hidden unit tests. A model sees the function signature plus docstring and must generate a body that passes every test.

How it is scored

pass@1 -- the percentage of problems solved on the first attempt. Frontier models in 2026 sit in the 94-99% range, so this benchmark is effectively saturated for top-tier LLMs.

Why it matters

Still useful as a floor-check: any serious coding model should clear 90% here. For real-world discrimination, SWE-bench Verified and LiveCodeBench are the benchmarks that still separate the field.

Leaderboard (14 models)

Sorted by HumanEvalscore. Tier column shows the tool's overall AIToolTier rank, which blends this benchmark with pricing, features, and real-world usability.

#ModelTierHumanEval score
1Codex (OpenAI)
GPT-5.2-Codex (launched 2026-04-23 -- SOTA on SWE-Bench Pro and Terminal-Bench 2.0; first-party scores below pending detailed third-party verification)
A95%
2ChatGPT
GPT-5.5 (launched 2026-04-23; scores below are the GPT-5.4 baseline -- GPT-5.5 launch benchmarks per OpenAI are logged in Known Issues, pending third-party verification)
A95%
3Claude (Anthropic)
Claude Opus 4.7 (4.6 baseline scores shown; 4.7 announced 13% coding lift, 3x production task completion)
A94%
4Gemini (Google)
Gemini 3.1 Ultra
A93.5%
5Qwen (Alibaba)
Qwen3.5-397B MoE
A92.5%
6Mistral AI
Mistral Medium 3.5 (vendor-published; third-party verification pending)
B92%
7DeepSeek
DeepSeek V4-Pro (launched 2026-04-24; scores below are the V3.2 baseline pending third-party V4 verification, which typically lands 3-7 days post-launch)
A91.5%
8Muse Spark (Meta)
Muse Spark
A91%
9Grok
Grok 4.20
B90%
10Nemotron (Nvidia)
Nemotron 3 Ultra (253B)
B89.6%
11GLM / Z.ai (Zhipu AI)
GLM-5.1 (744B MoE / 40B active)
A89.1%
12Llama 4 (Meta)
Llama 4 Maverick (17B/400B MoE)
B88%
13Gemma 4 (Google)
Gemma 4 31B
A85%
14Falcon (TII)
Falcon 3 10B
B73.8%

About HumanEval

Creator
OpenAI, 2021
Unit
% (max 100)

Other benchmarks