Meilleurs modèles IA pour les maths — Classement MATH, AIME, GSM8K
Meilleurs modèles IA pour le raisonnement mathématique, classés par scores MATH, AIME 2024, GSM8K et GPQA. Sélectionnés par OrcaRouter.
Top math models
Ordered by composite score across AIME 2024 (40%), MATH-level-5 (30%), Putnam-style problems (20%), and GSM8K (10%). Score weights AIME most heavily because GSM8K and most of MATH are saturated by 2026.
- gpt-5.5-thinking — AIME 2024: 94%, MATH-5: 88%. Best on competition-style problems. Extended thinking mode adds significant cost (~5× regular gpt-5.5) but is the only viable choice for genuinely hard problems.
- claude-opus-4-7-thinking — AIME 2024: 91%, MATH-5: 86%. Slightly behind gpt-5.5-thinking on AIME but strongest on multi-step proof-style reasoning where intermediate-step accuracy matters.
- deepseek-v4-pro-thinking — AIME 2024: 89%, MATH-5: 84%. Open-weights thinking model; remarkable for the price. Best non-frontier-vendor option.
- gemini-3.1-pro-preview — AIME 2024: 82%, MATH-5: 80%. No explicit thinking mode but very strong on symbolic and computational math. Best non-thinking model.
- claude-sonnet-4-6 — AIME 2024: 71%, MATH-5: 76%. Solid mid-tier non-thinking option. Use when latency matters more than competition-grade accuracy.
Why thinking models dominate
Mathematical reasoning rewards extended chain-of-thought heavily — a model that writes 5,000 tokens working through a proof is usually more accurate than the same base model in standard mode. AIME, Putnam and the harder MATH levels are essentially saturated by humans-with-time, so models that can spend 'time' (compute) on inference dominate the leaderboard.
When NOT to use a thinking model
Thinking modes cost 3-10× more per task and add seconds-to-minutes of latency. For grade-school arithmetic, simple algebra and structured numeric extraction, the standard models are already at >95% accuracy. Reserve thinking modes for genuinely hard problems where the alternative is solver software.