Question 1

Why is GSM8K weighted so low?

Accepted Answer

GSM8K is saturated — all frontier models score above 95% on it. Weighting it heavily would just be measuring the noise. AIME 2024 is unsaturated (top model is ~94%) so it has discriminating power.

Question 2

What is a 'thinking' model?

Accepted Answer

A variant of a base model that runs an extended chain-of-thought before producing its visible answer. Thinking traces can be 1K-10K tokens of internal reasoning that the user is billed for but not always shown.

Question 3

Are thinking-mode tokens billed differently?

Accepted Answer

Yes. Most providers bill thinking tokens at the standard output-token rate. A single AIME problem can consume 5K-50K thinking tokens, so thinking-mode is multiplicatively more expensive than the same model in non-thinking mode.