LLM Stats Score
Methodology v2.0 · published April 30, 2026 · updated June 10, 2026
The LLM Stats Score is a single number summarizing how capable a large language model is across the dimensions that matter for real workloads. It exists because no single public benchmark captures frontier capability, and because composite scores published elsewhere rely on hand-picked weights that are hard to audit. Instead of a weighted average, the LLM Stats Score is a skill rating: every tracked general-capability benchmark becomes a competition between models, and a TrueSkill rating system aggregates the outcomes.
How it’s computed
Each benchmark in the general-capability set is treated as a multiplayer game: all models with a recorded score on that benchmark are ranked against each other by raw score. A TrueSkill rating system processes these games and maintains, for each model, a skill estimate μ and an uncertainty σ. Models are initialized with a prior derived from their average normalized performance across all tracked benchmarks, then refined through the games. Because TrueSkill consumes rank orderwithin each benchmark rather than raw values, the score is invariant to each benchmark’s scale and grading scheme.
The published LLM Stats Score is the conservative rating μ − 3σ. A model with few benchmark results carries high uncertainty and is ranked conservatively until it accumulates more evidence.
Rating system parameters
| Parameter | Value | Meaning |
|---|---|---|
| Initial rating (μ₀) | 25 + 5g | Each model starts at μ = 25, shifted by g — its cross-benchmark z-score — so the prior reflects overall performance before any games are played. |
| Initial uncertainty (σ₀) | 25 / 3 ≈ 8.33 | Standard TrueSkill prior. Uncertainty shrinks as a model plays more benchmark games. |
| Rating passes | 3 | Every benchmark game is replayed three times so ratings converge regardless of game order. |
| Minimum models per benchmark | 3 | A benchmark only counts as a game once at least 3 models have a recorded score on it. |
| Draw probability | 0.05 | Probability assigned to ties between adjacent models in a game. |
| Displayed score | μ − 3σ | The conservative rating. Models must accumulate benchmark evidence — not just a high mean — to rank highly. |
| Confidence interval | μ ± 2σ | Taken directly from the TrueSkill posterior. |
What goes in (and what doesn’t)
Two sources feed the score. First, catalog results: benchmark scores recorded for each model, sourced from lab scorecards, papers, and official posts, or measured by the LLM Stats team. Second, verified evaluation runs executed on our own infrastructure, which also appear on community leaderboards.
Per model and benchmark there is only ever one canonical number. When LLM Stats verifies a result, the verified value replaces the self-reported one and is used everywhere. Where no verification exists yet, the lab-reported number is used and labeled as self-reported on the model page, with a link to its source. Verified results always take precedence over self-reported ones.
Benchmarks are excluded from rating games when fewer than 3 models have a recorded score on them, since a ranking between fewer participants carries too little signal.
Category indexes
The same rating system runs independently per category over the benchmarks tagged with it. The LLM Stats Score is the rating for the generalcategory; Coding, Agent, Reasoning, and the other indexes are separate computations over their own benchmark sets — the overall score is not an average of the category indexes.
| Index | Benchmark category |
|---|---|
| LLM Stats Score | general |
| Reasoning Index | reasoning |
| Coding Index | code |
| Agent Index | agents |
| Math Index | math |
| Long Context Index | long_context |
| Vision Index | vision |
A model appears in an index once it has at least one qualifying benchmark result in that category. Models without one are shown as unrated (“—”) rather than scored at zero.
Versioning
Methodology versions are dated. Version 1.0 (published April 30, 2026) described an earlier fixed-weight composite design; version 2.0 (June 10, 2026) documents the TrueSkill rating system in production today. When the rating system or benchmark eligibility rules change, we increment the version and update this page.
Current version: 2.0 (June 10, 2026).
Citation
LLM Stats (2026). LLM Stats Score (v2.0). LLM Stats. https://llm-stats.com/methodology/llm-stats-score
FAQ
What is the LLM Stats Score?
+
The LLM Stats Score is a skill rating that summarizes a model's overall capability. Every general-capability benchmark we track is treated as a multiplayer game in which models are ranked by their score, and a TrueSkill rating system processes those games to produce a skill estimate (mu) and uncertainty (sigma) per model. The displayed score is the conservative rating mu − 3·sigma, so models must demonstrate ability across real benchmark evidence to rank highly.
How is it different from other composite scores?
+
Three things: (1) it has no hand-picked benchmark weights — each benchmark contributes as a ranked game, which makes the score scale-invariant and robust to benchmarks with different score ranges; (2) it is uncertainty-aware — models with thin benchmark coverage carry high sigma and are ranked conservatively; (3) the underlying per-benchmark scores are linkable from each model page, with provenance labeled, so anyone can audit the inputs.
How are benchmarks weighted?
+
There are no explicit per-benchmark weights. Each benchmark is one game per rating pass, and because TrueSkill consumes rank order rather than raw values, a benchmark's score scale doesn't matter — only how models rank against each other on it. Benchmarks with more participating models produce more decisive rating updates. Benchmarks with fewer than 3 scored models are excluded.
Do self-reported scores count toward the score?
+
Yes, with clear provenance rules. Per model and benchmark there is only ever one canonical number. When LLM Stats verifies a result on its own infrastructure, the verified value replaces the self-reported one and is used everywhere. Where no verification exists yet, the lab-reported number is used at face value and is labeled as self-reported on the model page. Verified results always take precedence over self-reported ones for the same model and benchmark.
Why does a model show “—” in some category indexes?
+
Each category index (Coding, Agent, Reasoning, …) is an independent TrueSkill computation over the benchmarks tagged with that category. A “—” means the model is unrated in that category — it has no recorded score on any qualifying benchmark in it — not that it scored zero. The overall LLM Stats Score is computed separately over the general benchmark set, so a model can have an overall score while being unrated in individual categories.
How often is the score updated?
+
Continuously. Ratings are recomputed every few minutes from the latest benchmark data, and new benchmark scores or verified evaluation runs are folded in as soon as they are ingested. Methodology versions are dated — when the rating system itself changes, we publish a new version of this page.
How should I cite the LLM Stats Score?
+
LLM Stats (2026). LLM Stats Score (v2.0). LLM Stats. https://llm-stats.com/methodology/llm-stats-score
Can I reproduce the score myself?
+
Yes. Each component benchmark score is linked from the model page and raw scores are exposed via the public API. The rating system is standard TrueSkill with the parameters documented above: initialize each model at mu = 25 + 5g (g = cross-benchmark z-score), rank models within each qualifying benchmark, run 3 rating passes, and report mu − 3·sigma.