This leaderboard combines two independent signals: arena performance and benchmark scores. Arena rankings use TrueSkill (conservative rating: μ − 3σ) calculated from blind human voting in the coding arena. Each generation pits 4 randomly sampled models against the same prompt. Users see the live outputs — rendered websites, playable games, animated visualizations — and pick the best one without knowing which model made it. This eliminates brand bias and measures actual output quality.
The 7 coding arenas cover distinct real-world tasks: React website generation (the most popular), HTML5 Canvas game development, p5.js creative coding and animation, D3.js data visualization, Three.js 3D scene creation, SVG illustration, and Tone.js MIDI composition. A model needs to perform well across multiple arenas to rank highly — single-arena specialists get averaged down.
Benchmark scores come from evaluations like SWE-bench Verified (real GitHub issue resolution), HumanEval (function-level code generation), and LiveCodeBench (competitive programming). These measure different coding skills: SWE-bench tests multi-file debugging in real repositories, HumanEval tests algorithmic correctness, and LiveCodeBench tests problem-solving under constraints. We source scores from official model cards and independent reproductions.
The final ranking weights arena performance heavily because it measures end-to-end coding ability on open-ended tasks — the kind of work developers actually use AI for. Benchmark scores provide a cross-check and help differentiate models with similar arena ratings. Rankings update continuously: arena scores shift as new votes come in, and benchmark columns update when new evaluation results are published.