AI Benchmarks 2026

Name: AI & LLM Benchmark Results
Creator: LLM Stats
License: https://llm-stats.com/legal/terms-of-service

Compare 300+ AI and LLM benchmarks in one place — reasoning, coding, math, vision, tool use and more. Every benchmark links to a live leaderboard with independently verified model scores, updated continuously.

Every AI benchmark, with a live leaderboard

LLM Stats indexes 528+ AI and LLM benchmarks across reasoning, coding, math, vision, tool use and domain knowledge. Each benchmark below opens a live leaderboard ranking 300+ models by independently verified score — read the scoring methodology or browse the full LLM Leaderboard.

Reasoning

MMLU-Pro — A more robust and challenging multi-task language understanding benchmark that extends MMLU by expanding multiple-choice options from 4 to 10, elimin…
MMLU — Massive Multitask Language Understanding benchmark testing knowledge across 57 diverse subjects including STEM, humanities, social sciences, and prof…
Humanity's Last Exam — Humanity's Last Exam (HLE) is a multi-modal academic benchmark with 2,500 questions across mathematics, humanities, and natural sciences, designed to…
MATH — MATH dataset contains 12,500 challenging competition mathematics problems from AMC 10, AMC 12, AIME, and other mathematics competitions. Each problem…
HumanEval — A benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing…
MMMU — MMMU (Massive Multi-discipline Multimodal Understanding) is a benchmark designed to evaluate multimodal models on college-level subject knowledge and…
MMMU-Pro — A more robust multi-discipline multimodal understanding benchmark that enhances MMMU through a three-step process: filtering text-only answerable que…
AIME 2024 — American Invitational Mathematics Examination 2024, consisting of 30 challenging mathematical reasoning problems from AIME I and AIME II competitions…
LiveCodeBench v6 — LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems fro…
BrowseComp — BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entang…
GSM8k — Grade School Math 8K, a dataset of 8.5K high-quality linguistically diverse grade school math word problems requiring multi-step reasoning and elemen…
MMMLU — Multilingual Massive Multitask Language Understanding dataset released by OpenAI, featuring professionally translated MMLU test questions across 14 l…
MMLU-Redux — An improved version of the MMLU benchmark featuring manually re-annotated questions to identify and correct errors in the original dataset. Provides…
SimpleQA — SimpleQA is a factuality benchmark developed by OpenAI that measures the short-form factual accuracy of large language models. The benchmark contains…
CharXiv-R — CharXiv-R is the reasoning component of the CharXiv benchmark, focusing on complex reasoning questions that require synthesizing information across v…
ARC-C — The AI2 Reasoning Challenge (ARC) Challenge Set is a multiple-choice question-answering benchmark containing grade-school level science questions tha…
Tau2 Telecom — τ²-Bench telecom domain evaluates conversational agents in a dual-control environment modeled as a Dec-POMDP, where both agent and user use tools in…
MBPP — MBPP (Mostly Basic Python Problems) is a benchmark of 974 crowd-sourced Python programming problems designed to be solvable by entry-level programmer…
AI2D — AI2D is a dataset of 4,903 illustrative diagrams from grade school natural sciences (such as food webs, human physiology, and life cycles) with over…
MGSM — MGSM (Multilingual Grade School Math) is a benchmark of grade-school math problems. Contains 250 grade-school math problems manually translated from…
SuperGPQA — SuperGPQA is a comprehensive benchmark that evaluates large language models across 285 graduate-level academic disciplines. The benchmark contains 25…
DROP — DROP (Discrete Reasoning Over Paragraphs) is a reading comprehension benchmark requiring discrete reasoning over paragraph content. It contains crowd…
MMLU-ProX — Extended version of MMLU-Pro providing additional challenging multiple-choice questions for evaluating language models across diverse academic and pr…
SWE-bench Multilingual — A multilingual benchmark for issue resolving in software engineering that covers Java, TypeScript, JavaScript, Go, Rust, C, and C++. Contains 1,632 h…
SWE-Bench Pro — SWE-Bench Pro is an advanced version of SWE-Bench that evaluates language models on complex, real-world software engineering tasks requiring extended…
HellaSwag — A challenging commonsense natural language inference dataset that uses Adversarial Filtering to create questions trivial for humans (>95% accuracy) b…
Arena Hard — Arena-Hard-Auto is an automatic evaluation benchmark for instruction-tuned LLMs consisting of 500 challenging real-world prompts curated by BenchBuil…
Tau2 Retail — τ²-bench retail domain evaluates conversational AI agents in customer service scenarios within a dual-control environment where both agent and user c…
TAU-bench Retail — A benchmark for evaluating tool-agent-user interaction in retail environments. Tests language agents' ability to handle dynamic conversations with us…
Terminal-Bench — Terminal-Bench is a benchmark for testing AI agents in real terminal environments. It evaluates how well agents can handle real-world, end-to-end tas…
VideoMMMU — Video-MMMU evaluates Large Multimodal Models' ability to acquire knowledge from expert-level professional videos across six disciplines through three…
ChartQA — ChartQA is a large-scale benchmark comprising 9.6K human-written questions and 23.1K questions generated from human-written chart summaries, designed…
MCP Atlas — MCP Atlas is a benchmark for evaluating AI models on scaled tool use capabilities, measuring how well models can coordinate and utilize multiple tool…
t2-bench — t2-bench is a benchmark for evaluating agentic tool use capabilities, measuring how well models can select, sequence, and utilize tools to solve comp…
Tau2 Airline — TAU2 airline domain benchmark for evaluating conversational agents in dual-control environments where both AI agents and users interact with tools in…
TAU-bench Airline — Part of τ-bench (TAU-bench), a benchmark for Tool-Agent-User interaction in real-world domains. The airline domain evaluates language agents' ability…
MMStar — MMStar is an elite vision-indispensable multimodal benchmark comprising 1,500 challenge samples meticulously selected by humans to evaluate 6 core ca…
PolyMATH — Polymath is a challenging multi-modal mathematical reasoning benchmark designed to evaluate the general cognitive reasoning abilities of Multi-modal…
Winogrande — WinoGrande: An Adversarial Winograd Schema Challenge at Scale. A large-scale dataset of 44,000 pronoun resolution problems designed to test machine c…
BIG-Bench Hard — BIG-Bench Hard (BBH) is a subset of 23 challenging BIG-Bench tasks selected because prior language model evaluations did not outperform average human…
Multi-IF — Multi-IF benchmarks LLMs on multi-turn and multilingual instruction following. It expands upon IFEval by incorporating multi-turn sequences and trans…
Toolathlon — Tool Decathlon is a comprehensive benchmark for evaluating AI agents' ability to use multiple tools across diverse task categories. It measures profi…
BFCL-v3 — Berkeley Function Calling Leaderboard v3 (BFCL-v3) is an advanced benchmark that evaluates large language models' function calling capabilities throu…
C-Eval — C-Eval is a comprehensive Chinese evaluation suite designed to assess advanced knowledge and reasoning abilities of foundation models in a Chinese co…
MMBench-V1.1 — Version 1.1 of MMBench, an improved bilingual benchmark for assessing multi-modal capabilities of vision-language models through multiple-choice ques…
TriviaQA — A large-scale reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs author…
TruthfulQA — TruthfulQA is a benchmark to measure whether language models are truthful in generating answers to questions. It comprises 817 questions that span 38…
MVBench — A comprehensive multi-modal video understanding benchmark covering 20 challenging video tasks that require temporal understanding beyond single-frame…
AIME 2026 — All 30 problems from the 2026 American Invitational Mathematics Examination (AIME I and AIME II), testing olympiad-level mathematical reasoning with…
ARC-AGI v2 — ARC-AGI-2 is an upgraded benchmark for measuring abstract reasoning and problem-solving abilities in AI systems through visual grid transformation ta…
Arena-Hard v2 — Arena-Hard-Auto v2 is a challenging benchmark consisting of 500 carefully curated prompts sourced from Chatbot Arena and WildChat-1M, designed to eva…
CodeForces — A competitive programming benchmark using problems from the CodeForces platform. The benchmark evaluates code generation capabilities of LLMs on algo…
Hallusion Bench — A comprehensive benchmark designed to evaluate image-context reasoning in large visual-language models (LVLMs) by challenging models with 346 images…
IMO-AnswerBench — IMO-AnswerBench is a benchmark for evaluating mathematical reasoning capabilities on International Mathematical Olympiad (IMO) problems, focusing on…
OmniDocBench 1.5 — OmniDocBench 1.5 is a comprehensive benchmark for evaluating multimodal large language models on document understanding tasks, including OCR, documen…
FrontierMath — A benchmark of hundreds of original, exceptionally challenging mathematics problems crafted and vetted by expert mathematicians, covering most major…
Global-MMLU-Lite — A lightweight version of Global MMLU benchmark that evaluates language models across multiple languages while addressing cultural and linguistic bias…
LiveBench 20241125 — LiveBench is a challenging, contamination-limited LLM benchmark that addresses test set contamination by releasing new questions monthly based on rec…
Video-MME — Video-MME is the first-ever comprehensive evaluation benchmark of Multi-modal Large Language Models (MLLMs) in video analysis. It features 900 videos…
AA-LCR — Agent Arena Long Context Reasoning benchmark
BrowseComp-zh — A high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web, consisting of 289 multi-hop questions spanning 1…
CharXiv-D — CharXiv-D is the descriptive questions subset of the CharXiv benchmark, designed to assess multimodal large language models' ability to extract basic…
GDPval-AA — GDPval-AA is an evaluation of AI model performance on economically valuable knowledge work tasks across professional domains including finance, legal…
HiddenMath — Google DeepMind's internal mathematical reasoning benchmark that introduces novel problems not encountered during model training to evaluate true mat…
LiveBench — LiveBench is a challenging, contamination-limited LLM benchmark that addresses test set contamination by releasing new questions monthly based on rec…
BBH — Big-Bench Hard (BBH) is a suite of 23 challenging tasks selected from BIG-Bench for which prior language model evaluations did not outperform the ave…
Global PIQA — Global PIQA is a multilingual commonsense reasoning benchmark that evaluates physical interaction knowledge across 100 languages and cultures. It tes…
MedXpertQA — A comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning, featuring 4,460 questions spanning 17 specialties and 11…
MT-Bench — MT-Bench is a challenging multi-turn benchmark that measures the ability of large language models to engage in coherent, informative, and engaging co…
SciCode — SciCode is a research coding benchmark curated by scientists that challenges language models to code solutions for scientific problems. It contains 3…
BFCL — The Berkeley Function Calling Leaderboard (BFCL) is the first comprehensive and executable function call evaluation dedicated to assessing Large Lang…
BIG-Bench Extra Hard — BIG-Bench Extra Hard (BBEH) is a challenging benchmark that replaces each task in BIG-Bench Hard with a novel task that probes similar reasoning capa…
BLINK — BLINK: Multimodal Large Language Models Can See but Not Perceive. A benchmark for multimodal language models focusing on core visual perception abili…
Graphwalks BFS <128k — A graph reasoning benchmark that evaluates language models' ability to perform breadth-first search (BFS) operations on graphs with context length un…
Graphwalks parents <128k — A graph reasoning benchmark that evaluates language models' ability to find parent nodes in graphs with context length under 128k tokens, requiring u…
MMMU (val) — Validation set of the Massive Multi-discipline Multimodal Understanding and Reasoning benchmark. Features college-level multimodal questions across 6…
MuirBench — A comprehensive benchmark for robust multi-image understanding capabilities of multimodal LLMs. Consists of 12 diverse multi-image tasks involving 10…
PIQA — PIQA (Physical Interaction: Question Answering) is a benchmark dataset for physical commonsense reasoning in natural language. It tests AI systems' a…
AGIEval — A human-centric benchmark for evaluating foundation models on standardized exams including college entrance exams (Gaokao, SAT), law school admission…
BoolQ — BoolQ is a reading comprehension dataset for yes/no questions containing 15,942 naturally occurring examples. Each example consists of a question, pa…
COLLIE — COLLIE is a grammar-based framework for systematic construction of constrained text generation tasks. It allows specification of rich, compositional…
HumanEval+ — Enhanced version of HumanEval that extends the original test cases by 80x using EvalPlus framework for rigorous evaluation of LLM-synthesized code fu…
EgoSchema — A diagnostic benchmark for very long-form video language understanding consisting of over 5000 human curated multiple choice questions based on 3-min…
HMMT Feb 26 — HMMT February 2026 is a math competition benchmark based on problems from the Harvard-MIT Mathematics Tournament, testing advanced mathematical probl…
LiveCodeBench v5 — LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems fro…
MMBench — A bilingual benchmark for assessing multi-modal capabilities of vision-language models through multiple-choice questions in both English and Chinese,…
OJBench — OJBench is a competition-level code benchmark designed to assess the competitive-level code reasoning abilities of large language models. It comprise…
OpenAI-MRCR: 2 needle 128k — Multi-round Co-reference Resolution (MRCR) benchmark for evaluating an LLM's ability to distinguish between multiple needles hidden in long context.…
VITA-Bench — VITA-Bench evaluates AI agents on real-world virtual task automation, measuring their ability to complete complex multi-step tasks in simulated envir…
AndroidWorld_SR — AndroidWorld Success Rate (SR) benchmark - A dynamic benchmarking environment for autonomous agents operating on Android devices. Evaluates agents on…
ARC-E — ARC-E (AI2 Reasoning Challenge - Easy Set) is a subset of grade-school level, multiple-choice science questions that requires knowledge and reasoning…
DeepPlanning — DeepPlanning evaluates LLMs on complex multi-step planning tasks requiring long-horizon reasoning, goal decomposition, and strategic decision-making.
ECLeKTic — A multilingual closed-book question answering dataset that evaluates cross-lingual knowledge transfer in large language models across 12 languages, u…
Graphwalks BFS >128k — A graph reasoning benchmark that evaluates language models' ability to perform breadth-first search (BFS) operations on graphs with context length ov…
Natural2Code — NaturalCodeBench (NCB) is a challenging code benchmark designed to mirror the complexity and variety of real-world coding tasks. It comprises 402 hig…
WideSearch — WideSearch is an agentic search benchmark that evaluates models' ability to perform broad, parallel search operations across multiple sources. It tes…
Wild Bench — WildBench is an automated evaluation framework that benchmarks large language models using 1,024 challenging, real-world tasks selected from over one…
ZebraLogic — ZebraLogic is an evaluation framework for assessing large language models' logical reasoning capabilities through logic grid puzzles derived from con…
ARC-AGI — The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) is a benchmark designed to test general intelligence and abstract…
Bird-SQL (dev) — BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQLs) is a comprehensive text-to-SQL benchmark containing 12,751 question-SQL pairs across…
ComplexFuncBench — ComplexFuncBench is a benchmark designed to evaluate large language models' capabilities in handling complex function calling scenarios. It encompass…
Finance Agent — Finance Agent is a benchmark for evaluating AI models on agentic financial analysis tasks, testing their ability to process financial data, perform c…
Graphwalks parents >128k — A graph reasoning benchmark that evaluates language models' ability to find parent nodes in graphs with context length over 128k tokens, testing long…
MRCR — MRCR (Multi-Round Coreference Resolution) is a synthetic long-context reasoning task where models must navigate long conversations to reproduce speci…
MRCR v2 — MRCR v2 (Multi-Round Coreference Resolution version 2) is an enhanced version of the synthetic long-context reasoning task. It extends the original M…
Natural Questions — Natural Questions is a question answering dataset featuring real anonymized queries issued to Google search engine. It contains 307,373 training exam…
V* — A visual reasoning benchmark evaluating multimodal inference under challenging spatial and grounded tasks.
AMC_2022_23 — American Mathematics Competition problems from the 2022-23 academic year, consisting of multiple-choice mathematics competition problems designed for…
CMMLU — CMMLU (Chinese Massive Multitask Language Understanding) is a comprehensive Chinese benchmark that evaluates the knowledge and reasoning capabilities…
CountBench — CountBench evaluates object counting capabilities in visual understanding.
DeepSearchQA — DeepSearchQA is a benchmark for evaluating deep search and question-answering capabilities, testing models' ability to perform multi-hop reasoning an…
MATH (CoT) — MATH dataset contains 12,500 challenging competition mathematics problems from AMC 10, AMC 12, AIME, and other mathematics competitions. Each problem…
Multi-SWE-Bench — A multilingual benchmark for issue resolving that evaluates Large Language Models' ability to resolve software issues across diverse programming ecos…
Seal-0 — Seal-0 is a benchmark for evaluating agentic search capabilities, testing models' ability to navigate and retrieve information using tools.
SWE-Lancer (IC-Diamond subset) — SWE-Lancer (IC-Diamond subset) is a benchmark of real-world freelance software engineering tasks from Upwork, ranging from $50 bug fixes to $32,000 f…
Tau-bench — τ-bench: A benchmark for tool-agent-user interaction in real-world domains. Tests language agents' ability to interact with users and follow domain-s…
TheoremQA — A theorem-driven question answering dataset containing 800 high-quality questions covering 350+ theorems from Math, Physics, EE&CS, and Finance. Desi…
ZEROBench — ZEROBench is a challenging vision benchmark designed to test models on zero-shot visual understanding tasks.
BFCL v2 — Berkeley Function Calling Leaderboard (BFCL) v2 is a comprehensive benchmark for evaluating large language models' function calling capabilities. It…
BrowseComp Long Context 128k — A challenging benchmark for evaluating web browsing agents' ability to persistently navigate the internet and find hard-to-locate, entangled informat…
DynaMath — A multimodal mathematics and reasoning benchmark focused on dynamic visual problem solving.
Global-MMLU — A comprehensive multilingual benchmark covering 42 languages that addresses cultural and linguistic biases in evaluation, with improved translation q…
Multilingual MMLU — MMLU-ProX is a comprehensive multilingual benchmark covering 29 typologically diverse languages, building upon MMLU-Pro. Each language version consis…
OpenAI-MRCR: 2 needle 1M — Multi-Round Co-reference Resolution benchmark that tests an LLM's ability to distinguish between multiple similar needles hidden in long conversation…
OpenBookQA — OpenBookQA is a question-answering dataset modeled after open book exams for assessing human understanding. It contains 5,957 multiple-choice element…
ZEROBench-Sub — ZEROBench-Sub is a subset of the ZEROBench benchmark.
Aider — Aider is a comprehensive code editing benchmark based on 133 practice exercises from Exercism's Python repository, designed to evaluate AI models' ab…
AlignBench — AlignBench is a comprehensive multi-dimensional benchmark for evaluating Chinese alignment of Large Language Models. It contains 8 main categories: F…
AlpacaEval 2.0 — AlpacaEval 2.0 is a length-controlled automatic evaluator for instruction-following language models that uses GPT-4 Turbo to assess model responses a…
BabyVision — A benchmark for early-stage visual reasoning and perception on child-like vision tasks.
EvalPlus — A rigorous code synthesis evaluation framework that augments existing datasets with extensive test cases generated by LLM and mutation-based strategi…
LiveCodeBench Pro — LiveCodeBench Pro is an advanced evaluation benchmark for large language models for code that uses Elo ratings to rank models based on their performa…
MathArena Apex — MathArena Apex is a challenging math contest benchmark featuring the most difficult mathematical problems designed to test advanced reasoning and pro…
MBPP+ — MBPP+ is an enhanced version of MBPP (Mostly Basic Python Problems) with significantly more test cases (35x) for more rigorous evaluation. MBPP is a…
MMMUval — Validation set for MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning) benchmark, designed to evaluate multimodal models on massiv…
MMMU (validation) — Validation set of the Massive Multi-discipline Multimodal Understanding and Reasoning benchmark. Features college-level multimodal questions across 6…
MMT-Bench — MMT-Bench is a comprehensive multimodal benchmark for evaluating Large Vision-Language Models towards multitask AGI. It comprises 31,325 meticulously…
MMVU — MMVU (Multimodal Multi-disciplinary Video Understanding) is a benchmark for evaluating multimodal models on video understanding tasks across multiple…
SlakeVQA — A semantically-labeled knowledge-enhanced dataset for medical visual question answering. Contains 642 radiology images (CT scans, MRI scans, X-rays)…
SWE-Lancer — A benchmark for evaluating large language models on real-world freelance software engineering tasks from Upwork. Contains over 1,400 tasks valued at…
TAU3-Bench — TAU3-Bench is a benchmark for evaluating general-purpose agent capabilities, testing models on multi-turn interactions with simulated user models, re…
TIR-Bench — A tool-calling and multimodal interaction benchmark for testing visual instruction following and execution reliability.
Vending-Bench 2 — Vending-Bench 2 tests longer horizon planning capabilities by evaluating how well AI models can manage a simulated vending machine business over exte…
VLMsAreBlind — A vision-language benchmark that probes blind spots and brittle reasoning in multimodal models.
AITZ_EM — Android-In-The-Zoo (AitZ) benchmark for evaluating autonomous GUI agents on smartphones. Contains 18,643 screen-action pairs with chain-of-action-tho…
Android Control High_EM — Android device control benchmark using high exact match evaluation metric for assessing agent performance on mobile interface tasks
Android Control Low_EM — Android control benchmark evaluating autonomous agents on mobile device interaction tasks with low exact match scoring criteria
APEX-Agents — APEX-Agents is a benchmark evaluating AI agents on long horizon professional tasks that require sustained reasoning, planning, and execution across c…
API-Bank — A comprehensive benchmark for tool-augmented LLMs that evaluates API planning, retrieval, and calling capabilities. Contains 314 tool-use dialogues w…
Beyond AIME — Beyond AIME is a difficult mathematical reasoning benchmark designed to test deeper reasoning chains and harder decomposition than standard AIME-styl…
BIG-Bench — Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark consisting of 204+ tasks designed to probe large language models and ext…
CLUEWSC — CLUEWSC2020 is the Chinese version of the Winograd Schema Challenge, part of the CLUE benchmark. It focuses on pronoun disambiguation and coreference…
CRAG — CRAG (Comprehensive RAG Benchmark) is a factual question answering benchmark consisting of 4,409 question-answer pairs across 5 domains (finance, spo…
FinQA — A large-scale dataset for numerical reasoning over financial data with question-answering pairs written by financial experts, featuring complex numer…
FullStackBench en — English subset of FullStackBench for evaluating end-to-end software engineering and full-stack development capability.
FullStackBench zh — Chinese subset of FullStackBench for evaluating end-to-end software engineering and full-stack development capability.
GDPval-MM — GDPval-MM is the multimodal variant of the GDPval benchmark, evaluating AI model performance on real-world economically valuable tasks that require p…
Gorilla Benchmark API Bench — APIBench, a comprehensive dataset of over 11,000 instruction-API pairs from HuggingFace, TorchHub, and TensorHub APIs for evaluating language models'…
GraphWalks — GraphWalks is a synthetic multi-hop long-context reasoning benchmark in which a model is given an edge-list representation of a graph and must traver…
LingoQA — A benchmark for multimodal spatial-language understanding and visual-linguistic question answering.
MMBench-Video — A long-form multi-shot benchmark for holistic video understanding that incorporates approximately 600 web videos from YouTube spanning 16 major categ…
MME — A comprehensive evaluation benchmark for Multimodal Large Language Models measuring both perception and cognition abilities across 14 subtasks. Featu…
MMLU (CoT) — Chain-of-Thought variant of the Massive Multitask Language Understanding benchmark, evaluating language models across 57 tasks including elementary m…
MM-Mind2Web — A multimodal web navigation benchmark comprising 2,000 open-ended tasks spanning 137 websites across 31 domains. Each task includes HTML documents pa…
MRCR 1M — MRCR 1M is a variant of the Multi-Round Coreference Resolution benchmark designed for testing extremely long context capabilities with approximately…
Multilingual MGSM (CoT) — Multilingual Grade School Math (MGSM) benchmark evaluates language models' chain-of-thought reasoning abilities across ten typologically diverse lang…
Multipl-E MBPP — MultiPL-E extends the Mostly Basic Python Problems (MBPP) benchmark to 18+ programming languages for evaluating multilingual code generation capabili…
Nuscene — A multimodal benchmark for scene understanding and reasoning over the nuScenes autonomous driving domain.
OfficeQA Pro — OfficeQA Pro evaluates AI models on professional knowledge-work questions and tasks drawn from real office workflows, including document analysis, sp…
PhiBench — PhiBench is an internal benchmark designed to evaluate diverse skills and reasoning abilities of language models, covering a wide range of tasks incl…
PMC-VQA — A medical visual question answering benchmark built on biomedical literature and medical figures.
PopQA — PopQA is an entity-centric open-domain question-answering dataset consisting of 14,000 QA pairs designed to evaluate language models' ability to memo…
RULER — RULER v1 is a synthetic long-context benchmark for measuring how model quality degrades as input length increases. This packaging follows the public…
USAMO25 — The 2025 United States of America Mathematical Olympiad (USAMO) benchmark consists of six challenging mathematical problems requiring rigorous proof-…
VQAv2 — VQAv2 is a balanced Visual Question Answering dataset that addresses language bias by providing complementary images for each question, forcing model…
VQAv2 (val) — VQAv2 is a balanced Visual Question Answering dataset containing open-ended questions about images that require understanding of vision, language, an…
ACEBench — ACEBench is a comprehensive benchmark for evaluating Large Language Models' tool usage capabilities across three primary evaluation types: Normal (ba…
AdvancedIF — AdvancedIF is a rubric-based benchmark measuring complex, multi-turn, and system-prompted instruction following ability, scored with a calibrated LLM…
AIME — American Invitational Mathematics Examination (AIME) benchmark for evaluating mathematical reasoning capabilities of large language models. Contains…
AutoLogi — AutoLogi is an automated method for synthesizing open-ended logic puzzles to evaluate reasoning abilities of Large Language Models. The benchmark add…
BFCL_v3_MultiTurn — Berkeley Function Calling Leaderboard (BFCL) V3 MultiTurn benchmark that evaluates large language models' ability to handle multi-turn and multi-step…
BigCodeBench — A benchmark that challenges LLMs to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained programming tasks…
Blueprint-Bench 2 — Blueprint-Bench 2 is an agentic spatial reasoning benchmark that evaluates a model's ability to understand, plan, and reason over architectural bluep…
BrowseComp Long Context 256k — BrowseComp is a benchmark for measuring the ability of agents to browse the web, comprising 1,266 questions that require persistently navigating the…
CorpusQA 1M — CorpusQA 1M is a long-context question answering benchmark designed to evaluate models at approximately 1 million token contexts. Models are scored o…
EQ-Bench — EQ-Bench is an LLM-judged test evaluating active emotional intelligence abilities, understanding, insight, empathy, and interpersonal skills. The tes…
FActScore — A fine-grained atomic evaluation metric for factual precision in long-form text generation that breaks generated text into atomic facts and computes…
FlenQA — Flexible Length Question Answering dataset for evaluating the impact of input length on reasoning performance of language models, featuring True/Fals…
FRAMES — Factuality, Retrieval, And reasoning MEasurement Set - a unified evaluation dataset of 824 challenging multi-hop questions for testing retrieval-augm…
FunctionalMATH — A functional variant of the MATH benchmark that tests language models' ability to generalize reasoning patterns across different problem instances, r…
GeneBench — GeneBench is an evaluation focused on multi-stage scientific data analysis in genetics and quantitative biology. Tasks require reasoning about ambigu…
GSM-8K (CoT) — Grade School Math 8K with Chain-of-Thought prompting, featuring 8.5K high-quality linguistically diverse grade school math word problems requiring mu…
HumanEval-Mul — A multilingual variant of the HumanEval benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 o…
LMArena Text Leaderboard — LMArena Text Leaderboard is a blind human preference evaluation benchmark that ranks models based on pairwise comparisons in real-world conversations…
LongCodeBench — LongCodeBench evaluates the code understanding and comprehension abilities of large language models at very long context windows, scaling up to 1M to…
MBPP EvalPlus — MBPP (Mostly Basic Python Problems) is a benchmark of 974 crowd-sourced Python programming problems designed to be solvable by entry-level programmer…
MEGA MLQA — MLQA as part of the MEGA (Multilingual Evaluation of Generative AI) benchmark suite. A multi-way aligned extractive QA evaluation benchmark for cross…
MEGA TyDi QA — TyDi QA as part of the MEGA benchmark suite. A question answering dataset covering 11 typologically diverse languages (Arabic, Bengali, English, Finn…
MEGA XCOPA — XCOPA (Cross-lingual Choice of Plausible Alternatives) as part of the MEGA benchmark suite. A typologically diverse multilingual dataset for causal c…
MEGA XStoryCloze — XStoryCloze as part of the MEGA benchmark suite. A cross-lingual story completion task that consists of professionally translated versions of the Eng…
MMAU — A massive multi-task audio understanding and reasoning benchmark comprising 10,000 carefully curated audio clips paired with human-annotated natural…
MMLU-STEM — STEM-focused subset of the Massive Multitask Language Understanding benchmark, evaluating language models on science, technology, engineering, and ma…
MMVet — MM-Vet is an evaluation benchmark that examines large multimodal models on complicated multimodal tasks requiring integrated capabilities. It assesse…
MRCR 128K (8-needle) — MRCR (Multi-Round Coreference Resolution) at 128K context length with 8 needles. Models must navigate long conversations to reproduce specific model…
MuSR — MuSR (Multistep Soft Reasoning) is a benchmark for evaluating language models on multistep soft reasoning tasks specified in natural language narrati…
OmniMath — A Universal Olympiad Level Mathematic Benchmark for Large Language Models containing 4,428 competition-level problems with rigorous human annotation,…
PaperBench — PaperBench is a benchmark for evaluating AI agents on their ability to replicate research papers. It tests models on complex, multi-step workflows in…
PerceptionTest — A novel multimodal video benchmark designed to evaluate perception and reasoning skills of pre-trained models across video, audio, and text modalitie…
PhysicsFinals — PHYSICS is a comprehensive benchmark for university-level physics problem solving, containing 1,297 expert-annotated problems covering six core areas…
PolyMath-en — PolyMath is a multilingual mathematical reasoning benchmark covering 18 languages and 4 difficulty levels from easy to hard, ensuring difficulty comp…
Qasper — QASPER is a dataset of 5,049 information-seeking questions and answers anchored in 1,585 NLP research papers. Questions are written by NLP practition…
RepoQA — RepoQA is a benchmark for evaluating long-context code understanding capabilities of Large Language Models through the Searching Needle Function (SNF…
Spider — A large-scale, complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 college students. Contains 10,181 questions and 5,69…
SWE-Bench Multimodal — SWE-Bench Multimodal extends SWE-Bench to evaluate language models on software engineering tasks that involve visual inputs such as screenshots, UI m…
SWE-bench Verified (Agentic Coding) — SWE-bench Verified is a human-filtered subset of 500 software engineering problems drawn from real GitHub issues across 12 popular Python repositorie…
SWE-bench Verified (Agentless) — A human-validated subset of SWE-bench that evaluates language models' ability to resolve real-world GitHub issues using an agentless approach. The be…
TempCompass — TempCompass is a comprehensive benchmark for evaluating temporal perception capabilities of Video Large Language Models (Video LLMs). It constructs c…
Terminal-Bench 2.1 — Terminal-Bench 2.1 is an updated release of the Terminal-Bench benchmark that tests AI agents' ability to operate a computer via the terminal. It eva…
TydiQA — A multilingual question answering benchmark covering 11 typologically diverse languages with 204K question-answer pairs. Questions are written by peo…
AI2 Reasoning Challenge (ARC) — A dataset of 7,787 genuine grade-school level, multiple-choice science questions assembled to encourage research in advanced question-answering. The…
AMO Bench — AMO Bench is an olympiad-level mathematics benchmark that evaluates advanced mathematical problem-solving and multi-step reasoning on competition-sty…
Arc — The Abstraction and Reasoning Corpus (ARC) is a benchmark designed to measure human-like general fluid intelligence through grid-based reasoning task…
AutomationBench — AutomationBench is a tool-use benchmark that evaluates AI agents on automating real-world workflows, testing their ability to orchestrate tools and c…
Big Bench Audio — Big Bench Audio is an audio reasoning benchmark adapted from a subset of Big Bench Hard, with text questions converted to spoken audio. It evaluates…
BigCodeBench-Full — A comprehensive benchmark that evaluates large language models' ability to solve complex, practical programming tasks via code generation. Contains 1…
BigCodeBench-Hard — BigCodeBench-Hard is a subset of 148 challenging programming tasks from BigCodeBench, designed to evaluate large language models' ability to solve co…
BioMysteryBench — BioMysteryBench evaluates a model's ability to reason through challenging molecular biology problems, reporting performance on a hard subset and on t…
BixBench — BixBench is a benchmark for real-world bioinformatics and computational biology data analysis. It evaluates AI models on multi-step scientific workfl…
CBNSL — Curriculum Learning of Bayesian Network Structures (CBNSL) benchmark for evaluating algorithms that learn Bayesian network structures from data using…
CloningScenarios — CloningScenarios is an expert-level multi-step reasoning benchmark about difficult genetic cloning scenarios in multiple-choice format. It evaluates…
CommonSenseQA — CommonSenseQA is a multiple-choice question answering dataset that requires different types of commonsense knowledge to predict correct answers. It c…
CorpusQA — CorpusQA is a multi-document, free-form long-context question answering benchmark in which a model must retrieve and reason over information distribu…
CritPT — CritPT is a challenging reasoning benchmark reported by Qwen for evaluating frontier mathematical and critical problem-solving capability.
CRPErelation — Clinical reasoning problems evaluation benchmark for assessing diagnostic reasoning and medical knowledge application capabilities.
CRUXEval-Input-CoT — CRUXEval input prediction task with Chain of Thought (CoT) prompting. Part of the CRUXEval benchmark for code reasoning, understanding, and execution…
CruxEval-O — CruxEval-O is the output prediction task of the CRUXEval benchmark, designed to evaluate code reasoning, understanding, and execution capabilities. I…
CRUXEval-Output-CoT — CRUXEval-O (output prediction) with Chain-of-Thought prompting. Part of the CRUXEval benchmark consisting of 800 Python functions (3-13 lines) design…
CRUX-O — CRUXEval-O (output prediction) is part of the CRUXEval benchmark consisting of 800 Python functions (3-13 lines) designed to evaluate AI models' capa…
DailyOmni — DailyOmni evaluates multimodal models on daily-life video understanding tasks.
DRACO — DRACO is a deep research benchmark that evaluates an agent's ability to gather, synthesize, and reason over information to answer complex research qu…
DS-Arena-Code — Data Science Arena Code benchmark for evaluating LLMs on realistic data science code generation tasks. Tests capabilities in complex data processing,…
FinSearchComp T2&T3 — FinSearchComp T2&T3 is a combined benchmark for evaluating financial search and reasoning capabilities on Tier 2 and Tier 3 tasks, testing models' ab…
FinSearchComp-T3 — FinSearchComp-T3 is a benchmark for evaluating financial search and reasoning capabilities, testing models' ability to retrieve and analyze financial…
French MMLU — French version of MMLU-Pro, a multilingual benchmark for evaluating language models' cross-lingual reasoning capabilities across 14 diverse domains i…
FrontierCode — FrontierCode is Cognition's coding evaluation that tests whether models can pass difficult coding tasks while meeting the standards of high-quality p…
Frontier Science — Frontier Science is a benchmark of exceptionally challenging scientific reasoning problems spanning advanced natural-science domains, designed to tes…
FrontierScience Research — FrontierScience Research is a benchmark evaluating AI models on cutting-edge scientific research questions requiring deep domain expertise, multi-ste…
GDP.pdf — GDP.pdf is a knowledge-work vision benchmark that evaluates models on economically valuable professional tasks presented as visual documents (PDFs),…
GDPval-Rubrics — GDPval-Rubrics evaluates AI model performance on economically valuable knowledge work tasks drawn from the public GDPval dataset. It uses pointwise s…
GPQA Biology — Biology subset of GPQA, containing challenging multiple-choice questions written by domain experts in biology. These Google-proof questions require g…
GPQA Chemistry — Chemistry subset of GPQA, containing challenging multiple-choice questions written by domain experts in chemistry. These Google-proof questions requi…
GPQA Physics — Physics subset of GPQA, containing challenging multiple-choice questions written by domain experts in physics. These Google-proof questions require g…
GSM8K Chat — Grade School Math 8K adapted for chat format evaluation, featuring 8.5K high-quality linguistically diverse grade school math word problems requiring…
HR-Bench (4k) — HR-Bench (4k) evaluates image understanding on high-resolution visual inputs with a 4k setting.
HumanEval-Average — A variant of the HumanEval benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original progr…
HumanEval-ER — A variant of the HumanEval benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original progr…
HumanEval Plus — Enhanced version of HumanEval that extends the original test cases by 80x using EvalPlus framework for rigorous evaluation of LLM-synthesized code fu…
IMO 2025 — IMO 2025 evaluates models on the six problems from the 2025 International Mathematical Olympiad, requiring rigorous proof-based reasoning. Following…
IPhO 2025 — International Physics Olympiad 2025 (theory) comprises all 3 theory problems from the official 2025 IPhO competition. Results are based on blinded hu…
LBPP (v2) — LBPP (v2) benchmark - specific documentation not found in official sources, possibly related to language-based planning problems
Legal Agent Benchmark — The Legal Agent Benchmark evaluates AI agents on complex legal work, testing their ability to complete realistic professional legal tasks autonomousl…
LiveCodeBench(01-09) — LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems fro…
LiveCodeBench v5 24.12-25.2 — LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems fro…
LOCA-Bench (256k) — LOCA-Bench is a long-context agentic benchmark. The 256k variant evaluates agents using the official ReAct mode with an environment description lengt…
LongFact Concepts — LongFact is a benchmark for evaluating long-form factuality in large language models. It comprises 2,280 fact-seeking prompts spanning 38 topics, des…
LongFact Objects — LongFact is a benchmark for evaluating long-form factuality in large language models. It comprises 2,280 fact-seeking prompts spanning 38 topics, des…
LSAT — LSAT (Law School Admission Test) benchmark evaluating complex reasoning capabilities across three challenging tasks: analytical reasoning, logical re…
MASK — MASK is a collection of 1000 questions measuring whether models faithfully report their beliefs when pressured to lie. It operationalizes deception a…
MAVERIX — MAVERIX (Multimodal Audio-Visual Evaluation Reasoning Index) evaluates multimodal models on tasks that demand tight integration of video and audio in…
MBPP ++ base version — MBPP (Mostly Basic Python Problems) is a benchmark of 974 crowd-sourced Python programming problems designed to be solvable by entry-level programmer…
MBPP EvalPlus (base) — MBPP (Mostly Basic Python Problems) is a benchmark of 974 crowd-sourced Python programming problems designed to be solvable by entry-level programmer…
MBPP pass@1 — MBPP (Mostly Basic Python Problems) is a benchmark of 974 crowd-sourced Python programming problems designed to be solvable by entry-level programmer…
MBPP Plus — MBPP (Mostly Basic Python Problems) is a benchmark of 974 crowd-sourced Python programming problems designed to be solvable by entry-level programmer…
MEWC — MEWC is a benchmark that evaluates AI model performance on multi-environment web challenges, testing agents' ability to navigate and complete complex…
MLS-Bench Lite — MLS-Bench Lite is the official 30-task subset of MLS-Bench for evaluating whether AI systems can invent generalizable and scalable machine learning m…
MMAU Music — A subset of the MMAU benchmark focused specifically on music understanding and reasoning tasks. Part of a comprehensive multimodal audio understandin…
MMAU Sound — A subset of the MMAU benchmark focused specifically on environmental sound understanding and reasoning tasks. Part of a comprehensive multimodal audi…
MMAU Speech — A subset of the MMAU benchmark focused specifically on speech understanding and reasoning tasks. Part of a comprehensive multimodal audio understandi…
MM IF-Eval — A challenging multimodal instruction-following benchmark that includes both compose-level constraints for output responses and perception-level const…
MMLU-Base — Base version of the Massive Multitask Language Understanding benchmark, evaluating language models across 57 tasks including elementary mathematics,…
MMLU Chat — Chat-format variant of the Massive Multitask Language Understanding benchmark, evaluating language models across 57 tasks including elementary mathem…
MMLU French — French language variant of the Massive Multitask Language Understanding benchmark, evaluating language models across 57 tasks including elementary ma…
MMLU-redux-2.0 — A curated version of the MMLU benchmark featuring manually re-annotated 5,700 questions across 57 subjects to identify and correct errors in the orig…
MMVetGPT4Turbo — MM-Vet evaluation using GPT-4 Turbo for scoring. This variant of MM-Vet examines large multimodal models on complicated multimodal tasks requiring in…
MotionBench — MotionBench is a benchmark for evaluating multimodal models on motion understanding in videos, testing the ability to comprehend temporal dynamics, m…
MRCR 128K (2-needle) — MRCR (Multi-Round Coreference Resolution) at 128K context length with 2 needles. Models must navigate long conversations to reproduce specific model…
MRCR 128K (4-needle) — MRCR (Multi-Round Coreference Resolution) at 128K context length with 4 needles. Models must navigate long conversations to reproduce specific model…
MRCR 1M (pointwise) — MRCR 1M (pointwise) is a variant of the Multi-Round Coreference Resolution benchmark that uses pointwise evaluation for ultra-long contexts (~1M toke…
MRCR 64K (2-needle) — MRCR (Multi-Round Coreference Resolution) at 64K context length with 2 needles. Models must navigate long conversations to reproduce specific model o…
MRCR 64K (4-needle) — MRCR (Multi-Round Coreference Resolution) at 64K context length with 4 needles. Models must navigate long conversations to reproduce specific model o…
MRCR 64K (8-needle) — MRCR (Multi-Round Coreference Resolution) at 64K context length with 8 needles. Models must navigate long conversations to reproduce specific model o…
NoLiMa 128K — NoLiMa evaluated at a 131072-token context length. Tests latent associative reasoning in long contexts with minimal lexical overlap between questions…
NoLiMa 32K — NoLiMa evaluated at a 32768-token context length. Tests latent associative reasoning in long contexts with minimal lexical overlap between questions…
NoLiMa 64K — NoLiMa evaluated at a 65536-token context length. Tests latent associative reasoning in long contexts with minimal lexical overlap between questions…
NQ — Natural Questions (NQ) benchmark containing real user questions issued to Google search with answers found from Wikipedia, designed for training and…
OJBench (C++) — OJBench (C++) is the C++ subset of OJBench, a competition-level code benchmark that evaluates large language models on programming competition proble…
OlympiadBench — A challenging benchmark for promoting AGI with Olympiad-level bilingual multimodal scientific problems. Comprises 8,476 math and physics problems fro…
OmniBench — A novel multimodal benchmark designed to evaluate large language models' ability to recognize, interpret, and reason across visual, acoustic, and tex…
OmniGAIA — OmniGAIA evaluates multimodal perception and reasoning in agentic contexts, testing a model's ability to process diverse inputs and perform complex m…
OpenAI-MRCR: 2 needle 256k — Multi-Round Co-reference Resolution (MRCR) benchmark that tests long-context reasoning by evaluating a model's ability to distinguish between similar…
OpenRCA — OpenRCA is a benchmark for evaluating AI models on root cause analysis tasks. For each failure case, the model receives 1 point if all generated root…
OSWorld Extended — OSWorld is a scalable, real computer environment benchmark for evaluating multimodal agents on open-ended tasks across Ubuntu, Windows, and macOS. It…
PathMCQA — PathMMU is a massive multimodal expert-level benchmark for understanding and reasoning in pathology, containing 33,428 multimodal multi-choice questi…
PostTrainBench — PostTrainBench evaluates a model's ability to autonomously post-train base models. Given pretrain-only base models, the agent must complete the full…
QwenWorldBench — QwenWorldBench is Qwen's internal benchmark for evaluating LLMs as world models that simulate agentic environments across Terminal, SWE, MCP, Search,…
RepoBench — RepoBench is a benchmark for evaluating repository-level code auto-completion systems through three interconnected tasks: RepoBench-R (retrieval of r…
Robust IF — Robust IF evaluates instruction-following robustness on diverse, hard prompts, measuring whether a model reliably adheres to constraints across chall…
RULER 1000K — RULER 1000K evaluates the official 13-task RULER v1 suite at a 1048576-token (1M) context budget.
RULER 128k — RULER 128k evaluates the official 13-task RULER v1 suite at a 131072-token context budget.
RULER 2048K — RULER 2048K evaluates the official 13-task RULER v1 suite at a 2097152-token (2M) context budget.
RULER 512K — RULER 512K evaluates the official 13-task RULER v1 suite at a 524288-token context budget.
RULER 64k — RULER 64k evaluates the official 13-task RULER v1 suite at a 65536-token context budget.
SAT Math — SAT Math benchmark from AGIEval containing standardized mathematics questions from the College Board SAT examination, designed to evaluate mathematic…
ScienceQA — ScienceQA is the first large-scale multimodal science question answering benchmark with 21,208 multiple-choice questions covering 3 subjects (natural…
ScienceQA Visual — ScienceQA Visual is a multimodal science question answering benchmark consisting of 21,208 multiple-choice questions from elementary and high school…
SimpleQA Verified — SimpleQA Verified is a curated, reliability-focused subset of SimpleQA that addresses label noise and redundancy in the original benchmark, measuring…
STEM — A comprehensive multimodal benchmark dataset with 448 skills and 1,073,146 questions spanning all STEM subjects (Science, Technology, Engineering, Ma…
SuperGLUE — SuperGLUE is a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public le…
SWE-bench Verified (Multiple Attempts) — SWE-bench Verified is a human-validated subset of 500 test samples from the original SWE-bench dataset that evaluates AI systems' ability to automati…
Tau3 Airline — τ³-Bench airline domain evaluates agentic models on multi-turn, tool-using customer-support scenarios in a simulated airline booking and reservations…
Tau3 Banking — τ³-Bench banking domain evaluates agentic models on multi-turn, tool-using customer-support scenarios in a simulated retail banking environment.
Tau3 Retail — τ³-Bench retail domain evaluates agentic models on multi-turn, tool-using customer-support scenarios in a simulated online retail environment.
Tau3 Telecom — τ³-Bench telecom domain evaluates agentic models on multi-turn, tool-using customer-support and troubleshooting scenarios in a simulated telecommunic…
Terminus — Terminal-Bench is a benchmark for testing AI agents in real terminal environments, evaluating how well agents can handle real-world, end-to-end tasks…
Uniform Bar Exam — The Uniform Bar Examination (UBE) benchmark evaluates language models on the complete bar exam including multiple-choice Multistate Bar Examination (…
USAMO 2026 — USAMO 2026 evaluates models on the six problems from the 2026 United States of America Mathematical Olympiad, requiring rigorous proof-based reasonin…
VCR_en_easy — Visual Commonsense Reasoning (VCR) benchmark that tests higher-order cognition and commonsense reasoning beyond simple object recognition. Models mus…
VideoHolmes — VideoHolmes evaluates video understanding and reasoning capabilities in multimodal models.
VisuLogic — VisuLogic evaluates logical reasoning capabilities in visual contexts.
VoiceBench Avg — VoiceBench is the first benchmark designed to provide a multi-faceted evaluation of LLM-based voice assistants, evaluating capabilities including gen…
VQAv2 (test) — VQA v2.0 (Visual Question Answering v2.0) is a balanced dataset designed to counter language priors in visual question answering. It consists of comp…
We-Math — We-Math evaluates multimodal models on visual mathematical reasoning, requiring models to understand and solve math problems presented with visual el…
WorldVQA — WorldVQA is a benchmark designed to evaluate atomic vision-centric world knowledge. It assesses models' ability to understand and reason about visual…

General

Include — Include benchmark - specific documentation not found in official sources
IFBench — Instruction Following Benchmark evaluating model's ability to follow complex instructions
Aider-Polyglot — A coding benchmark that evaluates LLMs on 225 challenging Exercism programming exercises across C++, Go, Java, JavaScript, Python, and Rust. Models r…
OSWorld — OSWorld: The first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and inte…
OSWorld-Verified — OSWorld-Verified is a verified subset of OSWorld, a scalable real computer environment for multimodal agents supporting task setup, execution-based e…
MultiPL-E — MultiPL-E is a scalable and extensible system for translating unit test-driven code generation benchmarks to multiple programming languages. It exten…
Aider-Polyglot Edit — A challenging multi-language coding benchmark that evaluates models' code editing abilities across C++, Go, Java, JavaScript, Python, and Rust. Conta…
MAXIFE — MAXIFE is a multilingual benchmark evaluating LLMs on instruction following and execution across multiple languages and cultural contexts.
NOVA-63 — NOVA-63 is a multilingual evaluation benchmark covering 63 languages, designed to assess LLM performance across diverse linguistic contexts and tasks.
SimpleVQA — SimpleVQA is a visual question answering benchmark focused on simple queries.
MLVU-M — MLVU-M benchmark
Vibe-Eval — VIBE-Eval is a hard evaluation suite for measuring progress of multimodal language models, consisting of 269 visual understanding prompts with gold-s…
CSimpleQA — Chinese SimpleQA is the first comprehensive Chinese benchmark to evaluate the factuality ability of language models to answer short questions. It con…
Nexus — NexusRaven benchmark for evaluating function calling capabilities of large language models in zero-shot scenarios across cybersecurity tools and API…
AA-Index — No official academic documentation found for this benchmark. Extensive research through ArXiv, IEEE/ACL/NeurIPS papers, and university research sites…
Multipl-E HumanEval — MultiPL-E is a scalable and extensible approach to benchmarking neural code generation that translates unit test-driven code generation benchmarks ac…
MultiLF — MultiLF benchmark
Artificial Analysis — Artificial Analysis benchmark evaluates AI models across quality, speed, and pricing dimensions, providing a composite assessment of model capabiliti…
DS-FIM-Eval — DeepSeek's internal Fill-in-the-Middle evaluation dataset for measuring code completion performance improvements in data science contexts
HumanEvalFIM-Average — Average evaluation of HumanEval Fill-in-the-Middle benchmark variants (single-line, multi-line, random-span) for assessing code infilling capabilitie…
Instruct HumanEval — Instruction-based variant of HumanEval benchmark for evaluating large language models' code generation capabilities with functional correctness using…
MME-RealWorld — A comprehensive evaluation benchmark for Multimodal Large Language Models featuring over 13,366 high-resolution images and 29,429 question-answer pai…
NMOS — NMOS evaluation benchmark for assessing model performance on specialized tasks
OSWorld Screenshot-only — OSWorld Screenshot-only: A variant of the OSWorld benchmark that evaluates multimodal AI agents using only screenshot observations to complete open-e…

Agents

Claw-Eval — Claw-Eval tests real-world agentic task completion across complex multi-step scenarios, evaluating a model's ability to use tools, navigate environme…
NL2Repo — NL2Repo evaluates long-horizon coding capabilities including repository-level understanding, where models must generate or modify code across entire…
SkillsBench — SkillsBench evaluates coding agents on self-contained programming tasks, measuring practical engineering skills across diverse software development s…
ZClawBench — ZClawBench evaluates Claw-style agent task execution quality, measuring a model's ability to autonomously complete complex multi-step coding tasks in…
PinchBench — PinchBench evaluates coding agents on real-world agentic coding tasks, measuring both best-case and average performance across complex software engin…
MiMo Coding Bench — MiMo Coding Bench evaluates coding-agent capabilities on software engineering tasks reported with the MiMo model family.
VIBE-Pro — VIBE-Pro is an advanced version of the VIBE (Visual & Interactive Benchmark for Execution) benchmark that evaluates LLMs on professional-grade full-s…
CC-Bench-V2 Repo Exploration — CC-Bench-V2 Repo Exploration evaluates coding agents on repository-level understanding and navigation, measuring ability to explore, comprehend, and…
CL-bench — CL-bench is an open-source benchmark with its own data and rubrics for evaluating models on coding and agentic tasks, scored using a setup fully alig…
FrontierSWE (Impl.) — FrontierSWE (Impl.) evaluates software engineering implementation ability and reports model ranking on implementation tasks. Lower rank is better.
Kimi Claw 24/7 Bench — Kimi Claw 24/7 Bench is Moonshot AI's in-house benchmark for evaluating long-horizon agentic performance in persistent, multi-day coworking tasks. It…
Kimi Code Bench v2 — Kimi Code Bench v2 is Moonshot AI's in-house benchmark for evaluating coding agents on realistic software engineering tasks across 10+ mainstream pro…
LiveSQLBench — LiveSQLBench evaluates models on generating correct SQL queries against live PostgreSQL databases. The LiveSQLBench-Base-Full v1 dataset contains 600…
MLE-Bench Lite — MLE-Bench Lite evaluates AI agents on machine learning engineering tasks, testing their ability to build, train, and optimize ML models for Kaggle-st…
MM-ClawBench — MM-ClawBench evaluates models on MiniMax's Claw-style agent benchmark, measuring practical agentic task completion quality in real-world OpenClaw usa…
Program Bench — Program Bench evaluates code-generation agents by asking them to recreate a program's behavior from only a compiled binary and documentation. It span…
SWE Atlas - Codebase QnA — SWE Atlas - Codebase QnA evaluates a model's ability to answer questions about real codebases, measuring repository-level comprehension and the abili…
SWE Atlas - Test Writing — SWE Atlas - Test Writing evaluates a model's ability to author meaningful tests for real-world software projects, measuring how well agents can under…
SWE-fficiency — SWE-fficiency is an open-source benchmark and workflow that evaluates language models on optimizing the runtime efficiency of real-world software eng…
VIBE-V2 — VIBE-V2 is an internal benchmark covering pure front-end and full-stack Web, Android, and iOS projects with build-from-scratch tasks. It uses an Agen…
WildClawBench — WildClawBench is an agentic coding benchmark from InternLM/Claw-Eval that reports overall model performance on real-world tool-using development task…

Multimodal

MM-MT-Bench — A multi-turn LLM-as-a-judge evaluation benchmark for testing multimodal instruction-tuned models' ability to follow user instructions in multi-turn d…
InfoVQAtest — InfoVQA test set with infographic images requiring joint reasoning over document layout, textual content, graphical elements, and data visualizations…
DocVQAtest — DocVQA is a Visual Question Answering benchmark on document images containing 50,000 questions defined on 12,000+ document images. The benchmark focu…
InfoVQA — InfoVQA dataset with 30,000 questions and 5,000 infographic images requiring joint reasoning over document layout, textual content, graphical element…
Design2Code — Design2Code evaluates the ability to generate code (HTML/CSS/JS) from visual designs.
QwenWebBench — QwenWebBench is an internal front-end code generation benchmark by Qwen. It is bilingual (EN/CN) and spans 7 categories (Web Design, Web Apps, Games,…
Flame-VLM-Code — Flame-VLM-Code evaluates multimodal models on visual code generation tasks, measuring ability to generate code from visual inputs such as UI mockups…
ImageMining — ImageMining evaluates multimodal models on extracting structured information from images using tool use, measuring ability to combine visual understa…
InfographicsQA — InfographicVQA dataset with 5,485 infographic images and over 30,000 questions requiring joint reasoning over document layout, textual content, graph…
MusicCaps — MusicCaps is a dataset composed of 5,521 music examples, each labeled with an English aspect list and a free text caption written by musicians. The d…
OmniBench Music — Music component of OmniBench, a comprehensive benchmark for evaluating omni-language models' ability to recognize, interpret, and reason across visua…
QwenSVG — QwenSVG is Qwen's internal SVG generation benchmark for evaluating front-end and visual code generation. Scores are reported as BT/Elo ratings from a…
SVG-Bench — SVG-Bench is an internal benchmark that comprehensively evaluates SVG generation performance. It accepts text and image inputs across build-from-scra…
Vision2Web — Vision2Web evaluates multimodal models on converting visual designs and screenshots into functional web pages, measuring end-to-end design-to-code ca…

Safety

CyberGym — CyberGym is a benchmark for evaluating AI agents on cybersecurity tasks, testing their ability to identify vulnerabilities, perform security analysis…
AttaQ — AttaQ is a unique dataset containing adversarial examples in the form of questions designed to provoke harmful or inappropriate responses from large…
Cybersecurity CTFs — Cybersecurity Capture the Flag (CTF) benchmark for evaluating LLMs in offensive security challenges. Contains diverse cybersecurity tasks including c…
FigQA — FigQA is a multiple-choice benchmark on interpreting scientific figures from biology papers. It evaluates dual-use biological knowledge and multimoda…
XSTest — XSTest is a test suite designed to identify exaggerated safety behaviours in large language models. It comprises 450 prompts: 250 safe prompts across…
CyBench — CyBench is a suite of Capture-the-Flag (CTF) challenges measuring agentic cyber attack capabilities. It evaluates dual-use cybersecurity knowledge an…
POPE — Polling-based Object Probing Evaluation (POPE) is a benchmark for evaluating object hallucination in Large Vision-Language Models (LVLMs). POPE addre…
AIR-Bench — AIR-Bench 2024 is a safety benchmark grounded in risk categories derived from government regulations and company policies. It evaluates policy-ground…
BioLP-Bench — BioLP-Bench is a model-graded evaluation measuring ability to find and correct mistakes in common biological laboratory protocols. It evaluates dual-…
CyberSecEval 4 — CyberSecEval 4 is an evaluation suite covering cybersecurity-related capabilities and risks of large language models. The insecure-code-generation tr…
ExploitBench — ExploitBench is a cybersecurity benchmark that evaluates a model's ability to discover and exploit software vulnerabilities, reported as the fraction…
ProtocolQA — ProtocolQA is a multiple-choice benchmark on troubleshooting failed experimental outcomes from common biological laboratory protocols. It evaluates d…
Virology Capabilities Test — Virology Capabilities Test (VCT) is an expert-level multiple-choice benchmark measuring the capability to troubleshoot complex virology laboratory pr…
WMDP — Weapons of Mass Destruction (WMDP) is a multiple-choice benchmark on dual-use biology, chemistry, and cyber knowledge. It measures a model's capacity…

Language

WMT24++ — WMT24++ is a comprehensive multilingual machine translation benchmark that expands the WMT24 dataset to cover 55 languages and dialects. It includes…
CharadesSTA — Charades-STA is a benchmark dataset for temporal activity localization via language queries, extending the Charades dataset with sentence temporal an…
SQuALITY — SQuALITY (Summarization-format QUestion Answering with Long Input Texts, Yes!) is a long-document summarization dataset built by hiring highly-qualif…
Translation en→Set1 COMET22 — COMET-22 is an ensemble machine translation evaluation metric combining a COMET estimator model trained with Direct Assessments and a multitask model…
Translation en→Set1 spBleu — Translation evaluation using spBLEU (SentencePiece BLEU), a BLEU metric computed over text tokenized with a language-agnostic SentencePiece subword m…
Translation Set1→en COMET22 — COMET-22 is a neural machine translation evaluation metric that uses an ensemble of two models: a COMET estimator trained with Direct Assessments and…
Translation Set1→en spBleu — spBLEU (SentencePiece BLEU) evaluation metric for machine translation quality assessment, using language-agnostic SentencePiece tokenization with BLE…
MEGA UDPOS — Universal Dependencies POS tagging as part of the MEGA benchmark suite. A multilingual part-of-speech tagging dataset based on Universal Dependencies…
VATEX — VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. Contains over 41,250 videos and 825,000 captions in both Eng…
Open-rewrite — OpenRewriteEval is a benchmark for evaluating open-ended rewriting of long-form texts, covering a wide variety of rewriting types expressed through n…
TLDR9+ (test) — A large-scale summarization dataset containing over 9 million training instances extracted from Reddit, designed for extreme summarization (generatin…
XLSum English — Large-scale multilingual abstractive summarization dataset comprising 1 million professionally annotated article-summary pairs from BBC, covering 44…

Code

Codegolf v2.2 — Codegolf v2.2 benchmark
CFEval — CFEval benchmark for evaluating code generation and problem-solving capabilities
OctoCodingBench — Octopus coding benchmark for evaluating multi-language programming capabilities
SWE-Perf — Software Engineering Performance benchmark measuring code optimization capabilities
SWE-Review — Software Engineering Review benchmark evaluating code review capabilities
SWT-Bench — Software Test Benchmark evaluating LLM ability to write tests for software repositories
VIBE — Visual Interface Building Evaluation benchmark for UI/app generation
VIBE Android — VIBE benchmark subset for Android application generation
VIBE Backend — VIBE benchmark subset for backend service generation
VIBE iOS — VIBE benchmark subset for iOS application generation
VIBE Simulation — VIBE benchmark subset for simulation code generation
VIBE Web — VIBE benchmark subset for web application generation

Spatial Reasoning

RealWorldQA — RealWorldQA is a benchmark designed to evaluate basic real-world spatial understanding capabilities of multimodal models. The initial release consist…
ScreenSpot — ScreenSpot is the first realistic GUI grounding benchmark that encompasses mobile, desktop, and web environments. The dataset comprises over 1,200 in…
RefCOCO-avg — RefCOCO-avg measures object grounding accuracy averaged across RefCOCO, RefCOCO+, and RefCOCOg benchmarks.
EmbSpatialBench — EmbSpatialBench evaluates embodied spatial understanding and reasoning capabilities.
RefSpatialBench — RefSpatialBench evaluates spatial reference understanding and grounding.
Hypersim — Hypersim evaluates 3D grounding and depth understanding in synthetic indoor scenes.
SUNRGBD — SUNRGBD evaluates RGB-D scene understanding and 3D grounding capabilities.
InterGPS — Interpretable Geometry Problem Solver (Inter-GPS) with Geometry3K dataset of 3,002 geometry problems with dense annotation in formal language using t…
ARKitScenes — ARKitScenes evaluates 3D scene understanding and spatial reasoning in AR/VR contexts.
PointGrounding — PointArena is a comprehensive platform for evaluating multimodal pointing across diverse reasoning scenarios. It includes Point-Bench, a curated data…

Healthcare

HealthBench — An open-source benchmark for measuring performance and safety of large language models in healthcare, consisting of 5,000 multi-turn conversations ev…
HealthBench Professional — HealthBench Professional evaluates model capability and safety for clinician use cases using real clinician-style chats and physician-authored gradin…
WMT23 — The Eighth Conference on Machine Translation (WMT23) benchmark evaluating machine translation systems across 8 language pairs (14 translation directi…
CheXpert CXR — CheXpert is a large dataset of 224,316 chest radiographs from 65,240 patients for automated chest X-ray interpretation. The dataset includes uncertai…
DermMCQA — Dermatology multiple choice question assessment benchmark for evaluating medical knowledge and diagnostic reasoning in dermatological conditions and…
HealthBench Consensus — HealthBench Consensus is a HealthBench subset focused on questions where physician-created rubric criteria have especially high agreement, measuring…
MIMIC CXR — MIMIC-CXR is a large publicly available dataset of chest radiographs with free-text radiology reports. Contains 377,110 images corresponding to 227,8…
VQA-Rad — VQA-RAD (Visual Question Answering in Radiology) is the first manually constructed dataset of medical visual question answering containing 3,515 clin…

Math

MathVista — MathVista evaluates mathematical reasoning of foundation models in visual contexts. It consists of 6,141 examples derived from 28 existing multimodal…
HMMT 2025 — Harvard-MIT Mathematics Tournament 2025 - A prestigious student-organized mathematics competition for high school students featuring two tournaments…
MathVision — MATH-Vision is a dataset designed to measure multimodal mathematical reasoning capabilities. It focuses on evaluating how well models can solve mathe…
HMMT25 — Harvard-MIT Mathematics Tournament 2025 - A prestigious student-organized mathematics competition for high school students featuring two tournaments…
MathVista-Mini — MathVista-Mini is a smaller version of the MathVista benchmark that evaluates mathematical reasoning in visual contexts. It consists of examples deri…
CNMO 2024 — China Mathematical Olympiad 2024 - A challenging mathematics competition.
MathVerse-Mini — MathVerse-Mini is a subset of the MathVerse benchmark for evaluating math reasoning capabilities in vision-language models.

Structured Output

IFEval — Instruction-Following Evaluation (IFEval) benchmark for large language models, focusing on verifiable instructions with 25 types of instructions and…
CC-OCR — A comprehensive OCR benchmark for evaluating Large Multimodal Models (LMMs) in literacy. Comprises four OCR-centric tracks: multi-scene text reading,…
Internal API instruction following (hard) — Internal API instruction following (hard) benchmark - specific documentation not found in official sources
IF — Instruction-Following Evaluation (IFEval) benchmark for large language models, focusing on verifiable instructions with 25 types of instructions and…
SIFO — SIFO (Simple Instruction Following) evaluates how well language models follow simple, explicit instructions. It tests fundamental instruction-followi…
SIFO-Multiturn — SIFO-Multiturn evaluates instruction following capabilities in multi-turn conversational settings, testing how well models maintain context and follo…

Image To Text

DocVQA — A dataset for Visual Question Answering on document images containing 50,000 questions defined on 12,000+ document images. The benchmark tests AI's a…
OCRBench — OCRBench: Comprehensive evaluation benchmark for assessing Optical Character Recognition (OCR) capabilities in Large Multimodal Models across text re…
TextVQA — TextVQA contains 45,336 questions on 28,408 images that require reasoning about text to answer. Introduced to benchmark VQA models' ability to read a…
OCRBench-V2 (en) — OCRBench v2 English subset: Enhanced benchmark for evaluating Large Multimodal Models on visual text localization and reasoning with English text con…
OCRBench-V2 (zh) — OCRBench v2 Chinese subset: Enhanced benchmark for evaluating Large Multimodal Models on visual text localization and reasoning with Chinese text con…
OCRBench_V2 — OCRBench v2: Enhanced large-scale bilingual benchmark for evaluating Large Multimodal Models on visual text localization and reasoning with 10,000 hu…

Long Context

LVBench — LVBench is an extreme long video understanding benchmark designed to evaluate multimodal models on videos up to two hours in duration. It contains 6…
MMLongBench-Doc — MMLongBench-Doc evaluates long document understanding capabilities in vision-language models.
LongVideoBench — LongVideoBench is a question-answering benchmark featuring video-language interleaved inputs up to an hour long. It includes 3,763 varying-length web…
InfiniteBench/En.MC — InfiniteBench English Multiple Choice variant - first LLM benchmark featuring average data length surpassing 100K tokens for evaluating long-context…
InfiniteBench/En.QA — InfiniteBench English Question Answering variant - first LLM benchmark featuring average data length surpassing 100K tokens for evaluating long-conte…
NIH/Multi-needle — Multi-needle in a haystack benchmark for evaluating long-context comprehension capabilities of language models by testing retrieval of multiple targe…

Video

VideoMME w/o sub. — Video-MME is a comprehensive evaluation benchmark for multi-modal large language models in video analysis. It features 900 videos across 6 primary vi…
MLVU — A comprehensive benchmark for multi-task long video understanding that evaluates multimodal large language models on videos ranging from 3 minutes to…
VideoMME w sub. — The first-ever comprehensive evaluation benchmark of Multi-modal LLMs in Video analysis. Features 900 videos (254 hours) with 2,700 question-answer p…
QVHighlights — QVHighlights is a video moment retrieval benchmark for detecting moments and highlights in videos via natural language queries. Given a query, the mo…
ActivityNet — A large-scale video benchmark for human activity understanding. Provides samples from 203 activity classes with an average of 137 untrimmed videos pe…
Video-MME (long, no subtitles) — Video-MME is the first-ever comprehensive evaluation benchmark for Multi-modal Large Language Models (MLLMs) in video analysis. This variant focuses…

Vision

ODinW — Object Detection in the Wild (ODinW) benchmark for evaluating object detection models' task-level transfer ability across diverse real-world datasets…
AndroidWorld — AndroidWorld evaluates an agent's ability to operate in real Android GUI environments, completing multi-step tasks by perceiving screen content and e…
Objectron — Objectron evaluates 3D object detection and pose estimation capabilities.
WebVoyager — WebVoyager evaluates an agent's ability to navigate and complete tasks on real websites by perceiving page screenshots and executing browser actions.

Speech To Text

FLEURS — Few-shot Learning Evaluation of Universal Representations of Speech - a parallel speech dataset in 102 languages built on FLoRes-101 with approximate…
CoVoST2 — CoVoST 2 is a large-scale multilingual speech translation corpus derived from Common Voice, covering translations from 21 languages into English and…
Common Voice 15 — Common Voice is a massively-multilingual collection of transcribed speech intended for speech technology research and development. Version 15.0 conta…
CoVoST2 en-zh — CoVoST 2 English-to-Chinese subset is part of the large-scale multilingual speech translation corpus derived from Common Voice. This subset focuses s…

Search

BrowseComp-VL — BrowseComp-VL is the vision-language variant of BrowseComp, evaluating multimodal models on web browsing comprehension tasks that require processing…
MM-BrowserComp — MM-BrowserComp evaluates multimodal agents on web browsing and information retrieval tasks, testing a model's ability to perceive, navigate, and extr…
MMSearch — MMSearch evaluates multimodal models on search-based retrieval and question answering tasks that require processing both visual and textual informati…
MMSearch-Plus — MMSearch-Plus is an extended variant of MMSearch with harder multimodal search and retrieval tasks requiring deeper reasoning over visual and textual…

Finance

WritingBench — A comprehensive benchmark for evaluating large language models' generative writing capabilities across 6 core writing domains (Academic & Engineering…
BankerToolBench — BankerToolBench is a public benchmark that evaluates models on banking and finance tool-use tasks. Models are scored against dataset rubrics, measuri…
YC-Bench — YC-Bench evaluates agents on long-horizon, open-ended business and investment decision-making. The reported metric is the final assets (fund value, i…

Tool Calling

BFCL-V4 — Berkeley Function Calling Leaderboard V4 (BFCL-V4) evaluates LLMs on their ability to accurately call functions and APIs, including simple, multiple,…
MCP-Mark — MCP-Mark evaluates LLMs on their ability to use Model Context Protocol (MCP) tools effectively, testing tool discovery, selection, invocation, and re…
MCP-Universe — MCP-Universe evaluates LLMs on complex multi-step agentic tasks using Model Context Protocol (MCP) tools across diverse interactive environments, tes…

Psychology

Social IQa — The first large-scale benchmark for commonsense reasoning about social situations. Contains 38,000 multiple choice questions probing emotional and so…
OpenAI MMLU — MMLU (Massive Multitask Language Understanding) is a comprehensive benchmark that measures a text model's multitask accuracy across 57 diverse academ…
Meld — MELD (Multimodal EmotionLines Dataset) is a multimodal multi-party dataset for emotion recognition in conversations. Contains approximately 13,000 ut…

Summarization

GovReport — A long document summarization dataset consisting of reports from government research agencies including Congressional Research Service and U.S. Gover…
QMSum — QMSum is a benchmark for query-based multi-domain meeting summarization consisting of 1,808 query-summary pairs over 232 meetings across academic, pr…
SummScreenFD — SummScreenFD is the ForeverDreaming subset of the SummScreen dataset for abstractive screenplay summarization, comprising pairs of TV series transcri…

Grounding

GroundUI-1K — A subset of GroundUI-18K for UI grounding evaluation, where models must predict action coordinates on screenshots based on single-step instructions a…
OSWorld-G — OSWorld-G (Grounding) evaluates screenshot grounding accuracy for OS automation tasks.
RefCOCOg — RefCOCOg is a referring expression comprehension benchmark that evaluates spatial grounding in images. Given a natural language expression describing…

Frontend Development

MobileMiniWob++_SR — MobileMiniWob++ SR (Success Rate) is an adaptation of the MiniWob++ web interaction benchmark for mobile Android environments within AndroidWorld. It…
VisualWebBench — A multimodal benchmark designed to assess the capabilities of multimodal large language models (MLLMs) across web page understanding and grounding ta…
Artifacts Bench — Artifacts Bench evaluates a model's ability to generate visual code artifacts, measuring the quality of generated interactive and visual front-end ou…

Coding

CC-Bench-V2 Backend — CC-Bench-V2 Backend evaluates coding agents on backend development tasks, measuring practical engineering ability to implement server-side logic, API…
CC-Bench-V2 Frontend — CC-Bench-V2 Frontend evaluates coding agents on frontend development tasks, measuring ability to build UI components, handle styling, and implement c…
SecCodeBench — SecCodeBench evaluates LLM coding agents on secure code generation and vulnerability detection, testing the ability to produce code that is both func…

Document Understanding

RealKIE-FCC — RealKIE-FCC is a key information extraction benchmark drawn from real enterprise documents (FCC filings), part of the RealKIE suite of five novel dat…
OmniDocBench — OmniDocBench evaluates multimodal models on document understanding tasks such as OCR, layout parsing, and structured document comprehension.

Productivity

SpreadSheetBench-v1 — SpreadSheetBench-v1 evaluates office automation agents on spreadsheet reasoning and manipulation tasks, measuring the ability to analyze, transform,…
CoWorkBench — CoWorkBench is Qwen's internal cowork benchmark for evaluating long-horizon office and productivity agent tasks across domains such as computer scien…

Image-Generation

CVTG-2K — CVTG-2K (Chinese Visual Text Generation 2K) is a benchmark for evaluating text-to-image models on their ability to accurately render text within gene…
LongText-Bench — LongText-Bench evaluates text-to-image models on their ability to accurately render long text passages within generated images. It includes English (…

Audio

GiantSteps Tempo — A dataset for tempo estimation in electronic dance music containing 664 2-minute audio previews from Beatport, annotated from user corrections for ev…
VocalSound — A dataset for improving human vocal sounds recognition, containing over 21,000 crowdsourced recordings of laughter, sighs, coughs, throat clearing, s…

Systems

KernelBench Hard — KernelBench Hard evaluates agentic GPU kernel optimization on the hardest problem set. Each question is scored by the agent's submitted operator TFLO…
Kernel Bench L3 — Kernel Bench L3 evaluates agentic GPU kernel optimization across 50 problems. Qwen reports two metrics for this benchmark: median per-problem speedup…

Writing

Creative Writing v3 — EQ-Bench Creative Writing v3 is an LLM-judged creative writing benchmark that evaluates models across 32 writing prompts with 3 iterations per prompt…

Factuality

LongFact — LongFact evaluates factual precision over long-form generations containing many individual claims. Each claim is extracted and verified, and the mode…

Instruction Following

MIABench — MIABench evaluates multimodal instruction alignment and following capabilities.

Text-To-Image

MTVQA — MTVQA (Multilingual Text-Centric Visual Question Answering) is the first benchmark featuring high-quality human expert annotations across 9 diverse l…

Research

ResearchClawBench — ResearchClawBench evaluates research agents on realistic, tool-using research tasks that require code execution and filesystem workspace interaction.

Robotics

RoboSpatialHome — RoboSpatialHome evaluates spatial understanding for robotic home navigation and manipulation.

Frequently asked questions

What are AI benchmarks?

AI benchmarks are standardized tests that measure how well language models perform on specific tasks — reasoning, coding, math, factual recall, tool use and more. Each benchmark runs the same problems against every model so scores are directly comparable. On LLM Stats, every benchmark links to a live leaderboard ranking 300+ models by verified score.

Which AI benchmark is best for measuring reasoning?

GPQA Diamond is the most discriminating reasoning benchmark at the frontier — graduate-level science questions that are hard to answer without genuine reasoning. MMLU and MMLU-Pro cover broad knowledge, while AIME and HMMT test competition math. For agentic coding, SWE-Bench Verified is the standard.

How are benchmark scores verified on LLM Stats?

Scores come from public model evaluations and provider-reported results, cross-checked against independent runs. Verified benchmarks are marked with a badge. Pricing and metadata revalidate hourly and new results are added as they are published.

What is the difference between MMLU, GPQA and SWE-Bench?

MMLU measures broad multitask knowledge across 57 subjects. GPQA Diamond measures hard graduate-level reasoning. SWE-Bench Verified measures whether a model can resolve real GitHub issues end-to-end. They test different capabilities, so a model can lead one and trail another.

How often are AI benchmark results updated?

Benchmark metadata and pricing refresh hourly. New model results are added within hours of a release or a newly published evaluation, so the leaderboards reflect the current frontier.