Benchmark library

Every benchmark.
One command.

Containerized, parallelized, and verified. Pick a benchmark, point it at your agent, and get results in minutes instead of hours.

27benchmarks ready

Coding & SWE

9 benchmarks

SWE-bench

swebench

Real GitHub issues from popular Python repos

500+tasks

Hard

SWE-bench Pro

swebenchpro

Harder SWE-bench with multi-file edits

500+tasks

Very Hard

AutoCodeBench

autocodebench

Automated code generation evaluation

460tasks

Medium

HumanEvalFix

humanevalfix

Bug-fixing variant of HumanEval

164tasks

Medium

Aider Polyglot

aider-polyglot

Multi-language code editing

225tasks

Medium

QuixBugs

quixbugs

Classic bug-fixing across Python & Java

40tasks

Easy

CRUSTBench

crustbench

C/Rust systems programming tasks

100tasks

Hard

TerminalBench

terminalbench2

Terminal & shell automation tasks

485tasks

Hard

Harbor

harbor

Full-stack agent environment tasks

50tasks

Hard

Math & Reasoning

7 benchmarks

AIME

aime

Competition-level math problems

90tasks

Very Hard

IneqMath

ineqmath

Mathematical inequality proving

270tasks

Very Hard

GPQA Diamond

gpqa-diamond

PhD-level science questions

198tasks

Very Hard

Reasoning Gym

reasoning-gym

Diverse logical reasoning tasks

500+tasks

Medium

SATBench

satbench

Boolean satisfiability solving

200tasks

Hard

ARC-AGI-2

arc-agi-2

Abstract reasoning & pattern recognition

120tasks

Very Hard

USACO

usaco

Competitive programming olympiad

307tasks

Very Hard

Knowledge & QA

4 benchmarks

SimpleQA

simpleqa

Factual question answering

4326tasks

Easy

GAIA

gaia

General AI assistant tasks

165tasks

Hard

MMMLU

mmmlu

Massive multi-task language understanding

14042tasks

Medium

LawBench

lawbench

Legal reasoning & analysis

500tasks

Hard

Data Science

2 benchmarks

DS-1000

ds1000

Data science coding problems

1000tasks

Medium

DABStep

dabstep

Data analysis & business tasks

460tasks

Medium

Agents & Tools

2 benchmarks

BFCL

bfcl

Function calling & tool use

2000tasks

Medium

MMAU

mmau

Multi-modal agent understanding

990tasks

Hard

Safety & Science

3 benchmarks

StrongReject

strongreject

Safety & refusal evaluation

313tasks

Medium

ReplicationBench

replicationbench

Scientific replication tasks

50tasks

Very Hard

QCircuitBench

qcircuitbench

Quantum circuit design

100tasks

Very Hard

Bring your own benchmark

Have an internal eval?
We'll onboard it for you.

Your proprietary benchmarks are often the most important ones. We provide white-glove onboarding to containerize your custom evals and get them running on BenchSpan — same parallelization, same reproducibility, same dashboard.

Talk to us about your benchmark Email us your eval spec

We handle the Dockerfile, harness integration, and test verification

Your benchmark stays private to your organization

Typically onboarded in 1–2 business days

Ongoing support as your eval evolves

Every benchmark.One command.

Coding & SWE

SWE-bench

SWE-bench Pro

AutoCodeBench

HumanEvalFix

Aider Polyglot

QuixBugs

CRUSTBench

TerminalBench

Harbor

Math & Reasoning

AIME

IneqMath

GPQA Diamond

Reasoning Gym

SATBench

ARC-AGI-2

USACO

Knowledge & QA

SimpleQA

GAIA

MMMLU

LawBench

Data Science

DS-1000

DABStep

Agents & Tools

BFCL

MMAU

Safety & Science

StrongReject

ReplicationBench

QCircuitBench

Have an internal eval?We'll onboard it for you.

Every benchmark.
One command.

Have an internal eval?
We'll onboard it for you.