Benchmark library

Every benchmark.
One command.

Containerized, parallelized, and verified. Pick a benchmark, point it at your agent, and get results in minutes instead of hours.

27benchmarks ready

Coding & SWE

9 benchmarks

SWE-bench

swebench

Real GitHub issues from popular Python repos

500+tasks
Hard

SWE-bench Pro

swebenchpro

Harder SWE-bench with multi-file edits

500+tasks
Very Hard

AutoCodeBench

autocodebench

Automated code generation evaluation

460tasks
Medium

HumanEvalFix

humanevalfix

Bug-fixing variant of HumanEval

164tasks
Medium

Aider Polyglot

aider-polyglot

Multi-language code editing

225tasks
Medium

QuixBugs

quixbugs

Classic bug-fixing across Python & Java

40tasks
Easy

CRUSTBench

crustbench

C/Rust systems programming tasks

100tasks
Hard

TerminalBench

terminalbench2

Terminal & shell automation tasks

485tasks
Hard

Harbor

harbor

Full-stack agent environment tasks

50tasks
Hard

Math & Reasoning

7 benchmarks

AIME

aime

Competition-level math problems

90tasks
Very Hard

IneqMath

ineqmath

Mathematical inequality proving

270tasks
Very Hard

GPQA Diamond

gpqa-diamond

PhD-level science questions

198tasks
Very Hard

Reasoning Gym

reasoning-gym

Diverse logical reasoning tasks

500+tasks
Medium

SATBench

satbench

Boolean satisfiability solving

200tasks
Hard

ARC-AGI-2

arc-agi-2

Abstract reasoning & pattern recognition

120tasks
Very Hard

USACO

usaco

Competitive programming olympiad

307tasks
Very Hard

Knowledge & QA

4 benchmarks

SimpleQA

simpleqa

Factual question answering

4326tasks
Easy

GAIA

gaia

General AI assistant tasks

165tasks
Hard

MMMLU

mmmlu

Massive multi-task language understanding

14042tasks
Medium

LawBench

lawbench

Legal reasoning & analysis

500tasks
Hard

Data Science

2 benchmarks

DS-1000

ds1000

Data science coding problems

1000tasks
Medium

DABStep

dabstep

Data analysis & business tasks

460tasks
Medium

Agents & Tools

2 benchmarks

BFCL

bfcl

Function calling & tool use

2000tasks
Medium

MMAU

mmau

Multi-modal agent understanding

990tasks
Hard

Safety & Science

3 benchmarks

StrongReject

strongreject

Safety & refusal evaluation

313tasks
Medium

ReplicationBench

replicationbench

Scientific replication tasks

50tasks
Very Hard

QCircuitBench

qcircuitbench

Quantum circuit design

100tasks
Very Hard

Bring your own benchmark

Have an internal eval?
We'll onboard it for you.

Your proprietary benchmarks are often the most important ones. We provide white-glove onboarding to containerize your custom evals and get them running on BenchSpan — same parallelization, same reproducibility, same dashboard.

We handle the Dockerfile, harness integration, and test verification
Your benchmark stays private to your organization
Typically onboarded in 1–2 business days
Ongoing support as your eval evolves