Benchmarking your agent
shouldn't be this hard

You're building an AI agent. You need to know if it's getting better. But running benchmarks is slow, expensive, fragile, and impossible to collaborate on. BenchSpan fixes all of it.

The problem

Five ways benchmarking
fails you

If you've spent any time evaluating an AI agent, you've hit every single one of these.

01

Benchmarks aren't built for your agent

Every benchmark assumes a different interface. You spend days writing glue code, shimming your agent into someone else's harness, fighting with formats that don't match your architecture. This is engineering work that has nothing to do with making your agent better.

$ python run_benchmark.py --agent ./my_agent
Error: agent must implement BaseAgent.execute()
  Expected: (task: str) -> Result
  Got: incompatible interface

# three days later...
$ git log --oneline | head -4
a]f29c1 fix: wrap agent output in Result object
b82e4a fix: convert streaming to batch for harness
9d1f03 fix: patch benchmark timeout handler
c44a21 chore: add benchmark compat shim layer
02

Benchmarks are painfully slow

Running a full benchmark suite locally takes hours. Sometimes a full day. That means you get maybe one or two experiment iterations per day. Your research velocity is bottlenecked by how fast instances run sequentially on your laptop.

$ benchrun --suite swe-bench-verified --full
Running 500 instances sequentially...

[Instance 1/500] django__django-11099  ████░░░░  12m 34s
[Instance 2/500] django__django-11179  ██░░░░░░   8m 12s
[Instance 3/500] astropy__astropy-6938 ░░░░░░░░   estimating...

Estimated completion: 14 hours 22 minutes

You have mass at your desk. You can try one more
experiment today. Maybe.
03

Failures are expensive and wasteful

A single benchmark run can cost hundreds of dollars in tokens. When it fails halfway through — network timeout, rate limit, a bug in your prompt — you've burned money and hours. And you have to start from scratch because there's no way to resume.

$ benchrun --suite terminal-bench --full
Running 485 instances... ($847 estimated token cost)

[██████████████████████░░░░░░] 72% complete

[Instance 349] ERROR: OpenAI rate limit exceeded
[Instance 350] ERROR: OpenAI rate limit exceeded  
[Instance 351] ERROR: Connection timeout
...
[Instance 392] ERROR: OpenAI rate limit exceeded

✗ Run failed. 137 instances did not complete.
  Tokens spent: $612.40
  Results: unusable (incomplete run)

To retry: re-run all 485 instances from scratch.
04

Nobody trusts anyone else's numbers

Developer A ran the benchmark on their laptop with a slightly different Docker config. Developer B used a different commit. Neither wrote down which prompt version they tested. Now you're in a meeting arguing about whose results are real.

# slack, 2:47 PM

alice: I got 34% on SWE-bench with the new prompt
bob:   weird, I got 28% this morning
alice: which commit?
bob:   main I think? maybe the branch
alice: did you use the docker setup or local?
bob:   local, does it matter?
alice: ...yes it matters
bob:   ok let me re-run
alice: with which config?
bob:   idk yours? where is it
alice: I think I changed it after my run actually
05

Results vanish into the void

After every run, you copy numbers into a spreadsheet, a Notion doc, or a Slack message. There's no central place where results live. No way to compare run #47 to run #52. No way to see what changed between them. Your benchmark history is a graveyard of disconnected CSVs.

# your "system"

Desktop/
├── benchmark_results_v2_FINAL.csv
├── benchmark_results_v2_FINAL_actual.csv
├── swebench_results_march.json
├── results_new_prompt.txt
├── Untitled spreadsheet - Google Sheets (47 tabs)
└── slack-messages-with-numbers-i-need-to-find.png

# "what resolve rate did we get on the
#  March 15th run with gpt-4 turbo?"

# nobody knows. it's gone.

We built BenchSpan because
we lived this

One-time onboarding. Then every benchmark run is fast, reproducible, and shared with your whole team.

How it works

Three steps, then you're running

Step 01

Onboard your agent

Write a bash script that starts your agent. Point BenchSpan at it. That's the only integration work you'll ever do.

Step 02

Pick a benchmark and run

Choose from our library of benchmarks or bring your own. Set how many instances. Hit run. Every instance spins up in its own isolated Docker container, in parallel.

Step 03

Results flow in automatically

Scores, trajectories, errors, timing — all captured and organized. Tagged with your agent's commit hash. Compare runs side by side. Share with the team instantly.

What you get

Everything that was broken,
fixed

Any agent that runs via bash

If you can start your agent with a shell command, it works on BenchSpan. One-time onboarding. No framework lock-in, no interface conformance.

Massively parallel execution

Every instance runs in its own Docker container. A 500-instance benchmark that took 14 hours now finishes in minutes. Run more experiments per day, not fewer.

Rerun only what failed

Network error on 37 instances? Rerun just those 37. Join the results with the original run. Stop paying twice for work you already did.

Identical environments, every time

Same Docker image. Same benchmark version. Same config. Tagged with the exact commit hash of your agent. No more 'works on my machine.'

One source of truth for the team

Every run, every result, every trajectory — in one place. Tagged, searchable, comparable. Know who ran what, on which commit, with what outcome.

Smoke test before you burn

Run 5 instances of any benchmark to validate your setup before kicking off a 500-instance run. Catch bugs cheap.

See it in action

Watch the demo

Demo video coming soon

Benchmark library

Every benchmark your agent needs

Run against industry-standard benchmarks out of the box, or bring your own internal evals.

SWE-bench VerifiedSWE-bench LiteTerminal-BenchHumanEvalMBPPMATHGPQACustom / Internal

Stop fighting your benchmarks.
Start shipping your agent.

Get set up in an afternoon. Run your first benchmark today.

Book a demo