You're building an AI agent. You need to know if it's getting better. But running benchmarks is slow, expensive, fragile, and impossible to collaborate on. BenchSpan fixes all of it.
The problem
If you've spent any time evaluating an AI agent, you've hit every single one of these.
Every benchmark assumes a different interface. You spend days writing glue code, shimming your agent into someone else's harness, fighting with formats that don't match your architecture. This is engineering work that has nothing to do with making your agent better.
$ python run_benchmark.py --agent ./my_agent Error: agent must implement BaseAgent.execute() Expected: (task: str) -> Result Got: incompatible interface # three days later... $ git log --oneline | head -4 a]f29c1 fix: wrap agent output in Result object b82e4a fix: convert streaming to batch for harness 9d1f03 fix: patch benchmark timeout handler c44a21 chore: add benchmark compat shim layer
Running a full benchmark suite locally takes hours. Sometimes a full day. That means you get maybe one or two experiment iterations per day. Your research velocity is bottlenecked by how fast instances run sequentially on your laptop.
$ benchrun --suite swe-bench-verified --full Running 500 instances sequentially... [Instance 1/500] django__django-11099 ████░░░░ 12m 34s [Instance 2/500] django__django-11179 ██░░░░░░ 8m 12s [Instance 3/500] astropy__astropy-6938 ░░░░░░░░ estimating... Estimated completion: 14 hours 22 minutes You have mass at your desk. You can try one more experiment today. Maybe.
A single benchmark run can cost hundreds of dollars in tokens. When it fails halfway through — network timeout, rate limit, a bug in your prompt — you've burned money and hours. And you have to start from scratch because there's no way to resume.
$ benchrun --suite terminal-bench --full Running 485 instances... ($847 estimated token cost) [██████████████████████░░░░░░] 72% complete [Instance 349] ERROR: OpenAI rate limit exceeded [Instance 350] ERROR: OpenAI rate limit exceeded [Instance 351] ERROR: Connection timeout ... [Instance 392] ERROR: OpenAI rate limit exceeded ✗ Run failed. 137 instances did not complete. Tokens spent: $612.40 Results: unusable (incomplete run) To retry: re-run all 485 instances from scratch.
Developer A ran the benchmark on their laptop with a slightly different Docker config. Developer B used a different commit. Neither wrote down which prompt version they tested. Now you're in a meeting arguing about whose results are real.
# slack, 2:47 PM alice: I got 34% on SWE-bench with the new prompt bob: weird, I got 28% this morning alice: which commit? bob: main I think? maybe the branch alice: did you use the docker setup or local? bob: local, does it matter? alice: ...yes it matters bob: ok let me re-run alice: with which config? bob: idk yours? where is it alice: I think I changed it after my run actually
After every run, you copy numbers into a spreadsheet, a Notion doc, or a Slack message. There's no central place where results live. No way to compare run #47 to run #52. No way to see what changed between them. Your benchmark history is a graveyard of disconnected CSVs.
# your "system" Desktop/ ├── benchmark_results_v2_FINAL.csv ├── benchmark_results_v2_FINAL_actual.csv ├── swebench_results_march.json ├── results_new_prompt.txt ├── Untitled spreadsheet - Google Sheets (47 tabs) └── slack-messages-with-numbers-i-need-to-find.png # "what resolve rate did we get on the # March 15th run with gpt-4 turbo?" # nobody knows. it's gone.
One-time onboarding. Then every benchmark run is fast, reproducible, and shared with your whole team.
How it works
Write a bash script that starts your agent. Point BenchSpan at it. That's the only integration work you'll ever do.
Choose from our library of benchmarks or bring your own. Set how many instances. Hit run. Every instance spins up in its own isolated Docker container, in parallel.
Scores, trajectories, errors, timing — all captured and organized. Tagged with your agent's commit hash. Compare runs side by side. Share with the team instantly.
What you get
If you can start your agent with a shell command, it works on BenchSpan. One-time onboarding. No framework lock-in, no interface conformance.
Every instance runs in its own Docker container. A 500-instance benchmark that took 14 hours now finishes in minutes. Run more experiments per day, not fewer.
Network error on 37 instances? Rerun just those 37. Join the results with the original run. Stop paying twice for work you already did.
Same Docker image. Same benchmark version. Same config. Tagged with the exact commit hash of your agent. No more 'works on my machine.'
Every run, every result, every trajectory — in one place. Tagged, searchable, comparable. Know who ran what, on which commit, with what outcome.
Run 5 instances of any benchmark to validate your setup before kicking off a 500-instance run. Catch bugs cheap.
See it in action
Demo video coming soon
Benchmark library
Run against industry-standard benchmarks out of the box, or bring your own internal evals.
Get set up in an afternoon. Run your first benchmark today.
Book a demo