Benchmarks (x07 bench)
x07 bench is the patch-centric benchmark harness for agent correctness loops.
It evaluates whether a candidate patch actually resolves an instance, with deterministic artifacts and machine-readable reports.
Related benchmark surfaces:
- Performance regression canaries are tracked in CI (internal).
x07-perf-compare: cross-language performance comparisons (X07 vs C vs Rust).
Commands
x07 bench list --suite <suite.json>x07 bench validate --suite <suite.json>x07 bench eval --suite <suite.json> --predictions <predictions.jsonl>x07 bench eval --suite <suite.json> --oraclex07 bench eval --suite <suite.json> --predictions <predictions.jsonl> --runner docker
Bench suite layout
A suite is a directory tree rooted at suite.json:
labs/x07bench/
suites/
core_v0/
suite.json
instances/
std_math_0001/
instance.json
issue.md
repo/
oracle.patchset.json
Schemas
spec/x07-bench.suite.schema.json(x07.bench.suite@0.1.0)spec/x07-bench.instance.schema.json(x07.bench.instance@0.1.0)spec/x07-bench.report.schema.json(x07.bench.report@0.1.0)
Predictions JSONL supports:
patch_kind: "x07-arch-patchset-json"(primary)patch_kind: "unified-diff"(compat)
Evaluation protocol
Per instance, x07 bench runs:
- Baseline
x07 test(must fail) - Apply patch
- Optional repair on touched
*.x07.json - Post-patch
x07 test - Optional determinism rerun checks
Output
x07 bench eval emits x07.bench.report@0.1.0.
Primary KPIs:
resolved / instances_totalresolved_without_errors- repair iteration/ops averages
Docker path
Use --runner docker for in-command containerized evaluation:
x07 bench eval --suite labs/x07bench/suites/core_v0/suite.json --oracle --runner docker
x07 bench delegates to ci/x07bench/run.sh under the hood. You can also call the wrapper directly:
ci/x07bench/run.sh bench eval --suite labs/x07bench/suites/core_v0/suite.json --oracle