EPOB End-to-End Project Orchestration Benchmark

Submission

Evaluation process

EPOB keeps results comparable by fixing the protocol, versioning reference anchors, and publishing evidence packets rather than relying on a single model-as-judge.

Tier 0

Submission Smoke

Validate schema, run bundle shape, tool boundaries, and scorer compatibility. Smoke runs do not enter the public leaderboard.

Tier 1

Public Comparable

Run a fixed task set, seed set, resource profile, timeout, evidence schema, and scorer version. These runs can enter public leaderboard snapshots.

Tier 2

Robustness Track

Expand across task families, providers, and model endpoints for deeper sensitivity analysis. This track is higher cost and not the default submission path.

Tier 3

Audit And Reproducibility

Rerun selected cells, preserve hashes, inspect evidence packets, and publish limitations when reviewer agreement or formal venue checks are incomplete.

Reference Anchors

Fixed protocol, versioned anchors