Submission Smoke
Validate schema, run bundle shape, tool boundaries, and scorer compatibility. Smoke runs do not enter the public leaderboard.
Submission
EPOB keeps results comparable by fixing the protocol, versioning reference anchors, and publishing evidence packets rather than relying on a single model-as-judge.
Validate schema, run bundle shape, tool boundaries, and scorer compatibility. Smoke runs do not enter the public leaderboard.
Run a fixed task set, seed set, resource profile, timeout, evidence schema, and scorer version. These runs can enter public leaderboard snapshots.
Expand across task families, providers, and model endpoints for deeper sensitivity analysis. This track is higher cost and not the default submission path.
Rerun selected cells, preserve hashes, inspect evidence packets, and publish limitations when reviewer agreement or formal venue checks are incomplete.
Reference Anchors