Agent Benchmark Report with Codex: API Runtime Pattern

A production playbook for agent benchmark report in cross-industry operations using Codex: api runtime pattern, run-scoped inputs, logs, typed results, and artifacts.

Audience: AI infra teams comparing providers

The problem

AI infra teams comparing providers need agent benchmark report to run repeatedly against benchmark cases, transcripts, costs, and output ratings. In cross-industry operations, the pain is not one good answer; it is repeatability, auditability, exception handling, and evidence that survives handoff.

Implementation path

Package the agent benchmark report instructions as a skill, send benchmark cases, transcripts, costs, and output ratings as run-scoped inputs, execute with Codex, poll terminal status, and consume argo.result.v1 instead of parsing a transcript.

Tradeoffs and failure modes

The API boundary forces the workflow to define inputs, terminal states, and result shape before customers depend on it. For agent benchmark report, the practical test is whether a second run can be debugged, retried, and consumed by a product without reading the raw agent transcript.

Run request

POST /api/skills/<skill_id>/run
provider=codex
workflow=agent-benchmark-report
inputs[]=@./input-pack.zip
result_schema=argo.result.v1

Run this on Argo