Agent Benchmark Report with Claude Code: Result JSON Schema

A production playbook for agent benchmark report in cross-industry operations using Claude Code: result json schema, run-scoped inputs, logs, typed results, and artifacts.

Audience: AI infra teams comparing providers

The problem

AI infra teams comparing providers need agent benchmark report to run repeatedly against benchmark cases, transcripts, costs, and output ratings. In cross-industry operations, the pain is not one good answer; it is repeatability, auditability, exception handling, and evidence that survives handoff.

Implementation path

Define the outer result contract once, let the agent benchmark report skill own body.data, and reject terminal output that does not match the expected schema.

Tradeoffs and failure modes

Schema enforcement adds upfront design work, but removes prompt parsing from the product surface. For agent benchmark report, the practical test is whether a second run can be debugged, retried, and consumed by a product without reading the raw agent transcript.

Result shape

{
  "schema_version": "argo.result.v1",
  "summary": "agent benchmark report completed",
  "body": { "type": "agent_benchmark_report", "data": {}, "exceptions": [] },
  "artifacts": []
}

Run this on Argo