Agent Benchmark Report with Claude Code: SKILL.md Template

A production playbook for agent benchmark report in cross-industry operations using Claude Code: skill.md template, run-scoped inputs, logs, typed results, and artifacts.

Audience: AI infra teams comparing providers

The problem

AI infra teams comparing providers need agent benchmark report to run repeatedly against benchmark cases, transcripts, costs, and output ratings. In cross-industry operations, the pain is not one good answer; it is repeatability, auditability, exception handling, and evidence that survives handoff.

Implementation path

Put the operating procedure in SKILL.md, keep examples beside the skill, attach benchmark cases, transcripts, costs, and output ratings per run, and let Argo turn the folder into a repeatable Claude Code execution.

Tradeoffs and failure modes

A skill folder is less flexible than an open chat, but it gives the product a versioned workflow that can be tested and rolled back. For agent benchmark report, the practical test is whether a second run can be debugged, retried, and consumed by a product without reading the raw agent transcript.

SKILL.md starter

# SKILL.md
You run agent benchmark report using Claude Code.
Read only /skill/.argo/inputs.
Write artifacts to /skill/output/artifacts.
Return argo.result.v1 with body.type = "agent_benchmark_report".

Run this on Argo