Agent Evaluation Suite with Codex: SKILL.md Template

A production playbook for agent evaluation suite in cross-industry operations using Codex: skill.md template, run-scoped inputs, logs, typed results, and artifacts.

Audience: AI quality teams

The problem

AI quality teams need agent evaluation suite to run repeatedly against eval cases, expected outputs, logs, and scoring rubrics. In cross-industry operations, the pain is not one good answer; it is repeatability, auditability, exception handling, and evidence that survives handoff.

Implementation path

Put the operating procedure in SKILL.md, keep examples beside the skill, attach eval cases, expected outputs, logs, and scoring rubrics per run, and let Argo turn the folder into a repeatable Codex execution.

Tradeoffs and failure modes

A skill folder is less flexible than an open chat, but it gives the product a versioned workflow that can be tested and rolled back. For agent evaluation suite, the practical test is whether a second run can be debugged, retried, and consumed by a product without reading the raw agent transcript.

SKILL.md starter

# SKILL.md
You run agent evaluation suite using Codex.
Read only /skill/.argo/inputs.
Write artifacts to /skill/output/artifacts.
Return argo.result.v1 with body.type = "agent_evaluation_suite".

Run this on Argo