Agent Evaluation Suite with Claude Code: API Runtime Pattern

A production playbook for agent evaluation suite in cross-industry operations using Claude Code: api runtime pattern, run-scoped inputs, logs, typed results, and artifacts.

Audience: AI quality teams

The problem

AI quality teams need agent evaluation suite to run repeatedly against eval cases, expected outputs, logs, and scoring rubrics. In cross-industry operations, the pain is not one good answer; it is repeatability, auditability, exception handling, and evidence that survives handoff.

Implementation path

Package the agent evaluation suite instructions as a skill, send eval cases, expected outputs, logs, and scoring rubrics as run-scoped inputs, execute with Claude Code, poll terminal status, and consume argo.result.v1 instead of parsing a transcript.

Tradeoffs and failure modes

The API boundary forces the workflow to define inputs, terminal states, and result shape before customers depend on it. For agent evaluation suite, the practical test is whether a second run can be debugged, retried, and consumed by a product without reading the raw agent transcript.

Run request

POST /api/skills/<skill_id>/run
provider=claude-code
workflow=agent-evaluation-suite
inputs[]=@./input-pack.zip
result_schema=argo.result.v1

Run this on Argo