Agent Evaluation Suite with Claude Code: Cost Controls

A production playbook for agent evaluation suite in cross-industry operations using Claude Code: cost controls, run-scoped inputs, logs, typed results, and artifacts.

Audience: AI quality teams

The problem

AI quality teams need agent evaluation suite to run repeatedly against eval cases, expected outputs, logs, and scoring rubrics. In cross-industry operations, the pain is not one good answer; it is repeatability, auditability, exception handling, and evidence that survives handoff.

Implementation path

Set explicit limits for agent evaluation suite: input size, run time, tool calls, artifacts, retries, and concurrent runs per organization.

Tradeoffs and failure modes

Limits reject pathological runs, but they keep one workflow from turning into an unbounded infrastructure bill. For agent evaluation suite, the practical test is whether a second run can be debugged, retried, and consumed by a product without reading the raw agent transcript.

Run limits

max_run_seconds=1800
max_input_bytes=104857600
max_artifact_bytes=104857600
max_tool_calls=120
retry_after_seconds=60

Run this on Argo