MCP Tool Evaluation with Codex: API Runtime Pattern

A production playbook for MCP tool evaluation in cross-industry operations using Codex: api runtime pattern, run-scoped inputs, logs, typed results, and artifacts.

Audience: AI platform teams adopting MCP

The problem

AI platform teams adopting MCP need MCP tool evaluation to run repeatedly against tool definitions, auth policy, traces, and test cases. In cross-industry operations, the pain is not one good answer; it is repeatability, auditability, exception handling, and evidence that survives handoff.

Implementation path

Package the MCP tool evaluation instructions as a skill, send tool definitions, auth policy, traces, and test cases as run-scoped inputs, execute with Codex, poll terminal status, and consume argo.result.v1 instead of parsing a transcript.

Tradeoffs and failure modes

The API boundary forces the workflow to define inputs, terminal states, and result shape before customers depend on it. For MCP tool evaluation, the practical test is whether a second run can be debugged, retried, and consumed by a product without reading the raw agent transcript.

Run request

POST /api/skills/<skill_id>/run
provider=codex
workflow=mcp-tool-evaluation
inputs[]=@./input-pack.zip
result_schema=argo.result.v1

Run this on Argo