MCP Tool Evaluation with Claude Code: Artifact Delivery

A production playbook for MCP tool evaluation in cross-industry operations using Claude Code: artifact delivery, run-scoped inputs, logs, typed results, and artifacts.

Audience: AI platform teams adopting MCP

The problem

AI platform teams adopting MCP need MCP tool evaluation to run repeatedly against tool definitions, auth policy, traces, and test cases. In cross-industry operations, the pain is not one good answer; it is repeatability, auditability, exception handling, and evidence that survives handoff.

Implementation path

Require Claude Code to write customer-visible files under /skill/output/artifacts, validate filenames and sizes, then return signed artifact metadata in argo.result.v1.

Tradeoffs and failure modes

Artifact policy constrains file output, but customers receive files that are durable, typed, and safe to download. For MCP tool evaluation, the practical test is whether a second run can be debugged, retried, and consumed by a product without reading the raw agent transcript.

Artifact manifest

artifacts:
  - mcp-tool-evaluation-summary.md
  - mcp-tool-evaluation-evidence.csv
  - mcp-tool-evaluation-review.json
signed_urls: true
retention: org_policy

Run this on Argo