Token and context usage metrics

Hardproof includes a usage overlay in every scan report under usage_metrics. The overlay measures how much context a server consumes for an agent, especially around tools/list, schema payloads, and typical response sizes.

Why this exists

Oversized tool catalogs crowd out the actual user task from the model context.
Oversized responses make agents brittle (truncation, high cost, hard-to-diff regressions).
Schema bloat increases the steady-state prompt budget for every tool call.

What’s measured

The report includes byte counts and token counts for:

Tool catalog: size of tools/list and its token footprint.
Descriptions and tool count: average and max description size, plus overall tool count.
Schema footprint: total tool input schema size and token footprint.
Response footprint: typical response payload token footprint (p50/p95).
Metadata-to-payload ratio: how much schema and descriptor overhead the server adds compared with the actual payload it returns.

Truth classes

There is no universal single “real token count” for an MCP server unless you either choose a tokenizer family or ingest a real client trace. Hardproof makes this explicit in usage_metrics (requested mode, status, and the effective usage_mode):

usage_mode=estimate: deterministic estimates (used when auto falls back, or when explicitly requested).
usage_mode=tokenizer_exact: exact counts under a chosen tokenizer profile (for example --tokenizer openai:o200k_base).
usage_mode=trace_observed: observed counts from a real client trace (--token-trace).
usage_mode=mixed: per-metric mix of exact + observed when both are available.

By default, Hardproof uses requested_usage_mode=auto. Auto prefers exact tokenization when tokenizer tables are available and falls back to estimates when exact counting cannot be honored. Fallbacks and errors are explicit in usage_metrics.usage_status and usage_metrics.usage_fallback_reason.

Estimator metadata

In estimate mode, the usage overlay records estimator_family, estimator_version, and confidence next to the estimate fields. These values are deterministic comparison signals, not billing-grade truth.

Why two token estimates exist

The report keeps both cl100k and o200k tool-catalog estimates so consumers can compare context pressure across the model families that are commonly in use.

How to keep usage healthy

Keep tool descriptions short and remove redundant examples.
Prefer fewer tools with clearer names over many narrowly-scoped tools.
Return only necessary fields; paginate and filter instead of returning “full objects”.

CI policy

Hardproof can gate on usage directly with thresholds such as --max-avg-tool-description-tokens, --max-tool-count, and --max-metadata-to-payload-ratio-pct.

Methodology: /hardproof/methodology
Report format: /hardproof/report-format