How Hardproof scores servers (methodology)

Hardproof is a deterministic verifier. It runs a fixed set of checks and emits a machine-readable scan report that humans can review and agents can consume directly.

Dimensions

The scan report includes five dimensions:

Conformance: protocol behavior and required method coverage, using the official conformance baseline and optional fuller suites.
Security: deterministic checks for transport exposure, descriptor drift, injection patterns, command risk, and auth posture.
Performance: smoke, steady, and bounded-concurrency probes that stay cheap enough for CI while still producing usable latency and throughput signals.
Trust: publisher identity, signature and transparency evidence, and bundle consistency when release artifacts are provided.
Reliability: malformed input handling, replay stability, repeated-call drift, and other failure behavior that should stay reproducible across runs.

Score truth (publishable vs partial)

Hardproof does not treat every numeric signal as equally publishable. The report makes score truth explicit.

score_truth_status=publishable yields score_mode=full, where overall_score is present and backed by enough weighted dimensions.
score_truth_status=partial yields score_mode=partial, where overall_score is still computed as the effective score (matching partial_score), but the score is not publishable yet.
score_truth_status=insufficient means there is not enough evidence to defend a numeric score yet.

Trust is the most common reason a scan remains partial. hardproof ci now fails on partial scores by default. If you want full-score gating in CI, pass trust artifacts; use --allow-partial-score only when a partial result is intentional, and --require-trust-for-full-score when you want the strictest trust-aware gate.

Confidence and estimates

Hardproof tries to keep evidence and confidence separable. Some outputs are pass/fail-deterministic; others are bounded probes or deterministic estimates. Treat the report as a starting point for review, not as a substitute for judgment.

Usage metrics (usage_metrics) are deterministic usage signals derived from exact tokenization under a selected tokenizer profile (default: openai:o200k_base) when tokenizer tables are available, with deterministic estimate fallback and optional observed truth from a real client trace (--token-trace). usage_mode makes the truth class explicit.
Performance probes are bounded smoke/steady signals intended to stay cheap enough for CI. They include sample counts and a confidence marker (for example tool_call_confidence).
Score truth is the public confidence boundary: partial scans keep score_truth_status=partial until missing evidence gates (typically Trust) are satisfied.
Trust requires release metadata inputs. Without trust artifacts, the Trust dimension fails deterministically and the scan remains partial; with trust inputs, it can become publishable while still emitting warnings for missing transparency evidence. Use --require-trust-for-full-score if you want Trust to stay unknown until evidence is present.

Fairness and exclusions

Hardproof is intentionally scoped and deterministic. It does not attempt to answer every security or performance question, and its scores are not a cross-class leaderboard.

Exclusions: deep exploitation, vulnerability confirmation, non-deterministic fuzzing, unconstrained load testing, and “LLM judge” evaluations.
Fair comparisons require comparable conditions: same protocol baseline and suite, same transport, similar hardware/network, the same workload profile/budgets, and the same Trust inputs (or the same intentional absence of them).
Dimension boundaries: Security findings include both hard checks (transport/auth exposure, Host/Origin guard behavior) and heuristic surface signals (injection patterns, command-risk patterns). Warnings are review prompts, not proofs of exploitation.

Usage metrics are an overlay

Usage metrics measure context pressure from tool catalogs, schemas, and response payloads. They are first-class in the report and CI policy, but they remain an overlay rather than a substitute for the five core verification dimensions.

Evidence

Every scan includes findings (with codes, evidence, and suggested fixes) and references to generated artifacts. The report is designed to be reviewable without scraping logs.

Quality report: /hardproof/quality-report
Report format: /hardproof/report-format
Security guide: /hardproof/security-guide
Quality report pipeline (draft): /hardproof/quality-report-pipeline
Usage metrics: /hardproof/usage-metrics
Why deterministic: /hardproof/deterministic