Model evaluation

How models perform as panelists.

Every model is run against the same hidden conformance suite — real bugs, vulnerabilities, and drift with known ground truth — then scored on what it caught, what it missed, and what it invented.

Updated 2026-06-20 Suite v1.4.2 Cases 500 Metric F1 vs. ground truth

#	Model	Accuracy	False+	Cost / 1K	Latency	Composite
01	claude-sonnet-4.6	94.2%	2.1%	$0.003	1.2s	94
02	gpt-4o	92.8%	3.4%	$0.005	0.9s	92
03	gemini-2.5-pro	90.6%	4.0%	$0.004	1.5s	90
04	llama-3.3-70b (local)	88.1%	8.2%	$0.000	4.1s	88
05	claude-haiku-4.5	85.4%	5.9%	$0.0003	0.4s	85
06	qwen-2.5-coder-32b (local)	82.0%	9.7%	$0.000	3.3s	82

Illustrative data for the concept mockup — not a published benchmark.

Methodology

How the suite scores a model

Seat the model

Each model reviews all 500 cases as a single panelist, emitting findings in the standard format.

Cluster & dedupe

The deterministic reconciler clusters findings by location and dedupes by similarity — the same code that runs in production.

Score vs. truth

Clustered findings are matched against known ground truth for accuracy, false-positive rate, and an F1 composite.

The composite is one input, not a verdict. A cheap, fast model with a higher false-positive rate is still a strong panelist when paired with a slower, pickier one — because the reconciler scores them together. Use the table to staff a panel, not to crown a winner.

$atcr doctor

Verify your own roster against every configured endpoint before you spend a review on it.