Model evaluation

How models perform as panelists.

Every model is run against the same hidden conformance suite — real bugs, vulnerabilities, and drift with known ground truth — then scored on what it caught, what it missed, and what it invented.

Updated 2026-06-20 Suite v1.4.2 Cases 500 Metric F1 vs. ground truth
# Model Accuracy False+ Cost / 1K Latency Composite
01 claude-sonnet-4.6 94.2% 2.1% $0.003 1.2s 94
02 gpt-4o 92.8% 3.4% $0.005 0.9s 92
03 gemini-2.5-pro 90.6% 4.0% $0.004 1.5s 90
04 llama-3.3-70b (local) 88.1% 8.2% $0.000 4.1s 88
05 claude-haiku-4.5 85.4% 5.9% $0.0003 0.4s 85
06 qwen-2.5-coder-32b (local) 82.0% 9.7% $0.000 3.3s 82

Illustrative data for the concept mockup — not a published benchmark.

Methodology

How the suite scores a model

01

Seat the model

Each model reviews all 500 cases as a single panelist, emitting findings in the standard format.

02

Cluster & dedupe

The deterministic reconciler clusters findings by location and dedupes by similarity — the same code that runs in production.

03

Score vs. truth

Clustered findings are matched against known ground truth for accuracy, false-positive rate, and an F1 composite.

The composite is one input, not a verdict. A cheap, fast model with a higher false-positive rate is still a strong panelist when paired with a slower, pickier one — because the reconciler scores them together. Use the table to staff a panel, not to crown a winner.

$atcr doctor

Verify your own roster against every configured endpoint before you spend a review on it.