Check eval

A check is only useful if it stays accurate as it gets real usage. This measures the content checks the way the parity harness measures format faithfulness: a labeled corpus of inputs with the findings a correct check should produce, scored as precision and recall, and gated so a change that introduces a false positive or a missed finding fails the build. The corpus is seeded from the checkers' unit and adversarial cases and grows from real corrections (issue #759).

Cases

Precision

100%

Recall

100%

1.00

False positives

Missed

By check

Check	Cases	Precision	Recall	F1	FP	Missed
`dnt`	4	100%	100%	1.00	0	0
`placeholder`	6	100%	100%	1.00	0	0

Cases

Case	Check	Expected	Got	Score	Result
`dnt-translated`	dnt	do-not-translate	do-not-translate	75	ok
`dnt-preserved`	dnt	—	—	100	ok
`dnt-word-boundary`	dnt	—	—	100	ok
`dnt-case-sensitive`	dnt	do-not-translate	do-not-translate	75	ok
`placeholder-dropped`	placeholder	placeholder	placeholder	75	ok
`placeholder-preserved`	placeholder	—	—	100	ok
`placeholder-printf-dropped`	placeholder	placeholder	placeholder	50	ok
`placeholder-extra`	placeholder	placeholder	placeholder	95	ok
`placeholder-numbered-tag`	placeholder	—	—	100	ok
`placeholder-none`	placeholder	—	—	100	ok

The Score column is the rolled-up compliance score (0–100) for the case. Calibrated cases pin this value, so a change to the severity weights (neutral 0 / minor 1 / major 5 / critical 25) or to a checker’s severity choice is caught as score drift, not just a finding change (issue #758).

Corrections as ground truth

The loop’s premise is that a correction made repeatedly should become a check that catches the mistake. This measures exactly that: a simulated correction stream is aggregated and promoted through the real promotion path, then the resulting brand check is run on every correction’s original (the off-brand phrasing the team kept fixing) and its corrected form. A promoted rule must flag the original and never flag the fix; a below-threshold correction must stay silent. Swap the simulated stream for an export of a real workspace’s corrections to track the loop on live data.

Promoted

4/5

Precision

100%

Recall

100%

Caught

Over-flagged

Correction	Seen	Promoted	Catches original	Leaves fix alone
`utilize` → `use`	5×	yes	✓	✓
`leverage` → `use`	4×	yes	✓	✓
`synergy` → `collaboration`	3×	yes	✓	✓
`onboard` → `set up`	3×	yes	✓	✓
`kindly` → `please`	1×	below threshold	—	✓

How this grows

The seed corpus is small and gold-labeled, so the deterministic checks score a perfect F1 and any regression is caught immediately. The plan (issue #759) extends it with public datasets (XFORMAL for formality, MQM-annotated MT, a placeholder/DNT error set) and, most importantly, with a real correction stream: every human correction is a labeled example — the check should have flagged the original and should not flag the fix — so the eval set, and the calibration of thresholds like --voice-min, improve precisely where real content exercises them. Calibration curves and the ML proxy checks are tracked next.