Check eval
A check is only useful if it stays accurate as it gets real usage. This measures the content checks the way the parity harness measures format faithfulness: a labeled corpus of inputs with the findings a correct check should produce, scored as precision and recall, and gated so a change that introduces a false positive or a missed finding fails the build. The corpus is seeded from the checkers' unit and adversarial cases and grows from real corrections (issue #759).
By check
| Check | Cases | Precision | Recall | F1 | FP | Missed |
|---|---|---|---|---|---|---|
dnt | 4 | 100% | 100% | 1.00 | 0 | 0 |
placeholder | 6 | 100% | 100% | 1.00 | 0 | 0 |
Cases
| Case | Check | Expected | Got | Score | Result |
|---|---|---|---|---|---|
dnt-translated | dnt | do-not-translate | do-not-translate | 75 | ok |
dnt-preserved | dnt | — | — | 100 | ok |
dnt-word-boundary | dnt | — | — | 100 | ok |
dnt-case-sensitive | dnt | do-not-translate | do-not-translate | 75 | ok |
placeholder-dropped | placeholder | placeholder | placeholder | 75 | ok |
placeholder-preserved | placeholder | — | — | 100 | ok |
placeholder-printf-dropped | placeholder | placeholder | placeholder | 50 | ok |
placeholder-extra | placeholder | placeholder | placeholder | 95 | ok |
placeholder-numbered-tag | placeholder | — | — | 100 | ok |
placeholder-none | placeholder | — | — | 100 | ok |
The Score column is the rolled-up compliance score (0–100) for the case. Calibrated cases pin this value, so a change to the severity weights (neutral 0 / minor 1 / major 5 / critical 25) or to a checker’s severity choice is caught as score drift, not just a finding change (issue #758).
Corrections as ground truth
The loop’s premise is that a correction made repeatedly should become a check that catches the mistake. This measures exactly that: a simulated correction stream is aggregated and promoted through the real promotion path, then the resulting brand check is run on every correction’s original (the off-brand phrasing the team kept fixing) and its corrected form. A promoted rule must flag the original and never flag the fix; a below-threshold correction must stay silent. Swap the simulated stream for an export of a real workspace’s corrections to track the loop on live data.
| Correction | Seen | Promoted | Catches original | Leaves fix alone |
|---|---|---|---|---|
utilize → use | 5× | yes | ✓ | ✓ |
leverage → use | 4× | yes | ✓ | ✓ |
synergy → collaboration | 3× | yes | ✓ | ✓ |
onboard → set up | 3× | yes | ✓ | ✓ |
kindly → please | 1× | below threshold | — | ✓ |
How this grows
The seed corpus is small and gold-labeled, so the deterministic checks score a perfect F1 and any regression is caught immediately. The plan (issue #759) extends it with public datasets (XFORMAL for formality, MQM-annotated MT, a placeholder/DNT error set) and, most importantly, with a real correction stream: every human correction is a labeled example — the check should have flagged the original and should not flag the fix — so the eval set, and the calibration of thresholds like --voice-min, improve precisely where real content exercises them. Calibration curves and the ML proxy checks are tracked next.