Skip to main content

Check eval

A check is only useful if it stays accurate as it gets real usage. This measures the content checks the way the parity harness measures format faithfulness: a labeled corpus of inputs with the findings a correct check should produce, scored as precision and recall, and gated so a change that introduces a false positive or a missed finding fails the build. The corpus is seeded from the checkers' unit and adversarial cases and grows from real corrections (issue #759).

Cases
10
Precision
100%
Recall
100%
F1
1.00
False positives
0
Missed
0

By check

CheckCasesPrecisionRecallF1FPMissed
dnt4100%100%1.0000
placeholder6100%100%1.0000

Cases

CaseCheckExpectedGotScoreResult
dnt-translateddntdo-not-translatedo-not-translate75ok
dnt-preserveddnt100ok
dnt-word-boundarydnt100ok
dnt-case-sensitivedntdo-not-translatedo-not-translate75ok
placeholder-droppedplaceholderplaceholderplaceholder75ok
placeholder-preservedplaceholder100ok
placeholder-printf-droppedplaceholderplaceholderplaceholder50ok
placeholder-extraplaceholderplaceholderplaceholder95ok
placeholder-numbered-tagplaceholder100ok
placeholder-noneplaceholder100ok

The Score column is the rolled-up compliance score (0–100) for the case. Calibrated cases pin this value, so a change to the severity weights (neutral 0 / minor 1 / major 5 / critical 25) or to a checker’s severity choice is caught as score drift, not just a finding change (issue #758).

Corrections as ground truth

The loop’s premise is that a correction made repeatedly should become a check that catches the mistake. This measures exactly that: a simulated correction stream is aggregated and promoted through the real promotion path, then the resulting brand check is run on every correction’s original (the off-brand phrasing the team kept fixing) and its corrected form. A promoted rule must flag the original and never flag the fix; a below-threshold correction must stay silent. Swap the simulated stream for an export of a real workspace’s corrections to track the loop on live data.

Promoted
4/5
Precision
100%
Recall
100%
Caught
4
Over-flagged
0
CorrectionSeenPromotedCatches originalLeaves fix alone
utilizeuse5×yes
leverageuse4×yes
synergycollaboration3×yes
onboardset up3×yes
kindlyplease1×below threshold

How this grows

The seed corpus is small and gold-labeled, so the deterministic checks score a perfect F1 and any regression is caught immediately. The plan (issue #759) extends it with public datasets (XFORMAL for formality, MQM-annotated MT, a placeholder/DNT error set) and, most importantly, with a real correction stream: every human correction is a labeled example — the check should have flagged the original and should not flag the fix — so the eval set, and the calibration of thresholds like --voice-min, improve precisely where real content exercises them. Calibration curves and the ML proxy checks are tracked next.