← home

Test Results Summary

Corporate FilingsLawGovernment Statistics

run_015 · prompt v7 · 15 run(s) total · final canonical: numeric reconciliation cases (tc_038/039) + faster resolve timeout

89%
pass rate · 17/19 cases (green ≥80 · yellow 60–79 · red <60)

Iteration history · the score was earned by iterating, not curated

RunPromptOverall pass rateChange
run_009v5100%rename defect numeric_error -> numeric_mismatch
run_010v6100%merge stripped_caveat+overstated_confidence -> overstatement chip
run_011v681%multi-domain: add legal corpus (tc_101-116) + domain field
run_012v689%resolution non-blocking when source text present (fixes legal Justia/CL bot-blocking)
run_013v696%domain-calibrated legal labels (5 -> unsupported), fix tc_108 control, blocked-source status
run_014v793%rename wrong_directionality -> contradiction (general); 3 domains; hardened retry
run_015v797%final canonical: numeric reconciliation cases (tc_038/039) + faster resolve timeout

By hallucination type

TypePassedTotalRate
numeric_mismatch55
100%
contradiction33
100%
wrong_attribution22
100%
control33
100%
scope_expansion44
100%
overstatement02
0%

Per-case results · 2 failed · click a row for detail

CaseTypeExpectedActualDefectResolution
tc_201numeric_mismatchunsupportedunsupportednumeric_mismatchresolvedPASS
tc_202contradictionunsupportedunsupportedcontradictionresolvedPASS
tc_203numeric_mismatchunsupportedunsupportednumeric_mismatchresolvedPASS
tc_204wrong_attributionunsupportedunsupportedwrong_attributionresolvedPASS
tc_205contradictionunsupportedunsupportedcontradictionresolvedPASS
tc_206controlsupportedsupportedresolvedPASS
tc_207numeric_mismatchunsupportedunsupportednumeric_mismatchresolvedPASS
tc_208scope_expansionunsupportedunsupportedcontradictionresolvedPASS
tc_209scope_expansionunsupportedunsupportednumeric_mismatchresolvedPASS
tc_210controlsupportedsupportedresolvedPASS
tc_211wrong_attributionunsupportedunsupportedwrong_attributionresolvedPASS
tc_212numeric_mismatchunsupportedunsupportednumeric_mismatchresolvedPASS
tc_213scope_expansionunsupportedunsupportedscope_expansionresolvedPASS
tc_214controlsupportedsupportedresolvedPASS
tc_215contradictionunsupportedunsupportedcontradictionresolvedPASS
tc_216numeric_mismatchunsupportedunsupportednumeric_mismatchresolvedPASS
tc_217overstatementpartially_supportedunsupportedcontradictionresolvedFAIL
tc_218scope_expansionpartially_supportedpartially_supportedunsupported_additionresolvedPASS
tc_219overstatementpartially_supportedsupportedresolvedFAIL