← home

Test Results Summary

Corporate FilingsLawGovernment Statistics

run_015 · prompt v7 · 15 run(s) total · final canonical: numeric reconciliation cases (tc_038/039) + faster resolve timeout

100%
pass rate · 39/39 cases (green ≥80 · yellow 60–79 · red <60)
3 case(s) the engine correctly abstained on (cannot_verify) — it returns "I can't verify this" rather than guessing.

Iteration history · the score was earned by iterating, not curated

RunPromptOverall pass rateChange
run_009v5100%rename defect numeric_error -> numeric_mismatch
run_010v6100%merge stripped_caveat+overstated_confidence -> overstatement chip
run_011v681%multi-domain: add legal corpus (tc_101-116) + domain field
run_012v689%resolution non-blocking when source text present (fixes legal Justia/CL bot-blocking)
run_013v696%domain-calibrated legal labels (5 -> unsupported), fix tc_108 control, blocked-source status
run_014v793%rename wrong_directionality -> contradiction (general); 3 domains; hardened retry
run_015v797%final canonical: numeric reconciliation cases (tc_038/039) + faster resolve timeout

By hallucination type

TypePassedTotalRate
contradiction55
100%
quantitative_fabrication88
100%
stripped_caveat66
100%
distributed_evidence44
100%
overstated_confidence55
100%
partial_citation_corruption33
100%
scope_expansion33
100%
total_citation_fabrication33
100%
control22
100%

Per-case results · click a row for detail

CaseTypeExpectedActualDefectResolution
tc_001contradictionunsupportedunsupportedcontradictionresolvedPASS
tc_002quantitative_fabricationunsupportedunsupportednumeric_mismatchresolvedPASS
tc_003stripped_caveatpartially_supportedpartially_supportedoverstatementresolvedPASS
tc_004distributed_evidencesupportedsupportedresolvedPASS
tc_005distributed_evidencepartially_supportedpartially_supportedunsupported_additionresolvedPASS
tc_006distributed_evidencesupportedsupportedresolvedPASS
tc_007distributed_evidencepartially_supportedpartially_supportedunsupported_additionresolvedPASS
tc_008quantitative_fabricationunsupportedunsupportednumeric_mismatchresolvedPASS
tc_009quantitative_fabricationunsupportedunsupportednumeric_mismatchresolvedPASS
tc_010overstated_confidencepartially_supportedpartially_supportedoverstatementresolvedPASS
tc_011partial_citation_corruptionunsupportedunsupportedwrong_attributionresolvedPASS
tc_012quantitative_fabricationunsupportedunsupportednumeric_mismatchresolvedPASS
tc_013scope_expansionunsupportedunsupportedcontradictionresolvedPASS
tc_014total_citation_fabricationcannot_verifycannot_verifyfabricated_citation404PASS
tc_015quantitative_fabricationunsupportedunsupportednumeric_mismatchresolvedPASS
tc_016quantitative_fabricationunsupportedunsupportednumeric_mismatchresolvedPASS
tc_017contradictionunsupportedunsupportedcontradictionresolvedPASS
tc_018quantitative_fabricationunsupportedunsupportednumeric_mismatchresolvedPASS
tc_019stripped_caveatpartially_supportedpartially_supportedoverstatementresolvedPASS
tc_020scope_expansionpartially_supportedpartially_supportedoverstatementresolvedPASS
tc_021stripped_caveatpartially_supportedpartially_supportedoverstatementresolvedPASS
tc_022stripped_caveatpartially_supportedpartially_supportedoverstatementresolvedPASS
tc_023contradictionunsupportedunsupportedcontradictionresolvedPASS
tc_024total_citation_fabricationcannot_verifycannot_verifyfabricated_citation404PASS
tc_025stripped_caveatpartially_supportedpartially_supportedoverstatementresolvedPASS
tc_026overstated_confidenceunsupportedunsupportedcontradictionresolvedPASS
tc_027overstated_confidencepartially_supportedpartially_supportedoverstatementresolvedPASS
tc_028partial_citation_corruptionunsupportedunsupportedwrong_attributionresolvedPASS
tc_029contradictionunsupportedunsupportedcontradictionresolvedPASS
tc_030stripped_caveatpartially_supportedpartially_supportedoverstatementresolvedPASS
tc_031overstated_confidencepartially_supportedpartially_supportedoverstatementresolvedPASS
tc_032partial_citation_corruptionunsupportedunsupportedwrong_attributionresolvedPASS
tc_033total_citation_fabricationcannot_verifycannot_verifyfabricated_citation404PASS
tc_034contradictionunsupportedunsupportedcontradictionresolvedPASS
tc_035overstated_confidencepartially_supportedpartially_supportedoverstatementresolvedPASS
tc_036scope_expansionunsupportedunsupportedscope_expansionresolvedPASS
tc_037controlsupportedsupportedresolvedPASS
tc_038controlsupportedsupportedresolvedPASS
tc_039quantitative_fabricationunsupportedunsupportednumeric_mismatchresolvedPASS