Test Results Summary
run_015 · prompt v7 · 15 run(s) total · final canonical: numeric reconciliation cases (tc_038/039) + faster resolve timeout
100%
pass rate · 39/39 cases (green ≥80 · yellow 60–79 · red <60)
3 case(s) the engine correctly abstained on (cannot_verify) — it returns "I can't verify this" rather than guessing.
Iteration history · the score was earned by iterating, not curated
| Run | Prompt | Overall pass rate | Change |
|---|---|---|---|
| run_009 | v5 | 100% | rename defect numeric_error -> numeric_mismatch |
| run_010 | v6 | 100% | merge stripped_caveat+overstated_confidence -> overstatement chip |
| run_011 | v6 | 81% | multi-domain: add legal corpus (tc_101-116) + domain field |
| run_012 | v6 | 89% | resolution non-blocking when source text present (fixes legal Justia/CL bot-blocking) |
| run_013 | v6 | 96% | domain-calibrated legal labels (5 -> unsupported), fix tc_108 control, blocked-source status |
| run_014 | v7 | 93% | rename wrong_directionality -> contradiction (general); 3 domains; hardened retry |
| run_015 | v7 | 97% | final canonical: numeric reconciliation cases (tc_038/039) + faster resolve timeout |
By hallucination type
| Type | Passed | Total | Rate |
|---|---|---|---|
| contradiction | 5 | 5 | 100% |
| quantitative_fabrication | 8 | 8 | 100% |
| stripped_caveat | 6 | 6 | 100% |
| distributed_evidence | 4 | 4 | 100% |
| overstated_confidence | 5 | 5 | 100% |
| partial_citation_corruption | 3 | 3 | 100% |
| scope_expansion | 3 | 3 | 100% |
| total_citation_fabrication | 3 | 3 | 100% |
| control | 2 | 2 | 100% |
Per-case results · click a row for detail
| Case | Type | Expected | Actual | Defect | Resolution | |
|---|---|---|---|---|---|---|
| tc_001 | contradiction | unsupported | unsupported | contradiction | resolved | PASS |
| tc_002 | quantitative_fabrication | unsupported | unsupported | numeric_mismatch | resolved | PASS |
| tc_003 | stripped_caveat | partially_supported | partially_supported | overstatement | resolved | PASS |
| tc_004 | distributed_evidence | supported | supported | — | resolved | PASS |
| tc_005 | distributed_evidence | partially_supported | partially_supported | unsupported_addition | resolved | PASS |
| tc_006 | distributed_evidence | supported | supported | — | resolved | PASS |
| tc_007 | distributed_evidence | partially_supported | partially_supported | unsupported_addition | resolved | PASS |
| tc_008 | quantitative_fabrication | unsupported | unsupported | numeric_mismatch | resolved | PASS |
| tc_009 | quantitative_fabrication | unsupported | unsupported | numeric_mismatch | resolved | PASS |
| tc_010 | overstated_confidence | partially_supported | partially_supported | overstatement | resolved | PASS |
| tc_011 | partial_citation_corruption | unsupported | unsupported | wrong_attribution | resolved | PASS |
| tc_012 | quantitative_fabrication | unsupported | unsupported | numeric_mismatch | resolved | PASS |
| tc_013 | scope_expansion | unsupported | unsupported | contradiction | resolved | PASS |
| tc_014 | total_citation_fabrication | cannot_verify | cannot_verify | fabricated_citation | 404 | PASS |
| tc_015 | quantitative_fabrication | unsupported | unsupported | numeric_mismatch | resolved | PASS |
| tc_016 | quantitative_fabrication | unsupported | unsupported | numeric_mismatch | resolved | PASS |
| tc_017 | contradiction | unsupported | unsupported | contradiction | resolved | PASS |
| tc_018 | quantitative_fabrication | unsupported | unsupported | numeric_mismatch | resolved | PASS |
| tc_019 | stripped_caveat | partially_supported | partially_supported | overstatement | resolved | PASS |
| tc_020 | scope_expansion | partially_supported | partially_supported | overstatement | resolved | PASS |
| tc_021 | stripped_caveat | partially_supported | partially_supported | overstatement | resolved | PASS |
| tc_022 | stripped_caveat | partially_supported | partially_supported | overstatement | resolved | PASS |
| tc_023 | contradiction | unsupported | unsupported | contradiction | resolved | PASS |
| tc_024 | total_citation_fabrication | cannot_verify | cannot_verify | fabricated_citation | 404 | PASS |
| tc_025 | stripped_caveat | partially_supported | partially_supported | overstatement | resolved | PASS |
| tc_026 | overstated_confidence | unsupported | unsupported | contradiction | resolved | PASS |
| tc_027 | overstated_confidence | partially_supported | partially_supported | overstatement | resolved | PASS |
| tc_028 | partial_citation_corruption | unsupported | unsupported | wrong_attribution | resolved | PASS |
| tc_029 | contradiction | unsupported | unsupported | contradiction | resolved | PASS |
| tc_030 | stripped_caveat | partially_supported | partially_supported | overstatement | resolved | PASS |
| tc_031 | overstated_confidence | partially_supported | partially_supported | overstatement | resolved | PASS |
| tc_032 | partial_citation_corruption | unsupported | unsupported | wrong_attribution | resolved | PASS |
| tc_033 | total_citation_fabrication | cannot_verify | cannot_verify | fabricated_citation | 404 | PASS |
| tc_034 | contradiction | unsupported | unsupported | contradiction | resolved | PASS |
| tc_035 | overstated_confidence | partially_supported | partially_supported | overstatement | resolved | PASS |
| tc_036 | scope_expansion | unsupported | unsupported | scope_expansion | resolved | PASS |
| tc_037 | control | supported | supported | — | resolved | PASS |
| tc_038 | control | supported | supported | — | resolved | PASS |
| tc_039 | quantitative_fabrication | unsupported | unsupported | numeric_mismatch | resolved | PASS |