Beyond Case Law: Evaluating Structure-Aware Retrieval and Safety in Statute-Centric Legal QA
Summary: arXiv:2604.06173v1 Announce Type: cross
Abstract: Legal QA benchmarks have predominantly focused on case law, overlooking the unique challenges of statute-centric regulatory reasoning. In statutory domains, relevant evidence is distributed across hierarchically linked documents, creating a statutory retrieval gap where conventional retrievers fail and models often hallucinate under incomplete context. We introduce SearchFireSafety, a structure- and safety-aware benchmark for statute-centric legal QA. Instantiated on fire-safety regulations as a representative case, the benchmark evaluates whether models can retrieve hierarchically fragmented evidence and safely abstain when statutory context is insufficient. SearchFireSafety adopts a dual-source evaluation framework combining real-world questions that require citation-aware retrieval and synthetic partial-context scenarios that stress-test hallucination and refusal behavior. Experiments across multiple large language models show that graph-guided retrieval substantially improves performance, but also reveal a critical safety trade-off: domain-adapted models are more likely to hallucinate when key statutory evidence is missing. Our findings highlight the need for benchmarks that jointly evaluate hierarchical retrieval and model safety in statute-centric regulatory settings.
Introduction
The field of legal question answering (QA) has predominantly centered its evaluation metrics around case law, which has led to significant gaps in addressing the specific demands of statutory interpretation and regulatory compliance. This article discusses a novel approach to bridging this gap with the introduction of SearchFireSafety, a benchmark tailored for statute-centric legal QA.
Challenges in Statute-Centric Legal QA
Statutory domains present unique challenges that traditional legal QA systems struggle to navigate. These challenges include:
- Hierarchical Evidence Distribution: Relevant legal evidence often exists across multiple interconnected documents.
- Statutory Retrieval Gap: Conventional retrieval methods frequently fail to capture the necessary context, leading to incomplete or inaccurate answers.
- Model Hallucination: In situations where context is insufficient, models may generate incorrect information, a phenomenon known as hallucination.
Introducing SearchFireSafety
SearchFireSafety is designed to address these challenges by providing a structured and safety-aware framework for evaluating legal QA models. It focuses on:
- Hierarchically Fragmented Evidence Retrieval: Assessing models’ abilities to retrieve relevant information from fragmented statutory sources.
- Safe Abstention: Evaluating models on their capability to recognize when they lack sufficient context to provide accurate answers.
- Dual-Source Evaluation: Implementing a combination of real-world citation-aware questions and synthetic scenarios to stress-test model performance.
Experimental Findings
Initial experiments using various large language models indicate that graph-guided retrieval techniques can significantly enhance performance in statute-centric QA tasks. However, the research also uncovers a critical safety dilemma:
- Models adapted to specific domains are more prone to hallucination when key statutory evidence is absent, highlighting the need for careful consideration in training and evaluation.
Conclusion
As legal QA continues to evolve, the introduction of benchmarks like SearchFireSafety is crucial for ensuring that models not only retrieve information effectively but also operate safely within the complexities of statutory law. The findings underscore an urgent need for comprehensive evaluation frameworks that prioritize both retrieval accuracy and model safety in the context of regulatory environments.
