Can Small Models Reason About Legal Documents? A Comparative Study
Summary: arXiv:2603.25944v1 Announce Type: cross
Abstract
Large language models show promise for legal applications, but deploying frontier models raises concerns about cost, latency, and data privacy. We evaluate whether sub-10B parameter models can serve as practical alternatives by testing nine models across three legal benchmarks (ContractNLI, CaseHOLD, and ECtHR) using five prompting strategies (direct, chain-of-thought, few-shot, BM25 RAG, and dense RAG). Across 405 experiments with three random seeds per configuration, we find that a Mixture-of-Experts model activating only 3B parameters matches GPT-4o-mini in mean accuracy while surpassing it on legal holding identification, and that architecture and training quality matter more than raw parameter count. Our largest model (9B parameters) performs worst overall. Chain-of-thought prompting proves sharply task-dependent, improving contract entailment but degrading multiple-choice legal reasoning, while few-shot prompting emerges as the most consistently effective strategy. Comparing BM25 and dense retrieval for RAG, we find near-identical results, suggesting the bottleneck lies in the language model’s utilization of retrieved context rather than retrieval quality. All experiments were conducted via cloud inference APIs at a total cost of $62, demonstrating that rigorous LLM evaluation is accessible without dedicated GPU infrastructure.
Introduction
The rapid development of large language models (LLMs) has led to increased interest in their application in legal contexts. However, significant concerns regarding the costs, latency, and data privacy of deploying state-of-the-art models have prompted researchers to explore smaller alternatives. This study investigates whether models with fewer than 10 billion parameters can effectively handle legal reasoning tasks typically reserved for their larger counterparts.
Methodology
To assess the capabilities of smaller models in legal document reasoning, we conducted extensive testing using nine different models across three legal benchmarks. The models were evaluated using five prompting strategies:
- Direct prompting
- Chain-of-thought prompting
- Few-shot prompting
- BM25 RAG (Retrieval-Augmented Generation)
- Dense RAG
In total, we executed 405 experiments, ensuring robust statistical analysis by employing three random seeds for each configuration.
Results
Our findings revealed that a Mixture-of-Experts model, which activates only 3 billion parameters, achieved comparable mean accuracy to the larger GPT-4o-mini model. Notably, it outperformed the larger model in legal holding identification tasks. Interestingly, our largest model, which has 9 billion parameters, performed the worst overall, indicating that sheer parameter count does not guarantee superior performance.
Discussion
One of the most striking insights from our experiments is the task-dependent nature of prompting strategies. Chain-of-thought prompting improved performance in contract entailment scenarios but hindered results in multiple-choice legal reasoning. In contrast, few-shot prompting consistently emerged as the most effective strategy across various tasks.
Comparative Analysis
When comparing the two retrieval methods—BM25 and dense retrieval for RAG—we observed nearly identical results. This suggests that the primary limitation lies not in the retrieval quality itself but rather in how effectively the language model utilizes the retrieved context during reasoning tasks.
Conclusion
Our research demonstrates that smaller models can serve as viable alternatives for legal reasoning tasks traditionally dominated by larger models. The total cost of conducting these evaluations was only $62, highlighting the feasibility of rigorous LLM assessment without the need for specialized GPU infrastructure. The results of this study open avenues for further exploration into the deployment of smaller, cost-effective AI models in the legal domain.
