Small AI Models for Legal Document Reasoning: Study

Date:

Can Small Models Reason About Legal Documents? A Comparative Study

Summary: arXiv:2603.25944v1 Announce Type: cross

Abstract

Large language models show promise for legal applications, but deploying frontier models raises concerns about cost, latency, and data privacy. We evaluate whether sub-10B parameter models can serve as practical alternatives by testing nine models across three legal benchmarks (ContractNLI, CaseHOLD, and ECtHR) using five prompting strategies (direct, chain-of-thought, few-shot, BM25 RAG, and dense RAG). Across 405 experiments with three random seeds per configuration, we find that a Mixture-of-Experts model activating only 3B parameters matches GPT-4o-mini in mean accuracy while surpassing it on legal holding identification, and that architecture and training quality matter more than raw parameter count. Our largest model (9B parameters) performs worst overall. Chain-of-thought prompting proves sharply task-dependent, improving contract entailment but degrading multiple-choice legal reasoning, while few-shot prompting emerges as the most consistently effective strategy. Comparing BM25 and dense retrieval for RAG, we find near-identical results, suggesting the bottleneck lies in the language model’s utilization of retrieved context rather than retrieval quality. All experiments were conducted via cloud inference APIs at a total cost of $62, demonstrating that rigorous LLM evaluation is accessible without dedicated GPU infrastructure.

Introduction

The rapid development of large language models (LLMs) has led to increased interest in their application in legal contexts. However, significant concerns regarding the costs, latency, and data privacy of deploying state-of-the-art models have prompted researchers to explore smaller alternatives. This study investigates whether models with fewer than 10 billion parameters can effectively handle legal reasoning tasks typically reserved for their larger counterparts.

Methodology

To assess the capabilities of smaller models in legal document reasoning, we conducted extensive testing using nine different models across three legal benchmarks. The models were evaluated using five prompting strategies:

  • Direct prompting
  • Chain-of-thought prompting
  • Few-shot prompting
  • BM25 RAG (Retrieval-Augmented Generation)
  • Dense RAG

In total, we executed 405 experiments, ensuring robust statistical analysis by employing three random seeds for each configuration.

Results

Our findings revealed that a Mixture-of-Experts model, which activates only 3 billion parameters, achieved comparable mean accuracy to the larger GPT-4o-mini model. Notably, it outperformed the larger model in legal holding identification tasks. Interestingly, our largest model, which has 9 billion parameters, performed the worst overall, indicating that sheer parameter count does not guarantee superior performance.

Discussion

One of the most striking insights from our experiments is the task-dependent nature of prompting strategies. Chain-of-thought prompting improved performance in contract entailment scenarios but hindered results in multiple-choice legal reasoning. In contrast, few-shot prompting consistently emerged as the most effective strategy across various tasks.

Comparative Analysis

When comparing the two retrieval methods—BM25 and dense retrieval for RAG—we observed nearly identical results. This suggests that the primary limitation lies not in the retrieval quality itself but rather in how effectively the language model utilizes the retrieved context during reasoning tasks.

Conclusion

Our research demonstrates that smaller models can serve as viable alternatives for legal reasoning tasks traditionally dominated by larger models. The total cost of conducting these evaluations was only $62, highlighting the feasibility of rigorous LLM assessment without the need for specialized GPU infrastructure. The results of this study open avenues for further exploration into the deployment of smaller, cost-effective AI models in the legal domain.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.