From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text
Summary: arXiv:2604.16270v1 Announce Type: cross
Abstract: The complexity of Vietnam’s legal texts presents a significant barrier to public access to justice. While Large Language Models (LLMs) offer a promising solution for legal text simplification, evaluating their true capabilities requires a multifaceted approach that goes beyond surface-level metrics. This paper introduces a comprehensive dual-aspect evaluation framework to address this need.
Evaluation Framework
Our evaluation framework is divided into two main components:
- Performance Benchmark: We establish a benchmark for four state-of-the-art large language models: GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok-1. The benchmark focuses on three key dimensions:
- Accuracy: How accurately the models interpret and process legal text.
- Readability: The ease with which a legal text can be understood by the general public.
- Consistency: The reliability of the models in providing consistent outputs across similar legal texts.
Understanding the “Why”
To delve deeper into the performance scores, we conducted a large-scale error analysis on a curated dataset of 60 complex Vietnamese legal articles. This analysis employed a novel, expert-validated error typology, allowing us to understand the underlying reasons for the models’ performances.
Key Findings
Our findings unveiled a crucial trade-off among the models:
- Models like Grok-1 excelled in Readability and Consistency but compromised on fine-grained legal Accuracy.
- Conversely, Claude 3 Opus achieved high Accuracy scores but concealed a significant number of subtle yet critical reasoning errors.
Additionally, our error analysis identified two prevalent types of failures:
- Incorrect Example: Instances where the model provided legal interpretations or examples that were inaccurate or misleading.
- Misinterpretation: Cases where the model misinterpreted the legal context or nuances, leading to erroneous conclusions.
Conclusion
Our research confirms that the primary challenge for current LLMs in legal applications is not merely summarization but rather achieving controlled, accurate legal reasoning. By integrating a quantitative benchmark with a qualitative deep dive, our work offers a holistic and actionable assessment of LLMs for legal applications, paving the way for improvements in the accessibility of legal texts in Vietnam.
