Swiss-Bench SBP-002: A Frontier Model Comparison on Swiss Legal and Regulatory Tasks
Summary: arXiv:2603.23646v1 Announce Type: cross
Abstract: While recent work has benchmarked large language models on Swiss legal translation (Niklaus et al., 2025) and academic legal reasoning from university exams (Fan et al., 2025), no existing benchmark evaluates frontier model performance on applied Swiss regulatory compliance tasks. This article introduces Swiss-Bench SBP-002, a trilingual benchmark of 395 expert-crafted items spanning three Swiss regulatory domains (FINMA, Legal-CH, EFK), seven task types, and three languages (German, French, Italian). It evaluates ten frontier models from March 2026 using a structured three-dimension scoring framework assessed via a blind three-judge LLM panel (GPT-4o, Claude Sonnet 4, Qwen3-235B) with majority-vote aggregation and a weighted kappa = 0.605. Reference answers were validated by an independent human legal expert on a 100-item subset (73% rated Correct, 0% Incorrect, perfect Legal Accuracy).
Key Findings
Results reveal three descriptive performance clusters among the evaluated models:
- Tier A: 35-38% correct
- Tier B: 26-29% correct
- Tier C: 13-21% correct
Benchmark Difficulty
The benchmark proves challenging for all models. Even the top-ranked model, Qwen 3.5 Plus, achieved only 38.2% correct responses, with 47.3% incorrect and 14.4% partially correct. The difficulty of task types varied significantly, with the following results:
- Legal Translation and Case Analysis: 69-72% correct rates
- Regulatory Q&A, Hallucination Detection, and Gap Analysis: below 9% correct rates
Model Performance Overview
Among the roster of models evaluated, which includes seven open-weight and three closed-source models, it is noteworthy that an open-weight model leads the ranking. Several open-weight models matched or even outperformed their closed-source counterparts.
These findings provide an initial empirical reference point for assessing frontier model capability on Swiss regulatory tasks under zero-retrieval conditions. The results indicate not only the potential of these models in the legal domain but also highlight the areas where improvements are urgently needed.
Conclusion
Swiss-Bench SBP-002 serves as a critical benchmark for evaluating the performance of large language models in the context of Swiss legal and regulatory compliance. It underscores the complexity of the tasks at hand and the varying levels of success achieved by different models. As the legal landscape continues to evolve, ongoing research and development in AI will be essential to enhance these models’ accuracy and effectiveness in real-world applications.
