Swiss-Bench SBP-002: Top AI Models on Swiss Legal Tasks

Swiss-Bench SBP-002: A Frontier Model Comparison on Swiss Legal and Regulatory Tasks

Summary: arXiv:2603.23646v1 Announce Type: cross

Abstract: While recent work has benchmarked large language models on Swiss legal translation (Niklaus et al., 2025) and academic legal reasoning from university exams (Fan et al., 2025), no existing benchmark evaluates frontier model performance on applied Swiss regulatory compliance tasks. This article introduces Swiss-Bench SBP-002, a trilingual benchmark of 395 expert-crafted items spanning three Swiss regulatory domains (FINMA, Legal-CH, EFK), seven task types, and three languages (German, French, Italian). It evaluates ten frontier models from March 2026 using a structured three-dimension scoring framework assessed via a blind three-judge LLM panel (GPT-4o, Claude Sonnet 4, Qwen3-235B) with majority-vote aggregation and a weighted kappa = 0.605. Reference answers were validated by an independent human legal expert on a 100-item subset (73% rated Correct, 0% Incorrect, perfect Legal Accuracy).

Key Findings

Results reveal three descriptive performance clusters among the evaluated models:

Tier A: 35-38% correct
Tier B: 26-29% correct
Tier C: 13-21% correct

Benchmark Difficulty

The benchmark proves challenging for all models. Even the top-ranked model, Qwen 3.5 Plus, achieved only 38.2% correct responses, with 47.3% incorrect and 14.4% partially correct. The difficulty of task types varied significantly, with the following results:

Legal Translation and Case Analysis: 69-72% correct rates
Regulatory Q&A, Hallucination Detection, and Gap Analysis: below 9% correct rates

Model Performance Overview

Among the roster of models evaluated, which includes seven open-weight and three closed-source models, it is noteworthy that an open-weight model leads the ranking. Several open-weight models matched or even outperformed their closed-source counterparts.

These findings provide an initial empirical reference point for assessing frontier model capability on Swiss regulatory tasks under zero-retrieval conditions. The results indicate not only the potential of these models in the legal domain but also highlight the areas where improvements are urgently needed.

Conclusion

Swiss-Bench SBP-002 serves as a critical benchmark for evaluating the performance of large language models in the context of Swiss legal and regulatory compliance. It underscores the complexity of the tasks at hand and the varying levels of success achieved by different models. As the legal landscape continues to evolve, ongoing research and development in AI will be essential to enhance these models’ accuracy and effectiveness in real-world applications.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Swiss-Bench SBP-002: Top AI Models on Swiss Legal Tasks

Swiss-Bench SBP-002: A Frontier Model Comparison on Swiss Legal and Regulatory Tasks

Key Findings

Benchmark Difficulty

Model Performance Overview

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related