Swiss-Bench SBP-002: Top AI Models on Swiss Legal Tasks

Date:

Swiss-Bench SBP-002: A Frontier Model Comparison on Swiss Legal and Regulatory Tasks

Summary: arXiv:2603.23646v1 Announce Type: cross

Abstract: While recent work has benchmarked large language models on Swiss legal translation (Niklaus et al., 2025) and academic legal reasoning from university exams (Fan et al., 2025), no existing benchmark evaluates frontier model performance on applied Swiss regulatory compliance tasks. This article introduces Swiss-Bench SBP-002, a trilingual benchmark of 395 expert-crafted items spanning three Swiss regulatory domains (FINMA, Legal-CH, EFK), seven task types, and three languages (German, French, Italian). It evaluates ten frontier models from March 2026 using a structured three-dimension scoring framework assessed via a blind three-judge LLM panel (GPT-4o, Claude Sonnet 4, Qwen3-235B) with majority-vote aggregation and a weighted kappa = 0.605. Reference answers were validated by an independent human legal expert on a 100-item subset (73% rated Correct, 0% Incorrect, perfect Legal Accuracy).

Key Findings

Results reveal three descriptive performance clusters among the evaluated models:

  • Tier A: 35-38% correct
  • Tier B: 26-29% correct
  • Tier C: 13-21% correct

Benchmark Difficulty

The benchmark proves challenging for all models. Even the top-ranked model, Qwen 3.5 Plus, achieved only 38.2% correct responses, with 47.3% incorrect and 14.4% partially correct. The difficulty of task types varied significantly, with the following results:

  • Legal Translation and Case Analysis: 69-72% correct rates
  • Regulatory Q&A, Hallucination Detection, and Gap Analysis: below 9% correct rates

Model Performance Overview

Among the roster of models evaluated, which includes seven open-weight and three closed-source models, it is noteworthy that an open-weight model leads the ranking. Several open-weight models matched or even outperformed their closed-source counterparts.

These findings provide an initial empirical reference point for assessing frontier model capability on Swiss regulatory tasks under zero-retrieval conditions. The results indicate not only the potential of these models in the legal domain but also highlight the areas where improvements are urgently needed.

Conclusion

Swiss-Bench SBP-002 serves as a critical benchmark for evaluating the performance of large language models in the context of Swiss legal and regulatory compliance. It underscores the complexity of the tasks at hand and the varying levels of success achieved by different models. As the legal landscape continues to evolve, ongoing research and development in AI will be essential to enhance these models’ accuracy and effectiveness in real-world applications.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.