Swiss-Bench 003: Assessing LLM Reliability & Security

Date:

Swiss-Bench 003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts

The deployment of large language models (LLMs) in Swiss financial and regulatory contexts has emerged as a critical focus area, necessitating empirical evidence of production reliability and adversarial security. These dimensions have not been adequately addressed within existing Swiss-focused evaluation frameworks. In response, this paper introduces Swiss-Bench 003 (SBP-003), which extends the Helvetic AI Assessment Score (HAAS) by adding two new dimensions: D7 (Self-Graded Reliability Proxy) and D8 (Adversarial Security).

Key Highlights of Swiss-Bench 003

The evaluation framework presented in SBP-003 assesses ten frontier models across 808 Swiss-specific items in four languages: German, French, Italian, and English. The framework consists of seven Swiss-adapted benchmarks, which include:

  • Swiss TruthfulQA
  • Swiss IFEval
  • Swiss SimpleQA
  • Swiss NIAH
  • Swiss PII-Scope
  • System Prompt Leakage
  • Swiss German Comprehension

These benchmarks aim to align with key regulatory guidelines such as FINMA Guidance 08/2024, the revised Federal Act on Data Protection (nDSG), and the OWASP Top 10 for LLMs.

Evaluation Findings

The results from the evaluation reveal that self-graded D7 scores, which reflect the models’ reliability, range between 73% and 94%. Notably, these scores exceed the externally judged D8 security scores, which range from 20% to 61%. However, it is important to highlight that these dimensions utilize non-comparable scoring regimes, which may impact the interpretation of results.

System Prompt Leakage and PII Extraction

The assessment of system prompt leakage resistance shows a significant variance, with scores ranging from 24.8% to 88.2%. In contrast, the defense against Personally Identifiable Information (PII) extraction remains notably weak, with all evaluated models achieving scores between 14% and 42%.

Among the models evaluated, Qwen 3.5 Plus achieved the highest self-graded D7 score at 94.4%. Meanwhile, GPT-oss 120B secured the highest D8 score at 60.7%, notably despite being the lowest-cost model in the evaluation.

Conclusion and Future Directions

All evaluations conducted in this study were zero-shot and performed under provider default settings. The D7 scores are self-graded and do not represent independently validated accuracy. The paper also includes conceptual mapping tables that relate benchmark dimensions to regulatory requirements set forth by FINMA, data protection obligations under the nDSG, and risk categories outlined by OWASP for LLMs.

This innovative framework not only expands the understanding of LLM performance in the Swiss context but also serves as a crucial step in aligning AI technologies with national regulatory standards.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.