Swiss-Bench 003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts
The deployment of large language models (LLMs) in Swiss financial and regulatory contexts has emerged as a critical focus area, necessitating empirical evidence of production reliability and adversarial security. These dimensions have not been adequately addressed within existing Swiss-focused evaluation frameworks. In response, this paper introduces Swiss-Bench 003 (SBP-003), which extends the Helvetic AI Assessment Score (HAAS) by adding two new dimensions: D7 (Self-Graded Reliability Proxy) and D8 (Adversarial Security).
Key Highlights of Swiss-Bench 003
The evaluation framework presented in SBP-003 assesses ten frontier models across 808 Swiss-specific items in four languages: German, French, Italian, and English. The framework consists of seven Swiss-adapted benchmarks, which include:
- Swiss TruthfulQA
- Swiss IFEval
- Swiss SimpleQA
- Swiss NIAH
- Swiss PII-Scope
- System Prompt Leakage
- Swiss German Comprehension
These benchmarks aim to align with key regulatory guidelines such as FINMA Guidance 08/2024, the revised Federal Act on Data Protection (nDSG), and the OWASP Top 10 for LLMs.
Evaluation Findings
The results from the evaluation reveal that self-graded D7 scores, which reflect the models’ reliability, range between 73% and 94%. Notably, these scores exceed the externally judged D8 security scores, which range from 20% to 61%. However, it is important to highlight that these dimensions utilize non-comparable scoring regimes, which may impact the interpretation of results.
System Prompt Leakage and PII Extraction
The assessment of system prompt leakage resistance shows a significant variance, with scores ranging from 24.8% to 88.2%. In contrast, the defense against Personally Identifiable Information (PII) extraction remains notably weak, with all evaluated models achieving scores between 14% and 42%.
Among the models evaluated, Qwen 3.5 Plus achieved the highest self-graded D7 score at 94.4%. Meanwhile, GPT-oss 120B secured the highest D8 score at 60.7%, notably despite being the lowest-cost model in the evaluation.
Conclusion and Future Directions
All evaluations conducted in this study were zero-shot and performed under provider default settings. The D7 scores are self-graded and do not represent independently validated accuracy. The paper also includes conceptual mapping tables that relate benchmark dimensions to regulatory requirements set forth by FINMA, data protection obligations under the nDSG, and risk categories outlined by OWASP for LLMs.
This innovative framework not only expands the understanding of LLM performance in the Swiss context but also serves as a crucial step in aligning AI technologies with national regulatory standards.
