Swiss-Bench 003: Assessing LLM Reliability & Security

Swiss-Bench 003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts

The deployment of large language models (LLMs) in Swiss financial and regulatory contexts has emerged as a critical focus area, necessitating empirical evidence of production reliability and adversarial security. These dimensions have not been adequately addressed within existing Swiss-focused evaluation frameworks. In response, this paper introduces Swiss-Bench 003 (SBP-003), which extends the Helvetic AI Assessment Score (HAAS) by adding two new dimensions: D7 (Self-Graded Reliability Proxy) and D8 (Adversarial Security).

Key Highlights of Swiss-Bench 003

The evaluation framework presented in SBP-003 assesses ten frontier models across 808 Swiss-specific items in four languages: German, French, Italian, and English. The framework consists of seven Swiss-adapted benchmarks, which include:

Swiss TruthfulQA
Swiss IFEval
Swiss SimpleQA
Swiss NIAH
Swiss PII-Scope
System Prompt Leakage
Swiss German Comprehension

These benchmarks aim to align with key regulatory guidelines such as FINMA Guidance 08/2024, the revised Federal Act on Data Protection (nDSG), and the OWASP Top 10 for LLMs.

Evaluation Findings

The results from the evaluation reveal that self-graded D7 scores, which reflect the models’ reliability, range between 73% and 94%. Notably, these scores exceed the externally judged D8 security scores, which range from 20% to 61%. However, it is important to highlight that these dimensions utilize non-comparable scoring regimes, which may impact the interpretation of results.

System Prompt Leakage and PII Extraction

The assessment of system prompt leakage resistance shows a significant variance, with scores ranging from 24.8% to 88.2%. In contrast, the defense against Personally Identifiable Information (PII) extraction remains notably weak, with all evaluated models achieving scores between 14% and 42%.

Among the models evaluated, Qwen 3.5 Plus achieved the highest self-graded D7 score at 94.4%. Meanwhile, GPT-oss 120B secured the highest D8 score at 60.7%, notably despite being the lowest-cost model in the evaluation.

Conclusion and Future Directions

All evaluations conducted in this study were zero-shot and performed under provider default settings. The D7 scores are self-graded and do not represent independently validated accuracy. The paper also includes conceptual mapping tables that relate benchmark dimensions to regulatory requirements set forth by FINMA, data protection obligations under the nDSG, and risk categories outlined by OWASP for LLMs.

This innovative framework not only expands the understanding of LLM performance in the Swiss context but also serves as a crucial step in aligning AI technologies with national regulatory standards.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Swiss-Bench 003: Assessing LLM Reliability & Security

Swiss-Bench 003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts

Key Highlights of Swiss-Bench 003

Evaluation Findings

System Prompt Leakage and PII Extraction

Conclusion and Future Directions

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related