Soft-Label Governance for Distributional Safety in Multi-Agent Systems
Summary: arXiv:2604.19752v1 Announce Type: cross
Abstract: Multi-agent AI systems exhibit emergent risks that no single agent produces in isolation. Existing safety frameworks rely on binary classifications of agent behavior, discarding the uncertainty inherent in proxy-based evaluation. We introduce SWARM (System-Wide Assessment of Risk in Multi-agent systems), a simulation framework that replaces binary good/bad labels with soft probabilistic labels p = P(v{=}+1) ∈ [0,1], enabling continuous-valued payoff computation, toxicity measurement, and governance intervention.
SWARM implements a modular governance engine with configurable levers such as:
- Transaction taxes
- Circuit breakers
- Reputation decay
- Random audits
The framework quantifies the effects of these governance mechanisms through probabilistic metrics including:
- Expected toxicity: ℰ[1{-}p | accepted]
- Quality gap: ℰ[p | accepted] – ℰ[p | rejected]
Our experiments conducted across seven scenarios with five-seed replication reveal that strict governance reduces welfare by over 40% without yielding improvements in safety. In contrast, aggressively internalizing system externalities leads to a dramatic collapse in total welfare, dropping from a baseline of +262 to -67, all while toxicity levels remain consistent.
We found that circuit breakers necessitate careful calibration. Overly restrictive thresholds can severely diminish system value, while an optimal threshold can balance moderate welfare against minimized toxicity. Our companion experiments demonstrate that soft metrics effectively detect proxy gaming by self-optimizing agents that may pass conventional binary evaluations.
This fundamental governance layer is applicable to live LLM-backed agents, including Concordia entities, Claude, and GPT-4o Mini, without requiring modifications. The results from our study emphasize that achieving distributional safety demands the use of continuous risk metrics, highlighting the need for governance lever calibration that involves quantifiable safety-welfare tradeoffs.
For those interested in further exploring this topic, the source code and project resources are publicly available at https://www.swarm-ai.org/.
