Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning
A new study presents a groundbreaking approach to understanding vulnerabilities in modern language models that utilize continuous latent reasoning. The paper, titled “Thinking Wrong in Silence,” explores how these models, which operate without generating tokens and leaving no audit trail, are susceptible to backdoor attacks. The research highlights a novel attack method called ThoughtSteer that exploits this silent reasoning process.
Abstract
The continuous hidden state reasoning employed by the latest generation of language models creates a unique attack surface that was previously unexploited. Through the ThoughtSteer method, attackers can perturb a single embedding vector at the model’s input layer. This perturbation is amplified through the model’s multi-pass reasoning, ultimately leading to a hijacked latent trajectory that produces the attacker’s desired output while evading traditional defenses.
Key Findings
- The study evaluates ThoughtSteer across two architectures, Coconut and SimCoT, and demonstrates its effectiveness across three distinct reasoning benchmarks.
- It achieves an attack success rate of 99% or greater while maintaining near-baseline accuracy on clean data.
- The attack is robust enough to transfer to held-out benchmarks without the need for retraining, showcasing a success rate between 94% and 100%.
- ThoughtSteer successfully evades all five active defenses evaluated in the study and withstands 25 epochs of clean fine-tuning.
Underlying Mechanism
The researchers attribute these results to a phenomenon known as Neural Collapse within the latent space of the model. This effect causes triggered representations to cluster tightly around a geometric attractor, elucidating the reasons traditional defenses fail against such attacks. Additionally, the study reveals that any effective backdoor must leave a linearly separable signature, as evidenced by a probe that achieves an AUC of 0.999.
The Paradox of Latent Vectors
A fascinating paradox emerges from the findings: individual latent vectors can still encode the correct answer, even when the model outputs an incorrect response. The adversarial information, rather than being contained within any single vector, resides in the collective trajectory of the input through the model. This insight establishes backdoor perturbations as a new avenue for understanding the mechanistic interpretability of continuous reasoning within these advanced models.
Conclusion
The implications of this research are profound, suggesting that as language models continue to evolve, so too must our approaches to securing them against potential vulnerabilities. The availability of code and checkpoints for the ThoughtSteer method opens the door for further exploration and potential mitigation strategies. As we advance into the era of continuous reasoning in AI, understanding these silent vulnerabilities will be crucial for building robust and secure systems.
