Discover Layerwise Convergence Fingerprinting, a tuning-free method to detect runtime misbehavior in large language models with high accuracy and security.
Discover how integrating Sparse Autoencoders boosts Large Language Models' robustness, reducing jailbreak attacks by up to 5x without altering weights.