AIPsy-Affect: A Keyword-Free Clinical Stimulus Battery for Mechanistic Interpretability of Emotion in Language Models
Recent advancements in mechanistic interpretability research have highlighted the complexities of understanding emotion in large language models. The study titled “AIPsy-Affect” introduces a groundbreaking 480-item clinical stimulus battery designed to eliminate confounding variables associated with emotion keyword presence. This innovative approach is essential for validating the emotional recognition capabilities of language models without the biases introduced by specific word choices.
The primary challenge in current research is the reliance on stimuli that often contain explicit words denoting emotions. For instance, when a language model responds to the phrase “I am furious,” it becomes ambiguous whether the model is genuinely recognizing the emotion of anger or simply identifying the word “furious.” This distinction is critical as it informs the validity of claims regarding emotional circuits, features, and potential interventions within these models.
Key Features of AIPsy-Affect
The AIPsy-Affect battery is structured to provide clarity and enhance interpretability in emotion research. It includes:
- 192 Keyword-Free Vignettes: Each vignette is crafted to evoke one of Plutchik’s eight primary emotions through narrative alone, devoid of emotional keywords.
- 192 Matched Neutral Controls: These controls share characters, settings, lengths, and surface structures with the emotional vignettes, ensuring that the only difference is the presence of emotional content.
- Moderate-Intensity and Discriminant-Validity Splits: This allows researchers to gauge the intensity of emotional responses and validate the distinctions between different emotional states.
The matched-pair structure of the battery supports various interpretability methods such as linear probing, activation patching, sparse autoencoder (SAE) feature analysis, causal ablation, and steering vector extraction. This methodological rigor assures that any internal representation distinguishing a clinical item from its matched neutral counterpart cannot be influenced by the presence of emotion-related keywords.
Validation of AIPsy-Affect
AIPsy-Affect has undergone rigorous validation through a three-method NLP defense battery, which includes:
- Bag-of-Words Sentiment Analysis: This method confirms that only situational vocabulary is detected, with no emotional labeling.
- Emotion-Category Lexicon: This traditional approach further corroborates the absence of keyword influence in emotional detection.
- Contextual Transformer Classifier: Although this classifier can detect affect with a high degree of accuracy (p < 10^-15), it struggles to identify specific emotional categories, achieving only 5.2% top-1 accuracy compared to 82.5% on keyword-rich controls.
These validation techniques affirm the robustness of the AIPsy-Affect battery in isolating emotional recognition from keyword biases, providing a clear pathway for future research in emotion detection within language models.
AIPsy-Affect is a significant expansion of a previously released 96-item battery (arXiv:2603.22295), now offering researchers a comprehensive toolkit for exploring emotion in language models. The battery is openly available under the MIT license, encouraging widespread adoption and further exploration in the field of AI emotion interpretation.
Related AI Insights
- Agri-CPJ: Explainable Pest Diagnosis Without Training
- RaV-IDP: Validating Intelligent Document Processing Accuracy
- Hybrid JIT-CUDA Graph for Fast LLM Inference
- MTRouter: Cost-Efficient Multi-Turn LLM Routing System
- Efficient FPGA Sigmoid Function via Mixed-Radix CORDIC
- Consistency Distillation’s Role in Diffusion Model Memorization
- Open-Source Talking Slide Avatars for Engaging Teaching
- Behavior Understanding Alignment: LLMs Predict Daily Actions
- CUDA Tile Performance on Hopper & Blackwell GPUs for AI
- EyeBrain: Classify Brain Activity via Pupil & Fixation
