AIPsy-Affect: Keyword-Free Emotion Test for Language Models

AIPsy-Affect: A Keyword-Free Clinical Stimulus Battery for Mechanistic Interpretability of Emotion in Language Models

Recent advancements in mechanistic interpretability research have highlighted the complexities of understanding emotion in large language models. The study titled “AIPsy-Affect” introduces a groundbreaking 480-item clinical stimulus battery designed to eliminate confounding variables associated with emotion keyword presence. This innovative approach is essential for validating the emotional recognition capabilities of language models without the biases introduced by specific word choices.

The primary challenge in current research is the reliance on stimuli that often contain explicit words denoting emotions. For instance, when a language model responds to the phrase “I am furious,” it becomes ambiguous whether the model is genuinely recognizing the emotion of anger or simply identifying the word “furious.” This distinction is critical as it informs the validity of claims regarding emotional circuits, features, and potential interventions within these models.

Key Features of AIPsy-Affect

The AIPsy-Affect battery is structured to provide clarity and enhance interpretability in emotion research. It includes:

192 Keyword-Free Vignettes: Each vignette is crafted to evoke one of Plutchik’s eight primary emotions through narrative alone, devoid of emotional keywords.
192 Matched Neutral Controls: These controls share characters, settings, lengths, and surface structures with the emotional vignettes, ensuring that the only difference is the presence of emotional content.
Moderate-Intensity and Discriminant-Validity Splits: This allows researchers to gauge the intensity of emotional responses and validate the distinctions between different emotional states.

The matched-pair structure of the battery supports various interpretability methods such as linear probing, activation patching, sparse autoencoder (SAE) feature analysis, causal ablation, and steering vector extraction. This methodological rigor assures that any internal representation distinguishing a clinical item from its matched neutral counterpart cannot be influenced by the presence of emotion-related keywords.

Validation of AIPsy-Affect

AIPsy-Affect has undergone rigorous validation through a three-method NLP defense battery, which includes:

Bag-of-Words Sentiment Analysis: This method confirms that only situational vocabulary is detected, with no emotional labeling.
Emotion-Category Lexicon: This traditional approach further corroborates the absence of keyword influence in emotional detection.
Contextual Transformer Classifier: Although this classifier can detect affect with a high degree of accuracy (p < 10^-15), it struggles to identify specific emotional categories, achieving only 5.2% top-1 accuracy compared to 82.5% on keyword-rich controls.

These validation techniques affirm the robustness of the AIPsy-Affect battery in isolating emotional recognition from keyword biases, providing a clear pathway for future research in emotion detection within language models.

AIPsy-Affect is a significant expansion of a previously released 96-item battery (arXiv:2603.22295), now offering researchers a comprehensive toolkit for exploring emotion in language models. The battery is openly available under the MIT license, encouraging widespread adoption and further exploration in the field of AI emotion interpretation.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

AIPsy-Affect: Keyword-Free Emotion Test for Language Models

AIPsy-Affect: A Keyword-Free Clinical Stimulus Battery for Mechanistic Interpretability of Emotion in Language Models

Key Features of AIPsy-Affect

Validation of AIPsy-Affect

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related