The Alignment Tax: Response Homogenization in Aligned LLMs and Its Implications for Uncertainty Estimation
Recent research has unveiled significant insights into the behavior of Reinforcement Learning from Human Feedback (RLHF)-aligned language models, particularly concerning their tendency towards response homogenization. This phenomenon is characterized by a significant proportion of questions yielding similar responses across multiple independent samples. Such findings carry important implications for tasks that demand robust uncertainty estimation.
Key Findings
- On the TruthfulQA benchmark, which consists of 790 questions, it was observed that between 40% to 79% of inquiries resulted in a single semantic cluster across ten independent identically distributed (i.i.d.) samples.
- For questions affected by response homogenization, traditional sampling-based uncertainty methods demonstrated a lack of discriminative power, with an Area Under the Receiver Operating Characteristic (AUROC) score of 0.500. Conversely, the use of free token entropy as a metric retained meaningful signal, achieving an AUROC of 0.603.
- The alignment tax appears to be contingent on the specific task at hand. For instance, in the GSM8K benchmark comprising 500 questions, token entropy achieved an AUROC of 0.724, with a Cohen’s d effect size of 0.81.
Impact of Alignment on Response Patterns
To further investigate the causal relationship between model alignment and response behavior, a base versus instruct model ablation study was conducted. The findings revealed a stark contrast, with the base model exhibiting a mere 1.0% single-cluster rate compared to 28.5% for the instruct model, a difference that was statistically significant (p < 10^{-6}).
Additional analysis through a training stage ablation indicated that the predominant cause of response homogenization can be attributed to the Decision Process Optimization (DPO) stage rather than the Supervised Fine-Tuning (SFT) stage.
Variability Across Model Families
The severity of the alignment tax was found to vary across different model families and scales, confirmed through cross-family replication across four distinct model families. The validation process involved a comprehensive evaluation across 22 experiments, utilizing five benchmarks and three model scales ranging from 3 billion to 14 billion parameters.
The study employed various metrics, including Jaccard, embedding, and Natural Language Inference (NLI)-based baselines at three different DeBERTa scales, all yielding AUROC scores around 0.51. Furthermore, cross-embedder validation utilizing two independent embedding families effectively ruled out coupling bias as a factor in the observed outcomes.
Generalization and Future Directions
Cross-dataset validation on the WebQuestions benchmark, which achieved a 58.0% single-cluster rate, confirmed the generalizability of these findings beyond the TruthfulQA dataset. The central discovery of response homogenization is noted to be implementation-independent and label-free, presenting a unique challenge in the landscape of language model alignment.
Motivated by these findings, the research explores the concept of a cheapest-first cascade (UCBD) over orthogonal uncertainty signals. This methodology aims to enhance predictive accuracy, as evidenced by an increase in GSM8K accuracy from 84.4% to 93.2% at 50% coverage.
