Disposition Distillation at Small Scale: A Three-Arc Negative Result
Summary: arXiv:2604.11867v1 Announce Type: cross
Introduction
In the rapidly evolving field of artificial intelligence, the distillation of behavioral dispositions into small language models presents both opportunities and challenges. Recent research aimed to explore this potential through a comprehensive four-stage distillation pipeline developed at MIT. The study focused on training models with 0.6B to 2.3B effective parameters to enhance their capabilities in self-verification, uncertainty acknowledgment, and feedback integration.
Methodology
The research utilized a four-stage distillation pipeline, which included:
- Training behavioral dispositions into small language models.
- Conducting follow-on experiments on inference-time attention-head interventions.
- Implementing a frozen-base confidence-gated sidecar.
An internal draft initially reported significant performance gains: a +33.9-point increase in the Massachusetts Comprehensive Assessment System (MCAS) and a +15.3-point improvement in HumanEval scores for the Qwen3-0.6B model. However, these results were later found to be misleading.
Findings and Results
Subsequent sanity checks revealed that the reported gains were artifacts of the experimental setup:
- The HumanEval delta was identified as a truncation artifact, reversing to a decline of -8.0 points when the prediction count was adjusted from 512 to 1024.
- The MCAS gain vanished under rigorous apples-to-apples scoring conditions.
These falsifications led to three additional arcs of investigation:
- Applying SFT/DPO LoRA across three model families and two domains.
- Experimenting with inference-time attention-head tempering on the output projection layer (o_proj).
- Deploying a training-free frozen-base sidecar that analyzed the final-token hidden state (h_last).
Despite these extensive efforts, no operator was able to enhance judge-measured disposition without negatively impacting the content quality or resulting in stylistic mimicry across five tested models: Qwen3-0.6B, Qwen3-1.7B, Qwen3.5-0.8B, Gemma 4 E2B, and SmolLM2-1.7B-Instruct.
Conclusion
This study contributes a three-arc negative result with detailed mechanisms and introduces a two-failure-mode taxonomy for linear h_last probes. Moreover, it establishes an honest falsification pipeline that transforms previously generated false positives into publishable negatives. An independent observation noted that the Gemma 4 E2B model displayed a near-complete decoupling of confidence and correctness in the Chef domain, asserting at 91% regardless of the actual correctness, which raises intriguing questions about model reliability in various contexts.
This research emphasizes the complexity of distilling behavioral dispositions in AI models and the necessity for rigorous validation of results before publication.
