Weakly Supervised Distillation of Hallucination Signals into Transformer Representations
Summary: arXiv:2604.06277v1 Announce Type: new
Abstract: Existing hallucination detection methods for large language models (LLMs) rely on external verification at inference time, requiring gold answers, retrieval systems, or auxiliary judge models. We ask whether this external supervision can instead be distilled into the model’s own representations during training, enabling hallucination detection from internal activations alone at inference time.
Introduction
The rapid advancement of large language models (LLMs) has raised concerns regarding their tendency to generate hallucinated content—responses that may sound plausible but lack factual accuracy. Traditional methods for detecting such instances often depend on external verification mechanisms, which can be cumbersome and resource-intensive. This study explores an innovative approach to integrate these verification processes directly into the model’s training phase.
Methodology
To address the challenge of hallucination detection, we introduce a weak supervision framework that utilizes three complementary grounding signals:
- Substring matching
- Sentence embedding similarity
- LLM judge verdicts
This framework allows us to label generated responses as grounded or hallucinated without the need for human annotation. Utilizing this approach, we constructed a dataset comprising 15,000 samples from SQuAD v2, including 10,500 samples for training and development, and a separate test set of 5,000 samples. Each sample consists of a generated answer from the LLaMA-2-7B model, its full per-layer hidden states, and associated structured hallucination labels.
Model Training
We trained five probing classifiers on the hidden states extracted from the LLaMA-2-7B model:
- ProbeMLP (M0)
- LayerWiseMLP (M1)
- CrossLayerTransformer (M2)
- HierarchicalTransformer (M3)
- CrossLayerAttentionTransformerV2 (M4)
By treating external grounding signals as training-time supervision, the primary hypothesis is that hallucination detection signals can be distilled into transformer representations, thus enabling internal detection without requiring external verification during inference.
Results
The results of our experiments support the central hypothesis. Transformer-based probes demonstrated superior discrimination capabilities, with M2 achieving the highest 5-fold average AUC/F1 scores. Furthermore, M3 excelled in both single-fold validation and held-out test evaluations. We also evaluated the efficiency of the inference process, observing probe latency ranging from 0.15 to 5.62 milliseconds in batched scenarios and 1.55 to 6.66 milliseconds for single samples. Notably, the end-to-end generation combined with probe throughput maintained an approximate rate of 0.231 queries per second, indicating minimal overhead in practical applications.
Conclusion
This research presents a novel approach to hallucination detection in LLMs by integrating external verification signals into the model’s training process. By effectively distilling these signals into transformer representations, we pave the way for more efficient and reliable detection mechanisms that operate independently during inference, ultimately enhancing the robustness of large language models.
