Weakly Supervised Hallucination Detection in Transformers

Date:

Weakly Supervised Distillation of Hallucination Signals into Transformer Representations

Summary: arXiv:2604.06277v1 Announce Type: new

Abstract: Existing hallucination detection methods for large language models (LLMs) rely on external verification at inference time, requiring gold answers, retrieval systems, or auxiliary judge models. We ask whether this external supervision can instead be distilled into the model’s own representations during training, enabling hallucination detection from internal activations alone at inference time.

Introduction

The rapid advancement of large language models (LLMs) has raised concerns regarding their tendency to generate hallucinated content—responses that may sound plausible but lack factual accuracy. Traditional methods for detecting such instances often depend on external verification mechanisms, which can be cumbersome and resource-intensive. This study explores an innovative approach to integrate these verification processes directly into the model’s training phase.

Methodology

To address the challenge of hallucination detection, we introduce a weak supervision framework that utilizes three complementary grounding signals:

  • Substring matching
  • Sentence embedding similarity
  • LLM judge verdicts

This framework allows us to label generated responses as grounded or hallucinated without the need for human annotation. Utilizing this approach, we constructed a dataset comprising 15,000 samples from SQuAD v2, including 10,500 samples for training and development, and a separate test set of 5,000 samples. Each sample consists of a generated answer from the LLaMA-2-7B model, its full per-layer hidden states, and associated structured hallucination labels.

Model Training

We trained five probing classifiers on the hidden states extracted from the LLaMA-2-7B model:

  • ProbeMLP (M0)
  • LayerWiseMLP (M1)
  • CrossLayerTransformer (M2)
  • HierarchicalTransformer (M3)
  • CrossLayerAttentionTransformerV2 (M4)

By treating external grounding signals as training-time supervision, the primary hypothesis is that hallucination detection signals can be distilled into transformer representations, thus enabling internal detection without requiring external verification during inference.

Results

The results of our experiments support the central hypothesis. Transformer-based probes demonstrated superior discrimination capabilities, with M2 achieving the highest 5-fold average AUC/F1 scores. Furthermore, M3 excelled in both single-fold validation and held-out test evaluations. We also evaluated the efficiency of the inference process, observing probe latency ranging from 0.15 to 5.62 milliseconds in batched scenarios and 1.55 to 6.66 milliseconds for single samples. Notably, the end-to-end generation combined with probe throughput maintained an approximate rate of 0.231 queries per second, indicating minimal overhead in practical applications.

Conclusion

This research presents a novel approach to hallucination detection in LLMs by integrating external verification signals into the model’s training process. By effectively distilling these signals into transformer representations, we pave the way for more efficient and reliable detection mechanisms that operate independently during inference, ultimately enhancing the robustness of large language models.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.