Where Fake Citations Are Made: Tracing Field-Level Hallucination to Specific Neurons in LLMs
Summary: arXiv:2604.18880v1 Announce Type: cross
Abstract
Large Language Models (LLMs) frequently generate fictitious yet convincing citations, often expressing high confidence even when the underlying reference is incorrect. This phenomenon, commonly referred to as “citation hallucination,” poses significant challenges in various fields, particularly in academic writing where credibility is paramount. In this article, we explore the nature and mechanics of citation hallucination in LLMs, focusing on the findings from our study that spans nine models and 108,000 generated references.
Key Findings
- Author names are the most frequently hallucinated element across all models and settings, significantly more so than other citation fields.
- Contrary to expectations, citation style has no measurable effect on the frequency of hallucinated citations.
- Reasoning-oriented distillation techniques tend to degrade recall, further exacerbating the issue of hallucinations.
- Probes trained on one citation field show near-chance performance when applied to others, indicating that hallucination signals are not consistent across different fields.
Methodology
Building on our findings regarding field-specific hallucination, we employed elastic-net regularization with stability selection to analyze neuron-level CETT values from the Qwen2.5-32B-Instruct model. This analysis led us to identify a sparse set of neurons responsible for field-specific hallucinations, termed “FH-neurons.”
Causal Interventions
To further validate the role of these FH-neurons, we conducted causal interventions. Our experiments revealed that amplifying the activity of these neurons leads to an increase in hallucination rates, while suppressing their activity resulted in improved performance across various citation fields. Notably, the gains from this suppression were more pronounced in certain fields, suggesting that the impact of these neurons varies based on the specific citation context.
Implications for Future Research
The results of our study have significant implications for the development of LLMs and their application in academic and professional writing. By identifying and mitigating the influence of FH-neurons, we propose a lightweight approach to detecting and reducing citation hallucination. This strategy leverages internal model signals, potentially leading to more reliable and accurate citations in generated texts.
Conclusion
As LLMs continue to evolve and integrate into various domains, understanding the mechanics of citation hallucination is crucial. Our research highlights the importance of investigating specific neurons within these models to improve their performance and reliability. By focusing on field-specific hallucination neurons, we open new avenues for enhancing the integrity of information generated by LLMs, ultimately fostering trust and credibility in automated writing systems.
