Tracing Fake Citations to Neurons in Large Language Models

Where Fake Citations Are Made: Tracing Field-Level Hallucination to Specific Neurons in LLMs

Summary: arXiv:2604.18880v1 Announce Type: cross

Abstract

Large Language Models (LLMs) frequently generate fictitious yet convincing citations, often expressing high confidence even when the underlying reference is incorrect. This phenomenon, commonly referred to as “citation hallucination,” poses significant challenges in various fields, particularly in academic writing where credibility is paramount. In this article, we explore the nature and mechanics of citation hallucination in LLMs, focusing on the findings from our study that spans nine models and 108,000 generated references.

Key Findings

Author names are the most frequently hallucinated element across all models and settings, significantly more so than other citation fields.
Contrary to expectations, citation style has no measurable effect on the frequency of hallucinated citations.
Reasoning-oriented distillation techniques tend to degrade recall, further exacerbating the issue of hallucinations.
Probes trained on one citation field show near-chance performance when applied to others, indicating that hallucination signals are not consistent across different fields.

Methodology

Building on our findings regarding field-specific hallucination, we employed elastic-net regularization with stability selection to analyze neuron-level CETT values from the Qwen2.5-32B-Instruct model. This analysis led us to identify a sparse set of neurons responsible for field-specific hallucinations, termed “FH-neurons.”

Causal Interventions

To further validate the role of these FH-neurons, we conducted causal interventions. Our experiments revealed that amplifying the activity of these neurons leads to an increase in hallucination rates, while suppressing their activity resulted in improved performance across various citation fields. Notably, the gains from this suppression were more pronounced in certain fields, suggesting that the impact of these neurons varies based on the specific citation context.

Implications for Future Research

The results of our study have significant implications for the development of LLMs and their application in academic and professional writing. By identifying and mitigating the influence of FH-neurons, we propose a lightweight approach to detecting and reducing citation hallucination. This strategy leverages internal model signals, potentially leading to more reliable and accurate citations in generated texts.

Conclusion

As LLMs continue to evolve and integrate into various domains, understanding the mechanics of citation hallucination is crucial. Our research highlights the importance of investigating specific neurons within these models to improve their performance and reliability. By focusing on field-specific hallucination neurons, we open new avenues for enhancing the integrity of information generated by LLMs, ultimately fostering trust and credibility in automated writing systems.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Tracing Fake Citations to Neurons in Large Language Models

Where Fake Citations Are Made: Tracing Field-Level Hallucination to Specific Neurons in LLMs

Abstract

Key Findings

Methodology

Causal Interventions

Implications for Future Research

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related