Paper Reconstruction Evaluation: Evaluating Presentation and Hallucination in AI-written Papers
Summary: arXiv:2604.01128v1 Announce Type: cross
The emergence of AI-driven paper writing has raised significant concerns in the academic community, prompting researchers to question the reliability and quality of such outputs. In response, a new framework has been introduced, known as Paper Reconstruction Evaluation (PaperRecon), which aims to systematically evaluate the quality and associated risks of papers generated by modern coding agents.
This pioneering paper outlines the need for rigorous assessment mechanisms as AI-written papers proliferate. Despite the growing prevalence of AI tools in academia, there remains a lack of comprehensive approaches to evaluate the content produced by these systems. PaperRecon seeks to fill this gap by providing a structured methodology for assessing AI-generated research papers.
Overview of PaperRecon Framework
PaperRecon is designed to break down the evaluation of AI-written papers into two key dimensions: Presentation and Hallucination. Each of these dimensions serves a distinct purpose in the assessment process:
- Presentation: This aspect focuses on the clarity, coherence, and overall quality of the text. An established rubric is utilized to evaluate how well the AI-generated paper presents its arguments and findings.
- Hallucination: This term refers to the inaccuracies or false information that AI models may generate. Hallucination is assessed through agentic evaluation, which compares the AI output against the original paper source.
Introducing PaperWrite-Bench
To facilitate the evaluation process, the authors have developed PaperWrite-Bench, a benchmark consisting of 51 papers sourced from prestigious venues across various disciplines, all published after 2025. This diverse collection serves as a testing ground for the PaperRecon framework, allowing researchers to conduct comprehensive evaluations of AI-generated content.
Key Findings and Implications
Initial experiments using the PaperRecon framework have yielded intriguing insights into the performance of different AI models. The results indicate a significant trade-off between presentation quality and the frequency of hallucinations:
- ClaudeCode: This model demonstrates superior presentation quality but is associated with an average of over 10 hallucinations per paper. This raises questions about the reliability of its outputs despite the appealing presentation.
- Codex: In contrast, Codex produces fewer hallucinations, suggesting a more reliable factual basis. However, it falls short in terms of presentation quality, indicating room for improvement in how it articulates research findings.
These findings underscore the necessity for ongoing research into AI-driven paper writing and the establishment of robust evaluation frameworks. As AI tools continue to evolve and integrate into the academic landscape, understanding their strengths and weaknesses will be crucial for ensuring the integrity of research outputs.
Conclusion
In summary, the Paper Reconstruction Evaluation framework represents a significant advancement in the evaluation of AI-generated research papers. By disentangling the dimensions of Presentation and Hallucination, this work lays the groundwork for future studies aimed at enhancing the reliability of AI-driven academic writing. As the research community grapples with these emerging technologies, frameworks like PaperRecon will be essential in navigating the complexities of AI in academia.
