GPT4o-Receipt: A Dataset and Human Study for AI-Generated Document Forensics
In a groundbreaking study published on arXiv, researchers have introduced GPT4o-Receipt, a comprehensive benchmark aimed at understanding the nuances of AI-generated financial documents. The study investigates whether human annotators can effectively detect AI-generated receipts in comparison to state-of-the-art multimodal large language models (LLMs).
Summary of the Study
The study revolves around a dataset of 1,235 receipt images, which pairs AI-generated receipts produced by GPT-4o with authentic receipts collected from established datasets. The evaluation involved five cutting-edge multimodal LLMs and a perceptual study conducted by 30 human annotators.
Key Findings
The results of the study reveal a fascinating paradox: while humans are adept at identifying visual artifacts in AI-generated documents, their ability to detect the authenticity of these documents is significantly lower than that of LLMs. Here are some of the critical insights from the research:
- Human annotators displayed the largest visual discrimination gap among all evaluators.
- Despite their visual acuity, the binary detection F1 score of human annotators fell below that of Claude Sonnet 4 and Gemini 2.5 Flash.
- The primary forensic signals within AI-generated receipts were found to be arithmetic errors, which are challenging for humans to spot but can be verified quickly by LLMs.
Understanding the Paradox
The paradox of human detection capabilities versus machine accuracy becomes clearer when examining the nature of the errors present in AI-generated receipts. While human reviewers may notice visual discrepancies, they struggle to perceive numerical inaccuracies such as incorrect subtotals. In contrast, LLMs can process these documents and identify flawed arithmetic in mere milliseconds.
Evaluation of Multimodal Models
The research not only highlights the human versus LLM comparison but also emphasizes the significant disparities in performance among the five evaluated models. The findings suggest that traditional accuracy metrics may not be sufficient for model selection in the realm of AI document forensics. Researchers advocate for a more nuanced approach to evaluating these models to capture the complexities of AI-generated document detection.
Public Release and Future Research
In a bid to foster further research in AI document forensics, the GPT4o-Receipt dataset, the evaluation framework, and all associated results have been made publicly available. This initiative is expected to enable researchers and developers to enhance detection methodologies, ultimately improving the integrity of financial documentation in an increasingly digital world.
Conclusion
The findings from the GPT4o-Receipt study present critical implications for the future of AI document forensics. As AI-generated documents become more prevalent, understanding the strengths and limitations of both human and machine detection will be vital in ensuring the authenticity of financial transactions.
