GPT4o-Receipt Dataset for AI Document Forensics Study

GPT4o-Receipt: A Dataset and Human Study for AI-Generated Document Forensics

In a groundbreaking study published on arXiv, researchers have introduced GPT4o-Receipt, a comprehensive benchmark aimed at understanding the nuances of AI-generated financial documents. The study investigates whether human annotators can effectively detect AI-generated receipts in comparison to state-of-the-art multimodal large language models (LLMs).

Summary of the Study

The study revolves around a dataset of 1,235 receipt images, which pairs AI-generated receipts produced by GPT-4o with authentic receipts collected from established datasets. The evaluation involved five cutting-edge multimodal LLMs and a perceptual study conducted by 30 human annotators.

Key Findings

The results of the study reveal a fascinating paradox: while humans are adept at identifying visual artifacts in AI-generated documents, their ability to detect the authenticity of these documents is significantly lower than that of LLMs. Here are some of the critical insights from the research:

Human annotators displayed the largest visual discrimination gap among all evaluators.
Despite their visual acuity, the binary detection F1 score of human annotators fell below that of Claude Sonnet 4 and Gemini 2.5 Flash.
The primary forensic signals within AI-generated receipts were found to be arithmetic errors, which are challenging for humans to spot but can be verified quickly by LLMs.

Understanding the Paradox

The paradox of human detection capabilities versus machine accuracy becomes clearer when examining the nature of the errors present in AI-generated receipts. While human reviewers may notice visual discrepancies, they struggle to perceive numerical inaccuracies such as incorrect subtotals. In contrast, LLMs can process these documents and identify flawed arithmetic in mere milliseconds.

Evaluation of Multimodal Models

The research not only highlights the human versus LLM comparison but also emphasizes the significant disparities in performance among the five evaluated models. The findings suggest that traditional accuracy metrics may not be sufficient for model selection in the realm of AI document forensics. Researchers advocate for a more nuanced approach to evaluating these models to capture the complexities of AI-generated document detection.

Public Release and Future Research

In a bid to foster further research in AI document forensics, the GPT4o-Receipt dataset, the evaluation framework, and all associated results have been made publicly available. This initiative is expected to enable researchers and developers to enhance detection methodologies, ultimately improving the integrity of financial documentation in an increasingly digital world.

Conclusion

The findings from the GPT4o-Receipt study present critical implications for the future of AI document forensics. As AI-generated documents become more prevalent, understanding the strengths and limitations of both human and machine detection will be vital in ensuring the authenticity of financial transactions.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

GPT4o-Receipt Dataset for AI Document Forensics Study

GPT4o-Receipt: A Dataset and Human Study for AI-Generated Document Forensics

Summary of the Study

Key Findings

Understanding the Paradox

Evaluation of Multimodal Models

Public Release and Future Research

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related