DocShield: Towards AI Document Safety via Evidence-Grounded Agentic Reasoning
Summary: arXiv:2604.02694v1 Announce Type: cross
The rapid advancement of generative artificial intelligence has led to the creation of increasingly realistic text-centric image forgeries. These developments present significant challenges to document safety, as traditional forensic methods primarily rely on visual cues and often lack the necessary evidence-based reasoning to uncover subtle text manipulations.
Challenges in Document Safety
Existing methods for detecting forgery, localizing issues, and providing explanations are frequently addressed as isolated tasks. This approach limits the reliability and interpretability of results, leaving a gap that necessitates a more integrated strategy.
Introducing DocShield
To address these pressing challenges in document forensics, researchers propose DocShield, the first unified framework that formulates text-centric forgery analysis as a visual-logical co-reasoning problem. This innovative approach aims to enhance document safety by enabling a comprehensive analysis of both visual and textual elements.
Core Mechanism: Cross-Cues-aware Chain of Thought (CCT)
At the heart of DocShield is a novel Cross-Cues-aware Chain of Thought (CCT) mechanism. This mechanism facilitates implicit agentic reasoning by iteratively cross-validating visual anomalies with textual semantics. The result is a forensic analysis that is not only consistent but also grounded in solid evidence.
Optimization and Reward Structure
To further enhance its effectiveness, DocShield introduces a Weighted Multi-Task Reward for optimization based on GRPO (Generalized Reinforcement Policy Optimization). This reward structure aligns the reasoning framework, spatial evidence, and authenticity prediction, thereby improving the overall reliability of the analysis.
Dataset Development: RealText-V1
In conjunction with the framework, researchers have developed RealText-V1, a multilingual dataset that includes document-like text images, pixel-level manipulation masks, and expert-level textual explanations. This dataset serves as a crucial resource for training and validating the efficacy of the DocShield framework.
Performance and Results
Extensive experiments conducted to evaluate DocShield demonstrate its superior performance compared to existing methods. Key findings include:
- Improvement of macro-average F1 score by 41.4% over specialized frameworks.
- Enhancement of 23.4% over GPT-4o on the T-IC13 benchmark.
- Consistent gains observed on the challenging T-SROIE benchmark.
Future Directions
With the ongoing evolution of generative AI, the need for robust document safety measures is more critical than ever. The researchers behind DocShield plan to publicly release their dataset, model, and code, fostering further innovation in this vital area of research.
As the landscape of document security continues to evolve, frameworks like DocShield will play a pivotal role in ensuring the integrity and authenticity of digital documents.
