Internalized Reasoning for Long-Context Visual Document Understanding
Summary: arXiv:2604.02371v1 Announce Type: cross
Abstract: Visual long-document understanding is critical for enterprise, legal, and scientific applications, yet the best performing open recipes have not explored reasoning, a capability which has driven leaps in math and code performance. We introduce a synthetic data pipeline for reasoning in long-document understanding that generates thinking traces by scoring each page for question relevance, extracting textual evidence and ordering it from most to least relevant.
We apply SFT to the resulting traces within <SFT> tags, gated by a <control> control token, and the resulting reasoning capability is internalized via low-strength model merging. We study Qwen3 VL 32B and Mistral Small 3.1 24B. With Qwen3 VL, we achieve 58.3 on MMLongBenchDoc, surpassing the 7× larger Qwen3 VL 235B A22B (57.0).
With Mistral, we show that synthetic reasoning outperforms distillation from the Thinking version’s traces by 3.8 points on MMLBD-C, and internalized reasoning exhibits 12.4× fewer mean output tokens compared to explicit reasoning. We release our pipeline for reproducibility and further exploration.
Key Findings
- Introduction of a synthetic data pipeline enhances reasoning capabilities in long-document understanding.
- Qwen3 VL 32B demonstrates superior performance compared to the Qwen3 VL 235B variant.
- Mistral exhibits a significant improvement in reasoning efficiency, outperforming traditional methods.
- Internalized reasoning leads to a drastic reduction in output token count, promoting efficiency.
- The research pipeline is made publicly available for further research and validation.
Significance of the Study
Long-document understanding has become increasingly vital in various sectors, including legal and scientific fields, where the ability to comprehend and analyze extensive documents is paramount. The integration of reasoning capabilities into visual document understanding systems could revolutionize how organizations process information, leading to faster and more accurate insights.
By utilizing a synthetic data pipeline, this research not only addresses the current limitations in reasoning for long-context documents but also sets a foundation for future advancements in AI-driven document analysis. The ability to internalize reasoning within models could significantly enhance their performance, making them more adaptable and efficient in real-world applications.
Future Directions
As the research community continues to explore the implications of this study, several areas for future exploration arise:
- Expanding the synthetic data pipeline to include more diverse document types.
- Investigating the potential for further efficiency improvements in reasoning processes.
- Exploring the integration of these techniques in real-world enterprise applications.
- Collaborating with other researchers to enhance reproducibility and validation of findings.
This study sets a precedent for future research in the realm of long-document understanding and reasoning in AI, paving the way for more intelligent and capable systems.
