Logics-Parsing-Omni Technical Report
Summary: arXiv:2603.09677v3 Announce Type: replace
Abstract
Addressing the challenges of fragmented task definitions and the heterogeneity of unstructured data in multimodal parsing, this paper proposes the Omni Parsing framework. This framework establishes a Unified Taxonomy covering documents, images, and audio-visual streams, introducing a progressive parsing paradigm that bridges perception and cognition.
Framework Overview
The Omni Parsing framework integrates three hierarchical levels:
- Holistic Detection: This level achieves precise spatial-temporal grounding of objects or events to establish a geometric baseline for perception.
- Fine-grained Recognition: It performs symbolization (e.g., OCR/ASR) and attribute extraction on localized objects to complete structured entity parsing.
- Multi-level Interpreting: This level constructs a reasoning chain from local semantics to global logic.
Key Advantages
A pivotal advantage of this framework is its evidence anchoring mechanism, which enforces a strict alignment between high-level semantic descriptions and low-level facts. This enables “evidence-based” logical induction, transforming unstructured signals into standardized knowledge that is locatable, enumerable, and traceable.
Dataset and Model Release
Building on this foundation, a standardized dataset has been constructed, and the Logics-Parsing-Omni model has been released. This model successfully converts complex audio-visual signals into machine-readable structured knowledge.
Experimental Results
Experiments demonstrate that fine-grained perception and high-level cognition are synergistic, effectively enhancing model reliability. The integration of these capabilities allows for improved performance in multimodal parsing tasks.
Evaluation Benchmark
Furthermore, to quantitatively evaluate these capabilities, the authors introduce OmniParsingBench, a benchmark designed to assess the performance of the Omni Parsing framework. This benchmark provides a comprehensive evaluation of model performance across various multimodal tasks.
Access to Resources
Code, models, and the benchmark are released at the following link:
Logics-Parsing-Omni GitHub Repository.
Conclusion
The Omni Parsing framework represents a significant advancement in the field of multimodal parsing, providing a comprehensive solution to the challenges posed by unstructured data and fragmented task definitions. By integrating perception and cognition, the framework paves the way for future research and applications in this rapidly evolving domain.
