OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs
The increasing complexity of modern multimodal large language models (MLLMs) has brought forth new challenges in understanding how these systems generate their outputs. As these models can process interleaved text, image, audio, and video inputs, it becomes imperative to pinpoint which of these sources contribute to specific generated statements.
Current methods of attribution primarily cater to classification tasks, fixed prediction targets, or single-modality architectures. They fail to adequately address the needs of autoregressive, decoder-only models engaged in open-ended multimodal generation. To tackle this pressing issue, researchers have introduced a new framework called OmniTrace.
Understanding OmniTrace
OmniTrace is designed to be a lightweight and model-agnostic solution that formalizes the attribution challenge as a generation-time tracing problem. This approach leverages the causal decoding process inherent in multimodal generation to provide clearer insights into the models’ outputs.
Key Features of OmniTrace
- Unified Protocol: OmniTrace converts arbitrary token-level signals, such as attention weights or gradient-based scores, into coherent span-level, cross-modal explanations during the decoding phase.
- Token Tracing: The framework traces each generated token back to its multimodal inputs, allowing for a deeper understanding of the input-output relationship.
- Semantic Aggregation: By aggregating signals into semantically meaningful spans, OmniTrace enhances the interpretability of the model’s outputs.
- Confidence-Weighted Selection: The framework employs a confidence-weighted and temporally coherent aggregation method to select concise supporting sources, all without the need for retraining or supervision.
Evaluations and Results
Extensive evaluations conducted on Qwen2.5-Omni and MiniCPM-o-4.5 models across various tasks—spanning visual, audio, and video modalities—demonstrate that generation-aware span-level attribution yields more stable interpretations than traditional self-attribution methods and embedding-based baselines.
The findings suggest that OmniTrace not only enhances the transparency of outputs generated by multimodal language models but also offers robustness across multiple underlying attribution signals.
Conclusion
In summary, OmniTrace provides a scalable foundation for achieving transparency in omni-modal language models. By addressing the limitations of existing attribution methods, this innovative framework sets a new standard for understanding and interpreting the decisions made by complex multimodal systems, paving the way for future advancements in the field of artificial intelligence.
