MISID: A Multimodal Multi-turn Dataset for Complex Intent Recognition in Strategic Deception Games
Summary: arXiv:2604.12700v1 Announce Type: new
Abstract
Understanding human intent in complex multi-turn interactions remains a fundamental challenge in human-computer interaction and behavioral analysis. While existing intent recognition datasets focus mainly on single utterances or simple dialogues, real-world scenarios often involve sophisticated strategic interactions where participants must maintain complex deceptive narratives over extended periods. To address this gap, we introduce MISID, a comprehensive multimodal, multi-turn, and multi-participant benchmark for intent recognition.
Overview of MISID
Sourced from high-stakes social strategy games, MISID features a fine-grained, two-tier multi-dimensional annotation scheme tailored for long-context discourse analysis and evidence-based causal tracking.
Key Features of MISID
- Multimodal Interactions: The dataset captures both verbal and non-verbal cues, enabling a richer understanding of intent.
- Multi-turn Conversations: It includes extensive dialogues that reflect the complexities of human interactions over time.
- Multi-participant Engagement: The dataset comprises interactions among multiple individuals, simulating real-world strategic scenarios.
- Complex Deceptive Narratives: MISID is designed to analyze scenarios where deception plays a critical role, enhancing the applicability of AI in understanding human behavior.
Challenges in Current Intent Recognition Models
Our systematic evaluation of state-of-the-art Multimodal Large Language Models (MLLMs) on MISID reveals critical deficiencies in complex scenarios:
- Text-prior Visual Hallucination: Many models tend to rely too heavily on textual data, leading to inaccuracies when visual context is crucial.
- Impaired Cross-modal Synergy: Current models struggle to effectively integrate information from multiple modalities.
- Limited Capacity in Chaining Causal Cues: The ability to track and infer causality across extended dialogues is often lacking.
Proposed Solution: FRACTAM
To address these challenges, we propose FRACTAM as a baseline framework. Utilizing a “Decouple-Anchor-Reason” paradigm, FRACTAM significantly reduces text bias by:
- Extracting Pure Unimodal Factual Representations: This ensures that the model is not swayed by misleading textual cues.
- Employing Two-stage Retrieval: This technique enhances long-range factual anchoring, allowing for better context retention.
- Constructing Explicit Cross-modal Evidence Chains: By creating clear connections between modalities, FRACTAM improves overall interpretability and inference.
Conclusion
Extensive experiments demonstrate that FRACTAM enhances mainstream models’ performance in complex strategic tasks, improving hidden intent detection and inference while maintaining robust perceptual accuracy. Our dataset, MISID, is now available at https://naislab.cn/datasets/MISID, providing vital resources for advancing research in multimodal intent recognition.
