MISID Dataset: Multimodal Intent Recognition in Deception Games

MISID: A Multimodal Multi-turn Dataset for Complex Intent Recognition in Strategic Deception Games

Summary: arXiv:2604.12700v1 Announce Type: new

Abstract

Understanding human intent in complex multi-turn interactions remains a fundamental challenge in human-computer interaction and behavioral analysis. While existing intent recognition datasets focus mainly on single utterances or simple dialogues, real-world scenarios often involve sophisticated strategic interactions where participants must maintain complex deceptive narratives over extended periods. To address this gap, we introduce MISID, a comprehensive multimodal, multi-turn, and multi-participant benchmark for intent recognition.

Overview of MISID

Sourced from high-stakes social strategy games, MISID features a fine-grained, two-tier multi-dimensional annotation scheme tailored for long-context discourse analysis and evidence-based causal tracking.

Key Features of MISID

Multimodal Interactions: The dataset captures both verbal and non-verbal cues, enabling a richer understanding of intent.
Multi-turn Conversations: It includes extensive dialogues that reflect the complexities of human interactions over time.
Multi-participant Engagement: The dataset comprises interactions among multiple individuals, simulating real-world strategic scenarios.
Complex Deceptive Narratives: MISID is designed to analyze scenarios where deception plays a critical role, enhancing the applicability of AI in understanding human behavior.

Challenges in Current Intent Recognition Models

Our systematic evaluation of state-of-the-art Multimodal Large Language Models (MLLMs) on MISID reveals critical deficiencies in complex scenarios:

Text-prior Visual Hallucination: Many models tend to rely too heavily on textual data, leading to inaccuracies when visual context is crucial.
Impaired Cross-modal Synergy: Current models struggle to effectively integrate information from multiple modalities.
Limited Capacity in Chaining Causal Cues: The ability to track and infer causality across extended dialogues is often lacking.

Proposed Solution: FRACTAM

To address these challenges, we propose FRACTAM as a baseline framework. Utilizing a “Decouple-Anchor-Reason” paradigm, FRACTAM significantly reduces text bias by:

Extracting Pure Unimodal Factual Representations: This ensures that the model is not swayed by misleading textual cues.
Employing Two-stage Retrieval: This technique enhances long-range factual anchoring, allowing for better context retention.
Constructing Explicit Cross-modal Evidence Chains: By creating clear connections between modalities, FRACTAM improves overall interpretability and inference.

Conclusion

Extensive experiments demonstrate that FRACTAM enhances mainstream models’ performance in complex strategic tasks, improving hidden intent detection and inference while maintaining robust perceptual accuracy. Our dataset, MISID, is now available at https://naislab.cn/datasets/MISID, providing vital resources for advancing research in multimodal intent recognition.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

MISID Dataset: Multimodal Intent Recognition in Deception Games

MISID: A Multimodal Multi-turn Dataset for Complex Intent Recognition in Strategic Deception Games

Abstract

Overview of MISID

Key Features of MISID

Challenges in Current Intent Recognition Models

Proposed Solution: FRACTAM

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related