Commander-GPT: Dividing and Routing for Multimodal Sarcasm Detection
Summary: arXiv:2506.19420v2 Announce Type: replace
Abstract
Multimodal sarcasm understanding is a high-order cognitive task. Although large language models (LLMs) have shown impressive performance on many downstream NLP tasks, growing evidence suggests that they struggle with sarcasm understanding. In this paper, we propose Commander-GPT, a modular decision routing framework inspired by military command theory.
Introduction
Understanding sarcasm is a complex challenge in natural language processing (NLP) due to its reliance on context, tone, and often contradictory cues. Traditional LLMs, while powerful, have limitations in accurately detecting sarcasm, leading to misinterpretations in various applications. Commander-GPT aims to address this gap by leveraging a specialized team of LLM agents designed to handle distinct aspects of sarcasm detection.
Framework Overview
Commander-GPT orchestrates a team of specialized LLM agents, each assigned to focused sub-tasks such as keyword extraction and sentiment analysis. This modular approach allows for more nuanced understanding compared to using a single LLM. The outputs from these agents are then routed back to a central commander, which integrates the information and performs the final sarcasm judgment.
Components of Commander-GPT
The framework consists of three types of centralized commanders:
- Lightweight Encoder-Based Commander: Utilizes models like multi-modal BERT for efficient processing.
- Moderately Capable Commanders: Four small autoregressive language models, such as DeepSeek-VL, serve as intermediate decision-makers.
- Large LLM-Based Commanders: Two advanced models, Gemini Pro and GPT-4o, perform task routing, output aggregation, and sarcasm decision-making in a zero-shot fashion.
Evaluation and Results
We evaluated Commander-GPT on the MMSD and MMSD 2.0 benchmarks, employing five different prompting strategies to assess its performance. The results demonstrated that our framework achieved significant improvements over state-of-the-art (SoTA) baselines, with an average enhancement of 4.4% and 11.7% in F1 scores.
Conclusion
Commander-GPT showcases a promising approach to tackling the nuanced challenge of sarcasm detection in multimodal contexts. By utilizing a modular framework that combines the strengths of specialized LLM agents, we have demonstrated notable improvements over existing methods. As sarcasm detection becomes increasingly relevant in applications ranging from social media analysis to customer service interactions, Commander-GPT paves the way for more effective and accurate understanding of complex human communication.
Future Work
Future research will focus on refining the routing mechanisms and exploring the incorporation of additional modalities, such as visual and auditory cues, to further enhance the sarcasm detection capabilities of Commander-GPT.
