CoLoRSMamba: Conditional LoRA-Steered Mamba for Supervised Multimodal Violence Detection
Summary: arXiv:2604.03329v1 Announce Type: cross
Abstract
Violence detection benefits from audio, but real-world soundscapes can be noisy or weakly related to the visible scene. We present CoLoRSMamba, a directional Video to Audio multimodal architecture that couples VideoMamba and AudioMamba through CLS-guided conditional LoRA. At each layer, the VideoMamba CLS token produces a channel-wise modulation vector and a stabilization gate that adapt the AudioMamba projections responsible for the selective state-space parameters (Delta, B, C), including the step-size pathway, yielding scene-aware audio dynamics without token-level cross-attention.
Introduction
The detection of violence in multimedia contexts has gained significant attention due to its implications for security and social safety. Traditional methods often rely solely on visual cues; however, integrating audio data can enhance detection accuracy. CoLoRSMamba introduces a novel architecture that bridges the gap between visual and auditory modalities, leveraging the strengths of both to improve performance in real-world scenarios.
Methodology
CoLoRSMamba utilizes a unique approach by combining two components: VideoMamba and AudioMamba. This coupling is achieved through a mechanism known as CLS-guided conditional LoRA. The architecture operates on the principle that each layer’s VideoMamba CLS token can generate a channel-wise modulation vector along with a stabilization gate. These elements work together to adjust the AudioMamba projections, which are integral to the selective state-space parameters, including:
- Delta
- B
- C
This design enables the system to produce scene-aware audio dynamics without the complexity of token-level cross-attention, thus simplifying the model while retaining effectiveness.
Training and Evaluation
The training process for CoLoRSMamba involves a binary classification task enhanced by a symmetric AV-InfoNCE objective. This objective aligns clip-level audio and video embeddings, ensuring that both modalities are effectively synchronized during the learning process. To facilitate a fair evaluation of the multimodal system, researchers have curated audio-filtered clip-level subsets from the NTU-CCTV and DVD datasets. This curation process involves retaining only those clips where audio is available, thus providing a solid basis for testing the system’s capabilities.
Results
In comparative evaluations, CoLoRSMamba has demonstrated superior performance against a range of baselines, including audio-only, video-only, and other multimodal systems. The results show that CoLoRSMamba achieves:
- 88.63% accuracy / 86.24% F1-V on NTU-CCTV
- 75.77% accuracy / 72.94% F1-V on DVD
Moreover, CoLoRSMamba provides a favorable balance between accuracy and computational efficiency, outperforming several larger models while utilizing fewer parameters and FLOPs.
Conclusion
CoLoRSMamba marks a significant advancement in the field of multimodal violence detection by effectively integrating video and audio data. Its innovative architecture and training methodology set a new benchmark for future research in this area, paving the way for more robust security solutions in diverse environments.
