CoLoRSMamba: Advanced Multimodal Violence Detection Model

CoLoRSMamba: Conditional LoRA-Steered Mamba for Supervised Multimodal Violence Detection

Summary: arXiv:2604.03329v1 Announce Type: cross

Abstract

Violence detection benefits from audio, but real-world soundscapes can be noisy or weakly related to the visible scene. We present CoLoRSMamba, a directional Video to Audio multimodal architecture that couples VideoMamba and AudioMamba through CLS-guided conditional LoRA. At each layer, the VideoMamba CLS token produces a channel-wise modulation vector and a stabilization gate that adapt the AudioMamba projections responsible for the selective state-space parameters (Delta, B, C), including the step-size pathway, yielding scene-aware audio dynamics without token-level cross-attention.

Introduction

The detection of violence in multimedia contexts has gained significant attention due to its implications for security and social safety. Traditional methods often rely solely on visual cues; however, integrating audio data can enhance detection accuracy. CoLoRSMamba introduces a novel architecture that bridges the gap between visual and auditory modalities, leveraging the strengths of both to improve performance in real-world scenarios.

Methodology

CoLoRSMamba utilizes a unique approach by combining two components: VideoMamba and AudioMamba. This coupling is achieved through a mechanism known as CLS-guided conditional LoRA. The architecture operates on the principle that each layer’s VideoMamba CLS token can generate a channel-wise modulation vector along with a stabilization gate. These elements work together to adjust the AudioMamba projections, which are integral to the selective state-space parameters, including:

Delta
B
C

This design enables the system to produce scene-aware audio dynamics without the complexity of token-level cross-attention, thus simplifying the model while retaining effectiveness.

Training and Evaluation

The training process for CoLoRSMamba involves a binary classification task enhanced by a symmetric AV-InfoNCE objective. This objective aligns clip-level audio and video embeddings, ensuring that both modalities are effectively synchronized during the learning process. To facilitate a fair evaluation of the multimodal system, researchers have curated audio-filtered clip-level subsets from the NTU-CCTV and DVD datasets. This curation process involves retaining only those clips where audio is available, thus providing a solid basis for testing the system’s capabilities.

Results

In comparative evaluations, CoLoRSMamba has demonstrated superior performance against a range of baselines, including audio-only, video-only, and other multimodal systems. The results show that CoLoRSMamba achieves:

88.63% accuracy / 86.24% F1-V on NTU-CCTV
75.77% accuracy / 72.94% F1-V on DVD

Moreover, CoLoRSMamba provides a favorable balance between accuracy and computational efficiency, outperforming several larger models while utilizing fewer parameters and FLOPs.

Conclusion

CoLoRSMamba marks a significant advancement in the field of multimodal violence detection by effectively integrating video and audio data. Its innovative architecture and training methodology set a new benchmark for future research in this area, paving the way for more robust security solutions in diverse environments.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

CoLoRSMamba: Advanced Multimodal Violence Detection Model

CoLoRSMamba: Conditional LoRA-Steered Mamba for Supervised Multimodal Violence Detection

Abstract

Introduction

Methodology

Training and Evaluation

Results

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related