CoLoRSMamba: Advanced Multimodal Violence Detection Model

Date:

CoLoRSMamba: Conditional LoRA-Steered Mamba for Supervised Multimodal Violence Detection

Summary: arXiv:2604.03329v1 Announce Type: cross

Abstract

Violence detection benefits from audio, but real-world soundscapes can be noisy or weakly related to the visible scene. We present CoLoRSMamba, a directional Video to Audio multimodal architecture that couples VideoMamba and AudioMamba through CLS-guided conditional LoRA. At each layer, the VideoMamba CLS token produces a channel-wise modulation vector and a stabilization gate that adapt the AudioMamba projections responsible for the selective state-space parameters (Delta, B, C), including the step-size pathway, yielding scene-aware audio dynamics without token-level cross-attention.

Introduction

The detection of violence in multimedia contexts has gained significant attention due to its implications for security and social safety. Traditional methods often rely solely on visual cues; however, integrating audio data can enhance detection accuracy. CoLoRSMamba introduces a novel architecture that bridges the gap between visual and auditory modalities, leveraging the strengths of both to improve performance in real-world scenarios.

Methodology

CoLoRSMamba utilizes a unique approach by combining two components: VideoMamba and AudioMamba. This coupling is achieved through a mechanism known as CLS-guided conditional LoRA. The architecture operates on the principle that each layer’s VideoMamba CLS token can generate a channel-wise modulation vector along with a stabilization gate. These elements work together to adjust the AudioMamba projections, which are integral to the selective state-space parameters, including:

  • Delta
  • B
  • C

This design enables the system to produce scene-aware audio dynamics without the complexity of token-level cross-attention, thus simplifying the model while retaining effectiveness.

Training and Evaluation

The training process for CoLoRSMamba involves a binary classification task enhanced by a symmetric AV-InfoNCE objective. This objective aligns clip-level audio and video embeddings, ensuring that both modalities are effectively synchronized during the learning process. To facilitate a fair evaluation of the multimodal system, researchers have curated audio-filtered clip-level subsets from the NTU-CCTV and DVD datasets. This curation process involves retaining only those clips where audio is available, thus providing a solid basis for testing the system’s capabilities.

Results

In comparative evaluations, CoLoRSMamba has demonstrated superior performance against a range of baselines, including audio-only, video-only, and other multimodal systems. The results show that CoLoRSMamba achieves:

  • 88.63% accuracy / 86.24% F1-V on NTU-CCTV
  • 75.77% accuracy / 72.94% F1-V on DVD

Moreover, CoLoRSMamba provides a favorable balance between accuracy and computational efficiency, outperforming several larger models while utilizing fewer parameters and FLOPs.

Conclusion

CoLoRSMamba marks a significant advancement in the field of multimodal violence detection by effectively integrating video and audio data. Its innovative architecture and training methodology set a new benchmark for future research in this area, paving the way for more robust security solutions in diverse environments.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.