From Skeletons to Semantics: Design and Deployment of a Hybrid Edge-Based Action Detection System for Public Safety
Public spaces such as transport hubs, city centres, and event venues require timely and reliable detection of potentially violent behaviour to support public safety. While automated video analysis has made significant progress, practical deployment remains constrained by latency, privacy, and resource limitations, particularly under edge-computing conditions. This article presents an innovative approach to address these challenges through the development of a hybrid edge-based action detection system.
Abstract
The proposed system combines skeleton-based motion analysis with vision-language models for semantic scene interpretation. Skeleton-based processing enables continuous, privacy-aware monitoring with low computational overhead, while vision-language models provide contextual understanding and zero-shot reasoning capabilities for complex and previously unseen situations.
Key Features of the Hybrid Action Detection System
- Skeleton-Based Processing: This method allows for effective monitoring of individuals without compromising their privacy, as it focuses on skeletal data rather than identifiable images.
- Vision-Language Models: These models enhance the system’s ability to interpret scenes semantically, enabling it to understand context and infer actions that may not have been explicitly programmed into the system.
- Edge Computing Implementation: The system is designed to operate on a GPU-enabled edge device, ensuring low latency and reduced resource consumption.
- Real-Time Analysis: The hybrid architecture supports real-time video analysis, crucial for timely responses in public safety scenarios.
System-Level Comparison
The focus of the research is not on developing new recognition models, but rather on a system-level comparison of skeleton-based and semantic approaches under realistic edge constraints. This comparative analysis provides insights into the strengths and limitations of each method, guiding the development of a hybrid solution that leverages the best of both worlds.
Evaluation and Results
The system was evaluated based on latency, resource usage, and operational trade-offs using a demonstrator-based setup. Initial results indicate that the combination of motion-centric and semantic approaches leads to improved detection capabilities. The skeleton-based detection offers fast response times, while the semantic reasoning enhances the understanding of actions and contexts, allowing for better decision-making in potentially dangerous situations.
Conclusion
The presented hybrid edge-based action detection system serves as a practical foundation for privacy-aware, real-time video analysis in public safety applications. By integrating skeleton-based motion analysis with advanced vision-language models, this system not only addresses the constraints of traditional video analysis but also provides a scalable solution for various public spaces. Future work will focus on refining the system’s capabilities and exploring further enhancements to improve accuracy and responsiveness in real-world scenarios.
