Variational Encoder–Multi-Decoder (VE-MD) for Privacy-by-functional-design (Group) Emotion Recognition
Summary: arXiv:2604.02397v1 Announce Type: cross
Abstract
Group Emotion Recognition (GER) aims to infer collective affect in social environments such as classrooms, crowds, and public events. Many existing approaches rely on explicit individual-level processing, including cropped faces, person tracking, or per-person feature extraction, which makes the analysis pipeline person-centric and raises privacy concerns in deployment scenarios where only group-level understanding is needed.
This research proposes VE-MD, a Variational Encoder-Multi-Decoder framework for group emotion recognition under a privacy-aware functional design. Rather than providing formal anonymization or cryptographic privacy guarantees, VE-MD is designed to avoid explicit individual monitoring by constraining the model to predict only aggregate group-level affect, without identity recognition or per-person emotion outputs.
Key Features of VE-MD
- Joint optimization for emotion classification and internal prediction of body and facial structural representations.
- Two structural decoding strategies:
- Transformer-based PersonQuery decoder
- Dense Heatmap decoder accommodating variable group sizes
Research Findings
Experiments conducted on six in-the-wild datasets, including two GER and four Individual Emotion Recognition (IER) benchmarks, indicate that structural supervision significantly enhances representation learning. The results highlight a crucial distinction between GER and IER:
- Optimizing the latent space alone is often inadequate for GER as it may diminish interaction-related cues.
- Maintaining explicit structural outputs proves beneficial for collective affect inference.
- In contrast, projected structural representations effectively serve as a denoising bottleneck for IER.
Performance Metrics
VE-MD achieves state-of-the-art performance on various datasets:
- GAF-3.0: Up to 90.06%
- VGAF: 82.25% with multimodal fusion including audio
- SamSemo: 77.9% (adding text modality)
- MER-MULTI: 63.8%
- DFEW: 70.7%
- EngageNet: 69.0%
Conclusion
The findings underscore the significance of preserving interaction-related structural information for effective group-level affect modeling, all while minimizing reliance on prior individual feature extraction. VE-MD stands as a promising advancement in the field of emotion recognition, ensuring privacy without compromising accuracy.
