A Study of Failure Modes in Two-Stage Human-Object Interaction Detection
Summary: arXiv:2604.13448v1 Announce Type: cross
Abstract: Human-object interaction (HOI) detection aims to detect interactions between humans and objects in images. While recent advances have improved performance on existing benchmarks, their evaluations mainly focus on overall prediction accuracy and provide limited insight into the underlying causes of model failures. In particular, modern models often struggle in complex scenes involving multiple people and rare interaction combinations.
In this work, we present a study to better understand the failure modes of two-stage HOI models, which form the basis of many current HOI detection approaches. Rather than constructing a large-scale benchmark, we instead decompose HOI detection into multiple interpretable perspectives and analyze model behavior across these dimensions to study different types of failure patterns.
Introduction
Human-object interaction detection is a crucial aspect of computer vision, enabling systems to understand the context of scenes depicted in images. Despite the growing sophistication of machine learning models, there remains a gap in fully understanding why certain models fail in specific scenarios. This study aims to bridge that gap by examining two-stage HOI detection models in various configurations.
Methodology
To investigate the failure modes, we curated a subset of images from an existing HOI dataset. This subset was organized based on specific human-object interaction configurations, such as:
- Multi-person interactions
- Object sharing among multiple individuals
- Rare interaction combinations
By analyzing model behavior in these configurations, we sought to identify patterns that could explain the failures in predictions. This approach allows for a more nuanced understanding of model performance beyond mere accuracy metrics.
Findings
Our analysis revealed several significant insights into the limitations of current HOI detection models:
- Context Complexity: Models often struggle to interpret interactions correctly in scenes with multiple people, leading to incorrect predictions.
- Rare Interactions: The occurrence of unique interaction combinations can result in significant prediction errors due to insufficient training data.
- Misinterpretation of Object Relationships: High benchmark performance does not necessarily indicate that models understand the nuanced relationships between humans and objects.
Conclusion
This study highlights the need for a deeper understanding of the underlying mechanisms that govern model performance in HOI detection. By dissecting the failure modes of two-stage models, we provide insights that can guide future research. Addressing these limitations could lead to the development of more robust models capable of accurately interpreting complex scenes and interactions.
As the field of computer vision continues to evolve, it is essential for researchers and practitioners to consider not just the performance metrics but also the qualitative aspects of model behavior. We hope that our findings will stimulate further exploration into improving HOI detection methodologies.
