PushupBench: Your VLM is Not Good at Counting Pushups
In a groundbreaking study released on arXiv, researchers have unveiled a significant limitation of large vision-language models (VLMs) in video analysis. While these models excel at recognizing the content and context of visual data, they struggle with quantifying actions, particularly when it comes to counting repetitions in exercise videos. The study introduces PushupBench, a novel dataset specifically designed to evaluate the ability of VLMs to accurately count repetitions in video clips.
PushupBench comprises 446 long-form video clips, with an average duration of 36.7 seconds. These clips are meticulously curated to focus on various pushup techniques, offering a comprehensive platform for assessing the counting capabilities of VLMs. The research highlights the inadequacy of current models, revealing that even the best-performing frontier model achieves only 42.1% exact accuracy in counting pushups. In stark contrast, open-source models with 4 billion parameters score around 6%, which aligns with supervised baselines.
Key Findings from the Research
The study presents several critical findings that shed light on the performance of VLMs in action counting:
- Accuracy Misleading: The researchers emphasize that counting accuracy alone is a misleading metric. Weaker models appear to exploit the modal count—simply guessing the most common number—rather than engaging in temporal reasoning, which is essential for accurate counting.
- Fine-tuning Benefits: Fine-tuning VLMs on counting tasks using a subset of 1,000 samples has shown promising results. The models exhibited improved performance across various general video understanding benchmarks:
- MVBench: Increased by 2.15 points
- PerceptionTest: Increased by 1.88 points
- TVBench: Increased by 4.54 points
- Counting as a Proxy: The findings suggest that counting capabilities may serve as a proxy for broader temporal reasoning skills in VLMs, indicating that addressing this limitation could enhance overall model performance in video understanding.
Implications for the Future of VLMs
The introduction of PushupBench opens the door to a new avenue of research aimed at improving VLMs’ temporal reasoning capabilities. As the field of AI continues to evolve, understanding the limitations of current models is crucial for developing more sophisticated systems that can accurately interpret and analyze dynamic content.
PushupBench has been incorporated into the lmms-eval framework and is now hosted on pushupbench.com. This resource provides researchers and developers with the tools necessary to evaluate and enhance the counting abilities of their VLMs, ultimately pushing the boundaries of what these models can achieve.
As we move forward, it is essential for AI researchers to focus not only on increasing accuracy but also on fostering a deeper understanding of temporal dynamics in video content. The insights garnered from PushupBench could serve as a catalyst for future advancements, paving the way for more intelligent and capable AI systems.
Related AI Insights
- AI-Assisted Code Review Boosts Code Quality & Learning
- Sinkhorn with Memory for Nonlinear Schrödinger Bridge Control
- Learn&Drop: Accelerate CNN Training by Dropping Layers
- Knowledge Lever Risk Management in Software Engineering
- EmoTrans Benchmark for Emotion Transitions in Multimodal LLMs
- Polymorphic Backdoor Attack on Semantic Communication
- Enhancing Generative Retrieval: Testing Look-Ahead Prior Robustness
- Resolving Client Disagreements in Federated Learning Models
- Lightweight PDF Visual Element Parsing for Production
- Training-Free LLM Context Compression with Hybrid Graphs
