PushupBench Reveals VLMs Fail to Count Pushups Accurately

PushupBench: Your VLM is Not Good at Counting Pushups

In a groundbreaking study released on arXiv, researchers have unveiled a significant limitation of large vision-language models (VLMs) in video analysis. While these models excel at recognizing the content and context of visual data, they struggle with quantifying actions, particularly when it comes to counting repetitions in exercise videos. The study introduces PushupBench, a novel dataset specifically designed to evaluate the ability of VLMs to accurately count repetitions in video clips.

PushupBench comprises 446 long-form video clips, with an average duration of 36.7 seconds. These clips are meticulously curated to focus on various pushup techniques, offering a comprehensive platform for assessing the counting capabilities of VLMs. The research highlights the inadequacy of current models, revealing that even the best-performing frontier model achieves only 42.1% exact accuracy in counting pushups. In stark contrast, open-source models with 4 billion parameters score around 6%, which aligns with supervised baselines.

Key Findings from the Research

The study presents several critical findings that shed light on the performance of VLMs in action counting:

Accuracy Misleading: The researchers emphasize that counting accuracy alone is a misleading metric. Weaker models appear to exploit the modal count—simply guessing the most common number—rather than engaging in temporal reasoning, which is essential for accurate counting.
Fine-tuning Benefits: Fine-tuning VLMs on counting tasks using a subset of 1,000 samples has shown promising results. The models exhibited improved performance across various general video understanding benchmarks:

MVBench: Increased by 2.15 points
PerceptionTest: Increased by 1.88 points
TVBench: Increased by 4.54 points

Counting as a Proxy: The findings suggest that counting capabilities may serve as a proxy for broader temporal reasoning skills in VLMs, indicating that addressing this limitation could enhance overall model performance in video understanding.

Implications for the Future of VLMs

The introduction of PushupBench opens the door to a new avenue of research aimed at improving VLMs’ temporal reasoning capabilities. As the field of AI continues to evolve, understanding the limitations of current models is crucial for developing more sophisticated systems that can accurately interpret and analyze dynamic content.

PushupBench has been incorporated into the lmms-eval framework and is now hosted on pushupbench.com. This resource provides researchers and developers with the tools necessary to evaluate and enhance the counting abilities of their VLMs, ultimately pushing the boundaries of what these models can achieve.

As we move forward, it is essential for AI researchers to focus not only on increasing accuracy but also on fostering a deeper understanding of temporal dynamics in video content. The insights garnered from PushupBench could serve as a catalyst for future advancements, paving the way for more intelligent and capable AI systems.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

PushupBench Reveals VLMs Fail to Count Pushups Accurately

PushupBench: Your VLM is Not Good at Counting Pushups

Key Findings from the Research

Implications for the Future of VLMs

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related