VLAgeBench: Benchmarking Large Vision-Language Models for Zero-Shot Human Age Estimation
Human age estimation from facial images represents a challenging computer vision task with significant applications in biometrics, healthcare, and human-computer interaction. Traditional deep learning approaches often require extensive labeled datasets and domain-specific training, making them resource-intensive. However, recent advancements in large vision-language models (LVLMs) offer a compelling alternative by enabling zero-shot age estimation capabilities.
This study introduces a comprehensive zero-shot evaluation of state-of-the-art LVLMs for facial age estimation, a task that has traditionally been dominated by domain-specific convolutional networks and supervised learning techniques. The focus of this research is to assess the performance of three prominent LVLMs: GPT-4o, Claude 3.5 Sonnet, and LLaMA 3.2 Vision.
Key Findings
The evaluation is conducted on two well-known benchmark datasets, UTKFace and FG-NET, without any fine-tuning or task-specific adaptation. The study employs eight evaluation metrics, which include:
- Mean Absolute Error (MAE)
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- Mean Absolute Percentage Error (MAPE)
- Mean Bias Error (MBE)
- $R^2$ (Coefficient of Determination)
- Concordance Correlation Coefficient (CCC)
- Accuracy within ±5 years
The results demonstrate that general-purpose LVLMs can deliver competitive performance in zero-shot settings, showing promise for accurate biometric age estimation. This capability positions LVLMs as powerful tools for various real-world applications.
Challenges and Considerations
Despite the promising results, the study also highlights performance disparities linked to image quality and demographic subgroups. This underscores the critical need for fairness-aware multimodal inference to ensure equitable outcomes across diverse populations.
The research offers a reproducible benchmark for future studies, focusing on strict zero-shot inference without fine-tuning. The findings also emphasize several remaining challenges in the field, including:
- Prompt sensitivity of LVLMs
- Interpretability of model predictions
- Computational costs associated with using large models
- Addressing demographic fairness in age estimation tasks
Conclusion
The VLAgeBench study positions large vision-language models as promising tools for real-world applications in areas such as forensic science, healthcare monitoring, and human-computer interaction. By demonstrating the capability of LVLMs for zero-shot age estimation, this research paves the way for further exploration and development in the intersection of computer vision and language processing.
