Dynamic Summary Generation for Interpretable Multimodal Depression Detection
Source: arXiv:2604.11334v1
Announce Type: new
Abstract
Depression remains widely underdiagnosed and undertreated because stigma and subjective symptom ratings hinder reliable screening. To address this challenge, we propose a coarse-to-fine, multi-stage framework that leverages large language models (LLMs) for accurate and interpretable detection.
Overview of the Proposed Framework
The proposed framework consists of three key stages:
- Binary Screening: The first stage involves a preliminary assessment to identify individuals who may be at risk for depression.
- Five-Class Severity Classification: In this stage, the system classifies the severity of depression into five distinct categories, providing a more nuanced understanding of the individual’s condition.
- Continuous Regression: The final stage employs continuous regression techniques to quantify the level of depression, allowing for tailored interventions.
Role of Large Language Models
At each stage, a large language model generates progressively richer clinical summaries that serve a dual purpose:
- They enhance the clinician’s understanding of the patient’s condition.
- They guide a multimodal fusion module that integrates various features including text, audio, and video.
Multimodal Fusion Module
The multimodal fusion module is instrumental in synthesizing information from diverse data sources. By combining text, audio, and video features, the system produces predictions that are not only accurate but also transparent in their rationale. This transparency is crucial for fostering trust between healthcare providers and patients.
Assessment Report Generation
After processing the data through the three stages, the system consolidates all generated summaries into a concise, human-readable assessment report. This report is designed to be accessible to both clinicians and patients, ensuring that the insights derived from the analysis can be easily understood and acted upon.
Experimental Validation
To evaluate the effectiveness of the proposed framework, extensive experiments were conducted on two benchmark datasets: the E-DAIC and CMDC datasets. The results demonstrated significant improvements over state-of-the-art baselines in both accuracy and interpretability. Key findings include:
- Enhanced accuracy in identifying individuals at risk for depression.
- Improved interpretability, allowing clinicians to understand the reasoning behind the system’s predictions.
Conclusion
The development of a dynamic summary generation framework utilizing large language models for multimodal depression detection represents a significant advancement in mental health diagnostics. By addressing the challenges of underdiagnosis and stigma, this innovative approach holds promise for improving patient outcomes and facilitating timely interventions.
