Evaluating the Limitations of Protein Sequence Representations for Parkinson’s Disease Classification
Summary: arXiv:2604.11852v1 Announce Type: cross
Abstract
The identification of reliable molecular biomarkers for Parkinson’s disease remains challenging due to its multifactorial nature. Although protein sequences constitute a fundamental and widely available source of biological information, their standalone discriminative capacity for complex disease classification remains unclear. In this work, we present a controlled and leakage-free evaluation of multiple representations derived exclusively from protein primary sequences, including:
- Amino acid composition
- K-mers
- Physicochemical descriptors
- Hybrid representations
- Embeddings from protein language models
All representations were assessed under a nested stratified cross-validation framework to ensure unbiased performance estimation. The best-performing configuration (ProtBERT + MLP) achieves an F1-score of 0.704 ± 0.028 and ROC-AUC of 0.748 ± 0.047, indicating only moderate discriminative performance. Classical representations such as k-mers reach comparable F1 values (up to approximately 0.667), but exhibit highly imbalanced behavior, with recall close to 0.98 and precision around 0.50, reflecting a strong bias toward positive predictions.
Key Findings
Across representations, performance differences remain within a narrow range (F1 between 0.60 and 0.70). Unsupervised analyses reveal no intrinsic structure aligned with class labels, and statistical testing (Friedman test, p = 0.1749) does not indicate significant differences across models. These results demonstrate substantial overlap between classes and indicate that primary sequence information alone provides limited discriminative power for Parkinson’s disease classification.
Conclusion
This work establishes a reproducible baseline and provides empirical evidence that more informative biological features, such as:
- Structural descriptors
- Functional descriptors
- Interaction-based descriptors
are required for robust disease modeling. The findings emphasize the necessity for integrating additional biological information beyond primary sequence data to improve the classification accuracy for complex diseases like Parkinson’s.
Implications for Future Research
As the scientific community continues to explore the genetic and molecular underpinnings of Parkinson’s disease, this study highlights the importance of expanding the scope of analysis to include diverse biological features. Future research should focus on developing more sophisticated models that can leverage multi-modal data, thus enhancing our understanding of the disease and improving diagnostic tools.
