Limitations of Protein Sequence Data in Parkinson's Classification

Evaluating the Limitations of Protein Sequence Representations for Parkinson’s Disease Classification

Summary: arXiv:2604.11852v1 Announce Type: cross

Abstract

The identification of reliable molecular biomarkers for Parkinson’s disease remains challenging due to its multifactorial nature. Although protein sequences constitute a fundamental and widely available source of biological information, their standalone discriminative capacity for complex disease classification remains unclear. In this work, we present a controlled and leakage-free evaluation of multiple representations derived exclusively from protein primary sequences, including:

Amino acid composition
K-mers
Physicochemical descriptors
Hybrid representations
Embeddings from protein language models

All representations were assessed under a nested stratified cross-validation framework to ensure unbiased performance estimation. The best-performing configuration (ProtBERT + MLP) achieves an F1-score of 0.704 ± 0.028 and ROC-AUC of 0.748 ± 0.047, indicating only moderate discriminative performance. Classical representations such as k-mers reach comparable F1 values (up to approximately 0.667), but exhibit highly imbalanced behavior, with recall close to 0.98 and precision around 0.50, reflecting a strong bias toward positive predictions.

Key Findings

Across representations, performance differences remain within a narrow range (F1 between 0.60 and 0.70). Unsupervised analyses reveal no intrinsic structure aligned with class labels, and statistical testing (Friedman test, p = 0.1749) does not indicate significant differences across models. These results demonstrate substantial overlap between classes and indicate that primary sequence information alone provides limited discriminative power for Parkinson’s disease classification.

Conclusion

This work establishes a reproducible baseline and provides empirical evidence that more informative biological features, such as:

Structural descriptors
Functional descriptors
Interaction-based descriptors

are required for robust disease modeling. The findings emphasize the necessity for integrating additional biological information beyond primary sequence data to improve the classification accuracy for complex diseases like Parkinson’s.

Implications for Future Research

As the scientific community continues to explore the genetic and molecular underpinnings of Parkinson’s disease, this study highlights the importance of expanding the scope of analysis to include diverse biological features. Future research should focus on developing more sophisticated models that can leverage multi-modal data, thus enhancing our understanding of the disease and improving diagnostic tools.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Limitations of Protein Sequence Data in Parkinson’s Classification

Evaluating the Limitations of Protein Sequence Representations for Parkinson’s Disease Classification

Abstract

Key Findings

Conclusion

Implications for Future Research

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related