Limitations of Protein Sequence Data in Parkinson’s Classification

Date:

Evaluating the Limitations of Protein Sequence Representations for Parkinson’s Disease Classification

Summary: arXiv:2604.11852v1 Announce Type: cross

Abstract

The identification of reliable molecular biomarkers for Parkinson’s disease remains challenging due to its multifactorial nature. Although protein sequences constitute a fundamental and widely available source of biological information, their standalone discriminative capacity for complex disease classification remains unclear. In this work, we present a controlled and leakage-free evaluation of multiple representations derived exclusively from protein primary sequences, including:

  • Amino acid composition
  • K-mers
  • Physicochemical descriptors
  • Hybrid representations
  • Embeddings from protein language models

All representations were assessed under a nested stratified cross-validation framework to ensure unbiased performance estimation. The best-performing configuration (ProtBERT + MLP) achieves an F1-score of 0.704 ± 0.028 and ROC-AUC of 0.748 ± 0.047, indicating only moderate discriminative performance. Classical representations such as k-mers reach comparable F1 values (up to approximately 0.667), but exhibit highly imbalanced behavior, with recall close to 0.98 and precision around 0.50, reflecting a strong bias toward positive predictions.

Key Findings

Across representations, performance differences remain within a narrow range (F1 between 0.60 and 0.70). Unsupervised analyses reveal no intrinsic structure aligned with class labels, and statistical testing (Friedman test, p = 0.1749) does not indicate significant differences across models. These results demonstrate substantial overlap between classes and indicate that primary sequence information alone provides limited discriminative power for Parkinson’s disease classification.

Conclusion

This work establishes a reproducible baseline and provides empirical evidence that more informative biological features, such as:

  • Structural descriptors
  • Functional descriptors
  • Interaction-based descriptors

are required for robust disease modeling. The findings emphasize the necessity for integrating additional biological information beyond primary sequence data to improve the classification accuracy for complex diseases like Parkinson’s.

Implications for Future Research

As the scientific community continues to explore the genetic and molecular underpinnings of Parkinson’s disease, this study highlights the importance of expanding the scope of analysis to include diverse biological features. Future research should focus on developing more sophisticated models that can leverage multi-modal data, thus enhancing our understanding of the disease and improving diagnostic tools.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.