Interpretable DNA Sequence Classification via Dynamic Feature Generation in Decision Trees
Summary: arXiv:2604.12060v1 Announce Type: cross
The analysis of DNA sequences has become critical in numerous fields, from evolutionary biology to understanding gene regulation and disease mechanisms. As the volume of genomic data continues to grow, the need for effective and interpretable methods of analysis has never been greater. While deep neural networks can achieve remarkable predictive performance, they typically operate as black boxes, obscuring the reasoning behind their predictions. This lack of transparency can hinder trust and usability in critical applications such as healthcare and genetic research.
In contrast to the opaque nature of deep learning models, axis-aligned decision trees offer a more interpretable alternative. These models provide clear decision pathways, allowing researchers to understand how predictions are made. However, traditional decision trees face a significant limitation: they consider individual raw features in isolation at each split. This approach restricts their expressivity, necessitating prohibitively deep trees that compromise both interpretability and generalization performance.
The DEFT Framework
To address these challenges, we introduce DEFT, a novel framework that adaptively generates high-level sequence features during the tree construction process. DEFT represents a significant advancement in the field of interpretable machine learning, particularly in the context of genomic analysis.
- Dynamic Feature Generation: DEFT leverages large language models to propose biologically-informed features tailored to the local sequence distributions at each node. This dynamic generation of features allows for a more nuanced understanding of the underlying biological processes.
- Reflection Mechanism: The framework includes an innovative reflection mechanism that iteratively refines the proposed features. By continually assessing and adjusting features based on their predictive performance, DEFT enhances the overall quality and relevance of the features used for classification.
- Human-Interpretable Features: One of the key advantages of DEFT is its ability to discover human-interpretable sequence features. This is particularly valuable in genomics, where understanding the biological implications of features can lead to new insights in research and medicine.
Empirical Results
Empirical evaluations demonstrate that DEFT effectively discovers highly predictive sequence features across a diverse range of genomic tasks. In comparative studies, DEFT outperforms traditional decision tree approaches while maintaining a level of interpretability that is essential for scientific inquiry.
Furthermore, the framework has shown promise in enhancing the predictive performance of models applied to tasks such as gene expression analysis, disease classification, and evolutionary studies. By bridging the gap between high-performance machine learning and interpretability, DEFT paves the way for broader adoption of decision trees in the analysis of complex biological data.
Conclusion
As the field of genomics continues to evolve, the demand for interpretable and effective analytical methods will only increase. DEFT represents a crucial step towards fulfilling this demand, offering a robust solution for DNA sequence classification that blends the rigor of decision trees with the sophistication of modern feature generation techniques. This innovative framework not only enhances predictive accuracy but also ensures that the resulting models remain accessible and interpretable to researchers across various disciplines.
