GenoBERT: A Language Model for Accurate Genotype Imputation
Summary: arXiv:2604.00058v1 Announce Type: cross
In the rapidly evolving field of genomics, genotype imputation has emerged as a critical technique that enables dense variant coverage for genome-wide association studies and risk-prediction analyses. Traditional reference-panel methods have been constrained by issues of ancestry bias and limitations in accurately imputing rare variants. Addressing these challenges, researchers have introduced GenoBERT, a novel approach that leverages transformer-based architectures to enhance genotype imputation accuracy.
Overview of GenoBERT
GenoBERT, short for Genotype Bidirectional Encoder Representations from Transformers, is a reference-free framework designed to improve the accuracy of genotype imputation. The model tokenizes phased genotypes and employs a self-attention mechanism to effectively capture both short- and long-range linkage disequilibrium (LD) dependencies. This capability is essential for understanding the complex relationships between genetic variants.
Benchmarking Performance
To evaluate the effectiveness of GenoBERT, extensive benchmarking was conducted on two independent datasets: the Louisiana Osteoporosis Study (LOS) and the 1000 Genomes Project (1KGP). The model was tested across various ancestry groups and multiple levels of genotype missingness ranging from 5% to 50%. The results revealed that GenoBERT achieved the highest overall accuracy compared to four baseline methods:
- Beagle5.4
- SCDA
- BiU-Net
- STICI
Imputation Accuracy
At practical levels of sparsity, specifically with up to 25% of genotypes missing, GenoBERT demonstrated remarkable imputation accuracy, with an $r^2$ value approximately equal to 0.98 across the tested datasets. Even with 50% of genotype data missing, the model maintained robust performance, achieving an $r^2$ value greater than 0.90.
Consistent Gains Across Ancestries
The experimental results across different ancestral groups confirmed that GenoBERT consistently outperformed other methods, showing resilience to small sample sizes and weak linkage disequilibrium. This is particularly important as it suggests that GenoBERT can be effectively applied in diverse genomic settings, enhancing its utility in genetic research.
Context Window Validation
A critical aspect of the GenoBERT framework is its 128-SNP context window, which corresponds to approximately 100 Kb of genomic data. This window size has been validated through linkage disequilibrium decay analyses, confirming its sufficiency in capturing local correlation structures that are vital for accurate genotype imputation.
Conclusion
By eliminating the dependence on reference panels while maintaining high levels of accuracy, GenoBERT represents a scalable and robust solution for genotype imputation. This innovative approach not only addresses the limitations of existing methods but also lays a solid foundation for future genomic modeling and research initiatives, potentially transforming the landscape of genetic studies and personalized medicine.
