Optimal Gene Selection and Machine Learning Framework for Alzheimer’s Disease Prediction Using Transcriptomic Data

  • Omar Khaled Mohamed El-Shahat Modern University for Technology and Information (MTI)
  • BenBella Sayed Tawfik
  • Marwa Nabil Refaie
Keywords: Alzheimer’s Disease; Transcriptomics; Feature Selection; L1-SVM; Machine Learning; Biomarker Discovery

Abstract

Accurate Alzheimer’s Disease (AD) prediction using gene expression data is significantly challenged by ultra-highdimensionality (38,319 genes) and class imbalance (697 AD vs. 460 controls). To overcome these barriers, we developedan end-to-end machine learning framework integrating advanced feature engineering with optimized classification. Ourstudy leveraged 1,157 post-mortem dorsolateral prefrontal cortex samples from multi-cohort repositories, selected fortheir established relevance to Alzheimer’s disease (AD) pathology, and employed adaptive linear interpolation withbiological replicates to impute sparse missing values (<5 % per gene) while minimizing noise. We rigorously evaluatedfour feature selection approaches: ANOVA F-value filtering (emphasizing inter-group expression differences), MutualInformation scoring (detecting non-linear gene-AD relationships), L1-SVM regularization (simultaneous sparse selectionand classification), and Correlation-based elimination (reducing feature redundancy). Through exhaustive hyperparametertuning (120 configurations), L1-SVM proved optimal, identifying 2,890 biologically coherent genes (including knownAlzheimer’s disease markers APOE, BIN1, and CLU) with 92.5% dimensionality reduction and greater than 99% signalretention. Eight classifiers were benchmarked on this refined gene set. A support vector machine (SVM) with radial basisfunction kernels achieved peak performance: 94.37% accuracy, 96.32% precision, 94.24% recall, and a 95.27% F1-score.Crucially, the model demonstrated clinical robustness with only 8 false negatives and 5 false positives—exceeding existingtranscriptomic models by ≥ 7% specificity. Validation (1,000 iterations) confirmed stability (F1-score SD: ±0.38%). Thisframework enables cost-effective AD screening (reducing genomic testing burden by 92.5%) and provides mechanisticinsights through its interpretable gene panel, advancing precision neurology.
Published
2025-09-02
How to Cite
El-Shahat, O. K. M., Tawfik, B. S., & Refaie, M. N. (2025). Optimal Gene Selection and Machine Learning Framework for Alzheimer’s Disease Prediction Using Transcriptomic Data. Statistics, Optimization & Information Computing. https://doi.org/10.19139/soic-2310-5070-2723
Section
Research Articles