Diabetes prediction based on Ensemble Methods

Keywords: Diabetes Prediction, Ensemble Learning, Gradient Boosting, AdaBoost, XGBoost

Abstract

The incidence of diabetes, a chronic disease, is increasing worldwide, especially in low- and middle-income countries. To reduce complications and improve patient outcomes, early and accurate prediction is critical. Using two benchmark datasets, this test demonstrates an ensemble-based machine learning framework for diabetes prediction. Two ensemble strategies were evaluated using the Diabetes Prediction dataset and the Indian Diabetes Pima dataset: a sequential ensemble combining XGBoost, gradient boosting, and AdaBoost, and a parallel ensemble using a smooth voting classifier that encompassed logistic regression, decision tree, and K-Nearest Neighbors. forward feature selection strategies were used to find the most relevant predictors, improving model performance and generalizability. 70% of the data was used for training, 15% for validation, and 15% for testing. According to the experimental results, the sequential ensemble performed better on the Indian Pima dataset, achieving a training accuracy of 98.95%, a validation accuracy of 97.59%, and an F1 accuracy of 97.77%. This performance was better than the parallel ensemble, which achieved an F1 score of 96.62%, a validation accuracy of 96.38%, and a training accuracy of 98.16%. Overall, the sequential model outperformed both datasets, with the diabetes prediction dataset showing better performance than the parallel model. These results demonstrate how feature selection methods and boosting-based ensemble models can work together to create accurate and reliable medical prediction systems.
Published
2025-10-04
How to Cite
Mosa, J., & Abdulazeez, A. (2025). Diabetes prediction based on Ensemble Methods. Statistics, Optimization & Information Computing. https://doi.org/10.19139/soic-2310-5070-2771
Section
Research Articles