Machine Learning Models for Predicting COVID-19 Mortality Using Epidemiological Features
Keywords:
COVID-19, epidemiology, machine learning, threat Management, imbalanced dataset, RUS, SMOTE, ADASYN
Abstract
Identifying COVID-19 patients at high risk of fatality is critically important for healthcare professionals, as it supports informed decision-making and enhances the capacity to manage emerging crises within medical systems. Nevertheless, COVID-19 datasets are frequently highly imbalanced, with substantially fewer fatality cases presenting a challenge to the development of effective machine learning algorithms. This study aims to develop a high-performing machine learning approach to predict COVID-19 mortality using a Mexican epidemiological dataset. To tackle the class imbalance issue, numerous sampling techniques are applied, including SMOTE, SMOTE-ENN, ADASYN, SMOTE-Tomek, and Random Under-Sampling (RUS). Predictive models are created using several machine learning algorithms: Logistic Regression, Decision Tree, Gaussian Naïve Bayes, K-Nearest Neighbors, and Random Forest. Besides, we performed feature selection analysis using Shap technique to determine the main relevant attributes for predicting COVID-19 mortality. The results illustrate that Random Forest model, trained on balanced data with SMOTE-ENN technique yielded the best performance, with 89.44% accuracy, 87.88% Recall, and 88.74% ROC AUC score. Furthermore, feature selection analysis shows that Type of Patient, Age, Pneumonia, Intubation, having contact with COVID-19 infected patients are the key important attributes for predicting COVID-19 risk of fatality among hospitalized individuals.References
1. A. U. M. Shah, S. N. A. Safri, R. Thevadas, N. K. Noordin, A. Abd Rahman, Z. Sekawi, A. Ideris, and M. T. H. Sultan, COVID-19 outbreak in Malaysia: Actions taken by the Malaysian government, International Journal of Infectious Diseases, vol. 97, pp. 108–116, 2020.
2. C. Huang, Y. Wang, X. Li, L. Ren, J. Zhao, Y. Hu, L. Zhang, G. Fan, J. Xu, X. Gu, et al., Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China, The Lancet, vol. 395, no. 10223, pp. 497–506, 2020.
3. Q. Li, X. Guan, P. Wu, X. Wang, L. Zhou, Y. Tong, R. Ren, K. S. M. Leung, E. H. Y. Lau, J. Y. Wong, et al., Early transmission dynamics in Wuhan, China, of novel coronavirus–infected pneumonia, New England Journal of Medicine, vol. 382, no. 13, pp. 1199–1207, 2020.
4. D. Wolff, S. Nee, N. S. Hickey, and M. Marschollek, Risk factors for Covid-19 severity and fatality: a structured literature review, Infection, vol. 49, pp. 15–28, 2021.
5. M. Jawad Hashim, A. R. Alsuwaidi, and G. Khan, Population risk factors for COVID-19 mortality in 93 countries, Journal of Epidemiology and Global Health, vol. 10, no. 3, pp. 204–208, 2020.
6. S. El Khamlichi, A. Maurady, and A. Sedqui, Comparative study of COVID-19 situation between lower-middle-income countries in the eastern Mediterranean region, Journal of Oral Biology and Craniofacial Research, vol. 12, no. 1, pp. 165–176, 2022.
7. N. Salari, H. Khazaie, A. Hosseinian-Far, H. Ghasemi, M. Mohammadi, S. Shohaimi, A. Daneshkhah, B. Khaledi-Paveh, and M. Hosseinian-Far, The prevalence of sleep disturbances among physicians and nurses facing the COVID-19 patients: a systematic review and meta-analysis, Globalization and Health, vol. 16, pp. 1–14, 2020.
8. N. Salari, H. Khazaie, A. Hosseinian-Far, B. Khaledi-Paveh, M. Kazeminia, M. Mohammadi, S. Shohaimi, A. Daneshkhah, and S. Eskandari, The prevalence of stress, anxiety and depression within front-line healthcare workers caring for COVID-19 patients: a systematic review and meta-regression, Human Resources for Health, vol. 18, pp. 1–14, 2020.
9. J.-L. Vincent and J. Creteur, Ethical aspects of the COVID-19 crisis: How to deal with an overwhelming shortage of acute beds, European Heart Journal: Acute Cardiovascular Care, vol. 9, no. 3, pp. 248–252, 2020.
10. C. Iwendi, C. G. Y. Huescas, C. Chakraborty, and S. Mohan, COVID-19 health analysis and prediction using machine learning algorithms for Mexico and Brazil patients, Journal of Experimental & Theoretical Artificial Intelligence, vol. 36, no. 3, pp. 315–335, 2024.
11. S.Wollenstein-Betech, C. G. Cassandras, and I. Ch. Paschalidis, Personalized predictive models for symptomatic COVID-19 patients using basic preconditions: hospitalizations, mortality, and the need for an ICU or ventilator, International Journal of Medical Informatics, vol. 142, pp. 104258, 2020.
12. S. Bolourani, M. Brenner, P. Wang, T. McGinn, J. Hirsch, D. Barnaby, and T. Zanos, Development and Validation of a Machine learning prediction model of respiratory failure within 48 hours of patient admission for COVID-19, Journal of Medical Internet Research, 2021.
13. H. Mohammedqasim and O. Ata, Real-time data of COVID-19 detection with IoT sensor tracking using artificial neural network, Computers and Electrical Engineering, vol. 100, pp. 107971, 2022.
14. J. Wu, J. Shen, M. Xu, and M. Shao, A novel combined dynamic ensemble selection model for imbalanced data to detect COVID-19 from complete blood count, Computer Methods and Programs in Biomedicine, vol. 211, pp. 106444, 2021.
15. M. AlJame, I. Ahmad, A. Imtiaz, and A. Mohammed, Ensemble learning model for diagnosing COVID-19 from routine blood tests, Informatics in Medicine Unlocked, vol. 21, pp. 100449, 2020.
16. R. Mohammed, J. Rawashdeh, and M. Abdullah, Machine learning with oversampling and undersampling techniques: overview study and experimental results, In Proceedings of the 2020 11th International Conference on Information and Communication Systems (ICICS), pp. 243–248, 2020.
17. Y. Kamei, A. Monden, S. Matsumoto, T. Kakimoto, and K.-i. Matsumoto, The effects of over and under sampling on fault-prone module detection, In Proceedings of the First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007), pp. 196–204, 2007.
18. K. Fujiwara, Y. Huang, K. Hori, K. Nishioji, M. Kobayashi, M. Kamaguchi, and M. Kano, Over-and under-sampling approach for extremely imbalanced and small minority data problem in health record analysis, Frontiers in Public Health, vol. 8, pp. 178, 2020.
19. R. Blagus and L. Lusa, Joint use of over-and under-sampling techniques and cross-validation for the development and assessment of prediction models, BMC Bioinformatics, vol. 16, pp. 1–10, 2015.
20. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.
21. M. S. Shelke, P. R. Deshmukh, and V. K. Shandilya, A review on imbalanced data handling using undersampling and oversampling technique, International Journal of Recent Trends in Engineering and Research, vol. 3, no. 4, pp. 444–449, 2017.
22. B. Das, N. C. Krishnan, and D. J. Cook, RACOG and wRACOG: Two probabilistic oversampling techniques, IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 1, pp. 222–234, 2014.
23. T. Zhu, Y. Lin, and Y. Liu, Synthetic minority oversampling technique for multiclass imbalance problems, Pattern Recognition, vol. 72, pp. 327–340, 2017.
24. K. Alshouiliy, S. Ray, A. AlGhamdi, and D. P. Agrawal, Enhancing imbalanced dataset by utilizing (K-NN based SMOTE 3D algorithm), Annals of Robotics and Automation, vol. 4, no. 1, pp. 001–006, 2020.
25. A. Amin, S. Anwar, A. Adnan, M. Nawaz, N. Howard, J. Qadir, A. Hawalah, and A. Hussain, Comparing oversampling techniques to handle the class imbalance problem: A customer churn prediction case study, IEEE Access, vol. 4, pp. 7940–7957, 2016.
26. G. Ahmed, M. J. Er, M. M. S. Fareed, S. Zikria, S. Mahmood, J. He, M. Asad, S. F. Jilani, and M. Aslam, Dad-net: Classification of Alzheimer’s disease using ADASYN oversampling technique and optimized neural network, Molecules, vol. 27, no. 20, pp. 7085, 2022.
27. X.-Y. Liu, J. Wu, and Z.-H. Zhou, Exploratory undersampling for class-imbalance learning, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 39, no. 2, pp. 539–550, 2008.
28. H. He and E. A. Garcia, Learning from imbalanced data, IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263–1284, 2009.
29. M. Bach, A.Werner, and M. Palt, The proposal of undersampling method for learning from imbalanced datasets, Procedia Computer Science, vol. 159, pp. 125–134, 2019.
30. M. Saripuddin, A. Suliman, S. S. Sameon, and B. N. Jorgensen, Random undersampling on imbalance time series data for anomaly detection, in Proceedings of the 2021 4th International Conference on Machine Learning and Machine Intelligence, pp. 151–156, 2021.
31. A. Farshidvard, F. Hooshmand, and S. A. MirHassani, A novel two-phase clustering-based under-sampling method for imbalanced classification problems, Expert Systems with Applications, vol. 213, pp. 119003, 2023.
32. J. M. Johnson and T. M. Khoshgoftaar, Survey on deep learning with class imbalance, Journal of Big Data, vol. 6, no. 1, pp. 1–54, 2019.
33. A. El Hariri, M. Mouiti, O. Habibi, and M. Lazaar, Improving Deep Learning Performance Using Sampling Techniques for IoT Imbalanced Data, Procedia Computer Science, vol. 224, pp. 180–187, 2023.
2. C. Huang, Y. Wang, X. Li, L. Ren, J. Zhao, Y. Hu, L. Zhang, G. Fan, J. Xu, X. Gu, et al., Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China, The Lancet, vol. 395, no. 10223, pp. 497–506, 2020.
3. Q. Li, X. Guan, P. Wu, X. Wang, L. Zhou, Y. Tong, R. Ren, K. S. M. Leung, E. H. Y. Lau, J. Y. Wong, et al., Early transmission dynamics in Wuhan, China, of novel coronavirus–infected pneumonia, New England Journal of Medicine, vol. 382, no. 13, pp. 1199–1207, 2020.
4. D. Wolff, S. Nee, N. S. Hickey, and M. Marschollek, Risk factors for Covid-19 severity and fatality: a structured literature review, Infection, vol. 49, pp. 15–28, 2021.
5. M. Jawad Hashim, A. R. Alsuwaidi, and G. Khan, Population risk factors for COVID-19 mortality in 93 countries, Journal of Epidemiology and Global Health, vol. 10, no. 3, pp. 204–208, 2020.
6. S. El Khamlichi, A. Maurady, and A. Sedqui, Comparative study of COVID-19 situation between lower-middle-income countries in the eastern Mediterranean region, Journal of Oral Biology and Craniofacial Research, vol. 12, no. 1, pp. 165–176, 2022.
7. N. Salari, H. Khazaie, A. Hosseinian-Far, H. Ghasemi, M. Mohammadi, S. Shohaimi, A. Daneshkhah, B. Khaledi-Paveh, and M. Hosseinian-Far, The prevalence of sleep disturbances among physicians and nurses facing the COVID-19 patients: a systematic review and meta-analysis, Globalization and Health, vol. 16, pp. 1–14, 2020.
8. N. Salari, H. Khazaie, A. Hosseinian-Far, B. Khaledi-Paveh, M. Kazeminia, M. Mohammadi, S. Shohaimi, A. Daneshkhah, and S. Eskandari, The prevalence of stress, anxiety and depression within front-line healthcare workers caring for COVID-19 patients: a systematic review and meta-regression, Human Resources for Health, vol. 18, pp. 1–14, 2020.
9. J.-L. Vincent and J. Creteur, Ethical aspects of the COVID-19 crisis: How to deal with an overwhelming shortage of acute beds, European Heart Journal: Acute Cardiovascular Care, vol. 9, no. 3, pp. 248–252, 2020.
10. C. Iwendi, C. G. Y. Huescas, C. Chakraborty, and S. Mohan, COVID-19 health analysis and prediction using machine learning algorithms for Mexico and Brazil patients, Journal of Experimental & Theoretical Artificial Intelligence, vol. 36, no. 3, pp. 315–335, 2024.
11. S.Wollenstein-Betech, C. G. Cassandras, and I. Ch. Paschalidis, Personalized predictive models for symptomatic COVID-19 patients using basic preconditions: hospitalizations, mortality, and the need for an ICU or ventilator, International Journal of Medical Informatics, vol. 142, pp. 104258, 2020.
12. S. Bolourani, M. Brenner, P. Wang, T. McGinn, J. Hirsch, D. Barnaby, and T. Zanos, Development and Validation of a Machine learning prediction model of respiratory failure within 48 hours of patient admission for COVID-19, Journal of Medical Internet Research, 2021.
13. H. Mohammedqasim and O. Ata, Real-time data of COVID-19 detection with IoT sensor tracking using artificial neural network, Computers and Electrical Engineering, vol. 100, pp. 107971, 2022.
14. J. Wu, J. Shen, M. Xu, and M. Shao, A novel combined dynamic ensemble selection model for imbalanced data to detect COVID-19 from complete blood count, Computer Methods and Programs in Biomedicine, vol. 211, pp. 106444, 2021.
15. M. AlJame, I. Ahmad, A. Imtiaz, and A. Mohammed, Ensemble learning model for diagnosing COVID-19 from routine blood tests, Informatics in Medicine Unlocked, vol. 21, pp. 100449, 2020.
16. R. Mohammed, J. Rawashdeh, and M. Abdullah, Machine learning with oversampling and undersampling techniques: overview study and experimental results, In Proceedings of the 2020 11th International Conference on Information and Communication Systems (ICICS), pp. 243–248, 2020.
17. Y. Kamei, A. Monden, S. Matsumoto, T. Kakimoto, and K.-i. Matsumoto, The effects of over and under sampling on fault-prone module detection, In Proceedings of the First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007), pp. 196–204, 2007.
18. K. Fujiwara, Y. Huang, K. Hori, K. Nishioji, M. Kobayashi, M. Kamaguchi, and M. Kano, Over-and under-sampling approach for extremely imbalanced and small minority data problem in health record analysis, Frontiers in Public Health, vol. 8, pp. 178, 2020.
19. R. Blagus and L. Lusa, Joint use of over-and under-sampling techniques and cross-validation for the development and assessment of prediction models, BMC Bioinformatics, vol. 16, pp. 1–10, 2015.
20. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.
21. M. S. Shelke, P. R. Deshmukh, and V. K. Shandilya, A review on imbalanced data handling using undersampling and oversampling technique, International Journal of Recent Trends in Engineering and Research, vol. 3, no. 4, pp. 444–449, 2017.
22. B. Das, N. C. Krishnan, and D. J. Cook, RACOG and wRACOG: Two probabilistic oversampling techniques, IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 1, pp. 222–234, 2014.
23. T. Zhu, Y. Lin, and Y. Liu, Synthetic minority oversampling technique for multiclass imbalance problems, Pattern Recognition, vol. 72, pp. 327–340, 2017.
24. K. Alshouiliy, S. Ray, A. AlGhamdi, and D. P. Agrawal, Enhancing imbalanced dataset by utilizing (K-NN based SMOTE 3D algorithm), Annals of Robotics and Automation, vol. 4, no. 1, pp. 001–006, 2020.
25. A. Amin, S. Anwar, A. Adnan, M. Nawaz, N. Howard, J. Qadir, A. Hawalah, and A. Hussain, Comparing oversampling techniques to handle the class imbalance problem: A customer churn prediction case study, IEEE Access, vol. 4, pp. 7940–7957, 2016.
26. G. Ahmed, M. J. Er, M. M. S. Fareed, S. Zikria, S. Mahmood, J. He, M. Asad, S. F. Jilani, and M. Aslam, Dad-net: Classification of Alzheimer’s disease using ADASYN oversampling technique and optimized neural network, Molecules, vol. 27, no. 20, pp. 7085, 2022.
27. X.-Y. Liu, J. Wu, and Z.-H. Zhou, Exploratory undersampling for class-imbalance learning, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 39, no. 2, pp. 539–550, 2008.
28. H. He and E. A. Garcia, Learning from imbalanced data, IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263–1284, 2009.
29. M. Bach, A.Werner, and M. Palt, The proposal of undersampling method for learning from imbalanced datasets, Procedia Computer Science, vol. 159, pp. 125–134, 2019.
30. M. Saripuddin, A. Suliman, S. S. Sameon, and B. N. Jorgensen, Random undersampling on imbalance time series data for anomaly detection, in Proceedings of the 2021 4th International Conference on Machine Learning and Machine Intelligence, pp. 151–156, 2021.
31. A. Farshidvard, F. Hooshmand, and S. A. MirHassani, A novel two-phase clustering-based under-sampling method for imbalanced classification problems, Expert Systems with Applications, vol. 213, pp. 119003, 2023.
32. J. M. Johnson and T. M. Khoshgoftaar, Survey on deep learning with class imbalance, Journal of Big Data, vol. 6, no. 1, pp. 1–54, 2019.
33. A. El Hariri, M. Mouiti, O. Habibi, and M. Lazaar, Improving Deep Learning Performance Using Sampling Techniques for IoT Imbalanced Data, Procedia Computer Science, vol. 224, pp. 180–187, 2023.
Published
2025-05-28
How to Cite
EL KHAMLICHI, S., & Loubna Taidi. (2025). Machine Learning Models for Predicting COVID-19 Mortality Using Epidemiological Features. Statistics, Optimization & Information Computing. https://doi.org/10.19139/soic-2310-5070-2159
Issue
Section
Research Articles
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).