A Machine Learning Model for Predicting the Risk of Developing Diabetes - T2DM Using Real-World Data from Kilifi, Kenya

by Dr. Fullgence Mwakondo, Dr. Mvurya Mgala, Isaac Mumo Kailu

Published: August 29, 2025 • DOI: 10.51244/IJRSI.2025.120800026

Abstract

Type 2 Diabetes Mellitus (T2DM) is a growing public health concern in low-resource settings, where early detection remains limited due to infrastructural and diagnostic constraints. This study presents a machine learning-based risk prediction model developed using real-world data from Kilifi County Referral Hospital in Kenya, aiming to identify individuals at risk of developing T2DM before clinical onset. The study applied the CRISP-DM framework to guide the end-to-end process, from data collection to model deployment. A dataset comprising 2,500 anonymized electronic health records was used, incorporating a diverse range of features including clinical, behavioral, demographic, and socioeconomic variables. Feature selection was conducted using both statistical (Chi-square test) and algorithm-based methods (Random Forest, Recursive Feature Elimination, and XGBoost importance), resulting in two candidate feature sets (14-feature and 7-feature subsets). Four supervised learning algorithms; Logistic Regression, Support Vector Machine (SVM), Random Forest, and XGBoost were trained and evaluated using 5-fold cross-validation. Among them, the XGBoost model achieved the best performance, with a test set accuracy of 91.33%, F1-score of 88.66%, and an AUC-ROC of 96.24%, outperforming other models across all metrics. This study demonstrates that integrating multi-domain features with machine learning can enhance early risk stratification for T2DM in under-resourced environments. The final model’s ability to categorize individuals into low, medium, and high-risk groups offers a practical tool for targeted screening and preventive healthcare interventions in Kenyan public health systems.