Stroke risk prediction using machine learning: a prospective cohort study of 0.5 million Chinese adults.
Chun M., Clarke R., Cairns BJ., Clifton D., Bennett D., Chen Y., Guo Y., Pei P., Lv J., Yu C., Yang L., Li L., Chen Z., Zhu T., China Kadoorie Biobank Collaborative Group None.
OBJECTIVE: To compare Cox models, machine learning (ML), and ensemble models combining both approaches, for prediction of stroke risk in a prospective study of Chinese adults. MATERIALS AND METHODS: We evaluated models for stroke risk at varying intervals of follow-up (<9 years, 0-3 years, 3-6 years, 6-9 years) in 503 842 adults without prior history of stroke recruited from 10 areas in China in 2004-2008. Inputs included sociodemographic factors, diet, medical history, physical activity, and physical measurements. We compared discrimination and calibration of Cox regression, logistic regression, support vector machines, random survival forests, gradient boosted trees (GBT), and multilayer perceptrons, benchmarking performance against the 2017 Framingham Stroke Risk Profile. We then developed an ensemble approach to identify individuals at high risk of stroke (>10% predicted 9-yr stroke risk) by selectively applying either a GBT or Cox model based on individual-level characteristics. RESULTS: For 9-yr stroke risk prediction, GBT provided the best discrimination (AUROC: 0.833 in men, 0.836 in women) and calibration, with consistent results in each interval of follow-up. The ensemble approach yielded incrementally higher accuracy (men: 76%, women: 80%), specificity (men: 76%, women: 81%), and positive predictive value (men: 26%, women: 24%) compared to any of the single-model approaches. DISCUSSION AND CONCLUSION: Among several approaches, an ensemble model combining both GBT and Cox models achieved the best performance for identifying individuals at high risk of stroke in a contemporary study of Chinese adults. The results highlight the potential value of expanding the use of ML in clinical practice.