In-depth mining of clinical data: the construction of clinical prediction model with R
Zhi-Rui Zhou, Wei-Wei Wang, Yan Li, Kai-Rui Jin, Xuan-Yi Wang, Zi-Wei Wang, Yi-Shan Chen, Shao-Jia Wang, Jing Hu, Hui-Na Zhang, Po Huang, Guo-Zhen Zhao, Xing-Xing Chen, Bo Li, Tian-Song Zhang, Zhi-Rui Zhou, Wei-Wei Wang, Yan Li, Kai-Rui Jin, Xuan-Yi Wang, Zi-Wei Wang, Yi-Shan Chen, Shao-Jia Wang, Jing Hu, Hui-Na Zhang, Po Huang, Guo-Zhen Zhao, Xing-Xing Chen, Bo Li, Tian-Song Zhang
Abstract
This article is the series of methodology of clinical prediction model construction (total 16 sections of this methodology series). The first section mainly introduces the concept, current application status, construction methods and processes, classification of clinical prediction models, and the necessary conditions for conducting such researches and the problems currently faced. The second episode of these series mainly concentrates on the screening method in multivariate regression analysis. The third section mainly introduces the construction method of prediction models based on Logistic regression and Nomogram drawing. The fourth episode mainly concentrates on Cox proportional hazards regression model and Nomogram drawing. The fifth Section of the series mainly introduces the calculation method of C-Statistics in the logistic regression model. The sixth section mainly introduces two common calculation methods for C-Index in Cox regression based on R. The seventh section focuses on the principle and calculation methods of Net Reclassification Index (NRI) using R. The eighth section focuses on the principle and calculation methods of IDI (Integrated Discrimination Index) using R. The ninth section continues to explore the evaluation method of clinical utility after predictive model construction: Decision Curve Analysis. The tenth section is a supplement to the previous section and mainly introduces the Decision Curve Analysis of survival outcome data. The eleventh section mainly discusses the external validation method of Logistic regression model. The twelfth mainly discusses the in-depth evaluation of Cox regression model based on R, including calculating the concordance index of discrimination (C-index) in the validation data set and drawing the calibration curve. The thirteenth section mainly introduces how to deal with the survival data outcome using competitive risk model with R. The fourteenth section mainly introduces how to draw the nomogram of the competitive risk model with R. The fifteenth section of the series mainly discusses the identification of outliers and the interpolation of missing values. The sixteenth section of the series mainly introduced the advanced variable selection methods in linear model, such as Ridge regression and LASSO regression.
Keywords: Clinical prediction models; R; statistical computing.
Conflict of interest statement
Conflicts of Interest: The authors have no conflicts of interest to declare.
2019 Annals of Translational Medicine. All rights reserved.
Figures
![Figure 1](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f1.jpg)
![Figure 2](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f2.jpg)
![Figure 3](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f3.jpg)
![Figure 4](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f4.jpg)
![Figure 5](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f5.jpg)
![Figure 6](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f6.jpg)
![Figure 7](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f7.jpg)
![Figure 8](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f8.jpg)
![Figure 9](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f9.jpg)
![Figure 10](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f10.jpg)
![Figure 11](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f11.jpg)
![Figure 12](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f12.jpg)
![Figure 13](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f13.jpg)
![Figure 14](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f14.jpg)
![Figure 15](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f15.jpg)
![Figure 16](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f16.jpg)
![Figure 17](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f17.jpg)
![Figure 18](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f18.jpg)
![Figure 19](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f19.jpg)
![Figure 20](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f20.jpg)
![Figure 21](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f21.jpg)
![Figure 22](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f22.jpg)
![Figure 23](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f23.jpg)
![Figure 24](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f24.jpg)
![Figure 25](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f25.jpg)
![Figure 26](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f26.jpg)
![Figure 27](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f27.jpg)
![Figure 28](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f28.jpg)
![Figure 29](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f29.jpg)
![Figure 30](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f30.jpg)
![Figure 31](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f31.jpg)
![Figure 32](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f32.jpg)
Figure 33
Nomogram predicting cumulative risk of…
Figure 33
Nomogram predicting cumulative risk of recurrence at 36 and 60 months using Cox…
Figure 34
Visualization of missing values (1).
Figure 34
Visualization of missing values (1).
Figure 35
Visualization of missing values (2).
Figure 35
Visualization of missing values (2).
Figure 36
Distribution of missing values with…
Figure 36
Distribution of missing values with averages.
Figure 37
The relationship between the coefficient…
Figure 37
The relationship between the coefficient and the Log(λ).
Figure 38
The performance of this model…
Figure 38
The performance of this model on the test set.
Figure 39
The relationship between the coefficient…
Figure 39
The relationship between the coefficient and the Log(λ).
Figure 40
The performance of this model…
Figure 40
The performance of this model on the test set.
Figure 41
Relationship between AUC and Log(λ).
Figure 41
Relationship between AUC and Log(λ).
Figure 42
The performance of this model…
Figure 42
The performance of this model on the test set.
Figure 43
The relationship between the coefficient…
Figure 43
The relationship between the coefficient and the L1 norm.
Figure 44
The relationship between the coefficient…
Figure 44
The relationship between the coefficient and the Log(λ).
Figure 45
The relationship between the coefficient…
Figure 45
The relationship between the coefficient and the fraction deviance explained.
Figure 46
The relationship between predicted and…
Figure 46
The relationship between predicted and actual values in the ridge regression.
Figure 47
The relationship between the coefficient…
Figure 47
The relationship between the coefficient and the Log(λ) in the Lasso regression.
Figure 48
The relationship between predicted and…
Figure 48
The relationship between predicted and actual values in the LASSO regression.
Figure 49
The relationship between the logarithm…
Figure 49
The relationship between the logarithm of λ and the mean square error in…
- Predictive analytics in the era of big data: opportunities and challenges.Zhang Z. Zhang Z. Ann Transl Med. 2020 Feb;8(4):68. doi: 10.21037/atm.2019.10.97. Ann Transl Med. 2020. PMID: 32175361 Free PMC article. No abstract available.
- Real-life clinical data mining: generating hypotheses for evidence-based medicine.Bibault JE. Bibault JE. Ann Transl Med. 2020 Feb;8(4):69. doi: 10.21037/atm.2019.10.99. Ann Transl Med. 2020. PMID: 32175362 Free PMC article. No abstract available.
- Joint forces for making clinical prediction models contribute to science.Wang J, Li Y. Wang J, et al. Ann Transl Med. 2020 Feb;8(4):70. doi: 10.21037/atm.2019.11.10. Ann Transl Med. 2020. PMID: 32175363 Free PMC article. No abstract available.
- Overview of clinical prediction models.Chen L. Chen L. Ann Transl Med. 2020 Feb;8(4):71. doi: 10.21037/atm.2019.11.121. Ann Transl Med. 2020. PMID: 32175364 Free PMC article. No abstract available.
- Clinical prediction models: evaluation matters.Gu HQ, Liu C. Gu HQ, et al. Ann Transl Med. 2020 Feb;8(4):72. doi: 10.21037/atm.2019.11.143. Ann Transl Med. 2020. PMID: 32175365 Free PMC article. No abstract available.
- Statistical methods and models in the analysis of time to event data.Lee M, Han J. Lee M, et al. Ann Transl Med. 2020 Feb;8(4):73. doi: 10.21037/atm.2019.12.66. Ann Transl Med. 2020. PMID: 32175366 Free PMC article. No abstract available.
- Models and prediction, how and what?Xie Y, Yu Z. Xie Y, et al. Ann Transl Med. 2020 Feb;8(4):75. doi: 10.21037/atm.2019.12.133. Ann Transl Med. 2020. PMID: 32175368 Free PMC article. No abstract available.
- How to use statistical models and methods for clinical prediction.Cortese G. Cortese G. Ann Transl Med. 2020 Feb;8(4):76. doi: 10.21037/atm.2020.01.22. Ann Transl Med. 2020. PMID: 32175369 Free PMC article. No abstract available.
- The power of clinical data empowered by clinical prediction model: an R tutorial.Dai L, Yang D, Shen H. Dai L, et al. Ann Transl Med. 2020 Feb;8(4):77. doi: 10.21037/atm.2020.01.114. Ann Transl Med. 2020. PMID: 32175370 Free PMC article. No abstract available.
- A nomogram with enhanced function facilitated by nomogramEx and nomogramFormula.Bi G, Li R, Liang J, Hu Z, Zhan C. Bi G, et al. Ann Transl Med. 2020 Feb;8(4):78. doi: 10.21037/atm.2020.01.71. Ann Transl Med. 2020. PMID: 32175371 Free PMC article. No abstract available.
- Clinical prediction models in the precision medicine era: old and new algorithms.Luo JC, Zhao QY, Tu GW. Luo JC, et al. Ann Transl Med. 2020 Mar;8(6):274. doi: 10.21037/atm.2020.02.63. Ann Transl Med. 2020. PMID: 32355718 Free PMC article. No abstract available.
- Multinomial and ordinal Logistic regression analyses with multi-categorical variables using R.Liang J, Bi G, Zhan C. Liang J, et al. Ann Transl Med. 2020 Aug;8(16):982. doi: 10.21037/atm-2020-57. Ann Transl Med. 2020. PMID: 32953782 Free PMC article. No abstract available.
- A nomogram for determining the disease-specific survival in Ewing sarcoma: a population study.Zhang J, Pan Z, Yang J, Yan X, Li Y, Lyu J. Zhang J, et al. BMC Cancer. 2019 Jul 5;19(1):667. doi: 10.1186/s12885-019-5893-9. BMC Cancer. 2019. PMID: 31277591 Free PMC article.
- Development and Validation of a Nomogram for Predicting Survival in Male Patients With Breast Cancer.Chen S, Liu Y, Yang J, Liu Q, You H, Dong Y, Lyu J. Chen S, et al. Front Oncol. 2019 May 14;9:361. doi: 10.3389/fonc.2019.00361. eCollection 2019. Front Oncol. 2019. PMID: 31139562 Free PMC article.
- Improved Mortality Prediction in Dialysis Patients Using Specific Clinical and Laboratory Data.Hemke AC, Heemskerk MB, van Diepen M, Dekker FW, Hoitsma AJ. Hemke AC, et al. Am J Nephrol. 2015;42(2):158-67. doi: 10.1159/000439181. Epub 2015 Sep 26. Am J Nephrol. 2015. PMID: 26406283
- Nomogram to predict overall survival after gallbladder cancer resection in China.Bai Y, Liu ZS, Xiong JP, Xu WY, Lin JZ, Long JY, Miao F, Huang HC, Wan XS, Zhao HT. Bai Y, et al. World J Gastroenterol. 2018 Dec 7;24(45):5167-5178. doi: 10.3748/wjg.v24.i45.5167. World J Gastroenterol. 2018. PMID: 30568393 Free PMC article.
- Nontraditional Risk Factors in Cardiovascular Disease Risk Assessment: A Systematic Evidence Report for the U.S. Preventive Services Task Force [Internet].Lin JS, Evans CV, Johnson E, Redmond N, Burda BU, Coppola EL, Smith N. Lin JS, et al. Rockville (MD): Agency for Healthcare Research and Quality (US); 2018 Jul. Report No.: 17-05225-EF-1. Rockville (MD): Agency for Healthcare Research and Quality (US); 2018 Jul. Report No.: 17-05225-EF-1. PMID: 30234933 Free Books & Documents. Review.
- A novel prognostic model for patients with colon adenocarcinoma.Yin C, Wang W, Cao W, Chen Y, Sun X, He K. Yin C, et al. Front Endocrinol (Lausanne). 2023 Feb 27;14:1133554. doi: 10.3389/fendo.2023.1133554. eCollection 2023. Front Endocrinol (Lausanne). 2023. PMID: 36923226 Free PMC article.
- Significance of CD80 as a Prognostic and Immunotherapeutic Biomarker in Lung Adenocarcinoma.Feng W, He Z, Shi L, Zhu Z, Ma H. Feng W, et al. Biochem Genet. 2023 Mar 9. doi: 10.1007/s10528-023-10343-7. Online ahead of print. Biochem Genet. 2023. PMID: 36892747
- Prognosis of the Keratinizing Squamous Cell Carcinoma of the Tongue Based on Surveillance, Epidemiology, and End Results Database.Yu H, Xie S, Zheng X, Zhao Q, Xia X, Ming WK, Cheng LN, Duan X, Huang WE, Huang F, Lyu J, Deng L. Yu H, et al. Int J Clin Pract. 2023 Feb 24;2023:3016994. doi: 10.1155/2023/3016994. eCollection 2023. Int J Clin Pract. 2023. PMID: 36874384 Free PMC article. Clinical Trial.
- Development and validation of a pyroptosis-related genes signature for risk stratification in gliomas.Sun P, Wang X, Zhong J, Yu D, Xuan H, Xu T, Song D, Yang C, Wang P, Liu Y, Meng X, Cai J. Sun P, et al. Front Genet. 2023 Feb 13;14:1087563. doi: 10.3389/fgene.2023.1087563. eCollection 2023. Front Genet. 2023. PMID: 36861130 Free PMC article.
- A clinical prediction model for lung metastasis risk in osteosarcoma: A multicenter retrospective study.Zheng S, Chen L, Wang J, Wang H, Hu Z, Li W, Xu C, Ma M, Wang B, Huang Y, Liu Q, Tang ZR, Liu G, Wang T, Li W, Yin C. Zheng S, et al. Front Oncol. 2023 Feb 10;13:1001219. doi: 10.3389/fonc.2023.1001219. eCollection 2023. Front Oncol. 2023. PMID: 36845714 Free PMC article.
- Full Text Sources
- Medical
![Figure 33](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f33.jpg)
Figure 34
Visualization of missing values (1).
Figure 34
Visualization of missing values (1).
Figure 35
Visualization of missing values (2).
Figure 35
Visualization of missing values (2).
Figure 36
Distribution of missing values with…
Figure 36
Distribution of missing values with averages.
Figure 37
The relationship between the coefficient…
Figure 37
The relationship between the coefficient and the Log(λ).
Figure 38
The performance of this model…
Figure 38
The performance of this model on the test set.
Figure 39
The relationship between the coefficient…
Figure 39
The relationship between the coefficient and the Log(λ).
Figure 40
The performance of this model…
Figure 40
The performance of this model on the test set.
Figure 41
Relationship between AUC and Log(λ).
Figure 41
Relationship between AUC and Log(λ).
Figure 42
The performance of this model…
Figure 42
The performance of this model on the test set.
Figure 43
The relationship between the coefficient…
Figure 43
The relationship between the coefficient and the L1 norm.
Figure 44
The relationship between the coefficient…
Figure 44
The relationship between the coefficient and the Log(λ).
Figure 45
The relationship between the coefficient…
Figure 45
The relationship between the coefficient and the fraction deviance explained.
Figure 46
The relationship between predicted and…
Figure 46
The relationship between predicted and actual values in the ridge regression.
Figure 47
The relationship between the coefficient…
Figure 47
The relationship between the coefficient and the Log(λ) in the Lasso regression.
Figure 48
The relationship between predicted and…
Figure 48
The relationship between predicted and actual values in the LASSO regression.
Figure 49
The relationship between the logarithm…
Figure 49
The relationship between the logarithm of λ and the mean square error in…
![Figure 34](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f34.jpg)
![Figure 35](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f35.jpg)
![Figure 36](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f36.jpg)
![Figure 37](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f37.jpg)
![Figure 38](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f38.jpg)
![Figure 39](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f39.jpg)
![Figure 40](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f40.jpg)
![Figure 41](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f41.jpg)
![Figure 42](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f42.jpg)
![Figure 43](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f43.jpg)
![Figure 44](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f44.jpg)
![Figure 45](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f45.jpg)
![Figure 46](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f46.jpg)
![Figure 47](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f47.jpg)
![Figure 48](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f48.jpg)
![Figure 49](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/6989986/bin/atm-07-23-796-f49.jpg)
Source: PubMed