In-depth mining of clinical data: the construction of clinical prediction model with R

Zhi-Rui Zhou, Wei-Wei Wang, Yan Li, Kai-Rui Jin, Xuan-Yi Wang, Zi-Wei Wang, Yi-Shan Chen, Shao-Jia Wang, Jing Hu, Hui-Na Zhang, Po Huang, Guo-Zhen Zhao, Xing-Xing Chen, Bo Li, Tian-Song Zhang, Zhi-Rui Zhou, Wei-Wei Wang, Yan Li, Kai-Rui Jin, Xuan-Yi Wang, Zi-Wei Wang, Yi-Shan Chen, Shao-Jia Wang, Jing Hu, Hui-Na Zhang, Po Huang, Guo-Zhen Zhao, Xing-Xing Chen, Bo Li, Tian-Song Zhang

Abstract

This article is the series of methodology of clinical prediction model construction (total 16 sections of this methodology series). The first section mainly introduces the concept, current application status, construction methods and processes, classification of clinical prediction models, and the necessary conditions for conducting such researches and the problems currently faced. The second episode of these series mainly concentrates on the screening method in multivariate regression analysis. The third section mainly introduces the construction method of prediction models based on Logistic regression and Nomogram drawing. The fourth episode mainly concentrates on Cox proportional hazards regression model and Nomogram drawing. The fifth Section of the series mainly introduces the calculation method of C-Statistics in the logistic regression model. The sixth section mainly introduces two common calculation methods for C-Index in Cox regression based on R. The seventh section focuses on the principle and calculation methods of Net Reclassification Index (NRI) using R. The eighth section focuses on the principle and calculation methods of IDI (Integrated Discrimination Index) using R. The ninth section continues to explore the evaluation method of clinical utility after predictive model construction: Decision Curve Analysis. The tenth section is a supplement to the previous section and mainly introduces the Decision Curve Analysis of survival outcome data. The eleventh section mainly discusses the external validation method of Logistic regression model. The twelfth mainly discusses the in-depth evaluation of Cox regression model based on R, including calculating the concordance index of discrimination (C-index) in the validation data set and drawing the calibration curve. The thirteenth section mainly introduces how to deal with the survival data outcome using competitive risk model with R. The fourteenth section mainly introduces how to draw the nomogram of the competitive risk model with R. The fifteenth section of the series mainly discusses the identification of outliers and the interpolation of missing values. The sixteenth section of the series mainly introduced the advanced variable selection methods in linear model, such as Ridge regression and LASSO regression.

Keywords: Clinical prediction models; R; statistical computing.

Conflict of interest statement

Conflicts of Interest: The authors have no conflicts of interest to declare.

2019 Annals of Translational Medicine. All rights reserved.

Figures

Figure 1
Figure 1
The flow chart of construction and evaluation of clinical prediction models.
Figure 2
Figure 2
Research process and technical routes of three prediction models.
Figure 3
Figure 3
Nomogram based on model “fit1”.
Figure 4
Figure 4
Calibration curve based on model “fit1”.
Figure 5
Figure 5
Nomogram based on model “fit2”.
Figure 6
Figure 6
Calibration curve based on model “fit2”.
Figure 7
Figure 7
Nomogram based on model “fit”.
Figure 8
Figure 8
Calibration curve based on model “fit”.
Figure 9
Figure 9
Nomogram of Cox regression model.
Figure 10
Figure 10
Calibration curve of Cox model.
Figure 11
Figure 11
Nomogram based on median survival time of Cox regression.
Figure 12
Figure 12
Nomogram based on survival probality of Cox regression model.
Figure 13
Figure 13
Calibration curve based on Cox regression model.
Figure 14
Figure 14
ROC curve.
Figure 15
Figure 15
The comparison of two models.
Figure 16
Figure 16
DCA curve.
Figure 17
Figure 17
Clinical impact curve of simple model.
Figure 18
Figure 18
Clinical impact curve of complex model.
Figure 19
Figure 19
DCA of survival outcome data.
Figure 20
Figure 20
DCA curve of “coxmod” based on Cox regression model.
Figure 21
Figure 21
DCA curves of “coxmod1” and “coxmod1” based on two Cox regression models.
Figure 22
Figure 22
DCA curve of a single predictor “thickness” based on univariate Cox regression model.
Figure 23
Figure 23
DCA curve of a single predictor “thickness” based on univariate Cox regression model. Y axis represent net reduction in interventions per 100 persons.
Figure 24
Figure 24
Calibration plot.
Figure 25
Figure 25
ROC curve.
Figure 26
Figure 26
ROC curve in validation set.
Figure 27
Figure 27
The discrimination index of Cox (2 variables) compared with Cox (5 variables) without cross-validation.
Figure 28
Figure 28
The discriminability of Cox (2 variables) compared with Cox (5 variables) with cross-validation.
Figure 29
Figure 29
The Calibration Plot performed by pec package.
Figure 30
Figure 30
The Calibration Plot performed by pec package with cross-validation.
Figure 31
Figure 31
The survival curve of cumulative recurrence rate and cumulative competitive risk event incidence rate.
Figure 32
Figure 32
Nomogram predicting cumulative recurrence risk at 36 and 60 months using the competitive risk model. Nomogram estimates that patient no. 31 has a cumulative risk of recurrence of 0.196 and 0.213 at 36 and 60 months, respectively. *, P

Figure 33

Nomogram predicting cumulative risk of…

Figure 33

Nomogram predicting cumulative risk of recurrence at 36 and 60 months using Cox…

Figure 33
Nomogram predicting cumulative risk of recurrence at 36 and 60 months using Cox proportional hazard model. According to Nomogram’s estimate, the cumulative risk of recurrence in patient no. 31 at 36 and 60 months is 0.205 and 0.217, respectively. *, P

Figure 34

Visualization of missing values (1).

Figure 34

Visualization of missing values (1).

Figure 34
Visualization of missing values (1).

Figure 35

Visualization of missing values (2).

Figure 35

Visualization of missing values (2).

Figure 35
Visualization of missing values (2).

Figure 36

Distribution of missing values with…

Figure 36

Distribution of missing values with averages.

Figure 36
Distribution of missing values with averages.

Figure 37

The relationship between the coefficient…

Figure 37

The relationship between the coefficient and the Log(λ).

Figure 37
The relationship between the coefficient and the Log(λ).

Figure 38

The performance of this model…

Figure 38

The performance of this model on the test set.

Figure 38
The performance of this model on the test set.

Figure 39

The relationship between the coefficient…

Figure 39

The relationship between the coefficient and the Log(λ).

Figure 39
The relationship between the coefficient and the Log(λ).

Figure 40

The performance of this model…

Figure 40

The performance of this model on the test set.

Figure 40
The performance of this model on the test set.

Figure 41

Relationship between AUC and Log(λ).

Figure 41

Relationship between AUC and Log(λ).

Figure 41
Relationship between AUC and Log(λ).

Figure 42

The performance of this model…

Figure 42

The performance of this model on the test set.

Figure 42
The performance of this model on the test set.

Figure 43

The relationship between the coefficient…

Figure 43

The relationship between the coefficient and the L1 norm.

Figure 43
The relationship between the coefficient and the L1 norm.

Figure 44

The relationship between the coefficient…

Figure 44

The relationship between the coefficient and the Log(λ).

Figure 44
The relationship between the coefficient and the Log(λ).

Figure 45

The relationship between the coefficient…

Figure 45

The relationship between the coefficient and the fraction deviance explained.

Figure 45
The relationship between the coefficient and the fraction deviance explained.

Figure 46

The relationship between predicted and…

Figure 46

The relationship between predicted and actual values in the ridge regression.

Figure 46
The relationship between predicted and actual values in the ridge regression.

Figure 47

The relationship between the coefficient…

Figure 47

The relationship between the coefficient and the Log(λ) in the Lasso regression.

Figure 47
The relationship between the coefficient and the Log(λ) in the Lasso regression.

Figure 48

The relationship between predicted and…

Figure 48

The relationship between predicted and actual values in the LASSO regression.

Figure 48
The relationship between predicted and actual values in the LASSO regression.

Figure 49

The relationship between the logarithm…

Figure 49

The relationship between the logarithm of λ and the mean square error in…

Figure 49
The relationship between the logarithm of λ and the mean square error in the LASSO regression.
All figures (49)
Comment in
Similar articles
Cited by
Related information
[x]
Cite
Copy Download .nbib
Format: AMA APA MLA NLM
Figure 33
Figure 33
Nomogram predicting cumulative risk of recurrence at 36 and 60 months using Cox proportional hazard model. According to Nomogram’s estimate, the cumulative risk of recurrence in patient no. 31 at 36 and 60 months is 0.205 and 0.217, respectively. *, P

Figure 34

Visualization of missing values (1).

Figure 34

Visualization of missing values (1).

Figure 34
Visualization of missing values (1).

Figure 35

Visualization of missing values (2).

Figure 35

Visualization of missing values (2).

Figure 35
Visualization of missing values (2).

Figure 36

Distribution of missing values with…

Figure 36

Distribution of missing values with averages.

Figure 36
Distribution of missing values with averages.

Figure 37

The relationship between the coefficient…

Figure 37

The relationship between the coefficient and the Log(λ).

Figure 37
The relationship between the coefficient and the Log(λ).

Figure 38

The performance of this model…

Figure 38

The performance of this model on the test set.

Figure 38
The performance of this model on the test set.

Figure 39

The relationship between the coefficient…

Figure 39

The relationship between the coefficient and the Log(λ).

Figure 39
The relationship between the coefficient and the Log(λ).

Figure 40

The performance of this model…

Figure 40

The performance of this model on the test set.

Figure 40
The performance of this model on the test set.

Figure 41

Relationship between AUC and Log(λ).

Figure 41

Relationship between AUC and Log(λ).

Figure 41
Relationship between AUC and Log(λ).

Figure 42

The performance of this model…

Figure 42

The performance of this model on the test set.

Figure 42
The performance of this model on the test set.

Figure 43

The relationship between the coefficient…

Figure 43

The relationship between the coefficient and the L1 norm.

Figure 43
The relationship between the coefficient and the L1 norm.

Figure 44

The relationship between the coefficient…

Figure 44

The relationship between the coefficient and the Log(λ).

Figure 44
The relationship between the coefficient and the Log(λ).

Figure 45

The relationship between the coefficient…

Figure 45

The relationship between the coefficient and the fraction deviance explained.

Figure 45
The relationship between the coefficient and the fraction deviance explained.

Figure 46

The relationship between predicted and…

Figure 46

The relationship between predicted and actual values in the ridge regression.

Figure 46
The relationship between predicted and actual values in the ridge regression.

Figure 47

The relationship between the coefficient…

Figure 47

The relationship between the coefficient and the Log(λ) in the Lasso regression.

Figure 47
The relationship between the coefficient and the Log(λ) in the Lasso regression.

Figure 48

The relationship between predicted and…

Figure 48

The relationship between predicted and actual values in the LASSO regression.

Figure 48
The relationship between predicted and actual values in the LASSO regression.

Figure 49

The relationship between the logarithm…

Figure 49

The relationship between the logarithm of λ and the mean square error in…

Figure 49
The relationship between the logarithm of λ and the mean square error in the LASSO regression.
All figures (49)
Figure 34
Figure 34
Visualization of missing values (1).
Figure 35
Figure 35
Visualization of missing values (2).
Figure 36
Figure 36
Distribution of missing values with averages.
Figure 37
Figure 37
The relationship between the coefficient and the Log(λ).
Figure 38
Figure 38
The performance of this model on the test set.
Figure 39
Figure 39
The relationship between the coefficient and the Log(λ).
Figure 40
Figure 40
The performance of this model on the test set.
Figure 41
Figure 41
Relationship between AUC and Log(λ).
Figure 42
Figure 42
The performance of this model on the test set.
Figure 43
Figure 43
The relationship between the coefficient and the L1 norm.
Figure 44
Figure 44
The relationship between the coefficient and the Log(λ).
Figure 45
Figure 45
The relationship between the coefficient and the fraction deviance explained.
Figure 46
Figure 46
The relationship between predicted and actual values in the ridge regression.
Figure 47
Figure 47
The relationship between the coefficient and the Log(λ) in the Lasso regression.
Figure 48
Figure 48
The relationship between predicted and actual values in the LASSO regression.
Figure 49
Figure 49
The relationship between the logarithm of λ and the mean square error in the LASSO regression.

Source: PubMed

3
구독하다