Random forest classification of etiologies for an orphan disease

Jaime Lynn Speiser, Valerie L Durkalski, William M Lee, Jaime Lynn Speiser, Valerie L Durkalski, William M Lee

Abstract

Classification of objects into pre-defined groups based on known information is a fundamental problem in the field of statistics. Although approaches for solving this problem exist, finding an accurate classification method can be challenging in an orphan disease setting, where data are minimal and often not normally distributed. The purpose of this paper is to illustrate the application of the random forest (RF) classification procedure in a real clinical setting and discuss typical questions that arise in the general classification framework as well as offer interpretations of RF results. This paper includes methods for assessing predictive performance, importance of predictor variables, and observation-specific information.

Keywords: acute liver failure; etiology; random forest; statistical classification.

Copyright © 2014 John Wiley & Sons, Ltd.

Figures

Figure 1
Figure 1
This is a simple example of a CART model. To predict the outcome, one must begin at the top and recursively follow logic statements until a terminal node is reached. For true logic statements, the model follows the left branch, and false logic statements dictate movement to the right branch.
Figure 2
Figure 2
This plot displays the original distribution of the categorical outcome, etiology. There are fifteen total outcome groups, many of which are quite rare. The imbalanced nature of etiology presented a significant challenge to accurate classification.
Figure 3
Figure 3
This plot displays the original distribution of the categorical outcome, etiology. There are fifteen total outcome groups, many of which are quite rare. The imbalanced nature of etiology presented a significant challenge to accurate classification.
Figure 4
Figure 4
This graph displays the distribution of missing data for independent variables included within the RF after the variable selection procedure. Some variables, such as ionized calcium and amylase, have a substantial proportion of missing data (more than half). However, RF is able to impute missing values even under extreme cases of missing data, such as these variables.
Figure 5
Figure 5
The histogram of margins (left) is right-skewed, indicating that the correct etiology for the majority of patients was determined by the RF. The box plots (right) visualize margins by etiology groups, illustrating that the RF predicts with high confidence for APAP, moderate confidence for AIHep, and lower confidence for the remaining groups.
Figure 6
Figure 6
Plots for variable importance measures are presented. Though the exact ordering is different for mean decrease in accuracy and mean decrease in Gini, the top three most important variables in predicting etiology are bilirubin, days from onset, and ionized calcium.
Figure 7
Figure 7
Local importance plots by etiology groups are presented for four variables: bilirubin, days from onset, ionized calcium, and monocytes. These allow for distinguishing which variables are more or less important in determining specific etiologies.
Figure 8
Figure 8
The partial dependence plot on ionized calcium for the autoimmune hepatitis outcome group is presented. The slope of this plot indicates that as ionized calcium increases to 2 mg/dL, the probability the ALF was caused by autoimmune hepatitis increases, while other variables remain constant. Values above 2 mg/dL are associated with constant probabilities ALF was caused by autoimmune hepatitis since the slope of the plot levels off.

Source: PubMed

3
Abonnieren