Feature Selection Methods for Early Predictive Biomarker Discovery Using Untargeted Metabolomic Data

Dhouha Grissa, Mélanie Pétéra, Marion Brandolini, Amedeo Napoli, Blandine Comte, Estelle Pujos-Guillot, Dhouha Grissa, Mélanie Pétéra, Marion Brandolini, Amedeo Napoli, Blandine Comte, Estelle Pujos-Guillot

Abstract

Untargeted metabolomics is a powerful phenotyping tool for better understanding biological mechanisms involved in human pathology development and identifying early predictive biomarkers. This approach, based on multiple analytical platforms, such as mass spectrometry (MS), chemometrics and bioinformatics, generates massive and complex data that need appropriate analyses to extract the biologically meaningful information. Despite various tools available, it is still a challenge to handle such large and noisy datasets with limited number of individuals without risking overfitting. Moreover, when the objective is focused on the identification of early predictive markers of clinical outcome, few years before occurrence, it becomes essential to use the appropriate algorithms and workflow to be able to discover subtle effects among this large amount of data. In this context, this work consists in studying a workflow describing the general feature selection process, using knowledge discovery and data mining methodologies to propose advanced solutions for predictive biomarker discovery. The strategy was focused on evaluating a combination of numeric-symbolic approaches for feature selection with the objective of obtaining the best combination of metabolites producing an effective and accurate predictive model. Relying first on numerical approaches, and especially on machine learning methods (SVM-RFE, RF, RF-RFE) and on univariate statistical analyses (ANOVA), a comparative study was performed on an original metabolomic dataset and reduced subsets. As resampling method, LOOCV was applied to minimize the risk of overfitting. The best k-features obtained with different scores of importance from the combination of these different approaches were compared and allowed determining the variable stabilities using Formal Concept Analysis. The results revealed the interest of RF-Gini combined with ANOVA for feature selection as these two complementary methods allowed selecting the 48 best candidates for prediction. Using linear logistic regression on this reduced dataset enabled us to obtain the best performances in terms of prediction accuracy and number of false positive with a model including 5 top variables. Therefore, these results highlighted the interest of feature selection methods and the importance of working on reduced datasets for the identification of predictive biomarkers issued from untargeted metabolomics data.

Keywords: biomarker discovery; feature selection; formal concept analysis; machine learning; metabolomics; prediction; univariate statistics; visualization.

Figures

Figure 1
Figure 1
General feature selection process.
Figure 2
Figure 2
General framework. Main phases of metabolomic data treatment.
Figure 3
Figure 3
Detailed approach. Representation of the different steps of the proposed approach, from the untargeted metabolomics original dataset to the identification of predictive biomarkers. It includes (i) data transformation, i.e., noise filtering, scaling, to generate suitable datasets for feature selection methods; (ii) data reduction for feature selection, which consists in identifying relevant features for further use in predictive models; (iii) a prediction and validation step for discovering the best predictive markers. The numbers in the circles refer to the different sections of the manuscript for a more detailed description.
Figure 4
Figure 4
Experimental design. Experimental design for comparisons of feature selection methods (RF, RF-RFE, SVM-RFE, ANOVA), applied either on original dataset or after filters, based either on correlation coefficient (Cor) or on mutual information (MI), and used with different classifiers (MdGini, MdAcc, Kappa, W, p-value). It resulted in 10 different subsets, with different feature rankings.
Figure 5
Figure 5
The concept hierarchy derived from 48 × 10 binary table of Supplementary Table 1. It highlights the relationships existing between top-ranked features and selection methods. It allows visualizing the common features (ions) selected by a set of methods.
Figure 6
Figure 6
Correlation network between the 11 top-predictive features. Network built using Pearson correlation coefficient (indicated on the edges) between the best predictive features. Red edges: positive correlations, Blue edges: negative correlations. It highlighted two highly correlated sub-networks (yellow and green).

References

    1. Agrawal R., Mielinski T., Swami A. (eds.). (1993). Mining association rules between sets of items in large databasesMining association rules between sets of items in large databases, in ACM SIGMOD Conference (Washington, DC: ).
    1. Barber S. R., Davies M. J., Khunti K., Gray L. J. (2014). Risk assessment tools for detecting those with pre-diabetes: a systematic review. Diabetes Res. Clin. Pract. 105, 1–13. 10.1016/j.diabres.2014.03.007
    1. Baumgartner C., Osl M., Netzer M., Baumgartner D. (2011). Bioinformatic-driven search for metabolic biomarkers in disease. J. Clin. Bioinformatics 1:2. 10.1186/2043-9113-1-2
    1. Biau G. (2012). Analysis of a random forests model. J. Mach. Learn. Res. 13, 1063–1095.
    1. Boccard J., Veuthey J. L., Rudaz S. (2010). Knowledge discovery in metabolomics: an overview of MS data handling. J. Sep. Sci. 33, 290–304. 10.1002/jssc.200900609
    1. Boulesteix A.-L., Bender A., Bermejo J. L., Strobl C. (2012). Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations. Brief. Bioinformatics 13, 292–304. 10.1093/bib/bbr053
    1. Breiman L. (2001). Random forests. Mach. Learn. 45, 5–32. 10.1023/A:1010933404324
    1. Cawley G. C., Talbot N. L. C. (2010). On over-fitting in model selection and subsequent selection bias in performance evaluation. J. Mach. Learn. Res. 11, 2079–2107.
    1. Chen T., Cao Y., Zhang Y., Liu J., Bao Y., Wang C., et al. . (2013). Random forest in clinical metabolomics for phenotypic discrimination and biomarker selection. Evid. Based Complement. Altern. Med. 2013:298183. 10.1155/2013/298183
    1. Cho H.-W., Kim S. B., Jeong M. K., Park Y., Gletsu N., Ziegler T. R., et al. . (2008). Discovery of metabolite features for the modelling and analysis of high-resolution NMR spectra. Int. J. Data Min. Bioinform. 2, 176–192. 10.1504/IJDMB.2008.019097
    1. Cortes C., Vapnik V. (1995). Support-vector networks. Mach. Learn. 20, 273–297. 10.1007/BF00994018
    1. Díaz-Uriarte R., de Andrés S. A. (2006). Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7:3. 10.1186/1471-2105-7-3
    1. Drabovich A. B., Pavlou M. P., Bartruch I., Diamandis E. P. (2013). Mass spectrometry metabolomic data handling for biomarker discovery in “Proteomic and Metabolomic Approaches to Biomarker Discovery,” eds Issaq H. J., Veenstra T. D. (Frederick, MD: Academic Press; ), 17–37.
    1. Fan Y., Murphy T. B., Byrne J. C., Brennan L., Fitzpatrick J. M., Watson R. W. G. (2011). Applying random forests to identify biomarker panels in serum 2D-DIGE data for the detection and staging of prostate cancer. J. Proteome Res. 10, 1361–1373. 10.1021/pr1011069
    1. Fiehn O., Kopka J., Dormann P., Altmann T., Trethewey R. N., Willmitzer L. (2000). Metabolite profiling for plant functional genomics. Nat. Biotechnol. 18, 1157–1161. 10.1038/81137
    1. Freund Y., Schapire R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 119–139. 10.1006/jcss.1997.1504
    1. Frickenschmidt A., Frohlich H., Bullinger D., Zell A., Laufer S., Gleiter C., et al. . (2008). Metabonomics in cancer diagnosis: mass spectrometry-based profiling of urinary nucleo-sides from breast cancer patients. Biomarkers 13, 435–449. 10.1080/13547500802012858
    1. Ganter B., Wille R. (1999). Formal Concept Analysis - Mathematical Foundations. Secaucus, NJ: Springer.
    1. Giacomoni F., Le Corguille G., Monsoor M., Landi M., Pericard P., Petera M., et al. . (2015). Workflow4Metabolomics: a collaborative research infrastructure for computational metabolomics. Bioinformatics 31, 1493–1495. 10.1093/bioinformatics/btu813
    1. Giudici P., Figini S. (2009). Applied Data Mining Statistical Method for Business and Industry. Chichester, UK: John Wiley & Sons, Ltd.
    1. Goldberg M., Leclerc A., Zins M. (2015). Cohort profile update: the GAZEL cohort study. Int. J. Epidemiol. 44, 77–77g. 10.1093/ije/dyu224
    1. Gromski P., Muhamadali H., Ellis D., Xu Y., Correa E., Turner M., et al. . (2015). A tutorial review: metabolomics and partial least squares-discriminant analysis-a marriage of convenience or a shotgun wedding. Anal. Chim. Acta 879, 10–23. 10.1016/j.aca.2015.02.012
    1. Gromski P.-S., Xu Y., Correa E., Ellis D.-I., Turner M.-L., Goodacre R. (2014). A comparative investigation of modern feature selection and classification approaches for the analysis of mass spectrometry data. Anal. Chim. Acta 829, 1–8. 10.1016/j.aca.2014.03.039
    1. Guo Y., Balasubramanian R. (2012). Comparative evaluation of classifiers in the presence of statistical interactions between features in high dimensional data settings. Int. J. Biostat. 8, 1373–1405. 10.1515/1557-4679.1373
    1. Guyon I., Elisseeff A. (2003). An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182. 10.1162/153244303322753616
    1. Guyon I., Weston J., Barnhill S., Vapnik V. (2002). Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422. 10.1023/A:1012487302797
    1. Hapfelmeier A., Hothorn T., Ulm K., Strobl C. (2014). A new variable importance measure for random forests with missing data. Stat. Comput. 24, 21–34. 10.1007/s11222-012-9349-1
    1. Hermes L., Buhmann J. M. (2000). Feature selection for support vector machines, in Proceedings of the 15th International Conference on Pattern Recognition (Barcelona: ), 712–715.
    1. Ho T. K. (1998). The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 20, 832–844. 10.1109/34.709601
    1. Issaq H., Van Q., Waybright T., Muschik G., Veenstra T. (2009). Analytical and statistical approaches to metabolomics research. J. Sep. Sci. 32, 2183–2199. 10.1002/jssc.200900152
    1. Kohavi R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection, in Proceedings of the 14th International Joint Conference on Artificial Intelligence, Vol. 2 (San Francisco, CA: Morgan Kaufmann Publishers Inc.). 1137–1143.
    1. Lal N. T., Chapelle O., Weston J., Elisseeff A. (2006). Embedded methods, in Feature Extraction: Foundations and Applications, eds Guyon I. G. S., Nikravesh M., Zadeh L. A. (Berlin; Heidelberger: Springer-Verlag; ), 137–165. Available online at:
    1. Liaw A., Wiener M. (2002). Classification and Regression by randomForest. R. News 2, 18–22.
    1. Liu H., Motoda H. (1998). Feature Selection for Knowledge Discovery and Data Mining. Norwell, MA: Kluwer Academic Publishers.
    1. Liu H., Yu L. (2005). Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. Knowl Data Eng. 17, 491–502. 10.1109/TKDE.2005.66
    1. Mamas M., Dunn W. B., Neyses L., Goodacre R. (2011). The role of metabolites and metabolomics in clinically applicable biomarkers of disease. Arch. Toxicol. 85, 5–17. 10.1007/s00204-010-0609-6
    1. Mao Y., Zhou X., Wang S., Cheng Y. (2007). Urinary nucleosides based potential biomarker selection by support vector machine for bladder cancer recognition. Anal. Chim. Acta 598, 34–40. 10.1016/j.aca.2007.07.038
    1. Menze B. H., Kelm B. M., Masuch R., Himmelreich U., Bachert P., Petrich W., et al. . (2009). A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinformatics 10:213. 10.1186/1471-2105-10-213
    1. Nicholson J. K., Lindon J. C., Holmes E. (1999). ‘Metabonomics’: understanding the metabolic responses of living systems to pathophysiological stimuli via multivariate statistical analysis of biological NMR spectroscopic data. Xenobiotica 29, 1181–1189. 10.1080/004982599238047
    1. Patterson A. D., Bonzo J. A., Li F., Krausz K. W., Eichler G. S., Aslam S., et al. . (2011). Metabolomics reveals attenuation of the SLC6A20 kidney transporter in nonhuman primate and mouse models of type 2 diabetes mellitus. J. Biol. Chem. 286, 19511–19522. 10.1074/jbc.M111.221739
    1. Pereira H., Martin J. F., Joly C., Sebedio J. L., Pujos-Guillot E. (2010). Development and validation of a UPLC/MS method for a nutritional metabolomic study of human plasma. Metabolomics 6, 207–218. 10.1007/s11306-009-0188-9
    1. Ramautar R., Berger R., van der Greef J., Hankemeier T. (2013). Human metabolomics: strategies to understand biology. Curr. Opin. Chem. Biol. 17, 841–846. 10.1016/j.cbpa.2013.06.015
    1. Robin X., Turck N., Hainard A., Tiberti N., Lisacek F., Sanchez J. C., et al. . (2011). pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 12:77. 10.1186/1471-2105-12-77
    1. Saccenti E., Hoefsloot H. C. J., Smilde A. K., Westerhuis J. A., Hendriks M. (2014). Reflections on univariate and multivariate analysis of metabolomics data. Metabolomics 10, 361–374. 10.1007/s11306-013-0598-6
    1. Saeys Y., Inza I., Larraaga P. (2007). A review of feature selection techniques in bioinformatics. Bioinformatics 23, 2507–2517. 10.1093/bioinformatics/btm344
    1. Scott I. M., Lin W., Liakata M., Wood J. E., Vermeer C. P., Allaway D., et al. . (2013). Merits of random forests emerge in evaluation of chemometric classifiers by external validation. Anal. Chim. Acta 801, 22–33. 10.1016/j.aca.2013.09.027
    1. Tautenhahn R., Bottcher C., Neumann S. (2008). Highly sensitive feature detection for high resolution LC/MS. BMC Bioinformatics 9:504. 10.1186/1471-2105-9-504
    1. van der Kloet F. M., Bobeldijk I., Verheij E. R., Jellema R. H. (2009). Analytical error reduction using single point calibration for accurate and precise metabolomic phenotyping. J. Proteome Res. 8, 5132–5141. 10.1021/pr900499r
    1. Vapnik V. N. (1998). Statistical Learning Theory. Chichester, UK: Wiley-Interscience, John Willey & Sons.
    1. Wang H., Khoshgoftaar T. M., Wald R. (2013). Measuring stability of feature selection techniques on real-world software datasets, in Information Reuse and Integration in Academia And Industry, eds Özyer T., Kianmehr K., Tan M., Zeng J. (Vienna: Springer; ), 113–132.
    1. Weston J., Mukherjee S., Chapelle O., Pontil M., Poggio T., Vapnik V. (2001). Feature Selection for SVMs, in Advances in Neural Information Processing Systems 13 (NIPS) (Boston, MA: ).
    1. Witten I. H., Frank E. (2000). Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, 2nd Edn. San Francisco, CA: Morgan Kaufmann Publishers Inc.
    1. Xi B., Gu H., Baniasadi H., Raftery D. (2014). Statistical analysis and modeling of mass spectrometry-based metabolomics data. Methods Mol. Biol. 1198, 333–353. 10.1007/978-1-4939-1258-2_22
    1. Xia J., Broadhurst D. I., Wilson M., Wishart D. S. (2013). Translational biomarker discovery in clinical metabolomics: an introductory tutorial. Metabolomics 9, 280–299. 10.1007/s11306-012-0482-9
    1. Yevtushenko S. A. (2000). System of data analysis ‘Concept Explorer’, in Proceedings of the 7th National Conference on Artificial Intelligence (Russia: ), 127–134.

Source: PubMed

3
Prenumerera