A Review of Matched-pairs Feature Selection Methods for Gene Expression Data Analysis

Sen Liang, Anjun Ma, Sen Yang, Yan Wang, Qin Ma, Sen Liang, Anjun Ma, Sen Yang, Yan Wang, Qin Ma

Abstract

With the rapid accumulation of gene expression data from various technologies, e.g., microarray, RNA-sequencing (RNA-seq), and single-cell RNA-seq, it is necessary to carry out dimensional reduction and feature (signature genes) selection in support of making sense out of such high dimensional data. These computational methods significantly facilitate further data analysis and interpretation, such as gene function enrichment analysis, cancer biomarker detection, and drug targeting identification in precision medicine. Although numerous methods have been developed for feature selection in bioinformatics, it is still a challenge to choose the appropriate methods for a specific problem and seek for the most reasonable ranking features. Meanwhile, the paired gene expression data under matched case-control design (MCCD) is becoming increasingly popular, which has often been used in multi-omics integration studies and may increase feature selection efficiency by offsetting similar distributions of confounding features. The appropriate feature selection methods specifically designed for the paired data, which is named as matched-pairs feature selection (MPFS), however, have not been maturely developed in parallel. In this review, we compare the performance of 10 feature-selection methods (eight MPFS methods and two traditional unpaired methods) on two real datasets by applied three classification methods, and analyze the algorithm complexity of these methods through the running of their programs. This review aims to induce and comprehensively present the MPFS in such a way that readers can easily understand its characteristics and get a clue in selecting the appropriate methods for their analyses.

Keywords: Gene expression; Matched case-control design; Matched-pairs feature selection; Paired data.

Figures

Fig. 1
Fig. 1
Matched-pairs feature selection problem description. Paired data with matched p cases and q controls as input for the MPFS method and getting selected features as output.
Fig. 2
Fig. 2
Performances of the ten methods on two datasets. Fig. (A1–A3) are the classification performance of each method with top 1500 ranked gene list on TCGA dataset, and Fig. (B1–B3) are on GEO dataset. Fig. A1–B1, A2–B2, and A3–B3 are the comparison of SVM, GNB and Logistic Regression (LR) methods for both datasets, respectively. Each figure includes performance comparing the result of top 1500 ranked gene list, and a zoomed-in figure indicating the detail the of the top 100 ranked gene list. The accuracy data of PQLBoost and BVS-CLR methods are omitted after 1000 gene counts due to the need of enormous running time (exceeding 48 h).
Fig. 3
Fig. 3
Comparison of running time. It should be noted that the running time is the time for producing the gene lists for each method. Left figure is the comparison of ten methods on TCGA dataset, and right figure is on GEO dataset.
Fig. 4
Fig. 4
Paired and unpaired data diagram. Three data types for feature selection: (a) pure-paired data type, which has pure case and control data; (b) mixed-paired data type, which has different mixing degree of mixture case and control data, (c) unpaired data type, which contains mixture case data without matched control data. It is noteworthy that the mixing degree is referred to the ratio between control part (blue) and case part (red) on one case sample, and vice versa on a control sample.

References

    1. Kourou K., Exarchos T.P., Exarchos K.P., Karamouzis M.V., Fotiadis D.I. Machine learning applications in cancer prognosis and prediction. Comput Struct Biotechnol J. 2015;13:8–17.
    1. Challita N., Khalil M., Beauseroy P. 2016 IEEE Int Multidiscip Conf Eng Technol. 2016. New feature selection method based on neural network and machine learning; pp. 81–84.
    1. Sheikhpour R., Sarram M.A., Gharaghani S., Chahooki M.A.Z. A survey on semi-supervised feature selection methods. Pattern Recog. 2017;64:141–158.
    1. Chandrashekar G., Sahin F. A survey on feature selection methods. Comput Electr Eng. 2014;40:16–28.
    1. Cheng H., Liu Z., Yang L., Chen X. Sparse representation and learning in visual recognition: theory and applications. Signal Process. 2013;93:1408–1425.
    1. Song Q.J., Jiang H.Y., Liu J. Feature selection based on FDA and F-score for multi-class classification. Expert Syst Appl. 2017;81:22–27.
    1. Saeys Y., Inza I., Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007;23:2507–2517.
    1. Bolón-Canedo V., Sánchez-Maroño N., Alonso-Betanzos A. A review of feature selection methods on synthetic data. Knowl Inf Syst. 2013;34:483–519.
    1. Bolón-Canedo V., Sánchez-Maroño N., Alonso-Betanzos A., Benítez J.M., Herrera F. A review of microarray datasets and applied feature selection methods. Inf Sci (Ny) 2014;282:111–135.
    1. Singh K.P., Basant N., Gupta S. Support vector machines in water quality management. Anal Chim Acta. 2011;703:152–162.
    1. Ding C., Peng H. Minimum redundancy feature selection from microarray gene expression data. J Bioinform Comput Biol. 2005;3:185–205.
    1. Zou Q., Zeng J., Cao L., Ji R. A novel features ranking metric with application to scalable visual and bioinformatics data classification. Neurocomputing. 2016;173:346–354.
    1. Zou Q., Wan S., Ju Y., Tang J., Zeng X. Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy. BMC Syst Biol. 2016;10
    1. Tang W., Wan S., Yang Z., Teschendorff A.E., Zou Q. Tumor origin detection with tissue-specific miRNA and DNA methylation markers. Bioinformatics. 2017
    1. Qian J., Payabvash S., Kemmling A., Lev M.H., Schwamm L.H., Betensky R.A. Variable selection and prediction using a nested, matched case-control study: application to hospital acquired pneumonia in stroke patients. Biometrics. 2014;70:153–163.
    1. Breslow N.E., Day N.E., Halvorsen K.T., Prentice R.L., Sabai C. Estimation of multiple relative risk functions in matched case-control studies. Am J Epidemiol. 1978;108:299–307.
    1. Friedman J., Tibshirani R., Hastie T. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors) Ann Stat. 2000;28:337–407.
    1. Adewale A.J., Dinu I., Yasui Y. Boosting for correlated binary classification. J Comput Graph Stat. 2010;19:140–153.
    1. Bennett K.P., Campbell C. Support vector machines. ACM SIGKDD Explor Newsl. 2000;2:1–13.
    1. Tomczak K., Czerwińska P., Wiznerowicz M. The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Wspolczesna Onkol. 2015;1A:A68–77.
    1. Clough E., Barrett T. The gene expression omnibus database. Methods Mol Biol. 2016;1418:93–110.
    1. John G.H., Langley P. UAI'95 Proc. Elev. Conf. Uncertain. Artif. Intell. 1995. Estimating continuous distribution in Bayesian classifiers; pp. 338–345.
    1. Gortmaker S.L., Hosmer D.W., Lemeshow S. Applied logistic regression. Contemp Sociol. 1994;23:159.
    1. Bermingham M.L., Pong-Wong R., Spiliopoulou A., Hayward C., Rudan I., Campbell H. Application of high-dimensional feature selection: evaluation for genomic prediction in man. Sci Rep. 2015;5:10312.
    1. Liu H.J., Guo Y.Y., Li D.J. Predicting novel salivary biomarkers for the detection of pancreatic cancer using biological feature-based classification. Pathol Res Pract. 2017;213:394–399.
    1. Zhang B., He X., Ouyang F., Gu D., Dong Y., Zhang L. Radiomic machine-learning classifiers for prognostic biomarkers of advanced nasopharyngeal carcinoma. Cancer Lett. 2017;403:21–27.
    1. Shah F.P., Patel V. Proc 2016 IEEE Int Conf Wirel Commun Signal Process Networking, WiSPNET 2016. 2016. A review on feature selection and feature extraction for text classification; pp. 2264–2268.
    1. Elalami M.E. A new matching strategy for content based image retrieval system. Appl Soft Comput J. 2014;14:407–418.
    1. Xiaobo Z., Jiewen Z., Povey M.J.W., Holmes M., Hanpin M. Variables selection methods in near-infrared spectroscopy. Anal Chim Acta. 2010;667:14–32.
    1. Datta S., Pihur V. Feature selection and machine learning with mass spectrometry data. Methods Mol Biol. 2010;593:205–229.
    1. Demel M. a, AGK Janecek, Thai K.-M., Ecker G.F., Gansterer W.N. Predictive QSAR models for polyspecific drug targets: the importance of feature selection. Curr Comput Aided Drug Des. 2008;4:91–110.
    1. González M.P., Terán C., Saíz-Urra L., Teijeira M. Variable selection methods in QSAR: an overview. Curr Top Med Chem. 2008;8:1606–1627.
    1. Tsygankova I. Variable selection in QSAR models for drug design. Curr Comput Aided Drug Des. 2008;4:132–142.
    1. Inglis G., Thomas M., Thomas D., Kalmokoff M., Brooks S., Selinger L. Molecular methods to measure intestinal bacteria: a review. J AOAC Int. 2012;95:5–24.
    1. Zhou L.-T., Cao Y.-H., Lv L.-L., Ma K.-L., Chen P.-S., Ni H.-F. Feature selection and classification of urinary mRNA microarray data by iterative random forest to diagnose renal fibrosis: a two-stage study. Sci Rep. 2017;7:39832.
    1. Yousef M., Allmer J., Khalifa W. Proc 9th Int Jt Conf Biomed Eng Syst Technol. 2016. Feature selection for MicroRNA target prediction - comparison of one-class feature selection methodologies; pp. 216–225.
    1. Khalifa W., Yousef M., Saçar Demirci M.D., Allmer J. The impact of feature selection on one and two-class classification performance for plant microRNAs. Peer J. 2016;4
    1. Dong C., Wei P., Jian X., Gibbs R., Boerwinkle E., Wang K. Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. Hum Mol Genet. 2015;24:2125–2137.
    1. Pavlovic M., Ray P., Pavlovic K., Kotamarti A., Chen M., Zhang M.Q. DIRECTION: a machine learning framework for predicting and characterizing DNA methylation and hydroxymethylation in mammalian genomes. Bioinformatics. 2017
    1. Xu T., Le T.D., Liu L., Su N., Wang R., Sun B. CancerSubtypes: an R/Bioconductor package for molecular cancer subtype identification, validation and visualization. Bioinformatics. 2017
    1. Goh W.W. Bin, Wong L. NetProt: complex-based feature selection. J Proteome Res. 2017;16:3102–3112.
    1. Wang W., Sue A.C.H., Goh W.W.B. Feature selection in clinical proteomics: with great power comes great reproducibility. Drug Discov Today. 2017;22:912–918.
    1. Meng C., Zeleznik O.A., Thallinger G.G., Kuster B., Gholami A.M., Culhane A.C. Dimension reduction techniques for the integrative analysis of multi-omics data. Brief Bioinform. 2016;17:628–641.
    1. Mallik S., Bhadra T., Maulik U. Identifying epigenetic biomarkers using maximal relevance and minimal redundancy based feature selection for multi-omics data. IEEE Trans Nanobioscience. 2017;16:3–10.
    1. Liu C., Wang X., Genchev G.Z., Lu H. Multi-omics facilitated variable selection in Cox-regression model for cancer prognosis prediction. Methods. 2017;124:100–107.
    1. Cox D.R. Regression models and life tables. J R Stat Soc Ser B. 1972;34:187–220.
    1. Ma S., Huang J. Penalized feature selection and classification in bioinformatics. Brief Bioinform. 2008;9:392–403.
    1. Lorena L.H.N., Carvalho A.C.P.L.F., Lorena A.C. Filter feature selection for one-class classification. J Intell Robot Syst Theory Appl. 2015;80:227–243.
    1. Hall M., Smith L. a. vol. 1999. 1999. Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper; pp. 235–239. (Int FLAIRS Conf).
    1. Kohavi R., John G.H. Wrappers for feature subset selection. Artif Intell. 1997;97:273–324.
    1. Inza I., Larrañaga P., Blanco R., Cerrolaza A.J. Filter versus wrapper gene selection approaches in DNA microarray domains. Artif Intell Med. 2004;31:91–103.
    1. Sheng L., Pique-Regi R., Asgharzadeh S., Ortega A. IEEE Int. Conf. Acoust. Speech Signal Process. IEEE; 2009. Microarray classification by block diagonal linear discriminant analysis with embedded feature selection; pp. 1757–1760.
    1. Peikert R. Feature extraction stud fuzziness. Soft Comput. 2009;207:1–5.
    1. Guan D., Yuan W., Lee Y.-K., Najeebullah K., Rasel M.K. A review of ensemble learning based feature selection. IETE Tech Rev. 2014;31:190–198.
    1. Hira Z.M., Gillies D.F. A review of feature selection and feature extraction methods applied on microarray data. Adv Bioinform. 2015;2015:198363.
    1. Wang L., Wang Y., Chang Q. Feature selection methods for big data bioinformatics: a survey from the search perspective. Methods. 2016;111:21–31.
    1. Ang J.C., Mirzal A., Haron H., Hamed H.N.A. Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection. IEEE/ACM Trans Comput Biol Bioinform. 2016;13:971–989.
    1. Tan Q., Thomassen M., Kruse T.A. Feature selection for predicting tumor metastases in microarray experiments using paired design. Cancer Inform. 2007;3:133–138.
    1. Bunea F., Barbu A. Dimension reduction and variable selection in case control studies via regularized likelihood optimization. Electron J Stat. 2009;3:32.
    1. Sun H., Wang S. Penalized logistic regression for high-dimensional DNA methylation data with case-control studies. Bioinformatics. 2012;28:1368–1375.
    1. Du W., Sun Y., Wang Y., Cao Z., Zhang C., Liang Y. A novel multi-stage feature selection method for microarray expression data analysis. Int J Data Min Bioinform. 2013;7:58.
    1. Cao Z., Wang Y., Sun Y., Du W., Liang Y. 2013 IEEE Int. Conf. Bioinforma. Biomed. 2013. Effective and stable feature selection method based on filter for gene signature identification in paired microarray data; pp. 189–192.
    1. Sun H., Wang S. Network-based regularization for matched case-control analysis of high-dimensional DNA methylation data. Stat Med. 2013;32:2127–2139.
    1. Balasubramanian R., Andres Houseman E., Coull B.A., Lev M.H., Schwamm L.H., Betensky R.A. Variable importance in matched case-control studies in settings of high dimensional data. J R Stat Soc Ser C Appl Stat. 2014;63:639–655.
    1. Asafu-Adjei J., Tadesse M.G., Coull B., Balasubramanian R., Lev M., Schwamm L. Bayesian variable selection methods for matched case-control studies. Int J Biostat. 2017;13
    1. Hsu H., Lachenbruch P.A. Paired t test. Wiley Encycl Clin Trials. 2008:1–3.
    1. David H.A., Gunnink J.L. The paired t test under artificial pairing. Am Stat. 1997;51:9–12.
    1. Kearns M., Ron D. Algorithmic stability and sanity-check bounds for leave-one-out cross-validation. Neural Comput. 1999;11:1427–1453.
    1. Tibshirani R., Hastie T., Narasimhan B., Chu G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci. 2002;99:6567–6572.
    1. Story J.D. A direct approach to false discovery rates. J R Stat Soc. 2002;64:479–498.
    1. Connolly M.A.L.K.-Y. Condition logistic regression models for correlated binary data. Biometrika. 1988;75:501–506.
    1. Zhang P. Model selection via multifold cross-validation. Ann Stat. 1993;21:299–313.
    1. Gilks W. CRC Press; 1998. Markov chain Monte Carlo in practice.
    1. Chib S., Greenberg E. Understanding the metropolis-hastings algorithm. Am Stat. 1995;49:327–335.
    1. Friedman J.H. Greedy function approximation: A gradient boosting machine 1 function estimation 2 numerical optimization in function space. North. 1999;1:1–10.
    1. Bühlmann P., Yu B. Boosting with the L 2 loss. J Am Stat Assoc. 2003;98:324–339.
    1. Tutz G., Reithinger F. A boosting approach to flexible semiparametric mixed models. Stat Med. 2007;26:2872–2900.
    1. Chang K., Creighton C.J., Davis C., Donehower L., Drummond J., Wheeler D. The cancer genome atlas pan-cancer analysis project. Nat Genet. 2013;45:1113–1120.
    1. Scott J., Marwaha S., Ratheesh A., Macmillan I., Yung A.R., Morriss R. Bipolar at-risk criteria: an examination of which clinical features have optimal utility for identifying youth at risk of early transition from depression to bipolar disorders. Schizophr Bull. 2017;43:737–744.
    1. Giuliano A., Saviozzi I., Brambilla P., Muratori F., Retico A., Calderoni S. The effect of age, sex and clinical features on the volume of Corpus Callosum in pre-schoolers with Autism Spectrum Disorder: a case-control study. Eur J Neurosci. 2017
    1. Xu S.-Y., Liu Z., Ma W.-J., Sheyhidin I., Zheng S.-T., Lu X.-M. New potential biomarkers in the diagnosis of esophageal squamous cell carcinoma. Biomarkers. 2009;14:340–346.
    1. Anglim P.P., Galler J.S., Koss M.N., Hagen J.A., Turla S., Campan M. Identification of a panel of sensitive and specific DNA methylation markers for squamous cell lung cancer. Mol Cancer. 2008;7:62.
    1. Tsou J.A., Galler J.S., Siegmund K.D., Laird P.W., Turla S., Cozen W. Identification of a panel of sensitive and specific DNA methylation markers for lung adenocarcinoma. Mol Cancer. 2007;6:70.
    1. Zak D.E., Penn-Nicholson A., Scriba T.J., Thompson E., Suliman S., Amon L.M. A blood RNA signature for tuberculosis disease risk: a prospective cohort study. Lancet. 2016;387:2312–2322.
    1. Klöppel S., Stonnington C.M., Barnes J., Chen F., Chu C., Good C.D. Accuracy of dementia diagnosis - a direct comparison between radiologists and a computerized method. Brain. 2008;131:2969–2974.
    1. Gronich N., Lavi I., Barnett-Griness O., Saliba W., Abernethy D.R., Rennert G. Tyrosine kinase-targeting drugs-associated heart failure. Br J Cancer. 2017;116:1366–1373.
    1. Holsbø E., Perduca V., Bongo L.A., Lund E., Birmelé E. Curve selection for predicting breast cancer metastasis from prospective gene expression in blood. bioRxiv. 2017:1–16.
    1. de la Iglesia B. Evolutionary computation for feature selection in classification problems. Wiley Interdiscip Rev Data Min Knowl Discov. 2013;3:381–407.
    1. Kalousis A., Prados J., Hilario M. Fifth IEEE Int. Conf. Data Min. IEEE; 2005. Stability of feature selection algorithms; pp. 218–225.
    1. He Z., Yu W. Stable feature selection for biomarker discovery. Comput Biol Chem. 2010;34:215–225.
    1. Awada W., Khoshgoftaar T.M., Dittman D., Wald R., Napolitano A. Proc. 2012 IEEE 13th Int. Conf. Inf. Reuse Integr. IEEE; 2012. A review of the stability of feature selection techniques for bioinformatics data; pp. 356–363.
    1. Browne M.W. Cross-validation methods. J Math Psychol. 2000;44:108–132.
    1. Mooney C.Z.D.R.D. vol. 94–95. Sage; 1993. Bootstrapping: a nonparametric approach to statistical inference.

Source: PubMed

3
Subskrybuj