Clinical data miner: an electronic case report form system with integrated data preprocessing and machine-learning libraries supporting clinical diagnostic model research

Arnaud Jf Installé, Thierry Van den Bosch, Bart De Moor, Dirk Timmerman, Arnaud Jf Installé, Thierry Van den Bosch, Bart De Moor, Dirk Timmerman

Abstract

Background: Using machine-learning techniques, clinical diagnostic model research extracts diagnostic models from patient data. Traditionally, patient data are often collected using electronic Case Report Form (eCRF) systems, while mathematical software is used for analyzing these data using machine-learning techniques. Due to the lack of integration between eCRF systems and mathematical software, extracting diagnostic models is a complex, error-prone process. Moreover, due to the complexity of this process, it is usually only performed once, after a predetermined number of data points have been collected, without insight into the predictive performance of the resulting models.

Objective: The objective of the study of Clinical Data Miner (CDM) software framework is to offer an eCRF system with integrated data preprocessing and machine-learning libraries, improving efficiency of the clinical diagnostic model research workflow, and to enable optimization of patient inclusion numbers through study performance monitoring.

Methods: The CDM software framework was developed using a test-driven development (TDD) approach, to ensure high software quality. Architecturally, CDM's design is split over a number of modules, to ensure future extendability.

Results: The TDD approach has enabled us to deliver high software quality. CDM's eCRF Web interface is in active use by the studies of the International Endometrial Tumor Analysis consortium, with over 4000 enrolled patients, and more studies planned. Additionally, a derived user interface has been used in six separate interrater agreement studies. CDM's integrated data preprocessing and machine-learning libraries simplify some otherwise manual and error-prone steps in the clinical diagnostic model research workflow. Furthermore, CDM's libraries provide study coordinators with a method to monitor a study's predictive performance as patient inclusions increase.

Conclusions: To our knowledge, CDM is the only eCRF system integrating data preprocessing and machine-learning libraries. This integration improves the efficiency of the clinical diagnostic model research workflow. Moreover, by simplifying the generation of learning curves, CDM enables study coordinators to assess more accurately when data collection can be terminated, resulting in better models or lower patient recruitment costs.

Keywords: clinical decision support systems; data analysis; data collection; machine-learning.

Conflict of interest statement

Conflicts of Interest: None declared.

Figures

Figure 1
Figure 1
Typical workflow of clinical diagnostic model research. The Clinical Data Miner software framework improves support for the steps indicated in green. Support for steps marked in blue is planned for future work. (Abbreviations used: CRF=case report form; eCRF=electronic CRF; API=application programming interface.).
Figure 2
Figure 2
In Clinical Data Miner (CDM)'s layered architecture, module cdm-common contains functionality common to client and server. The server code is implemented in module cdm-server, while client code is further split into user interface logic (cdm-client) and user interface presentation (cdm-client-gwt). Finally, cdm-webapp combines the modules and provides CDM's entry point.
Figure 3
Figure 3
The DataManager application programming interfaces includes methods to access and preprocess data.
Figure 4
Figure 4
Unified Modeling Language diagram of Clinical Data Miner (CDM)'s machine-learning application programming interfaces. ClassifierFacade is the entry point to CDM's machine-learning functionality, which operates on Classifier objects to obtain Model objects.
Figure 5
Figure 5
Clinical Data Miner (CDM)'s data collection user interface. The possibility to include pictograms in case report forms is particularly interesting for variables obtained from imaging modalities.
Figure 6
Figure 6
Learning curves, plotting predictive performance with respect to number of patient inclusions, can easily be generated using Clinical Data Miner (CDM)'s libraries. (Abbreviations: AUC=area under the ROC curve; ROC=receiver operating characteristic.).
Figure 7
Figure 7
Distribution of respondents over different ranges of issue frequencies. A large majority, 79% (22/28), of survey participants experienced problems in less than 5% of their interactions with Clinical Data Miner.

References

    1. Richards MA. The size of the prize for earlier diagnosis of cancer in England. Br J Cancer. 2009 Dec 3;101 Suppl 2:S125–129. doi: 10.1038/sj.bjc.6605402.
    1. Timmerman D, Bourne T, Tailor A, Collins WP, Verrelst H, Vandenberghe K, Vergote I. A comparison of methods for preoperative discrimination between malignant and benign adnexal masses: The development of a new logistic regression model. Am J Obstet Gynecol. 1999 Jul;181(1):57–65.
    1. Timmerman D, Testa AC, Bourne T, Ferrazzi E, Ameye L, Konstantinovic ML, Van Calster B, Collins WP, Vergote I, Van Huffel S, Valentin L, International Ovarian Tumor Analysis Group Logistic regression model to distinguish between the benign and malignant adnexal mass before surgery: A multicenter study by the International Ovarian Tumor Analysis Group. J Clin Oncol. 2005 Dec 1;23(34):8794–8801. doi: 10.1200/JCO.2005.01.7632.
    1. Timmerman D, Testa AC, Bourne T, Ameye L, Jurkovic D, Van Holsbeke C, Paladini D, Van Calster B, Vergote I, Van Huffel S, Valentin L. Simple ultrasound-based rules for the diagnosis of ovarian cancer. Ultrasound Obstet Gynecol. 2008 Jun;31(6):681–690. doi: 10.1002/uog.5365.
    1. Van Calster B, Valentin L, Van Holsbeke C, Testa AC, Bourne T, Van Huffel S, Timmerman D. Polytomous diagnosis of ovarian tumors as benign, borderline, primary invasive or metastatic: Development and validation of standard and kernel-based risk prediction models. BMC Med Res Methodol. 2010;10:96. doi: 10.1186/1471-2288-10-96.
    1. Leone FPG, Timmerman D, Bourne T, Valentin L, Epstein E, Goldstein SR, Marret H, Parsons AK, Gull B, Istre O, Sepulveda W, Ferrazzi E, Van den Bosch T. Terms, definitions and measurements to describe the sonographic features of the endometrium and intrauterine lesions: A consensus opinion from the International Endometrial Tumor Analysis (IETA) group. Ultrasound Obstet Gynecol. 2010 Jan;35(1):103–112. doi: 10.1002/uog.7487.
    1. Harris PA, Taylor R, Thielke R, Payne J, Gonzalez N, Conde JG. Research electronic data capture (REDCap)--a metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inform. 2009 Apr;42(2):377–381. doi: 10.1016/j.jbi.2008.08.010.
    1. Walther B, Hossin S, Townend J, Abernethy N, Parker D, Jeffries D. Comparison of electronic data capture (EDC) with the standard data capture method for clinical trial data. PLoS One. 2011;6(9):e25348. doi: 10.1371/journal.pone.0025348.
    1. Pavlović I, Kern T, Miklavcic D. Comparison of paper-based and electronic data collection process in clinical trials: Costs simulation study. Contemp Clin Trials. 2009 Jul;30(4):300–316. doi: 10.1016/j.cct.2009.03.008.
    1. El Emam K, Jonker E, Sampson M, Krleza-Jerić K, Neisa A. The use of electronic data capture tools in clinical trials: Web-survey of 259 Canadian trials. J Med Internet Res. 2009;11(1):e8. doi: 10.2196/jmir.1120.
    1. Cheung CS, Tong EL, Cheung NT, Chan WM, Wang HH, Kwan MW, Fan CK, Liu KQ, Wong MC. Factors associated with adoption of the electronic health record system among primary care physicians. JMIR Med Inform. 2013 Aug 26;1(1):e1. doi: 10.2196/medinform.2766.
    1. R Core Team . R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2014. [2014-09-18]. .
    1. The Mathworks, Inc . MATLAB and statistics toolbox release 2010b. Natick, Massachussets, United States: The Mathworks, Inc; 2010. [2014-09-18]. .
    1. Witten IH, Frank E. Data mining: Practical machine learning tools and techniques. Amsterdam: Morgan Kaufman; 2005.
    1. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA data mining software: An update. SIGKDD Explor. Newsl. 2009 Nov 16;11(1):10. doi: 10.1145/1656274.1656278.
    1. Suits DB. Use of dummy variables in regression equations. Journal of the American Statistical Association. 1957 Dec;52(280):548–551. doi: 10.2307/2281705.
    1. Muthén LK, Muthén BO. How to use a Monte Carlo study to decide on sample size and determine power. Structural Equation Modeling: A Multidisciplinary Journal. 2002 Oct;9(4):599–620. doi: 10.1207/S15328007SEM0904_8.
    1. Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol. 1996 Dec;49(12):1373–1379.
    1. Installé AJF. Clinical Data Miner – Towards more efficient clinical study support. Leuven, Belgium: KU Leuven; 2014. Jun, [2014-09-18]. .
    1. Hosmer DW, Lemeshow S. Applied logistic regression. New York: Wiley; 2000.
    1. Suykens JAK, Vandewalle J. Least squares support vector machine classifiers. Neural Process Lett. 1999;9(3):293–300. doi: 10.1023/A:1018628609742.
    1. Suykens JAK, Van Gestel T, De Brabanter J, De Moor B, Vandewalle J. Least squares support vector machines. River Edge, NJ: World Scientific; 2002.
    1. Gamma E. Design patterns: Elements of reusable object-oriented software. Reading, Mass: Addison-Wesley; 1995.
    1. Installé AJF, Van den Bosch T, Van Schoubroeck D, Heymans J, Zannoni L, Jokubkiene L, Sladkevicius P, Valentin L, De Moor B, Timmerman D. Showing pictograms in electronic data capture software improves inter-rater agreement. Ultrasound Obstet Gynecol; Proceedings of the 21st World Congress in Obstetrics & Gynecology; Sept 2011; Los Angeles, USA. Wiley & Sons Ltd; 2011. pp. 18–22.
    1. Votino A, Installé AJF, Van den Bosch T, Van Schoubroeck D, Kacem Y, Kaijser J, De Moor B, Timmerman D, Van Pachterbeke C. Optimal ultrasound visualization of the endometrial-myometrial junction (EMJ). Ultrasound Obstet Gynecol; Proceedings of the 22nd World Congress in Obstetrics and Gynecology; Sept 2012; Copenhagen, Denmark. Wiley & Sons Ltd; 2012. Sep, pp. 9–12.
    1. Votino A, Installé AJF, Van Pachterbeke C, Van Schoubroeck D, Kacem Y, Kaijser J, De Moor B, Timmerman D, Van den Bosch T. Optimization of the image quality of endometrial-myometrial junction (EMJ). Ultrasound Obstet Gynecol; Proceedings of the 22nd World Congress in Obstetrics and Gynecology; Sept 2012; Copenhagen, Denmark. Wiley & Sons Ltd; 2012. Sep, pp. 9–12.
    1. Votino A, Installé AJF, Van den Bosch T, Van Schoubroeck D, Kacem Y, Kaijser J, De Moor B, Timmerman D, Van Pachterbeke C. The influence of patient characteristics on the image quality of the endometrial-myometrial junction (EMJ). Ultrasound Obstet Gynecol; Proceedings of the 22nd World Congress in Obstetrics and Gynecology; Sept 2012; Copenhagen, Denmark. Wiley & Sons Ltd; 2012. Sep, pp. 9–12.
    1. Van Schoubroeck D, Installé AJF, Raine-Fenning NJ, De Neubourg D, Van den Bosch T, De Moor B, Bourne T, Timmerman D. Interobserver variability in the ultrasound diagnosis of polycystic ovaries using pattern recognition. Ultrasound Obstet Gynecol; Proceedings of the 22nd World Congress in Obstetrics and Gynecology; Sept 2012; Copenhagen, Denmark. Wiley & Sons Ltd; 2012. Sep, pp. 9–12.
    1. Van Schoubroeck D, Installé AJF, Raine-Fenning NJ, De Neubourg D, Van den Bosch T, De Moor B, Bourne T, Timmerman D. Interobserver variability in the ultrasound diagnosis of congenital uterine anomalies. Ultrasound Obstet Gynecol; Proceedings of the 22nd World Congress in Obstetrics and Gynecology; Sept 2012; Copenhagen, Denmark. Wiley & Sons Ltd; 2012. Sep, pp. 9–12.

Source: PubMed

3
Předplatit