Can synthetic data be a proxy for real clinical trial data? A validation study

Zahra Azizi, Chaoyi Zheng, Lucy Mosquera, Louise Pilote, Khaled El Emam, GOING-FWD Collaborators, Louise Pilote, Colleen M Norris, Valeria Raparelli, Alexandra Kautzky-Willer, Karolina Kublickiene, Maria Trinidad Herrero, Karin Humphries, Monica Parry, Lawrence S Bloomberg, Ruth Sapir-Pichhadze, Michal Abrahamowicz, Khaled El Emam, Simon Bacon, Peter Klimek, Jennifer Fishman, Zahra Azizi, Chaoyi Zheng, Lucy Mosquera, Louise Pilote, Khaled El Emam, GOING-FWD Collaborators, Louise Pilote, Colleen M Norris, Valeria Raparelli, Alexandra Kautzky-Willer, Karolina Kublickiene, Maria Trinidad Herrero, Karin Humphries, Monica Parry, Lawrence S Bloomberg, Ruth Sapir-Pichhadze, Michal Abrahamowicz, Khaled El Emam, Simon Bacon, Peter Klimek, Jennifer Fishman

Abstract

Objectives: There are increasing requirements to make research data, especially clinical trial data, more broadly available for secondary analyses. However, data availability remains a challenge due to complex privacy requirements. This challenge can potentially be addressed using synthetic data.

Setting: Replication of a published stage III colon cancer trial secondary analysis using synthetic data generated by a machine learning method.

Participants: There were 1543 patients in the control arm that were included in our analysis.

Primary and secondary outcome measures: Analyses from a study published on the real dataset were replicated on synthetic data to investigate the relationship between bowel obstruction and event-free survival. Information theoretic metrics were used to compare the univariate distributions between real and synthetic data. Percentage CI overlap was used to assess the similarity in the size of the bivariate relationships, and similarly for the multivariate Cox models derived from the two datasets.

Results: Analysis results were similar between the real and synthetic datasets. The univariate distributions were within 1% of difference on an information theoretic metric. All of the bivariate relationships had CI overlap on the tau statistic above 50%. The main conclusion from the published study, that lack of bowel obstruction has a strong impact on survival, was replicated directionally and the HR CI overlap between the real and synthetic data was 61% for overall survival (real data: HR 1.56, 95% CI 1.11 to 2.2; synthetic data: HR 2.03, 95% CI 1.44 to 2.87) and 86% for disease-free survival (real data: HR 1.51, 95% CI 1.18 to 1.95; synthetic data: HR 1.63, 95% CI 1.26 to 2.1).

Conclusions: The high concordance between the analytical results and conclusions from synthetic and real data suggests that synthetic data can be used as a reasonable proxy for real clinical trial datasets.

Trial registration number: NCT00079274.

Keywords: epidemiology; health informatics; information management; information technology; statistics & research methods.

Conflict of interest statement

Competing interests: This work was performed in collaboration with Replica Analytics Ltd. This company is a spin-off from the Children’s Hospital of Eastern Ontario Research Institute. KEE is cofounder and has equity in this company. LM and CZ are data scientists employed by Replica Analytics Ltd.

© Author(s) (or their employer(s)) 2021. Re-use permitted under CC BY-NC. No commercial re-use. See rights and permissions. Published by BMJ.

Figures

Figure 1
Figure 1
Tau coefficient for the real and synthetic data, and the CI overlap for the bivariate relationship with obstruction. BMI, Body Mass Index; ECOG, Eastern Cooperative Oncology Group; KRAS, Kirsten rat sarcoma virus; LNs, Lymph Nodes.
Figure 2
Figure 2
Tau coefficient and CI overlap for the real and synthetic variables against overall survival. BMI, Body Mass Index; ECOG, Eastern Cooperative Oncology Group; KRAS, Kirsten rat sarcoma virus; LNs, Lymph Nodes.
Figure 3
Figure 3
Tau coefficient and CI overlap for the real and synthetic variables against disease-free survival. BMI, Body Mass Index; ECOG, Eastern Cooperative Oncology Group; KRAS, Kirsten rat sarcoma virus; LNs, Lymph Nodes.
Figure 4
Figure 4
Survival curve comparing overall survival in OBS+ and OBS− patients in the real (A) versus synthetic (B) datasets. OBS+, obstructed; OBS−, non-obstructed.
Figure 5
Figure 5
Survival curve comparing disease-free survival in OBS+ and OBS− patients in the real (A) versus synthetic (B) datasets. OBS+, obstructed; OBS−, non-obstructed.
Figure 6
Figure 6
Comparison of real and synthetic Cox model parameters (HR) with the overall survival outcome variable. BMI, Body Mass Index; ECOG, Eastern Cooperative Oncology Group; LNs, Lymph Nodes.
Figure 7
Figure 7
Comparison of real and synthetic Cox model parameters (HR) with the disease-free survival outcome variable. BMI, Body Mass Index; ECOG, Eastern Cooperative Oncology Group; LNs, Lymph Nodes.

References

    1. Ebrahim S, Sohani ZN, Montoya L, et al. . Reanalyses of randomized clinical trial data. JAMA 2014;312:1024–32. 10.1001/jama.2014.9646
    1. Ferran J-M, Nevitt SJ. European medicines Agency policy 0070: an exploratory review of data utility in clinical study reports for academic research. BMC Med Res Methodol 2019;19:204. 10.1186/s12874-019-0836-3
    1. Phrma & EFPIA . Principles for responsible clinical trial data sharing, 2013. Available:
    1. TransCelerate Biopharma . De-identification and anonymization of individual patient data in clinical studies: a model approach, 2017.
    1. TransCelerate Biopharma . Protection of personal data in clinical documents – a model approach, 2017.
    1. European Medicines Agency . European medicines Agency policy on publication of data for medicinal products for human use: policy, 2014. Available:
    1. Taichman DB, Backus J, Baethge C, et al. . Sharing clinical trial data: a proposal from the International Committee of medical Journal editors. Ann Intern Med 2016;164:505–6. 10.7326/M15-2928
    1. Institute of Medicine, . Sharing clinical trial data: maximizing benefits, minimizing risk. Washington, DC, 2015.
    1. International Committee of Medical Journal Editors . Recommendations for the conduct, reporting, editing, and publication of scholarly work in medical journals, 2019. Available: [Accessed 29 Jun 2020].
    1. The Wellcome Trust . Policy on data, software and materials management and sharing, 2017. Available: [Accessed 12 Sep 2017].
    1. National Institutes of Health . Final NIH statement on sharing research data, 2003. Available: [Accessed 29 Jun 2020].
    1. Doshi P. Data too important to share: do those who control the data control the message? BMJ 2016;352:i1027. 10.1136/bmj.i1027
    1. Polanin JR. Efforts to retrieve individual participant data sets for use in a meta-analysis result in moderate data sharing but many data sets remain missing. J Clin Epidemiol 2018;98:157–9. 10.1016/j.jclinepi.2017.12.014
    1. Naudet F, Sakarovitch C, Janiaud P, et al. . Data sharing and reanalysis of randomized controlled trials in leading biomedical journals with a full data sharing policy: survey of studies published in The BMJ and PLOS Medicine. BMJ 2018;360:k400. 10.1136/bmj.k400
    1. Nevitt SJ, Marson AG, Davie B, et al. . Exploring changes over time and characteristics associated with data retrieval across individual participant data meta-analyses: systematic review. BMJ 2017;357:j1390. 10.1136/bmj.j1390
    1. Villain B, Dechartres A, Boyer P, et al. . Feasibility of individual patient data meta-analyses in orthopaedic surgery. BMC Med 2015;13:131. 10.1186/s12916-015-0376-6
    1. Ventresca M, Schünemann HJ, Macbeth F, et al. . Obtaining and managing data sets for individual participant data meta-analysis: Scoping review and practical guide. BMC Med Res Methodol 2020;20:113. 10.1186/s12874-020-00964-6
    1. Artificial Intelligence in Health Care . National Academy of medicine and the general accountability office, 2019.
    1. El Emam K, Jonker E, Moher E, et al. . A review of evidence on consent bias in research. Am J Bioeth 2013;13:42–4. 10.1080/15265161.2013.767958
    1. de Montjoye Y-A, Hidalgo CA, Verleysen M, et al. . Unique in the crowd: the privacy bounds of human mobility. Sci Rep 2013;3:1376. 10.1038/srep01376
    1. de Montjoye Y-A, Radaelli L, Singh VK, et al. . Identity and privacy. unique in the Shopping mall: on the reidentifiability of credit card metadata. Science 2015;347:536–9. 10.1126/science.1256297
    1. Sweeney L, Yoo JS, Perovich L, et al. . Re-Identification risks in HIPAA safe harbor data: a study of data from one environmental health study. Technol Sci 2017;2017:2017082801.
    1. Su Yoo J, Thaler A, Sweeney L. Risks to patient privacy: a re-identification of patients in Maine and Vermont statewide hospital data. J Technol Sci 2018:2018100901.
    1. Sweeney L. Matching known patients to health records in Washington State Data, Harvard University. data privacy lab, 2013.
    1. Sweeney L, von Loewenfeldt M, Perry M. Saying it’s anonymous doesn’t make it so: re-identifications of ‘anonymized’ law school data. J Technol Sci 2018:2018111301.
    1. Zewe A. Imperiled information: Students find website data leaks pose greater risks than most people realize, Harvard John A. Paulson School of Engineering and Applied Sciences, 2020. Available: [Accessed 23 Mar 2020].
    1. Bode K. Researchers find ‘anonymized’ data is even less anonymous than we thought. Motherboard: Tech by Vice, 2020.
    1. Clemons E. Online profiling and invasion of privacy: the myth of anonymization. HuffPost 2013.
    1. Jee C. You’re very easy to track down, even when your data has been anonymized, MIT Technology Review, 2019. Available: [Accessed 11 May 2020].
    1. Kolata G. Your data were ‘anonymized’? These scientists can still identify you. The New York Times 2019.
    1. Lomas N. Researchers spotlight the lie of ‘anonymous’ data, 2019. Available: [Accessed 11 May 2020].
    1. Mitchell S. Study finds HIPAA protected data still at risks, 2019. Available: [Accessed 11 May 2020].
    1. Thompson SA, Warzel C. Twelve million phones, one dataset, zero privacy. The New York Times 2019.
    1. Hern A. ‘Anonymised’ data can never be totally anonymous, says study. The Guardian 2019.
    1. Wolk A. The (Im)Possibilities of Scientific Research Under the GDPR. Cybersecurity Law Report 2020.
    1. Ghafur S, Van Dael J, Leis M, et al. . Public perceptions on data sharing: key insights from the UK and the USA. Lancet Digit Health 2020;2:e444–6. 10.1016/S2589-7500(20)30161-8
    1. El Emam K, Mosquera L, Hoptroff R. Practical synthetic data generation: balancing privacy and the broad availability of data, 2020.
    1. El Emam K, Hoptroff R. The synthetic data paradigm for using and sharing data. Cutter Executive Update 2019;19.
    1. Polonetsky J, Renieris E. 10 privacy risks and 10 privacy technologies to watch in the next decade. Future of Privacy Forum 2020.
    1. Guo A, Foraker RE, MacGregor RM, et al. . The use of synthetic electronic health record data and deep learning to improve timing of high-risk heart failure surgical intervention by predicting proximity to catastrophic decompensation. Front Digit Health 2020;2. 10.3389/fdgth.2020.576945
    1. Navar AM, Pencina MJ, Rymer JA, et al. . Use of open access platforms for clinical trial data. JAMA 2016;315:1283. 10.1001/jama.2016.2374
    1. Reiner Benaim A, Almog R, Gorelik Y, et al. . Analyzing medical research results based on synthetic data and their relation to real data results: systematic comparison from five observational studies. JMIR Med Inform 2020;8:e16492. 10.2196/16492
    1. Foraker RE, Yu SC, Gupta A, et al. . Spot the difference: comparing results of analyses from real patient data and synthetic derivatives. JAMIA Open 2020;3:ooaa060. 10.1093/jamiaopen/ooaa060
    1. Beaulieu-Jones BK, Wu ZS, Williams C. Privacy-preserving generative deep neural networks support clinical data sharing. Circ Cardiovasc Qual Outcomes 2017;12:159756. 10.1161/CIRCOUTCOMES.118.005122
    1. CEO life sciences consortium . Project Data Sphere; Share, integrate & analyze cancer research data. Available:
    1. Alberts SR, Sargent DJ, Nair S, et al. . Effect of oxaliplatin, fluorouracil, and leucovorin with or without cetuximab on survival among patients with resected stage III colon cancer: a randomized trial. JAMA 2012;307:1383–93. 10.1001/jama.2012.385
    1. Dahdaleh FS, Sherman SK, Poli EC, et al. . Obstruction predicts worse long-term outcomes in stage III colon cancer: a secondary analysis of the N0147 trial. Surgery 2018;164:1223–9. 10.1016/j.surg.2018.06.044
    1. Carraro PG, Segala M, Cesana BM, et al. . Obstructing colonic cancer: failure and survival patterns over a ten-year follow-up after one-stage curative surgery, Dis. Colon Rectum 2001;44:243–50.
    1. Mella J, Biffin A, Radcliffe AG, et al. . Population-based audit of colorectal cancer management in two UK health regions. colorectal cancer Working group, Royal College of Surgeons of England clinical epidemiology and audit unit. Br J Surg 1997;84:1731–6.
    1. Drechsler J, Reiter JP. An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Comput Stat Data Anal 2011;55:3232–43. 10.1016/j.csda.2011.06.006
    1. Arslan RC, Schilling KM, Gerlach TM, et al. . Using 26,000 diary entries to show ovulatory changes in sexual desire and behavior. J Pers Soc Psychol 2018. 10.1037/pspp0000208
    1. Bonnéry D, Feng Y, Henneberger AK, et al. . The promise and limitations of synthetic data as a strategy to expand access to State-Level Multi-Agency longitudinal data. J Res Educ Eff 2019;12:616–47. 10.1080/19345747.2019.1631421
    1. Sabay A, Harris L, Bejugama V. Overcoming small data limitations in heart disease prediction by using surrogate data. SMU Data Science Review 2018.
    1. Freiman M, Lauger A, Reiter J. Data synthesis and perturbation for the American community survey at the US. census bureau, us census bureau, working paper 2017.
    1. Nowok B. Utility of synthetic microdata generated using tree-based methods, 2015.
    1. Raab GM, Nowok B, Dibben C. Practical data synthesis for large samples. Journal of Privacy and Confidentiality 2016;7:67–97. 10.29012/jpc.v7i3.407
    1. Nowok B, Raab GM, Dibben C. Providing bespoke synthetic data for the UK longitudinal studies and other sensitive data with the synthpop package for R1. Statistical Journal of the IAOS 2017;33:785–96. 10.3233/SJI-150153
    1. Quintana DS. A synthetic dataset primer for the biobehavioural sciences to promote reproducibility and hypothesis generation. Elife 2020;9 10.7554/eLife.53275
    1. Hothorn T, Hornik K, Zeileis A. Unbiased recursive partitioning: a conditional inference framework. J Comput Graph Stat 2006;15:651–74.
    1. Park N, Mohammadi M, Gorde K, et al. . Data synthesis based on generative adversarial networks. Proc VLDB Endow 2018;11:1071–83. 10.14778/3231751.3231757
    1. Chin-Cheong K, Sutter T, Vogt JE. Generation of heterogeneous synthetic electronic health records using Gans, presented at the workshop on machine learning for health (ML4H) at the 33rd conference on neural information processing systems (NeurIPS 2019) 2019.
    1. Karr AF, Kohnen CN, Oganian A, et al. . A framework for evaluating the utility of data altered to protect confidentiality. Am Stat 2006;60:224–32. 10.1198/000313006X124640
    1. Agresti A. Categorical data analysis. 2nd edn. Hoboken: Wiley, 2002.
    1. Reiter JP. New approaches to data dissemination: a glimpse into the future (?). CHANCE 2004;17:11–15. 10.1080/09332480.2004.10554907
    1. Hu J. Bayesian estimation of attribute and identification disclosure risks in synthetic data, 2018. Available: [Accessed 15 Mar 2019].
    1. Taub J, Elliot M, Pampaka M. Differential correct attribution probability for synthetic data: an exploration. Privacy in Statistical Databases 2018:122–37.
    1. Hu J, Reiter JP, Wang Q. Disclosure risk evaluation for fully synthetic categorical data. Privacy in Statistical Databases 2014:185–99.
    1. Wei L, Reiter JP. Releasing synthetic magnitude microdata constrained to fixed marginal totals. Stat J IAOS 2016;32:93–108. 10.3233/SJI-160959
    1. Ruiz N, Muralidhar K, Domingo-Ferrer J. On the privacy guarantees of synthetic data: a reassessment from the maximum-knowledge attacker perspective. Privacy in Statistical Databases 2018:59–74.
    1. Reiter JP. Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study. J Royal Statistical Soc A 2005;168:185–205. 10.1111/j.1467-985X.2004.00343.x
    1. El Emam K, Mosquera L, Bass J. Evaluating identity disclosure risk in fully synthetic health data: model development and validation. J Med Internet Res 2020;22:e23139. 10.2196/23139

Source: PubMed

3
Abonnere