Bayesian network analysis incorporating genetic anchors complements conventional Mendelian randomization approaches for exploratory analysis of causal relationships in complex data

Richard Howey, So-Youn Shin, Caroline Relton, George Davey Smith, Heather J Cordell, Richard Howey, So-Youn Shin, Caroline Relton, George Davey Smith, Heather J Cordell

Abstract

Mendelian randomization (MR) implemented through instrumental variables analysis is an increasingly popular causal inference tool used in genetic epidemiology. But it can have limitations for evaluating simultaneous causal relationships in complex data sets that include, for example, multiple genetic predictors and multiple potential risk factors associated with the same genetic variant. Here we use real and simulated data to investigate Bayesian network analysis (BN) with the incorporation of directed arcs, representing genetic anchors, as an alternative approach. A Bayesian network describes the conditional dependencies/independencies of variables using a graphical model (a directed acyclic graph) with an accompanying joint probability. In real data, we found BN could be used to infer simultaneous causal relationships that confirmed the individual causal relationships suggested by bi-directional MR, while allowing for the existence of potential horizontal pleiotropy (that would violate MR assumptions). In simulated data, BN with two directional anchors (mimicking genetic instruments) had greater power for a fixed type 1 error than bi-directional MR, while BN with a single directional anchor performed better than or as well as bi-directional MR. Both BN and MR could be adversely affected by violations of their underlying assumptions (such as genetic confounding due to unmeasured horizontal pleiotropy). BN with no directional anchor generated inference that was no better than by chance, emphasizing the importance of directional anchors in BN (as in MR). Under highly pleiotropic simulated scenarios, BN outperformed both MR (and its recent extensions) and two recently-proposed alternative approaches: a multi-SNP mediation intersection-union test (SMUT) and a latent causal variable (LCV) test. We conclude that BN incorporating genetic anchors is a useful complementary method to conventional MR for exploring causal relationships in complex data sets such as those generated from modern "omics" technologies.

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1. Average Bayesian networks for the…
Fig 1. Average Bayesian networks for the TwinsUK data using either (A) all available variables or (B, C, D) a subset of variables, as shown.
The red numbers indicate the probability of existence of the edge, and the numbers in brackets indicate the probability of the edge operating in direction shown, given that it exists. The thickness of an edge indicates its strength (probability of existence).
Fig 2. Simulation models used in simulation…
Fig 2. Simulation models used in simulation study 1 of quantitative trait data.
Data were simulated for two continuous variables (X and Y), together with a genetic instrument G (coded as 0, 1, 2) and a continuous instrumental variable Z. Parameter values for models involving weak confounding were chosen as βGX = 0.1, βZY = 0.075 and βCX = βCY = βGS = βSY = 0.25. For models involving strong confounding, the parameter values were the same except that βCX = βCY = βGS = βSY = 0.5 i.e. the parameters controlling the confounding effects were doubled. The parameter βXY was varied using values of 0.0, 0.1, 0.2, 0.3, 0.4 and 0.5. For the calculations where false positives were counted as detections of an arrow between X and Y in the wrong direction, the direction of causality was reversed between X and Y, such that for model 1 the equations become X∼N(βYXY+βGXG,1) and Y∼N(βZYZ,1), with βYX varied using values of 0.1, 0.3 and 0.5 (and similarly for models 2 and 3).
Fig 3. Performance (power and type I…
Fig 3. Performance (power and type I error) of different methods for detecting an edge from X to Y, under different generating scenarios that include weak confounding.
MR and St denote MR and MR Steiger respectively, performed using instrumental variable regression which takes into account the uncertainty of the predicted values in the first-stage regression to calculate the MR p-values. MR’ and St’ denote MR and MR Steiger respectively, performed using two-stage least squares regression without accounting for the uncertainty of the predicted values in the first-stage regression. Left hand plots (A, D, G) are generated under model 1 (no confounding), middle plots (B, E, H) are generated under model 2 (non-genetic confounding), and right hand plots (C, F, I) are generated under model 3 (genetic confounding).
Fig 4. Performance (power and type I…
Fig 4. Performance (power and type I error) of different methods for detecting an edge from X to Y, under different generating scenarios that include strong confounding.
MR and St denote MR and MR Steiger respectively, performed using instrumental variable regression which takes into account the uncertainty of the predicted values in the first-stage regression to calculate the MR p-values. MR’ and St’ denote MR and MR Steiger respectively, performed using two-stage least squares regression without accounting for the uncertainty of the predicted values in the first-stage regression. Left hand plots (A, D, G) are generated under model 1 (no confounding), middle plots (B, E, H) are generated under model 2 (non-genetic confounding), and right hand plots (C, F, I) are generated under model 3 (genetic confounding).
Fig 5. ROC curves for different methods…
Fig 5. ROC curves for different methods for detecting an edge from X to Y, under different generating scenarios that include weak confounding.
MR and St denote MR and MR Steiger respectively, performed using instrumental variable regression which takes into account the uncertainty of the predicted values in the first-stage regression to calculate the MR p-values. Left hand plots (A, D) are generated under model 1 (no confounding), middle plots (B, E) are generated under model 2 (non-genetic confounding), and right hand plots (C, F) are generated under model 3 (genetic confounding). For the top plots (panels A-C), false positives on the x-axis are counted using simulations when there is no effect (βXY = 0), while for the bottom plots (panels D-F), the false positive rate is calculated by simulating from a model where there is a causal effect from Y to X.
Fig 6. ROC curves for different methods…
Fig 6. ROC curves for different methods for detecting an edge from X to Y, under different generating scenarios that include strong confounding.
MR and St denote MR and MR Steiger respectively, performed using instrumental variable regression which takes into account the uncertainty of the predicted values in the first-stage regression to calculate the MR p-values. Left hand plots (A, D) are generated under model 1 (no confounding), middle plots (B, E) are generated under model 2 (non-genetic confounding), and right hand plots (C, F) are generated under model 3 (genetic confounding). For the top plots (panels A-C), false positives on the x-axis are counted using simulations when there is no effect (βXY = 0), while for the bottom plots (panels D-F), the false positive rate is calculated by simulating from a model where there is a causal effect from Y to X.
Fig 7. Graph of the simulation model…
Fig 7. Graph of the simulation model used for simulation study 2 for four different parameter scenarios as described by Shih et al. [55].
The data simulated consisted of four binary variables: Q, representing a gene; W, representing high alcohol; H, representing high alanine transaminase; and the outcome variable, Y, representing hepatocellular carcinoma.
Fig 8. Average Bayesian networks for each…
Fig 8. Average Bayesian networks for each of the four scenarios (A–D) used for the simulated binary data.
The red numbers indicate the probability of existence of an edge, and the numbers in brackets indicate the probability of the edge operating in direction shown, given that it exists. The thickness of the edges indicates their strength (probability of existence). G is constrained to have no parents.
Fig 9. Average Bayesian networks for each…
Fig 9. Average Bayesian networks for each of the four scenarios (A–D) used for the simulated binary data.
The red numbers indicate the probability of existence of an edge, and the numbers in brackets indicate the probability of the edge operating in direction shown, given that it exists. The thickness of the edges indicates their strength (probability of existence). G is constrained to have no parents and Y is constrained to have no children.
Fig 10. Performance (power and type I…
Fig 10. Performance (power and type I error) of different methods under a simulation model with 12 metabolites, an outcome Y, 150 SNPs affecting the metabolites, 75 other SNPs affecting Y, and 9775 SNPs with no effect.
Four metabolites (middle panels) have a causal effect on Y, four metabolites (right hand panels) have a reverse causal effect from Y to the metabolite, and four metabolites (left hand panels) have no effects to Y in any direction. The left-to-right arrows show tests for a causal effect from the metabolite to Y, and right-to-left arrows show tests from Y to one of the metabolites. MR: Mendelian randomization using an allele score as an instrumental variable for one of the metabolites or Y. S: SMUT, using SNPs as random effect variables for one of the metabolites or Y. B1: Bayesian network consisting of one metabolite, Y and the two corresponding allele score variables. B12: Bayesian network consisting of all 12 metabolites, Y and all corresponding allele score variables. BMA: multivariable MR based on Bayesian model averaging (MR-BMA).

References

    1. Davey Smith G, Ebrahim S. Epidemiology—is it time to call it a day? Int J Epidemiology. 2001;30:1–11. 10.1093/ije/30.1.1
    1. Robins JM. A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Mathematical Modelling. 1986;7:1393–1512. 10.1016/0270-0255(86)90088-6
    1. Robins JM, Hernán MA. Estimation of the causal effects of time-varying exposures In: Longitudinal Data Analysis. New York: Chapman & Hall/CRC Press; 2009. p. 553–599.
    1. Davey Smith G, Ebrahim S. ‘Mendelian randomization’: can genetic epidemiology contribute to understanding environmental determinants of disease? Int J Epidemiology. 2003;32:1–22. 10.1093/ije/dyg070
    1. Evans DM, Davey Smith G. Mendelian Randomization: New Applications in the Coming Age of Hypothesis-Free Causality. Annu Rev Genomics Hum Genet. 2015;16:327–350. 10.1146/annurev-genom-090314-050016
    1. Lawlor DA, Windmeijer F, Davey Smith G. Is Mendelian randomization ‘lost in translation?’: Comments on ‘Mendelian randomization equals instrumental variable analysis with genetic instruments’ by Wehby et al. Statistics in Medicine. 2008;27:2750–2755. 10.1002/sim.3308
    1. Didelez V, Sheehan N. Mendelian randomization as an instrumental variable approach to causal inference. Stat Methods Med Res. 2007;16:309–330. 10.1177/0962280206077743
    1. Davies NM, Holmes MV, Davey Smith G. Reading Mendelian randomisation studies: a guide, glossary, and checklist for clinicians. BMJ. 2018;362.
    1. Voight BF, Peloso GM, Orho-Melander M, Frikke-Schmidt R, Barbalic M, Jensen MK, et al. Plasma HDL cholesterol and risk of myocardial infarction: a mendelian randomisation study. Lancet. 2012;380:572–580. 10.1016/S0140-6736(12)60312-2
    1. Weng LC, Roetker NS, Lutsey PL, Alonso A, Guan W, Pankow JS, et al. Evaluation of the relationship between plasma lipids and abdominal aortic aneurysm: A Mendelian randomization study. PLoS One. 2018;13(4):e0195719 10.1371/journal.pone.0195719
    1. Richmond RC, Sharp GC, Ward ME, Fraser A, Lyttleton O, McArdle WL, et al. DNA Methylation and BMI: Investigating Identified Methylation Sites at HIF3A in a Causal Framework. Diabetes. 2016;65(5):1231–1244. 10.2337/db15-0996
    1. Richardson TG, Haycock PC, Zheng J, Timpson NJ, Gaunt TR, Davey Smith G, et al. Systematic Mendelian randomization framework elucidates hundreds of CpG sites which may mediate the influence of genetic variants on disease. Hum Molec Genet. 2018;27:3293–3304. 10.1093/hmg/ddy210
    1. Yao C, Chen G, Song C, Keefe J, Mendelson M, Huan T, et al. Genome-wide mapping of plasma protein QTLs identifies putatively causal genes and pathways for cardiovascular disease. Nature Communications. 2018;9:3268 10.1038/s41467-018-05512-x
    1. Burgess S, Butterworth A, Thompson SG. Mendelian randomization analysis with multiple genetic variants using summarized data. Genet Epidemiol. 2013;37:658–665. 10.1002/gepi.21758
    1. Davey Smith G, Hemani G. Mendelian randomization: genetic anchors for causal inference in epidemiological studies. Hum Molec Genet. 2014;23(R1):R89–98. 10.1093/hmg/ddu328
    1. Burgess S, Scott RA, Timpson NJ, Davey Smith G, Thompson SG, EPIC-InterAct Consortium. Using published data in Mendelian randomization: a blueprint for efficient identification of causal risk factors. Eur J Epidemiol. 2015;30:543–552. 10.1007/s10654-015-0011-z
    1. Hartwig FP, Davies NM, Hemani G, Davey Smith G. Two-sample Mendelian randomization: avoiding the downsides of a powerful, widely applicable but potentially fallible technique. Int J Epidemiol. 2016;45:1717–1726. 10.1093/ije/dyx028
    1. Relton C, Davey Smith G. Two-step epigenetic Mendelian randomization: a strategy for establishing the causal role of epigenetic processes in pathways to disease. Int J Epidemiol. 2012;41:161–176. 10.1093/ije/dyr233
    1. Bowden J, Davey Smith G, S B. Mendelian randomization with invalid instruments: effect estimation and bias detection through Egger regression. Int J Epidemiol. 2015;44:512–525. 10.1093/ije/dyv080
    1. Burgess S, Thompson SG. Multivariable Mendelian randomization: the use of pleiotropic genetic variants to estimate causal effects. Am J Epidemiol. 2015;181:251–260. 10.1093/aje/kwu283
    1. Burgess S, Daniel RM, Butterworth AS, Thompson SG, EPIC-InterAct Consortium. Network Mendelian randomization: using genetic variants as instrumental variables to investigate mediation in causal pathways. Int J Epidemiol. 2015;44:484–495. 10.1093/ije/dyu176
    1. Bowden J, Del Greco MF, Minelli C, Davey Smith G, Sheehan N, Thompson J. A framework for the investigation of pleiotropy in two-sample summary data Mendelian randomization. Statistics in Medicine. 2017;36:1783–1802. 10.1002/sim.7221
    1. Bowden J, Hemani G, Davey Smith G. Detecting individual and global horizontal pleiotropy in Mendelian randomization: a job for the humble heterogeneity statistic? Am J Epidemiol. 2018;187:2681–2685. 10.1093/aje/kwy185
    1. Verbanck M, Chen CY, Neale B, Do R. Detection of widespread horizontal pleiotropy in causal relationships inferred from Mendelian randomization between complex traits and diseases. Nat Genet. 2018;50:693–698. 10.1038/s41588-018-0099-7
    1. Zuber V, Colijn JM, Klaver C, Burgess S. Selecting causal risk factors from high-throughput experiments using multivariable Mendelian randomization. bioRxiv. 2018; 10.1101/396333.
    1. Porcu E, Rüeger S, Lepik K, eQTLGen Consortium, BIOS Consortium, Santoni FA, et al. Mendelian randomization integrating GWAS and eQTL data reveals genetic determinants of complex and clinical traits. Nature Communications. 2019;10:3300 10.1038/s41467-019-10936-0
    1. Timpson NJ, Nordestgaard BG, Harbord RM, Zacho J, Frayling TM, Tybjærg-Hansen A, et al. C-reactive protein levels and body mass index: elucidating direction of causation through reciprocal Mendelian randomization. Int J Obes. 2011;35:300–308. 10.1038/ijo.2010.137
    1. Hemani G, Tilling K, Davey Smith G. Orienting the causal relationship between imprecisely measured traits using GWAS summary data. PLOS Genetics. 2017;13:e1007081 10.1371/journal.pgen.1007081
    1. O’Connor LJ, Price AL. Distinguishing genetic correlation from causation across 52 diseases and complex traits. Nat Genet. 2018;50:1726–1734.
    1. Pearl J. Bayesian networks: A model of self-activated memory for evidential reasoning. In: Proceedings, Cognitive Science Society. Irvine, CA; 1985. p. 329–334.
    1. Pearl J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann; 1988.
    1. Spirtes P. Introduction to Causal Inference. Journal of Machine Learning Research. 2010;11:1643–1662.
    1. Spirtes P, Glymour C, Scheines R. Causation, prediction, and search. Springer; 1993.
    1. Pearl J. Causality: models, reasoning, and inference, 2nd Ed Cambridge University Press; 2009.
    1. Scheines R. Computation and causation. Metaphilosophy. 2002;33(1-2):158–180. 10.1111/1467-9973.00223
    1. Lagani V, Triantafillou S, Ball G, Tegnér J, Tsamardinos I. Probabilistic Computational Causal Discovery for Systems Biology In: Geris L, Gomez-Cabrero D, editors. Uncertainty in Biology: A Computational Modeling Approach. Studies in Mechanobiology, Tissue Engineering and Biomaterials 17 Switzerland: Springer International Publishing; 2016. p. 33–73.
    1. Nagarajan R, Scutari M, Lébre S. Bayesian Networks in R. Springer-Verlag; New York; 2013.
    1. Hemani G, Bowden J, Davey Smith G. Evaluating the potential role of pleiotropy in Mendelian randomization studies. Hum Molec Genet. 2018;27:R195–R208. 10.1093/hmg/ddy163
    1. Scutari M, Denis JB. Bayesian Networks with Examples in R Texts in Statistical Science, Chapman & Hall/CRC; (US: ); 2014.
    1. Chickering DM, Heckerman D, Meek C. Large-Sample Learning of Bayesian Networks is NP-Hard. The Journal of Machine Learning Research. 2004;5:1287–1330.
    1. Hua L, Zheng WY, Xia H, Zhou P. Detecting the potential cancer association or metastasis by multi-omics data analysis. Genetic Molecular Research. 2016;15(3). 10.4238/gmr.15038987
    1. Myte R, Gylling B, Häggström J, Schneede J, Magne Ueland P, Hallmans G, et al. Untangling the role of one-carbon metabolism in colorectal cancer risk: a comprehensive Bayesian network analysis. Scientific Reports. 2017;7:43434 10.1038/srep43434
    1. Zhu J, Lum PY, Lamb J, GuhaThakurta D, Edwards SW, Thieringer R, et al. An integrative genomics approach to the reconstruction of gene networks in segregating populations. Cytogenetic and Genome Research. 2004;105(2-4):363–374. 10.1159/000078209
    1. Zhu J, Sova P, Xu Q, Dombek KM, Xu EY, Vu H, et al. Stitching together multiple data dimensions reveals interacting metabolomic and transcriptomic networks that modulate cell regulation. PLoS Biology. 2012;10(4):e1001301 10.1371/journal.pbio.1001301
    1. Yazdani A, Yazdani A, Samiei A, Boerwinkle E. Generating a robust statistical causal structure over 13 cardiovascular disease risk factors using genomics data. Journal of Biomedical Informatics. 2016;60:114–119. 10.1016/j.jbi.2016.01.012
    1. Sedgewick AJ, Buschur K, Shi I, Ramsey JD, Raghu VK, Manatakis DV, et al. Mixed Graphical Models for Integrative Causal Analysis with Application to Chronic Lung Disease Diagnosis and Prognosis. Bioinformatics. 2019;35:1204–1212. 10.1093/bioinformatics/bty769
    1. Badsha MB, Fu AQ. Learning Causal Biological Networks With the Principle of Mendelian Randomization. Frontiers in Genetics. 2019;10:460 10.3389/fgene.2019.00460
    1. Zhong W, Spracklen CN, Mohlke KL, Zheng X, Fine J, Li Y. Multi-SNP mediation intersection-union test. Bioinformatics. 2019;35:4724–4729. 10.1093/bioinformatics/btz285
    1. Moayyeri A, Hammond CJ, Valdes AM, Spector TD. Cohort Profile: TwinsUK and healthy ageing twin study. Int J Epidemiol. 2013;42:76–85. 10.1093/ije/dyr207
    1. Shi SY, Fauman EB, Petersen AK, Krumsiek J, Santos R, Huang J, et al. An atlas of genetic influences on human blood metabolites. Nat Genet. 2014;46(6):543–550. 10.1038/ng.2982
    1. Speliotes EK, et al. Association analyses of 249,796 individuals reveal eighteen new loci associated with body mass index. Nature Genetics. 2010;42:937–948. 10.1038/ng.686
    1. Monda KL, et al. A meta-analysis identifies new loci associated with body mass index in individuals of African ancestry. Nature Genetics. 2013;45(6):690–696. 10.1038/ng.2608
    1. Boettcher SG, Dethlefsen C. deal: A Package for Learning Bayesian Networks. Journal of Statistical Software. 2003;8(20).
    1. Wasserstein RL, Lazar NA. The ASA’s Statement on p-Values: Context, Process, and Purpose. The American Statistician. 2016;70:129–133. 10.1080/00031305.2016.1154108
    1. Shih S, Huang YT, Yang HI. A multiple mediator analysis approach to quantify the effects of the ADH1B and ALDH2 genes on hepatocellular carcinoma risk. Genetic Epidemiology. 2018;42(4):394–404. 10.1002/gepi.22120
    1. Cho Y, Haycock PC, Sanderson E, Gaunt TR, Zheng J, Davey Smith APMG, et al. MR-TRYX: A Mendelian randomization framework that exploits horizontal pleiotropy to infer novel causal pathways. bioRxiv. 2019; 10.1101/476085.
    1. Greenland S, Senn SJ, Rothman KJ, Carlin JB, Poole C, Goodman SN, et al. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol. 2016;31:337–350. 10.1007/s10654-016-0149-3
    1. Visscher PM, Brown MA, McCarthy MI, Yang J. Five years of GWAS discovery. Am J Hum Genet. 2012;90:7–24. 10.1016/j.ajhg.2011.11.029
    1. Brumpton B, Sanderson E, Pires Hartwig F, Harrison S, Åberge Vie G, Cho Y, et al. Within-family studies for Mendelian randomization: avoiding dynastic, assortative mating, and population stratification biases. bioRxiv. 2019; 10.1101/602516.
    1. Ainsworth HF, Shin SY, Cordell HJ. A comparison of methods for inferring causal relationships between genotype and phenotype using additional biological measurements. Genet Epidemiol. 2017;41(7):577–586. 10.1002/gepi.22061
    1. Bycroft C and Freeman C and Petkova D and Band G and Elliott L T and Sharp K and Motyer A and Vukcevic D and Delaneau O and O’Connell J and Cortes A and Welsh S and Young A and Effingham M and McVean G and Leslie S and Allen N and Donnelly P and Marchini J. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. 10.1038/s41586-018-0579-z
    1. Lawlor DA, Tilling K, Davey Smith G. Triangulation in aetiological epidemiology. Int J Epidemiol. 2016;45:1866–1886. 10.1093/ije/dyw314
    1. Munafò MR, Davey Smith G. Robust research needs many lines of evidence. Nature. 2018;553:399–401. 10.1038/d41586-018-01023-3
    1. Burgess S, Small DS, Thompson SG. A review of instrumental variable estimators for Mendelian randomization. Stat Methods Med Res. 2017;26:2333–2355. 10.1177/0962280215597579
    1. Kleiber C, Zeileis A. Applied Econometrics with R. New York: Springer-Verlag; 2008. Available from: .
    1. Howey R. BayesNetty. Computer program package obtainable from ; 2019.
    1. Csardi G, Nepusz T. The igraph software package for complex network research. InterJournal. 2006;Complex Systems:1695.
    1. Sanderson E, Davey Smith G, Windmeijer F, Bowden J. An examination of multivariable Mendelian randomization in the single-sample and two-sample summary data settings. Int J Epidemiol. 2019;in press. 10.1093/ije/dyy262
    1. Kettunen J, Demirkan A, Würtz P, Draisma HH, Haller T, Rawal R, et al. Genome-wide study for circulating metabolites identifies 62 loci and reveals novel systemic effects of LPA. Nature Communications. 2016;7:11122 10.1038/ncomms11122
    1. Do R, Willer CJ, Schmidt EM, Sengupta S, Gao C, Peloso GM, et al. Common variants associated with plasma triglycerides and risk for coronary artery disease. Nat Genet. 2013;45:1345–1352. 10.1038/ng.2795

Source: PubMed

3
Abonner