The UK Biobank resource with deep phenotyping and genomic data

Clare Bycroft, Colin Freeman, Desislava Petkova, Gavin Band, Lloyd T Elliott, Kevin Sharp, Allan Motyer, Damjan Vukcevic, Olivier Delaneau, Jared O'Connell, Adrian Cortes, Samantha Welsh, Alan Young, Mark Effingham, Gil McVean, Stephen Leslie, Naomi Allen, Peter Donnelly, Jonathan Marchini, Clare Bycroft, Colin Freeman, Desislava Petkova, Gavin Band, Lloyd T Elliott, Kevin Sharp, Allan Motyer, Damjan Vukcevic, Olivier Delaneau, Jared O'Connell, Adrian Cortes, Samantha Welsh, Alan Young, Mark Effingham, Gil McVean, Stephen Leslie, Naomi Allen, Peter Donnelly, Jonathan Marchini

Abstract

The UK Biobank project is a prospective cohort study with deep genetic and phenotypic data collected on approximately 500,000 individuals from across the United Kingdom, aged between 40 and 69 at recruitment. The open resource is unique in its size and scope. A rich variety of phenotypic and health-related information is available on each participant, including biological measurements, lifestyle indicators, biomarkers in blood and urine, and imaging of the body and brain. Follow-up information is provided by linking health and medical records. Genome-wide genotype data have been collected on all participants, providing many opportunities for the discovery of new genetic associations and the genetic bases of complex traits. Here we describe the centralized analysis of the genetic data, including genotype quality, properties of population structure and relatedness of the genetic data, and efficient phasing and genotype imputation that increases the number of testable variants to around 96 million. Classical allelic variation at 11 human leukocyte antigen genes was imputed, resulting in the recovery of signals with known associations between human leukocyte antigen alleles and many diseases.

Conflict of interest statement

J.M. is a founder and director of Gensci Ltd. P.D., G.M. and S.L. are partners in Peptide Groove LLP. G.M. and P.D. are founders and directors of Genomics Plc. The remaining authors declare no competing financial interests.

Figures

Fig. 1. Summary of the UK Biobank… — **Fig. 1. Summary of the UK Biobank resource and genotyping array content.**
Summary of the major components of the UK Biobank resource. See Extended Data Table 1 for more details. The figure also shows a schematic representation of the different categories of content on the UK Biobank Axiom genotype array. Numbers indicate the approximate count of markers within each category, ignoring any overlap. A more detailed description of the array content is available in the UK Biobank Axiom Array Content Summary.

Fig. 2. Summary of genotype data quality… — **Fig. 2. Summary of genotype data quality and content.**
All plots show properties of the UK Biobank genotype data after applying quality control. a, MAF distribution based on all samples (805,426 markers). The inset shows rare markers only (MAF < 0.01). b, The distribution of the number of batch-level quality control (QC) tests that a marker fails (see Methods). For each of four MAF ranges, we show the fraction of markers that fail the specified number of batches. c, Comparison of MAF in UK Biobank with the frequency of the same allele in ExAC, among the European-ancestry participants within each study (Supplementary Information). This analysis used 91,298 overlapping markers. Each hexagonal bin is coloured according to the number of markers falling in that bin (log10 scale). The dashed red line shows x = y. The markers with very different allele frequencies seen on the top, bottom and left-hand sides of the plot comprise approximately 300 markers. This is 0.3% of all markers in the comparison (see Supplementary Information for discussion). d, Mean log2 ratios (L2R) on X and Y chromosomes for each sample, indicating probable sex chromosome aneuploidy (see Methods). There are 652 samples with a probable sex chromosome aneuploidy (indicated by crosses). Locations of clusters of individuals with different putative karyotypes are indicated by Greek symbols: λ = X0 (or mosaic XX/X0), θ = XXX, α = XXY, and π = XYY. Counts of individuals in these regions are given in Supplementary Table 2. The colours indicate different combinations of self-reported sex, and sex inferred by Affymetrix (from the genetic data). For almost all samples (99.9%), the self-reported and the inferred sex are the same, but for a small number of samples (378) they do not match (see Supplementary Information for discussion).

**Fig. 3. Ancestral diversity and familial relatedness.**
a, Each point represents a UK Biobank participant (n = 488,377 samples) and is placed according to their principal component (PC) scores in each of the top four principal components. Colours and shapes indicate the self-reported ethnic background of each individual. See Extended Data Table 3 for proportions in each category. b, Distribution of the number of relatives that participants have in the UK Biobank cohort. The height of each bar shows the count of participants (log10 scale) with the stated number of relatives. The colours indicate the proportions of each relatedness class within a bar. c, Examples of family groups within the UK Biobank cohort. Points represent participants, and coloured lines between points indicate their inferred relationship (for example, blue lines join full siblings). The integers show the total number of family networks in the cohort (if more than one) with that same configuration, ignoring third-degree pairs.

**Fig. 4. Association statistics for human height.**
Results (P values) of association tests between human height and genotypes using three different sets of data for chromosome 2. In a–c, P values are shown on the −log10 scale, capped at 50 for visual clarity and uncorrected for multiple comparisons. Markers with −log10(P) > 50 are plotted at 50 on the y axis and shown as triangles rather than dots. Horizontal red lines denote P = 5 × 10−8. a, Results for published meta-analysis by GIANT (n = 253,288), with NCBI GWAS catalogue markers superimposed in red (plotted at the reported P values). b, Association statistics (from linear mixed model, see Methods) for UK Biobank markers in the genotype data (n = 343,321). c, Association statistics (from linear mixed model, see Methods) for UK Biobank markers in the imputed data (n = 343,321). Points coloured pink indicate genotyped markers that were used in pre-phasing and imputation. This means that most of the data at each of these markers comes from the genotyping assay. Black points (the vast majority, ~8 million) indicate fully imputed markers. d, Venn diagram of the results of counting the number of 1-Mb windows with at least one locus with P < 5 × 10−8 in the GIANT, UK Biobank genotyped and UK Biobank imputed datasets (see Methods). Percentages in brackets are the proportion of the union of such windows across all three data sources (1,215). There were only three windows contained in UK Biobank genotyped data and not the imputed data. e, Comparison of Z-scores in UK Biobank (y axis) and GIANT (x axis). Z-scores were calculated as effect size divided by standard error, but only for markers with P < 5 × 10−8 in GIANT, for a set of 575 associated regions, which we also used for the credible set analysis (see Methods). The marker with the smallest P value (in GIANT) within each region is highlighted with blue circles. The black dotted line shows x = y, and the red solid line shows the linear regression line estimated on these data. The standard error of the regression coefficient is shown in brackets. Pearson’s correlation was used to calculate the r2 value.

Extended Data Fig. 1. Summary of sample-based… — **Extended Data Fig. 1. Summary of sample-based quality control.**
a–c, The three plots show heterozygosity and missing rates, which we used to flag poor quality samples (n = 488,377 samples). Panels a and b show heterozygosity for each sample before and after, respectively, correcting for ancestral background using principal components. The symbols (shapes and colours) indicate the self-reported ethnic background of each participant. Panel c shows the set of 968 samples we flagged as outliers (in red), and all other samples (in black), with shapes the same as for the other two plots. The vertical line shows the threshold we used to call samples as outliers on missing rate. In all plots missing rate data are transformed to the logit scale, but with the axis annotated with the original values.

Extended Data Fig. 2. Examples of intensity… — **Extended Data Fig. 2. Examples of intensity data and genotype calls for markers of different allele frequencies.**
Each sub-figure shows intensity data for a single marker within six different batches. Batches labelled with the prefix ‘UKBiLEVEAX’ contain only samples typed using the UK BiLEVE Axiom array, and those with the prefix ‘batch’ contain only samples typed using the UK Biobank Axiom array. Each point represents one sample and is coloured according to the inferred genotype at the marker. The x and y axes are transformations of the intensities for probe sets targeting each of the alleles ‘A’ and ‘B’ (see Supplementary Information for definition of probe set). The ellipses indicate the location and shape of the posterior probability distribution (two-dimensional multivariate normal) of the transformed intensities for the three genotypes in the stated batch. That is, each ellipse is drawn such that it contains 85% of the probability density. See Affymetrix *Axiom Genotyping Solution Data Analysis Guide* for more details of Affymetrix genotype calling. The MAF of each of the markers is computed using all samples in the released UK Biobank genotype data. a, A marker with a MAF of 0.077 with well-separated genotype clusters. b, Intensities for a marker with a MAF of 0.00092 with well-separated genotype clusters. As would be expected under Hardy–Weinberg equilibrium, there are no instances of samples with the minor homozygote genotype. c, Intensities for a marker with a MAF of 0.00066, and in which the heterozygote cluster is not well separated from the large major homozygote cluster in some batches, making it more difficult to call the heterozygous genotypes confidently.

Extended Data Fig. 3. Mean principal component… — **Extended Data Fig. 3. Mean principal component scores for each self-reported country of birth.**
Each column shows one principal component and each element is the mean principal component score for individuals born in the labelled country, scaled by the standard deviation of the scores for that principal component. Elements in each column are only coloured if the country has a non-zero coefficient (P < 10−5; two-sided t-test) in a linear model with country of birth as predictor and principal component scores as outcome (n = 487,848 samples). Countries (rows) have been ordered using hierarchical clustering (‘hclust’ function in R). The symbols next to each country label indicate the most common ethnic background category among the participants born in that country. For example, the most common self-reported ethnic background of participants born in Sri Lanka is ‘Any other Asian background’. Countries with fewer than 20 individuals born there were excluded from this analysis.

Extended Data Fig. 4. Distribution of information… — **Extended Data Fig. 4. Distribution of information scores at autosomal markers in the imputed dataset.**
The top left graph shows the full distribution of the information scores. The remaining panels show distributions in tranches of MAF; MAF > 5%, 1% ≤ MAF

**Extended Data Fig. 5. Example region of…**

**Extended Data Fig. 5. Example region of association in standing height GWAS.**

GWAS association statistics…

**Extended Data Fig. 5. Example region of association in standing height GWAS.**
GWAS association statistics (P values) for standing height focusing on a ~3-Mb region of chromosome 2 that did not reach genome-wide significance in the GIANT (2014) meta-analysis, but did in UK Biobank (linear mixed model; see Methods). The P values shown are not adjusted for multiple testing. Markers genotyped in the UK Biobank are shown as diamonds, and imputed markers as circles. The two markers with the smallest P value for each of the genotyped data and imputed data are enlarged and highlighted with black outlines, and other UK Biobank markers are coloured according to their correlation (r2) with one of these two. That is, genotyped markers with the leading genotyped marker (rs17713396), and imputed markers with the leading imputed marker (rs12714401). Markers with r2 values of less than 0.1 are shown as black or green.

**Extended Data Fig. 6. Comparison of fine-mapping…**

**Extended Data Fig. 6. Comparison of fine-mapping in GIANT (2014) and UK Biobank imputed data.**

**Extended Data Fig. 6. Comparison of fine-mapping in GIANT (2014) and UK Biobank imputed data.**
Here we summarize results of our credible set analysis in GIANT (2014) and UK Biobank for 575 genomics regions associated with standing height in both studies (see Methods). A red solid line on a plot indicates where x = y. a, Both plots compare the number of markers in the 95% credible sets in which the size is less than 18 markers in both studies (363 regions in the left-hand plot; 445 in the right-hand plot). b, c, Both plots are from the analysis considering all markers in each study. In b we show, for each region, the proportion of markers used in the analysis for a given study that are in the 95% credible set for that study. The plot contains the same 363 regions as shown in the left-hand plot in a. In c we summarize, for all 575 regions, how much weight our UK Biobank analysis placed on markers that our analysis of GIANT (2014) indicated were important.

All figures (10)

See this image and copyright information in PMC

Extended Data Fig. 5. Example region of… — **Extended Data Fig. 5. Example region of association in standing height GWAS.**
GWAS association statistics (P values) for standing height focusing on a ~3-Mb region of chromosome 2 that did not reach genome-wide significance in the GIANT (2014) meta-analysis, but did in UK Biobank (linear mixed model; see Methods). The P values shown are not adjusted for multiple testing. Markers genotyped in the UK Biobank are shown as diamonds, and imputed markers as circles. The two markers with the smallest P value for each of the genotyped data and imputed data are enlarged and highlighted with black outlines, and other UK Biobank markers are coloured according to their correlation (r2) with one of these two. That is, genotyped markers with the leading genotyped marker (rs17713396), and imputed markers with the leading imputed marker (rs12714401). Markers with r2 values of less than 0.1 are shown as black or green.

Extended Data Fig. 6. Comparison of fine-mapping… — **Extended Data Fig. 6. Comparison of fine-mapping in GIANT (2014) and UK Biobank imputed data.**
Here we summarize results of our credible set analysis in GIANT (2014) and UK Biobank for 575 genomics regions associated with standing height in both studies (see Methods). A red solid line on a plot indicates where x = y. a, Both plots compare the number of markers in the 95% credible sets in which the size is less than 18 markers in both studies (363 regions in the left-hand plot; 445 in the right-hand plot). b, c, Both plots are from the analysis considering all markers in each study. In b we show, for each region, the proportion of markers used in the analysis for a given study that are in the 95% credible set for that study. The plot contains the same 363 regions as shown in the left-hand plot in a. In c we summarize, for all 575 regions, how much weight our UK Biobank analysis placed on markers that our analysis of GIANT (2014) indicated were important.

References

1. Plenge RM, Scolnick EM, Altshuler D. Validating therapeutic targets through human genetics. Nat. Rev. Drug Discov. 2013;12:581–594. doi: 10.1038/nrd4051.
1. The UK Biobank. UK Biobank Axiom Array Content Summary (2014).
1. The UK Biobank. Genotyping and Quality Control of UK Biobank, a Large-Scale, Extensively Phenotyped Prospective Resource (2015).
1. Young AI, Wauthier F, Donnelly P. Multiple novel gene-by-environment interactions modify the effect of FTO variants on body mass index. Nat. Commun. 2016;7:12724. doi: 10.1038/ncomms12724.
1. Astle WJ, et al. The allelic landscape of human blood cell trait variation and links to common complex disease. Cell. 2016;167:1415–1429.e19. doi: 10.1016/j.cell.2016.10.042.
1. Wain LV, et al. Novel insights into the genetics of smoking behaviour, lung function, and chronic obstructive pulmonary disease (UK BiLEVE): a genetic association study in UK Biobank. Lancet Respir. Med. 2015;3:769–781. doi: 10.1016/S2213-2600(15)00283-0.
1. Elliott P, Peakman TC. The UK Biobank sample handling and storage protocol for the collection, processing and archiving of human blood and urine. Int. J. Epidemiol. 2008;37:234–244. doi: 10.1093/ije/dym276.
1. Doherty A, et al. Large scale population assessment of physical activity using wrist worn accelerometers: The UK Biobank Study. PLoS One. 2017;12:e0169649. doi: 10.1371/journal.pone.0169649.
1. Miller KL, et al. Multimodal population brain imaging in the UK Biobank prospective epidemiological study. Nat. Neurosci. 2016;19:1523–1536. doi: 10.1038/nn.4393.
1. Petersen SE, et al. Imaging in population science: cardiovascular magnetic resonance in 100,000 participants of UK Biobank – rationale, challenges and approaches. J. Cardiovasc. Magn. Reson. 2013;15:46. doi: 10.1186/1532-429X-15-46.
1. Coffey S, et al. Protocol and quality assurance for carotid imaging in 100,000 participants of UK Biobank: development and assessment. Eur. J. Prev. Cardiol. 2017;24:1799–1806. doi: 10.1177/2047487317732273.
1. Harvey NC, Matthews P, Collins R, Cooper C, Group UBMA. Osteoporosis epidemiology in UK Biobank: a unique opportunity for international researchers. Osteoporosis Int. 2013;24:2903–2905. doi: 10.1007/s00198-013-2508-1.
1. Sudlow C, et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 2015;12:e1001779. doi: 10.1371/journal.pmed.1001779.
1. The UK Biobank. Touchscreen Questionnaire Ordering, Validation and Dependencies (2018).
1. The International Multiple Sclerosis Genetics Consortium & The Wellcome Trust Case Control Consortium 2 Genetic risk and a primary role for cell-mediated immune mechanisms in multiple sclerosis. Nature. 2011;476:214–219. doi: 10.1038/nature10251.
1. Affymetrix. Axiom Genotyping Solution Data Analysis Guide (2017).
1. Nielsen J, Wohlert M. Chromosome abnormalities found among 34,910 newborn children: results from a 13-year incidence study in Arhus, Denmark. Hum. Genet. 1991;87:81–83. doi: 10.1007/BF01213097.
1. Lek M, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536:285–291. doi: 10.1038/nature19057.
1. Marchini J, Cardon LR, Phillips MS, Donnelly P. The effects of human population structure on large genetic association studies. Nat. Genet. 2004;36:512–517. doi: 10.1038/ng1337.
1. Shibata K, et al. The confounding effect of cryptic relatedness for environmental risks of systolic blood pressure on cohort studies. Mol. Genet. Genomic Med. 2013;1:45–53. doi: 10.1002/mgg3.4.
1. Voight BF, Pritchard JK. Confounding from cryptic relatedness in case-control association studies. PLoS Genet. 2005;1:e32. doi: 10.1371/journal.pgen.0010032.
1. The UK Biobank. UK Biobank: Protocol for a Large-Scale Prospective Epidemiological Resource (2007).
1. Howie B, Fuchsberger C, Stephens M, Marchini J, Abecasis GR. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat. Genet. 2012;44:955–959. doi: 10.1038/ng.2354.
1. O’Connell J, et al. Haplotype estimation for biobank-scale datasets. Nat. Genet. 2016;48:817–820. doi: 10.1038/ng.3583.
1. The 1000 Genomes Project Consortium A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393.
1. McCarthy S, et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 2016;48:1279–1283. doi: 10.1038/ng.3643.
1. Huang J, et al. Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel. Nat. Commun. 2015;6:8111. doi: 10.1038/ncomms9111.
1. Elliott L, et al. Genome-wide association studies of brain imaging phenotypes in UK Biobank. Nat. Commun. 2018;9:1470. doi: 10.1038/s41467-018-03819-3.
1. Welter D, et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 2014;42:D1001–D1006. doi: 10.1093/nar/gkt1229.
1. Dilthey A, et al. Multi-population classical HLA type imputation. PLOS Comput. Biol. 2013;9:e1002877. doi: 10.1371/journal.pcbi.1002877.
1. The International Multiple Sclerosis Genetics Consortium Class II HLA interactions modulate genetic risk for multiple sclerosis. Nat. Genet. 2015;47:1107–1113. doi: 10.1038/ng.3395.
1. Wood AR, et al. Defining the role of common variation in the genomic and biological architecture of adult human height. Nat. Genet. 2014;46:1173–1186. doi: 10.1038/ng.3097.
1. The Wellcome Trust Case Control Consortium et al. Bayesian refinement of association signals for 14 loci in 3 common diseases. Nat. Genet. 2012;44:1294–1301. doi: 10.1038/ng.2435.
1. Welsh S, Peakman T, Sheard S, Almond R. Comparison of DNA quantification methodology used in the DNA extraction protocol for the UK Biobank cohort. BMC Genomics. 2017;18:26. doi: 10.1186/s12864-016-3391-x.
1. Affymetrix. UKB_WCSGAX: UK Biobank 500K Samples Genotyping Data Generation by the Affymetrix Research Services Laboratory (2017).
1. UK Biobank. Genotyping of 500,000 UK Biobank Participants: Description of Sample Processing Workflow and Preparation of DNA for Genotyping (2015).
1. Affymetrix. UKB_WCSGAX: UK Biobank 500K Samples Processing by the Affymetrix Research Services Laboratory (2017).
1. Galinsky KJ, et al. Fast principal-component analysis reveals convergent evolution of ADH1B in Europe and East Asia. Am. J. Hum. Genet. 2016;98:456–472. doi: 10.1016/j.ajhg.2015.12.022.
1. Price AL, et al. Long-range LD can confound genome scans in admixed populations. Am. J. Hum. Genet. 2008;83:132–135. doi: 10.1016/j.ajhg.2008.06.005.
1. Lawson DJ, Hellenthal G, Myers S, Falush D. Inference of population structure using dense haplotype data. PLoS Genet. 2012;8:e1002453. doi: 10.1371/journal.pgen.1002453.
1. Manichaikul A, et al. Robust relationship inference in genome-wide association studies. Bioinformatics. 2010;26:2867–2873. doi: 10.1093/bioinformatics/btq559.
1. Loh P-R, Palamara PF, Price AL. Fast and accurate long-range phasing in a UK Biobank cohort. Nat. Genet. 2016;48:811–816. doi: 10.1038/ng.3571.
1. Loh P-R, et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nat. Genet. 2016;48:1443–1448. doi: 10.1038/ng.3679.
1. Webb TR, et al. Systematic evaluation of pleiotropy identifies 6 further loci associated with coronary artery disease. J. Am. Coll. Cardiol. 2017;69:823–836. doi: 10.1016/j.jacc.2016.11.056.
1. Fuchsberger C, et al. The genetic architecture of type 2 diabetes. Nature. 2016;536:41–47. doi: 10.1038/nature18642.
1. Loh P-R, et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 2015;47:284–290. doi: 10.1038/ng.3190.
1. International HapMap Consortium A haplotype map of the human genome. Nature. 2005;437:1299–1320. doi: 10.1038/nature04226.
1. Galante J, et al. The acceptability of repeat Internet-based hybrid diet assessment of previous 24-h dietary intake: administration of the Oxford WebQ in UK Biobank. Br. J. Nutr. 2016;115:681–686. doi: 10.1017/S0007114515004821.

Source: PubMed

The UK Biobank resource with deep phenotyping and genomic data

Abstract

Conflict of interest statement

Figures

References

Szponzorok és közreműködők

Egészségi állapot

Kábítószer-beavatkozások