Genome reference and sequence variation in the large repetitive central exon of human MUC5AC

Xueliang Guo, Shuo Zheng, Hong Dang, Rhonda G Pace, Jaclyn R Stonebraker, Corbin D Jones, Frank Boellmann, George Yuan, Prashamsha Haridass, Olivier Fedrigo, David L Corcoran, Max A Seibold, Swati S Ranade, Michael R Knowles, Wanda K O'Neal, Judith A Voynow, Xueliang Guo, Shuo Zheng, Hong Dang, Rhonda G Pace, Jaclyn R Stonebraker, Corbin D Jones, Frank Boellmann, George Yuan, Prashamsha Haridass, Olivier Fedrigo, David L Corcoran, Max A Seibold, Swati S Ranade, Michael R Knowles, Wanda K O'Neal, Judith A Voynow

Abstract

Despite modern sequencing efforts, the difficulty in assembly of highly repetitive sequences has prevented resolution of human genome gaps, including some in the coding regions of genes with important biological functions. One such gene, MUC5AC, encodes a large, secreted mucin, which is one of the two major secreted mucins in human airways. The MUC5AC region contains a gap in the human genome reference (hg19) across the large, highly repetitive, and complex central exon. This exon is predicted to contain imperfect tandem repeat sequences and multiple conserved cysteine-rich (CysD) domains. To resolve the MUC5AC genomic gap, we used high-fidelity long PCR followed by single molecule real-time (SMRT) sequencing. This technology yielded long sequence reads and robust coverage that allowed for de novo sequence assembly spanning the entire repetitive region. Furthermore, we used SMRT sequencing of PCR amplicons covering the central exon to identify genetic variation in four individuals. The results demonstrated the presence of segmental duplications of CysD domains, insertions/deletions (indels) of tandem repeats, and single nucleotide variants. Additional studies demonstrated that one of the identified tandem repeat insertions is tagged by nonexonic single nucleotide polymorphisms. Taken together, these data illustrate the successful utility of SMRT sequencing long reads for de novo assembly of large repetitive sequences to fill the gaps in the human genome. Characterization of the MUC5AC gene and the sequence variation in the central exon will facilitate genetic and functional studies for this critical airway mucin.

Figures

Figure 1.
Figure 1.
MUC5AC genomic region. Current status and research design to cover the MUC5AC genomic gap. (A) Annotation tracks for the MUC5AC region excerpted from the University of California Santa Cruz genome browser (http://genome.ucsc.edu) (GRCh37/hg19) with notes added to emphasize the gap. The current gap in the MUC5AC gene is situated between a set of exons (blue vertical bars along arrowed line) that are 5′ (the 5′ exons are incorrectly annotated to MUC5B) and a set of 3′ exons (that are correctly annotated to MUC5AC). The entire region is in general disarray. (B) Available sequences used to inform the selection of PCR primers to characterize the gene, with a focus on filling the gap. The PCR primer selection was based on the use of the human reference alternative assembly genomic scaffold sequence NW_001838016 and other existing cDNA and genomic sequences. The High Throughput Genomics (http://www.ncbi.nlm.nih.gov/genbank/htgs) working draft sequence FP326773 became available during the course of this work, and it differs from NW_001838016 in length in the regions indicated. Together, these two sequences provide information for the 5′ end of the gap and contribute to the gene model. From previous efforts, there is strong evidence that the 3′ end portion of the gap consists of the MUC5AC large central exon. The available sequences in the large central exon region that were used to inform the PCR strategy consisted of one partial mRNA (AJ298317) and two partial genomic PCR sequences (AJ298318 and AJ298319) (22). These sequences were used in conjunction with NW_001838016 and FP326773 to generate a gene model, which was consistent with previous efforts (17). (C) Schematic representation of the overlapping PCR products used in contig development for de novo assembly of the MUC5AC gene from the African American (AfrAm) subject and the region of focus for white subjects with cystic fibrosis (CauCF1–3). The AfrAm individual was sequenced in two phases (Phase 1 and 2) as described in Materials and Methods (further details are provided in Table E2).
Figure 2.
Figure 2.
MUC5AC gene defined in an AfrAm subject. (A) Schematic representation of de novo assembled sequence contigs produced from Pacific Biosciences sequencing of pooled PCR product sequences from the AfrAm subject. The relative sizes of the contigs are shown roughly to proportion, and the predicted gap size for the reference genome is 22.1 kb based on the location of flanking GRCh37/hg19 sequences. (B) Schematic representation of MUC5AC gene showing exon locations. The large central exon is predicted to be exon 31, containing nine CysD domains and the tandem repeat (PTS-TR) sequences. (C) Schematic MUC5AC mRNA protein translation showing the major protein domains and their relationship to the entire gene and the central exon. The central exon consists of a 5′ region characterized by one Class I CysD domain, duplicated pairs of Class II and Class III CysD domains, and adjacent homologous sequence, which is rich in prolines, threonines, and serines (PTS region). The 3′ half of the central exon has a different structure, which is characterized by Class III CysD domains and adjacent unique sequences separated by PTS-TR units (TR1–4) of 24-bp imperfect repeats. Other protein features shown were previously defined (22). C domain = von Willebrand factor type C domain; CK domain = C-terminal cysteine knot domain; D domain = von Willebrand factor type D domain. The definition of the CysD domain classes is provided in the text.
Figure 3.
Figure 3.
Genetic variants and organization of the MUC5AC central exon. Sequence schematics representing the MUC5AC central exon from the AfrAm subject and three white subjects with cystic fibrosis produced by de novo assembly were compared, as a group and individually, with the central exon model (22), and the results are shown. All four subjects in this study have larger PTS-TR1 (extra 1.9 kb, blue bar) and PTS-TR4 (extra 216 bp, green bar) regions than the draft genome model sequence (Figure 1B). The increase in the PTS-TR lengths, when compared with the previously known model, more specifically shown in the central exon model of this figure, effectively link the previously available genomic fragments (Figure 1) into one unit and complete the central exon sequence. Indels are shown as purple bars, pink bars, or black stars. The three classes of CysD domains are shown (colored ovals). The duplication of the CysD domains in CauCF3, as compared with other subjects, is indicated by CysD4a, CysD5a, and CysD6a. HinfI sites are shown by red arrows, and the small and large HinfI fragment lengths identified by Southern blots are shown in red text, which are very similar to the in silico sizes (blue text).
Figure 4.
Figure 4.
Confirmation of structural variation in subject CauCF3 predicted from de novo assembly. (A) PCR primers located in CysD1 and PTS-TR1 were used to amplify genomic DNA. The expected increase in size (from 2.3 to 3.4 kb), consistent with the addition of a 5′ region conserved duplicon, was observed in CauCF3. (B) Long sequence reads from BbsI-enriched genomic DNA (see Materials and Methods) from CauCF3 were mapped to the de novo consensus contig 1 from CauCF3 using Burrows-Wheeler Aligner with custom parameters (49) and were visualized in the Integrated Genomics Viewer software (50, 51). The black, purple, and red bars (Class I, Class II, and Class III, respectively) indicate location of the 12 CysD domains on the de novo contig produced from the PCR amplification of the region from CauCF3. The gray horizontal bars show the individual BbsI-enriched genomic DNA sequence reads mapped to the contig. The black bars, within the gray sequence reads, indicate regions within the individual sequences that have a high indel content (consequence of SMRT Sequencing errors). Several individual sequence reads show the seven (CysD1–CysD5a) predicted CysD domains at the 5′ region. Several other individual sequence reads demonstrate the duplication of CysD6a in the 3′ region. (C) Although the coverage was not sufficient to produce a de novo assembly free of sequence errors across the entire region, six de novo assembled contigs were generated from the low-coverage genomic DNA reads that mapped to the previously determined MUC5AC gene model (not shown). BLAST of CysD domain sequences to one of these contigs demonstrates the arrangement of CysD domains expected from CysD4a-CysD5a duplication in CauCF3 contigs (purple: CysD2 aligning to Class II; red: CysD3 aligning to Class III). The dotted vertical lines mark the region in (B) and (C) for illustrative purposes.

References

    1. Rose MC, Voynow JA. Respiratory tract mucin genes and mucin glycoproteins in health and disease. Physiol Rev. 2006;86:245–278.
    1. Hansson GC. Role of mucus layers in gut infection and inflammation. Curr Opin Microbiol. 2012;15:57–62.
    1. Rodríguez-Piñeiro AM, Bergström JH, Ermund A, Gustafsson JK, Schuette A, Johansson ME, Hansson GC. Gastrointestinal mucus proteome reveals Muc2 and Muc5ac accompanied by a set of core proteins: 2. Studies of mucus in mouse stomach, small intestine, and colon. Am J Physiol Gastrointest Liver Physiol. 2013;305:G348–G356.
    1. Linden SK, Sutton P, Karlsson NG, Korolik V, McGuckin MA. Mucins in the mucosal barrier to infection. Mucosal Immunol. 2008;1:183–197.
    1. Stonebraker JR, Wagner D, Lefensty RW, Burns K, Gendler SJ, Bergelson JM, Boucher RC, O’Neal WK, Pickles RJ. Glycocalyx restricts adenoviral vector access to apical receptors expressed on respiratory epithelium in vitro and in vivo: role for tethered mucins as barriers to lumenal infection. J Virol. 2004;78:13755–13768.
    1. Button B, Cai LH, Ehre C, Kesimer M, Hill DB, Sheehan JK, Boucher RC, Rubinstein M. A periciliary brush promotes the lung health by separating the mucus layer from airway epithelia. Science. 2012;337:937–941.
    1. Hasnain SZ, Evans CM, Roy M, Gallagher AL, Kindrachuk KN, Barron L, Dickey BF, Wilson MS, Wynn TA, Grencis RK, et al. Muc5ac: a critical component mediating the rejection of enteric nematodes. J Exp Med. 2011;208:893–900.
    1. Koeppen M, McNamee EN, Brodsky KS, Aherne CM, Faigle M, Downey GP, Colgan SP, Evans CM, Schwartz DA, Eltzschig HK. Detrimental role of the airway mucin Muc5ac during ventilator-induced lung injury. Mucosal Immunol. 2013;6:762–775.
    1. Boltin D, Perets TT, Vilkin A, Niv Y. Mucin function in inflammatory bowel disease: an update. J Clin Gastroenterol. 2013;47:106–111.
    1. Larsson JM, Karlsson H, Crespo JG, Johansson ME, Eklund L, Sjövall H, Hansson GC. Altered O-glycosylation profile of MUC2 mucin occurs in active ulcerative colitis and is associated with increased inflammation. Inflamm Bowel Dis. 2011;17:2299–2307.
    1. Heazlewood CK, Cook MC, Eri R, Price GR, Tauro SB, Taupin D, Thornton DJ, Png CW, Crockford TL, Cornall RJ, et al. Aberrant mucin assembly in mice causes endoplasmic reticulum stress and spontaneous inflammation resembling ulcerative colitis. PLoS Med. 2008;5:e54.
    1. Van der Sluis M, De Koning BA, De Bruijn AC, Velcich A, Meijerink JP, Van Goudoever JB, Büller HA, Dekker J, Van Seuningen I, Renes IB, et al. Muc2-deficient mice spontaneously develop colitis, indicating that MUC2 is critical for colonic protection. Gastroenterology. 2006;131:117–129.
    1. Kobayashi M, Lee H, Nakayama J, Fukuda M. Roles of gastric mucin-type O-glycans in the pathogenesis of Helicobacter pylori infection. Glycobiology. 2009;19:453–461.
    1. Niv Y, Boltin D. Secreted and membrane-bound mucins and idiopathic peptic ulcer disease. Digestion. 2012;86:258–263.
    1. Lindén S, Mahdavi J, Hedenbro J, Borén T, Carlstedt I. Effects of pH on Helicobacter pylori binding to human gastric mucins: identification of binding to non-MUC5AC mucins. Biochem J. 2004;384:263–270.
    1. Kirby A, Gnirke A, Jaffe DB, Barešová V, Pochet N, Blumenstiel B, Ye C, Aird D, Stevens C, Robinson JT, et al. Mutations causing medullary cystic kidney disease type 1 lie in a large VNTR in MUC1 missed by massively parallel sequencing. Nat Genet. 2013;45:299–303.
    1. Seibold MA, Wise AL, Speer MC, Steele MP, Brown KK, Loyd JE, Fingerlin TE, Zhang W, Gudmundsson G, Groshong SD, et al. A common MUC5B promoter polymorphism and pulmonary fibrosis. N Engl J Med. 2011;364:1503–1512.
    1. Stock CJ, Sato H, Fonseca C, Banya WA, Molyneaux PL, Adamali H, Russell AM, Denton CP, Abraham DJ, Hansell DM, et al. Mucin 5B promoter polymorphism is associated with idiopathic pulmonary fibrosis but not with development of lung fibrosis in systemic sclerosis or sarcoidosis. Thorax. 2013;68:436–441.
    1. Zhang Y, Noth I, Garcia JG, Kaminski N. A variant in the promoter of MUC5B and idiopathic pulmonary fibrosis. N Engl J Med. 2011;364:1576–1577.
    1. Guo X, Pace RG, Stonebraker JR, Commander CW, Dang AT, Drumm ML, Harris A, Zou F, Swallow DM, Wright FA, et al. Mucin variable number tandem repeat polymorphisms and severity of cystic fibrosis lung disease: significant association with MUC5AC. PLoS ONE. 2011;6:e25452.
    1. Desseyn JL, Aubert JP, Porchet N, Laine A. Evolution of the large secreted gel-forming mucins. Mol Biol Evol. 2000;17:1175–1184.
    1. Escande F, Aubert JP, Porchet N, Buisine MP. Human mucin gene MUC5AC: organization of its 5′-region and central repetitive region. Biochem J. 2001;358:763–772.
    1. Lang T, Hansson GC, Samuelsson T. Gel-forming mucins appeared early in metazoan evolution. Proc Natl Acad Sci USA. 2007;104:16209–16214.
    1. Thornton DJ, Rousseau K, McGuckin MA. Structure and function of the polymeric mucins in airways mucus. Annu Rev Physiol. 2008;70:459–486.
    1. Vinall LE, Hill AS, Pigny P, Pratt WS, Toribara N, Gum JR, Kim YS, Porchet N, Aubert JP, Swallow DM. Variable number tandem repeat polymorphism of the mucin genes located in the complex on 11p15.5. Hum Genet. 1998;102:357–366.
    1. Fowler J, Vinall L, Swallow D. Polymorphism of the human muc genes. Front Biosci. 2001;6:D1207–D1215.
    1. Rousseau K, Swallow DM. Mucin methods: genes encoding mucins and their genetic variation with a focus on gel-forming mucins. Methods Mol Biol. 2012;842:1–26.
    1. Zhang X, Davenport KW, Gu W, Daligault HE, Munk AC, Tashima H, Reitenga K, Green LD, Han CS. Improving genome assemblies by sequencing PCR products with PacBio. Biotechniques. 2012;53:61–62.
    1. English AC, Richards S, Han Y, Wang M, Vee V, Qu J, Qin X, Muzny DM, Reid JG, Worley KC, et al. Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PLoS ONE. 2012;7:e47768.
    1. Zheng S, Byrd AS, Fischer BM, Grover AR, Ghio AJ, Voynow JA. Regulation of MUC5AC expression by NAD(P)H:quinone oxidoreductase 1. Free Radic Biol Med. 2007;42:1398–1408.
    1. Drumm ML, Konstan MW, Schluchter MD, Handler A, Pace R, Zou F, Zariwala M, Fargo D, Xu A, Dunn JM, et al. Gene Modifier Study Group. Genetic modifiers of lung disease in cystic fibrosis. N Engl J Med. 2005;353:1443–1453.
    1. Ye J, Coulouris G, Zaretskaya I, Cutcutache I, Rozen S, Madden TL. Primer-BLAST: a tool to design target-specific primers for polymerase chain reaction. BMC Bioinformatics. 2012;13:134.
    1. Bikandi J, San Millán R, Rementeria A, Garaizar J. In silico analysis of complete bacterial genomes: PCR, AFLP-PCR and endonuclease restriction. Bioinformatics. 2004;20:798–799.
    1. Chevreux B, Pfisterer T, Drescher B, Driesel AJ, Müller WE, Wetter T, Suhai S. Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs. Genome Res. 2004;14:1147–1159.
    1. Meezaman D, Charles P, Daskal E, Polymeropoulos MH, Martin BM, Rose MC. Cloning and analysis of cDNA encoding a major airway glycoprotein, human tracheobronchial mucin (MUC5) J Biol Chem. 1994;269:12932–12939.
    1. Bolisetty MT, Beemon KL. Splicing of internal large exons is defined by novel cis-acting sequence elements. Nucleic Acids Res. 2012;40:9244–9254.
    1. Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, Peluso P, Rank D, Baybayan P, Bettman B, et al. Real-time DNA sequencing from single polymerase molecules. Science. 2009;323:133–138.
    1. Ewing B, Green P. Base-calling of automated sequencer traces using phred: II. Error probabilities. Genome Res. 1998;8:186–194.
    1. Marques-Bonet T, Girirajan S, Eichler EE. The origins and impact of primate segmental duplications. Trends Genet. 2009;25:443–454.
    1. Stankiewicz P, Lupski JR. Structural variation in the human genome and its role in disease. Annu Rev Med. 2010;61:437–455.
    1. Rousseau K, Byrne C, Griesinger G, Leung A, Chung A, Hill AS, Swallow DM. Allelic association and recombination hotspots in the mucin gene (MUC) complex on chromosome 11p15.5. Ann Hum Genet. 2007;71:561–569.
    1. Desseyn JL. Mucin CYS domains are ancient and highly conserved modules that evolved in concert. Mol Phylogenet Evol. 2009;52:284–292.
    1. Wright FA, Strug LJ, Doshi VK, Commander CW, Blackman SM, Sun L, Berthiaume Y, Cutler D, Cojocaru A, Collaco JM, et al. Genome-wide association and linkage identify modifier loci of lung disease severity in cystic fibrosis at 11p13 and 20q13.2. Nat Genet. 2011;43:539–546.
    1. Spada F, Steen H, Troedsson C, Kallesoe T, Spriet E, Mann M, Thompson EM. Molecular patterning of the oikoplastic epithelium of the larvacean tunicate Oikopleura dioica. J Biol Chem. 2001;276:20624–20632.
    1. Ambort D, van der Post S, Johansson ME, Mackenzie J, Thomsson E, Krengel U, Hansson GC. Function of the CysD domain of the gel-forming MUC2 mucin. Biochem J. 2011;436:61–70.
    1. Thornton DJ, Howard M, Khan N, Sheehan JK. Identification of two glycoforms of the MUC5B mucin in human respiratory mucus: evidence for a cysteine-rich sequence repeated within the molecule. J Biol Chem. 1997;272:9561–9566.
    1. Bäckström M, Ambort D, Thomsson E, Johansson ME, Hansson GC. Increased understanding of the biochemistry and biosynthesis of MUC2 and other gel-forming mucins through the recombinant expression of their protein domains. Mol Biotechnol. 2013;54:250–256.
    1. Perez-Vilar J, Randell SH, Boucher RC. C-Mannosylation of MUC5AC and MUC5B Cys subdomains. Glycobiology. 2004;14:325–337.
    1. Li H, Durbin R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010;26:589–595.
    1. Thorvaldsdóttir H, Robinson JT, Mesirov JP. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform. 2013;14:178–192.
    1. Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G, Mesirov JP. Integrative genomics viewer. Nat Biotechnol. 2011;29:24–26.

Source: PubMed

3
Subscribe