Fast and accurate short read alignment with Burrows-Wheeler transform
Heng Li, Richard Durbin, Heng Li, Richard Durbin
Abstract
Motivation: The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals.
Results: We implemented Burrows-Wheeler Alignment tool (BWA), a new read alignment package that is based on backward search with Burrows-Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. Evaluations on both simulated and real data suggest that BWA is approximately 10-20x faster than MAQ, while achieving similar accuracy. In addition, BWA outputs alignment in the new standard SAM (Sequence Alignment/Map) format. Variant calling and other downstream analyses after the alignment can be achieved with the open source SAMtools software package.
Availability: http://maq.sourceforge.net.
Figures
References
- Burrows M, Wheeler DJ. Technical report 124. Palo Alto, CA: Digital Equipment Corporation; 1994. A block-sorting lossless data compression algorithm.
- Campagna D, et al. PASS: a program to align short sequences. Bioinformatics. 2009;25:967–968.
- Eaves HL, Gao Y. MOM: maximum oligonucleotide mapping. Bioinformatics. 2009;25:969–970.
- Ferragina P, Manzini G. Proceedings of the 41st Symposium on Foundations of Computer Science (FOCS 2000) IEEE Computer Society; 2000. Opportunistic data structures with applications; pp. 390–398.
- Grossi R, Vitter JS. Proceedings on 32nd Annual ACM Symposium on Theory of Computing (STOC 2000) ACM; 2000. Compressed suffix arrays and suffix trees with applications to text indexing and string matching; pp. 397–406.
- Hon W.-K, et al. A space and time efficient algorithm for constructing compressed suffix arrays. Algorithmica. 2007;48:23–36.
- Jiang H, Wong WH. SeqMap: mapping massive amount of oligonucleotides to the genome. Bioinformatics. 2008;24:2395–2396.
- Jung Kim Y, et al. ProbeMatch: a tool for aligning oligonucleotide sequences. Bioinformatics. 2009;25:1424–1425.
- Lam TW, et al. Compressed indexing and local alignment of DNA. Bioinformatics. 2008;24:791–797.
- Langmead B, et al. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25.
- Lin H, et al. ZOOM! Zillions of oligos mapped. Bioinformatics. 2008;24:2431–2437.
- Lippert RA. Space-efficient whole genome comparisons with Burrows-Wheeler transforms. J. Comput. Biol. 2005;12:407–415.
- Li H, et al. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008a;18:1851–1858.
- Li R, et al. SOAP: short oligonucleotide alignment program. Bioinformatics. 2008b;24:713–714.
- Malhis N, et al. Slider–maximum use of probability information for alignment of short sequence reads and SNP detection. Bioinformatics. 2009;25:6–13.
- Schatz M. Cloudburst: highly sensitive read mapping with mapreduce. Bioinformatics. 2009;25:1363–1369.
- Smith AD, et al. Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinformatics. 2008;9:128.
Source: PubMed