Liquid Chromatography Mass Spectrometry-Based Proteomics: Biological and Technological Aspects

Yuliya V Karpievitch, Ashoka D Polpitiya, Gordon A Anderson, Richard D Smith, Alan R Dabney, Yuliya V Karpievitch, Ashoka D Polpitiya, Gordon A Anderson, Richard D Smith, Alan R Dabney

Abstract

Mass spectrometry-based proteomics has become the tool of choice for identifying and quantifying the proteome of an organism. Though recent years have seen a tremendous improvement in instrument performance and the computational tools used, significant challenges remain, and there are many opportunities for statisticians to make important contributions. In the most widely used "bottom-up" approach to proteomics, complex mixtures of proteins are first subjected to enzymatic cleavage, the resulting peptide products are separated based on chemical or physical properties and analyzed using a mass spectrometer. The two fundamental challenges in the analysis of bottom-up MS-based proteomics are: (1) Identifying the proteins that are present in a sample, and (2) Quantifying the abundance levels of the identified proteins. Both of these challenges require knowledge of the biological and technological context that gives rise to observed data, as well as the application of sound statistical principles for estimation and inference. We present an overview of bottom-up proteomics and outline the key statistical issues that arise in protein identification and quantification.

Figures

Figure 1
Figure 1
Overview of LC-MS-based proteomics. Proteins are extracted from biological samples, then digested and ionized prior to introduction to the mass spectrometer. Each MS scan results in a mass spectrum, measuring m/z values and peak intensities. Based on observed spectral information, database searching is typically employed to identify the peptides most likely responsible for high -abundance peaks. Finally, peptide information is rolled up to the protein level, and protein abundance is quantified using either peak intensities or spectral counts.
Figure 2
Figure 2
Sample preparation. Complex biological samples are first processed to extract proteins. Proteins are typically fractionated to eliminate high-abundance proteins or other proteins that are not of interest. The remaining proteins are then digested into peptides, which are commonly introduced to a liquid chromatography column for separation. Upon eluting from the LC column, peptides are ionized.
Figure 3
Figure 3
Mass spectrometry. The mass spectrometer consists of an ion source, responsible for ionizing peptides, the mass analyzer and the detector, responsible for recording m/z values and intensities, respectively, for each ion species. Each MS scan results in a mass spectrum, and a single sample may be subjected to thousands of scans.
Figure 4
Figure 4
Data acquisition: (a) Scan numbers and m/z values for an example raw LC-MS dataset. Each individual scan contains a single mass spectrum. (b) The mass spectrum for scan 5338. (c) A zoomed -in look at the scans 5275–5400 in m/z range 753–755.5. The cluster of dots is indicative of a single LC-MS “feature”. (d) The isotopic distribution for this feature in scan 5280. Peaks are separated by approximately 1/3, indicating a charge state of +3. The monoisotopic mass is thus 753.36 × 3 = 2260.08 Da. (e) The elution profile at m/z 753.36.
Figure 5
Figure 5
Protein identification. Peptide and protein identification is most commonly accomplished by matching observed spectral measurements to theoretical or previously-observed measurements in a database. In LC-MS/MS, measurements consist of fragmentation spectra, whereas mass and elution time alone are used in high resolution LC-MS. Once a best match is found, one of the following methods for assessing confidence in the match is employed: decoy databases, empirical Bayes, or “expectation values”.
Figure 6
Figure 6
Protein quantitation. The left panel shows the proportion of missing values in an example dataset as a function of the mean of the observed intensities for each peptide. There is a strong inverse relationship between these, suggesting that many missing intensities have been censored. The right panel shows an example protein found to be differentially expressed in a two-class human study. The protein had 6 peptides that were identified, although two were filtered out due to too many missing values (peptides 1 and 2, as indicated by the vertical shaded lines). Estimated protein abundances and confidence intervals are constructed from the peptide-level intensities by a censored likelihood model (Karpievitch et al (2009a)).

Source: PubMed

3
Sottoscrivi