Statistical design for biospecimen cohort size in proteomics-based biomarker discovery and verification studies

Steven J Skates, Michael A Gillette, Joshua LaBaer, Steven A Carr, Leigh Anderson, Daniel C Liebler, David Ransohoff, Nader Rifai, Marina Kondratovich, Živana Težak, Elizabeth Mansfield, Ann L Oberg, Ian Wright, Grady Barnes, Mitchell Gail, Mehdi Mesri, Christopher R Kinsinger, Henry Rodriguez, Emily S Boja, Steven J Skates, Michael A Gillette, Joshua LaBaer, Steven A Carr, Leigh Anderson, Daniel C Liebler, David Ransohoff, Nader Rifai, Marina Kondratovich, Živana Težak, Elizabeth Mansfield, Ann L Oberg, Ian Wright, Grady Barnes, Mitchell Gail, Mehdi Mesri, Christopher R Kinsinger, Henry Rodriguez, Emily S Boja

Abstract

Protein biomarkers are needed to deepen our understanding of cancer biology and to improve our ability to diagnose, monitor, and treat cancers. Important analytical and clinical hurdles must be overcome to allow the most promising protein biomarker candidates to advance into clinical validation studies. Although contemporary proteomics technologies support the measurement of large numbers of proteins in individual clinical specimens, sample throughput remains comparatively low. This problem is amplified in typical clinical proteomics research studies, which routinely suffer from a lack of proper experimental design, resulting in analysis of too few biospecimens to achieve adequate statistical power at each stage of a biomarker pipeline. To address this critical shortcoming, a joint workshop was held by the National Cancer Institute (NCI), National Heart, Lung, and Blood Institute (NHLBI), and American Association for Clinical Chemistry (AACC) with participation from the U.S. Food and Drug Administration (FDA). An important output from the workshop was a statistical framework for the design of biomarker discovery and verification studies. Herein, we describe the use of quantitative clinical judgments to set statistical criteria for clinical relevance and the development of an approach to calculate biospecimen sample size for proteomic studies in discovery and verification stages prior to clinical validation stage. This represents a first step toward building a consensus on quantitative criteria for statistical design of proteomics biomarker discovery and verification research.

Figures

Figure 1. Distribution of Proteins in Blood…
Figure 1. Distribution of Proteins in Blood (Plasma/Serum) by Concentration Decade
This is a discrete version of a triangular distribution of the number of plasma proteins with increasing concentration decade (adapted from Horton and Anderson et al. [24]). Until a human protein quantitation project is completed, the distribution of plasma proteins as a function of concentration below 4 logs of concentration is based on an extrapolation.
Figure 2. Distribution of Biological CV by…
Figure 2. Distribution of Biological CV by Concentration Decade
The biological CV, denoted by σ, is plotted against the concentration decade for the table of blood protein tests in Ricós, et al. [25]. A statistical regression model estimates the increasing expected level (blue line) and increasing variation (red lines 1 SD and 2 SDs) for σ as a function of concentration decade on the log scale. The model provides estimates for the variation of plasma proteins across the nine decades of concentration simulated for the power calculations.
Figure 3. Separation of Biomarker Distribution between…
Figure 3. Separation of Biomarker Distribution between Cases Shedding the Biomarker and Controls, Crossed with Fraction of Cases Shedding Biomarker
This figure is a simulation example provided to biomarker researchers in choosing the expected separation between cases and controls (rows) provided by the target biomarker, and the fraction of cases shedding the biomarker (column). These two parameters are instrumental in determining the required number of samples. The biomarker distribution in controls is given by the blue histogram with density represented by the dashed line. Cases are a mixture of tumors that shed the biomarker and have a distribution (light red) shifted to the right from the biomarker distribution in controls by 5, 4, 3, and 2 SDs for the 1st, 2nd, 3rd and 4th rows, respectively. The proportion of cases shedding the biomarker changes by column from 80% to 50% to 20% in the 1st, 2nd and 3rd column, respectively. Cases that do not shed the biomarker have the same biomarker distribution as controls. The red histogram represents the mixture of the cases shedding the biomarker (solid line on right, light red) and the cases not shedding the biomarker (solid line on left under the dashed line, dark red). The top left corner (5 SDs of separation with 80% cases shedding biomarker) illustrates the most extreme and easy-to-discover tumor biomarker (CA125). Hence, this situation forms the extreme of the spectrum of separation and fraction of cases shedding the biomarker with subsequent examples of decreasing the separation, or the fraction shedding the biomarker, or both. Biomarker discoverers need to judge where the “to-be-discovered” biomarker lies within this spectrum and obtain an estimate as to sample size in the discovery and verification stages of a multistage proteomic pipeline.
Figure 4. Distribution and Separation of Cases…
Figure 4. Distribution and Separation of Cases versus Controls of CA125 in Blood
For CA125, the typical median measurement in controls is 15 U/mL, while the typical median measurement is 100 U/mL in cases at diagnosis of late stage disease, providing a 6-fold increase, or an increase of 1.9 = log(100/15) on the log scale. With CA125 having an inter-person SD of 0.50 (~CV of 50%), this difference corresponds to a signal of 3.8 SDs. However, CA125 for ovarian cancers is one rare exception where its separation and ubiquity of expression enable it to be detected with relatively small sample sizes. The detection of other protein biomarker candidates would likely require an examination of the impact of sample sizes on discovery and verification of a signal ranging from 1, 2, 3, 4 and 5 SDs.

Source: PubMed

3
Abonner