Fold assessment for comparative protein structure modeling

Francisco Melo, Andrej Sali, Francisco Melo, Andrej Sali

Abstract

Accurate and automated assessment of both geometrical errors and incompleteness of comparative protein structure models is necessary for an adequate use of the models. Here, we describe a composite score for discriminating between models with the correct and incorrect fold. To find an accurate composite score, we designed and applied a genetic algorithm method that searched for a most informative subset of 21 input model features as well as their optimized nonlinear transformation into the composite score. The 21 input features included various statistical potential scores, stereochemistry quality descriptors, sequence alignment scores, geometrical descriptors, and measures of protein packing. The optimized composite score was found to depend on (1) a statistical potential z-score for residue accessibilities and distances, (2) model compactness, and (3) percentage sequence identity of the alignment used to build the model. The accuracy of the composite score was compared with the accuracy of assessment by single and combined features as well as by other commonly used assessment methods. The testing set was representative of models produced by automated comparative modeling on a genomic scale. The composite score performed better than any other tested score in terms of the maximum correct classification rate (i.e., 3.3% false positives and 2.5% false negatives) as well as the sensitivity and specificity across the whole range of thresholds. The composite score was implemented in our program MODELLER-8 and was used to assess models in the MODBASE database that contains comparative models for domains in approximately 1.3 million protein sequences.

Figures

Figure 1.
Figure 1.
ROC curves of the most accurate classifiers based on single features. ROC curves are shown for the single features (Table 1) that are most accurate for fold assessment (Table 2). (A) The combined statistical potential z-score (Melo et al. 2002) of the model (thick continuous line), pairwise statistical potential z-score (Melo et al. 2002) of the model, (thick dashed line), accessible surface statistical potential z-score (Melo et al. 2002) of the model (thin continuous line), and z-score of the target–template alignment (thin dashed line). (B) Same as A, but magnified.
Figure 2.
Figure 2.
ROC curves of the most accurate classifiers based on a combination of multiple features. The ROC curves are shown for several of the most accurate combined scores (Tables 2, 3). (A) The ROC curves for the GA341 discriminant function (thick continuous line), LDA discriminant function (thick dashed line), pG score (thin continuous line), and the combined statistical potential z-score of the model (thin dashed line). (B) Same as A, but magnified. (C) Magnified X-axis allows a better comparison of the classifier specificities. (D) Magnified Y-axis allows a better comparison of the classifier sensitivities.
Figure 3.
Figure 3.
The GA341 discriminant function. The discriminant function values range from 0 to 1, with higher values corresponding to correct models. Because the function depends on three variables, each column represents the three-dimensional surface of the discriminant function when one variable is fixed at a single value. (Left column) Combined statistical potential z-score (Melo et al. 2002) of the model is fixed (values increase from top to bottom). (Middle column) Percentage sequence identity of the alignment used to build the model is fixed (values decrease from top to bottom). (Right column) Model compactness is fixed (values decrease from top to bottom).
Figure 4.
Figure 4.
Fold assessment scheme. The GA341 score depends on three variables: The percentage sequence identity is calculated from the alignment that was used to build the model, while the compactness (Materials and Methods) and the combined statistical potential z-score (Melo et al. 2002) are calculated from the 3D model itself. Next, the length of the model and the GA341 score are plugged into a naive Bayesian classifier to obtain the conditional probability that the model is correct, pC(GA341,length). Typically, pC ≥ 0.7 predicts that the model is correct; otherwise, it is classified as incorrect.
Figure 5.
Figure 5.
Genetic algorithm for finding optimal feature subsets and multivariate classifiers. (A) Typical flowchart of a genetic algorithm. (B) Coding of a mathematical expression into a linear string of numbers. (C) Decoding a chromosome by following the prefix or Polish notation from left to right. The dashed line represents the action of an operator. (D) Calculation of the fitness value for a chromosome. For details see Materials and Methods.

Source: PubMed

3
Abonner