Methods of Secondary Structure Definition

Determination of secondary structure of a protein from its tertiary structure is not an exact science. Since there isn’t a unique definition of secondary structure, different methods of definition assign different secondary structures to the same protein. Actually, it has been shown that the three most popular methods DSSP, STRIDE and DEFINE only agree 63% of the time.

DSSP

Dictionary of Protein Secondary Structure is a method for determining secondary structure of a protein given its tertiary structure. This method depends on coordinates of atoms in a protein chain (some other methods consider the backbone dihedral angles phi and psi). This method checks for hydrogen bonds between residues. Residues are initially checked if they are in a Turn structure or a Bridge structure by examining the hydrogen bonds they form with other residues. A repetition of turns results in helix structure. A repetition of bridges results in ladder structure. Beta-sheets are two or more ladders sharing one or more residue. There are 8 states this algorithm outputs: alpha-helix, 3/10 helix, pi-helix, isolated beta-bridge, extended strand, turn, bend and coil. Bend is a purely geometric structure that doesn’t depend on hydrogen bonds (unlike turns).

There are other methods used in secondary structure definition. Some of these methods rely on backbone dihedral angles phi and psi. A repetition of specific combinations of these angles results in a secondary structure motif (helix, sheet etc). However, the downside of this approach is to decide which combinations result in a helix structure, which combinations result in sheet. Each structure would require at least 4 parameters to specify the allowed regions (2 boundaries for phi and 2 boundaries for psi angles). In DSSP only an energy value cut-off is required to check if two residues form a Hydrogen bond. Authors suggest, this is an advantage over the other methods.

 

Methods of Secondary Structure Prediction

           

The secondary structure prediction methods utilize three major components in secondary structure predictions:

  1. Sequence-to-Structure Component

Almost every method in literature predicts secondary structure from sequence information. The methods described here use a frame of residues that surround the residue whose secondary structure is to be predicted as an input to the method they use. For example, let “RICPRIWMECTRDS” be the sequence of a protein chain. When predicting the secondary structure of the residue W with a frame size of 9, the input to an algorithm would be “CPRIWMECT”.

  1. Structure-to-Structure Component

The predictions using the residues in a frame of residues predict the structure of every residue one by one. But this approach does not take into account the structural states of consecutive residues. Most methods add another prediction step or a simple filter to smooth the initial predictions. For example, a single helix is not very likely. Actually, DSSP defines regions of helices if they are at least two residues long.

  1. Multiple Sequence Alignments

A major improvement in accuracy can be gained by introducing homology information to secondary structure predictions. We don’t need to know the structure of the homologues of a protein in advance. Only the sequence information is required. The assumption here is structure is more preserved than sequence.

PHD

PHD is the first method to break the 65% boundary on Q3 accuracies (three state accuracy based on helix/strand/coil states) of secondary structure prediction methods. The two-level neural network structure in this work has been adopted by several other methods such as JNet and PSIPRED later. (The steps are briefly described above.). This method uses a data set called RS126, which has no significant pair wise homologues. If you predict the structure of a protein using a model trained on a homologue of the protein whose structure is known, you will obtain better results than a prediction based on models trained that doesn’t include any homologues of the protein. Including homologues of known structure is acceptable during real predictions but when assessing the accuracy of a method, it may give biased results.

Input to this method is a set of frequencies for each residue. A residue is represented by its column in a multiple sequence alignment. And the column is represented by the percentages of each of the 20 possible amino acids in that column. Also a 21st amino acid name is reserved for the cases where the local frame extends over the C or N terminus of the protein.

The sequence to structure part of this algorithm is a feed-forward neural network. This part uses a local frame of 13 residues as input. Other than the frequencies of each residue in a multiple sequence alignment, a conservation weight, which is a measure of the quality of an alignment in a particular column in that alignment, is also used. So there are 13x(20+1+1)=13x22 input nodes in this part. There are 3 output nodes in this part, a weight for each of the helix, strand or coil states.

The structure to structure part of this algorithm is also a feed-forward neural network. This time a frame of 17 residues is used. The input is the output of the sequence to structure step, plus a conservation weight again for that position. Again, for the cases where the frame extends over the termini of the protein, a dummy weight is used. So there are 17*(3+1+1)=17x5 inputs to this network. The output is a set of weights again for each of the three states.

A number of networks have been trained with random initial weights and the jury decision is simply an average over all outputs of the network. The over all accuracy of this method is 70.2%.

This method provides the basis for most of the secondary structure studies. Also, the article on this work clearly emphasizes the contributions of each part to the accuracy of the method. For a researcher interested in secondary structure predictions, this is a very informative, hence suitable article to get familiar with the literature.

JNet

JNet is a neural network based prediction method with one of the most complicated architecture in literature. It uses the same network structure used in PHD method. The difference of this algorithm is that it utilizes an expanded set of protein chains (called CB480), a different reduction 8-to-3 state reduction scheme and different methods for generating multiple sequence alignments.

The multiple sequence alignments in this method have been obtained by running PSI-BLAST searches on different databases and by aligning the sequences using different techniques (such as AMPS and CLUSTALW). The secondary structure definition algorithm used in this work is also DSSP.

The sequence-to-structure part of this algorithm is, like PHD, a neural network. In this case the input frame is 17 residues long. At this step the various networks were trained which utilizes different representations of the columns of multiple sequence alignments (i.e. the query residue and its equivalents in the homologous proteins). These representations are:

-          Frequencies of residues in the column of multiple sequence alignment (Same method with PHD).

-          Weighted frequencies, where weights are the BLOSUM6 scores of each residue with respect to the query residue in the column.

-          A position specific profile with position specific scores.

Residues in the frame are represented with one of these values and also for each residue in an alignment its conservation weight is added. This level consists of one input, one hidden and one output layer. The hidden layer has nine nodes.

The output of the sequence-to-structure network is fed into a structure-to-structure network. This network also uses the conservation weights. The frame size at this part is 19 residues. This level also consists of one input, one hidden and one output layer. The hidden layer of this network also has nine nodes.

This algorithm utilizes one more level of neural network like the PHD method (In PHD, this was just an arithmetic average). If all the networks, which were trained on different data representations, agree on the final prediction than the residue is predicted to be of that structure. For the positions where there isn’t a consensus on the final prediction (i.e. when all members of jury do not agree), a separately trained neural network is utilized (a network trained on no jury positions only).

This article emphasizes that the major increase in prediction accuracy stems from the multiple sequence alignments. This time, PSI-BLAST was used to search for alignments, which clearly increases the number of homolog sequences found.

The models generated by this method are not open to interpretation. There are different parts of the algorithm (especially the jury network for non-consensus regions) that one cannot tell why they are there, and why they help the accuracy increase.

PSIPRED

PSIPRED is a neural network based method, which has three components. The difference of this method is that it conducts homology searches on a different database and uses a different set of proteins for training and testing. It also represents the multiple sequence alignments only as PSI-BLAST position specific scoring profiles.

The network structure is simplified with respect to PHD and JNet methods. The sequence-to-structure part of the method is a back-propagation neural network. The input to this part is a frame of 15 residues. The residues are represented by the PSI-BLAST scoring matrices. This neural network has 75 hidden nodes and 3 output nodes.

The output of the sequence-to-structure network is fed to the structure-to-structure network in frames of 15 residues. This network has 60 hidden nodes and 3 output nodes for the final prediction.

The performance of this method is not directly comparable to PHD or JNet since the same data set with those methods was not utilized during its development. Its Q3 accuracy is 76.5%. This method has, however, proven to be more successful than the others in the third Critical Assessment of Techniques for Protein Structure Prediction (CASP) experiment.  

GORV

GORV is a secondary structure prediction method based on information theory and Bayesian statistics. Unlike other methods mentioned previously, this method does not use real valued encodings (such as frequencies or position specific scoring matrices etc.) of multiple sequence alignments.

GORV uses the CB513 data set. Secondary structure assignments were taken from DSSP. The 8-to-3 state reduction scheme used is significantly different than the previously mentioned methods. This scheme does not take into account the 3/10 helices, which are not so rare (3%). Thus the published results are slightly over-stated. We have checked to see that this reduction scheme may add at least 2.44% to the actual performance prediction of the prediction.

GORV method utilizes the three major parts of secondary structure prediction (mentioned in section 2.3). The sequence-to-structure component depends on information theory, and specifically on the information function. Each residue is represented by a frame of 7 to 13 residues (depending on the sequence length). The predictions are based on the information function. A Bayesian approximation is used for this formula (details in GORV article). The probability that a pattern (a local frame) is of a secondary structure is approximated by the statistics of single residues, pairs of residues and triplets of residues in the frame. This means the frame is represented by the residues in specific positions, and residue pairs and triplets in specific positions. The probabilities of each secondary structure state (helix, strand or coil) are calculated using this method and each probability is normalized to [0, 1] interval. Then the most probable state is selected as the secondary structure prediction. Some thresholds for assigning a secondary structure are also applied at this step since the algorithm had a bias towards the coil structure (i.e. a considerable number of helix and strand states were predicted to be coil).

Multiple sequence alignments are introduced to the predictions at the sequence-to-structure step. Basically each residue in the protein chains in the alignment of a query protein is assigned a probability for each of the states. Then these probabilities are averaged residue by residue and the most probable structure is assigned as the prediction. The thresholds are applied at this step when multiple sequence alignments are incorporated.

The structure-to-structure part of this algorithm is not a learner but simply a filter. At this step, only the unlikely estimates are eliminated. Very short helices and one-residue long strands are assigned to be loop (since the reduction scheme does not take into account the isolated beta-bridges).

The sequence-to-structure part of this method has 66.9% single-sequence Q3 accuracy. When multiple sequence alignments are incorporated to the algorithm, the accuracy rises to 73.4%. The individual contribution of the filtering part is not stated.

This algorithm is fairly simple than the other methods. But the model generated still lacks the interpretability of the decision lists. A model in this case is a set of frequencies of structures in the training set. Inferring a biological rule from this set of frequencies is not an easy process.