Determination of secondary structure of a protein from its tertiary structure is not an exact science. Since there isn’t a unique definition of secondary structure, different methods of definition assign different secondary structures to the same protein. Actually, it has been shown that the three most popular methods DSSP, STRIDE and DEFINE only agree 63% of the time.
Dictionary of Protein Secondary Structure is a method for determining
secondary structure of a protein given its tertiary structure. This method
depends on coordinates of atoms in a protein chain (some other methods consider
the backbone dihedral angles phi and psi). This method checks for hydrogen
bonds between residues. Residues are initially checked if they are in a Turn structure or a Bridge structure by examining the hydrogen bonds they form with
other residues. A repetition of turns results in helix structure. A repetition
of bridges results in ladder structure. Beta-sheets are two or more ladders
sharing one or more residue. There are 8 states this algorithm outputs: alpha-helix,
3/10 helix, pi-helix, isolated beta-bridge, extended strand, turn, bend and
coil.
There are other methods used in secondary structure definition. Some
of these methods rely on backbone dihedral angles phi and psi. A repetition of
specific combinations of these angles results in a secondary structure motif
(helix, sheet etc). However, the downside of this approach is to decide which combinations
result in a helix structure, which combinations result in sheet. Each structure
would require at least 4 parameters to specify the allowed regions (2
boundaries for phi and 2 boundaries for psi angles). In DSSP only an energy
value cut-off is required to check if two residues form a Hydrogen bond. Authors
suggest, this is an advantage over the other methods.
The secondary structure prediction methods utilize three major components in secondary structure predictions:
Almost every method in literature predicts secondary structure from sequence information. The methods described here use a frame of residues that surround the residue whose secondary structure is to be predicted as an input to the method they use. For example, let “RICPRIWMECTRDS” be the sequence of a protein chain. When predicting the secondary structure of the residue W with a frame size of 9, the input to an algorithm would be “CPRIWMECT”.
The predictions using the residues in a frame of residues predict the structure of every residue one by one. But this approach does not take into account the structural states of consecutive residues. Most methods add another prediction step or a simple filter to smooth the initial predictions. For example, a single helix is not very likely. Actually, DSSP defines regions of helices if they are at least two residues long.
A major improvement in accuracy can be gained by introducing homology information to secondary structure predictions. We don’t need to know the structure of the homologues of a protein in advance. Only the sequence information is required. The assumption here is structure is more preserved than sequence.
PHD is the first method to break the 65%
boundary on Q3 accuracies (three state accuracy based on
helix/strand/coil states) of secondary structure prediction methods. The
two-level neural network structure in this work has been adopted by several
other methods such as JNet and PSIPRED later. (The
steps are briefly described above.). This method uses a data set called RS126,
which has no significant pair wise homologues. If you predict the structure of
a protein using a model trained on a homologue of the protein whose structure
is known, you will obtain better results than a prediction based on models
trained that doesn’t include any homologues of the protein. Including
homologues of known structure is acceptable during real predictions but when
assessing the accuracy of a method, it may give biased results.
Input to this method is a set of
frequencies for each residue. A residue is represented by its column in a
multiple sequence alignment. And the column is represented by the percentages
of each of the 20 possible amino acids in that column. Also a 21st
amino acid name is reserved for the cases where the local frame extends over
the C or N terminus of the protein.
The sequence to structure part of this
algorithm is a feed-forward neural network. This part uses a local frame of 13
residues as input. Other than the frequencies of each residue in a multiple
sequence alignment, a conservation weight, which is a measure of the quality of
an alignment in a particular column in that alignment, is also used. So there
are 13x(20+1+1)=13x22 input nodes in this part. There
are 3 output nodes in this part, a weight for each of the helix, strand or coil
states.
The structure to structure part of this
algorithm is also a feed-forward neural network. This time a frame of 17
residues is used. The input is the output of the sequence to structure step,
plus a conservation weight again for that position. Again, for the cases where
the frame extends over the termini of the protein, a dummy weight is used. So
there are 17*(3+1+1)=17x5 inputs to this network. The
output is a set of weights again for each of the three states.
A number of networks have been trained with
random initial weights and the jury decision is simply an average over all
outputs of the network. The over all accuracy of this method is 70.2%.
This method provides the basis for most of
the secondary structure studies. Also, the article on this work clearly
emphasizes the contributions of each part to the accuracy of the method. For a
researcher interested in secondary structure predictions, this is a very informative,
hence suitable article to get familiar with the literature.
JNet is a neural
network based prediction method with one of the most complicated architecture in literature. It uses the same network structure used in PHD method. The difference of this algorithm is that it
utilizes an expanded set of protein chains (called CB480), a different reduction
8-to-3 state reduction scheme and different methods for generating multiple
sequence alignments.
The multiple sequence alignments in this
method have been obtained by running PSI-BLAST searches on different databases
and by aligning the sequences using different techniques (such as AMPS and
CLUSTALW). The secondary structure definition algorithm used in this work is
also DSSP.
The sequence-to-structure part of this
algorithm is, like PHD, a neural network. In this case the input frame is 17
residues long. At this step the various networks were trained which utilizes
different representations of the columns of multiple sequence alignments (i.e.
the query residue and its equivalents in the homologous proteins). These
representations are:
-
Frequencies
of residues in the column of multiple sequence alignment (Same method with
PHD).
-
Weighted
frequencies, where weights are the BLOSUM6
scores of each residue with respect to the query residue in the column.
-
A
position specific profile with position specific scores.
Residues in the frame are represented with
one of these values and also for each residue in an alignment its conservation
weight is added. This level consists of one input, one hidden and one output
layer. The hidden layer has nine nodes.
The output of the sequence-to-structure
network is fed into a structure-to-structure network. This network also uses
the conservation weights. The frame size at this part is 19 residues. This
level also consists of one input, one hidden and one output layer. The hidden
layer of this network also has nine nodes.
This algorithm utilizes one more level of
neural network like the PHD method (In PHD, this was just an arithmetic
average). If all the networks, which were trained on different data
representations, agree on the final prediction than the residue is predicted to
be of that structure. For the positions where there isn’t a consensus on the
final prediction (i.e. when all members of jury do not agree), a separately
trained neural network is utilized (a network trained on no jury positions
only).
This article emphasizes that the major increase in prediction accuracy stems from the multiple sequence alignments. This time, PSI-BLAST was used to search for alignments, which clearly increases the number of homolog sequences found.
The models generated by this method are not open to interpretation. There are different parts of the algorithm (especially the jury network for non-consensus regions) that one cannot tell why they are there, and why they help the accuracy increase.
PSIPRED is a neural network based method,
which has three components. The difference of this method is that it conducts
homology searches on a different database and uses a different set of proteins
for training and testing. It also represents the multiple sequence alignments
only as PSI-BLAST position specific scoring profiles.
The network structure is simplified with
respect to PHD and JNet methods. The
sequence-to-structure part of the method is a back-propagation neural network.
The input to this part is a frame of 15 residues. The residues are represented
by the PSI-BLAST scoring matrices. This neural network has 75 hidden nodes and
3 output nodes.
The output of the sequence-to-structure
network is fed to the structure-to-structure network in frames of 15 residues.
This network has 60 hidden nodes and 3 output nodes for the final prediction.
The performance of this method is not
directly comparable to PHD or JNet since the same
data set with those methods was not utilized during its development. Its Q3
accuracy is 76.5%. This method has, however, proven to be more successful than
the others in the third Critical
Assessment of Techniques for Protein Structure Prediction (CASP) experiment.
GORV is a secondary structure prediction
method based on information theory and Bayesian statistics. Unlike other
methods mentioned previously, this method does not use real valued encodings
(such as frequencies or position specific scoring matrices etc.) of multiple
sequence alignments.
GORV uses the CB513 data set. Secondary
structure assignments were taken from DSSP. The 8-to-3 state reduction scheme
used is significantly different than the previously mentioned methods. This
scheme does not take into account the 3/10 helices, which are not so rare (3%).
Thus the published results are slightly over-stated. We have checked to see
that this reduction scheme may add at least 2.44% to the actual performance
prediction of the prediction.
GORV method utilizes the three major parts
of secondary structure prediction (mentioned in section 2.3). The sequence-to-structure component depends on
information theory, and specifically on the information function. Each residue
is represented by a frame of 7 to 13 residues (depending on the sequence
length). The predictions are based on the information function. A Bayesian
approximation is used for this formula (details in GORV article). The probability
that a pattern (a local frame) is of a secondary structure is approximated by
the statistics of single residues, pairs of residues and triplets of residues
in the frame. This means the frame is represented by the residues in specific
positions, and residue pairs and triplets in specific positions. The
probabilities of each secondary structure state (helix, strand or coil) are
calculated using this method and each probability is normalized to [0, 1]
interval. Then the most probable state is selected as the secondary structure
prediction. Some thresholds for assigning a secondary structure are also
applied at this step since the algorithm had a bias towards the coil structure
(i.e. a considerable number of helix and strand states were predicted to be
coil).
Multiple sequence alignments are introduced
to the predictions at the sequence-to-structure step. Basically each residue in
the protein chains in the alignment of a query protein is assigned a
probability for each of the states. Then these probabilities are averaged
residue by residue and the most probable structure is assigned as the
prediction. The thresholds are applied at this step when multiple sequence
alignments are incorporated.
The structure-to-structure part of this
algorithm is not a learner but simply a filter. At this step, only the unlikely
estimates are eliminated. Very short helices and one-residue long strands are
assigned to be loop (since the reduction scheme does not take into account the
isolated beta-bridges).
The sequence-to-structure part of this
method has 66.9% single-sequence Q3 accuracy. When multiple sequence
alignments are incorporated to the algorithm, the accuracy rises to 73.4%. The
individual contribution of the filtering part is not stated.
This algorithm is fairly simple than the
other methods. But the model generated still lacks the interpretability of the
decision lists. A model in this case is a set of frequencies of structures in
the training set. Inferring a biological rule from this set of frequencies is
not an easy process.