Here - Jesse R. Walsh

advertisement

Protein Secondary Structure Prediction using SVM

Yun Sook Lee, Usha K. Muppirala, Jessie Walsh

Abstract:

We have implemented a method of predicting secondary structure of proteins using a support vector machine. We used SVM light

software for learning and classification purposes. We constructed a binary classifier for each secondary structure, in this case helix, coil and sheet using various attributes such as hydrophobicity, molecular weight and isoelectric point. RS126 and CB513 datasets were used as the training dataset and the test dataset respectively. We achieved an accuracy of 63% in predicting the coil using only hydrophobicity values.

Introduction:

Understanding sequence-structure-function relationships of proteins is one of the major research interests in bioinformatics. The primary structure of a protein is its sequence itself. The sequence folds into intermediate 3D structures known as secondary structures such as coil, helix and sheet. Interactions of secondary structures lead to a complex tertiary structure that directly affects the functionality of the protein. Thus identifying the secondary structures is a very important step in acquiring a better knowledge about the protein. It is widely believed that the structural information of a protein is present in the sequence itself. There are millions of protein sequences, but very few structures are available in comparison. Experimentally determining the structures using methods such as NMR and X-ray crystallography are expensive and time consuming. It would be highly desirable to predict these structures using computational methods. Chou PY and

Fasman GD were the first to develop an algorithm for predicting protein conformation

[1]. Since then, a lot of research has been conducted in this field. Several machine learning approaches such as Neural Networks have been successfully applied to this problem [2], [3]. Hua and Sun extended the concepts of Support Vector Machines to predict the protein structures [4].

A Support Vector Machine is a supervised machine learning technology first proposed by

Vladimir Vapnik in 1996 [5]. An SVM is used to solve classification and regression problems. Sometimes it is very difficult to separate data points using a single hyperplane on a given input space. The idea behind SVM is to map the input data points onto a high dimensional space using a kernel function so that the data can be classified using a hyperplane. Once the data has been mapped to a feature space, two hyperplanes are constructed along the boundary of each class of data. The SVM finds an optimal separating hyperplane (OSH) that maximizes the distance between these two hyperplanes, thereby optimizing the performance of the learning machine. The training points that are crucial for the position of the hyperplane are called support vectors.

In this figure, the data points in the Input Space R d are impossible to separate using a single hyperplane. By mapping these data points on to a Feature Space H, a separating hyperplane can be identified. The dashed lines represent the two hyperplanes that are on the boundaries of the two data sets. The support vectors are marked by an extra circle around the data point [4].

In our project, the data points are the positive and negative examples of a particular secondary structure. Given an unknown data point, we can classify whether it belongs to a particular class of secondary structure or not.

Materials and Methods:

We used RS126 dataset for training the SVM. RS126 is proposed by Rost and Sander [6].

It is a dataset of 126 non-homologous proteins where no two proteins share more than

25% similarity. This is one of the widely used datasets for predicting secondary structures. CB513 data, used for testing the accuracy, proposed by Cuff and Barton, is a much larger set consisting of 513 proteins [7]. This is also a non-homologous set of proteins and 16 different chains are included. This set also includes RS126 proteins.

There are many secondary structure predictions available for RS126 and CB513 set. We used the ones provided by DSSP. DSSP is a database of secondary structure assignments for all proteins present in the Protein Data Bank. The secondary structures assignments are based on hydrogen bonds and other geometrical features specified in the co-ordinates of x-ray crystallographic structures [8].

SVM light

software developed by Thorsten Joachims is an implementation of Support

Vector Machine in C and it consists of a learning module and a classification module [9],

[10]. We have used these two modules in our project to implement the binary classifiers.

Using RS126 as training data set resulted in 21000 examples. Out of which, 9000 were positive for coil, 5000 were positive for sheet and 7000 were positive for helix. The positive examples for any two structures are treated as negative examples for the remaining structure. The set including all 21000 examples was called a full training set.

Since the examples are not proportionate for all three structures, we did a random

4

5

6

7

8

1

2

3 sampling of 4000 positive examples and 400 negative examples for each secondary structure. This set was called a reduced training set.

The first step in building the model was to extract and format the input data according to

SVM light

specifications. The input file contains the training examples. Each line represents one training example and is of the following format:

<line> .=. <target> <feature>:<value> <feature>:<value> ...

<feature>:<value> # <info>

<target> .=. +1 | -1 | 0 | <float>

<feature> .=. <integer> | "qid"

<value> .=. <float>

<info> .=. <string>

Example:

+1 1:-1.6 2:4.2 3:-0.4 4:4.5 5:3.8 6:-1.3 7:1.8 8:4.2 9:2.8 10:-3.9

11:1.8 #

+1 represents positive example. In 1:-1.6, 1 is the attribute and -1.6 is the value corresponding to that attribute.

We used a sliding window of length 11 over the protein sequence to generate the examples. The example corresponds to the middle residue’s structure of the sliding window. For example, APAFSVSPASG is a positive example for residue S’ corresponding secondary structure.

The SVM light

software provides an executable svm_learn.exe that takes an input file of examples and outputs a model file that can be used as a classifier in later steps. The executable considers the positive and negative data as two classes of data and maps them to a feature space using the Gaussian function. The output file contains information about the kernel parameters used and support vectors identified to construct the hyperplane.

Three binary classifiers Coil/~Coil, Helix/~Helix and Sheet/~Sheet were constructed for each different attribute values. The values of the attributes chosen were hydrophobicity values, molecular weight and isoelectric point. The table below lists these values for each amino acid.

S.No Amino acid Hydrophobicity value

Molecular weight

Isoelectric point

Alanine (A)

Arginine (R)

Asparagine (N)

1.8

-4.5

-3.5

71.04

156.10

114.04

6.00

11.15

5.41

Aspartic acid (D)

Cysteine (C)

Glutamine (Q)

Glutamic acid (E)

Glycine (G)

-3.5

2.5

-3.5

-3.5

-0.4

115.03

103.01

128.06

129.04

57.02

2.77

5.02

5.65

3.22

5.97

9

10

11

12

13

14

15

Histidine (H)

Isoleucine (I)

Leucine (L)

Lysine (K)

Methionine (K)

Phenylalanine (F)

Proline (P)

-3.2

4.5

3.8

-3.9

1.9

2.8

-1.6

137.06

113.08

113.08

128.09

131.04

147.07

97.05

7.47

5.94

5.98

9.59

5.74

5.48

6.30

16

17

18

19

Serine (S)

Threonine (T)

Tryptophan (W)

Tyrosine (Y)

Valine (V)

-0.8

-0.7

-0.9

-1.3

4.2

87.03

101.05

186.08

163.06

99.07

5.68

5.64

5.89

5.66

5.96 20

Due to time limitations, we could not test the models on entire CB513 dataset. Instead, we randomly picked 27 sequences from the set. But, later found that 16 of which were already present in RS126 and used in the construction of classifiers. So, the remaining 11 sequences alone were used as test data set.

For each protein in the test data set, the sequence data was extracted and formatted in the same way as the training data. SVM light

provides a classification executable svm_classify.exe which takes in the test data, the classifier generated by svm_learn and outputs a file classifying the test data.

Results:

The classifier produced an output for every residue in the sequence in the range -1 to +1.

The range was not strict and there were some values lying on either side of the boundary.

When the predicted value is greater than or equal to zero, the residue has the corresponding secondary structure. When the predicted value is less than zero, the residue does not have the corresponding secondary structure.

It was not possible to predict the structure for the first 5 residues and last 5 residues in a given sequence. This was a result of using sliding window.

Here is an example of a test sequence and the prediction from Coil/~Coil classifier.

>1coi-1-AS

EVEALEKKVAALESKVQALEKKVEALEHG

-HHHHHHHHHHHHHHHHHHHHHHHHH---

XXXXX--NNNNNN--NNNN---NNXXXXX

The first line contains the sequence name. The sequence is given in the second line. The third line contains the DSSP prediction and the last line is the predicted output from the

Coil/~Coil classifier. In DSSP predictions, ‘ ‘ represents a Coil, ‘ H ’ represents a Helix

5

6

7

8 and ‘ E ’ represents a Sheet secondary structure. In the above example, ‘ X ’ means structure not predicted for that residue and ‘

N

’ represents its ~Coil(not Coil).

The table below summarizes the predictions of a binary classifier Coil/~Coil. This classifier was designed on full training set of hydrophobicity attribute. As mentioned above, the classifier was tested on 11 sequences only. The overall accuracy of this classifier is 63%. The Helix/~Helix classifier and Sheet/~Sheet classifier always predicted the given data as not in that corresponding structure i.e. Sheet/~Sheet classifier outputs that none of residues belong to secondary structure ‘Sheet’.

S. No Test Sequence Name Sequence Length Accuracy (%)

1 1clc-2-AS.1 239 62.45

2

3

4

2bat-1-GJB

1amp-1-AS

1dlc-1-AS.1

388

291

229

66.93

66.55

61.87

1mdta-2-AS

1coi-1-AS

1wfbb-1-AUTO.1

1cdlg-1-DOMAK

177

29

37

20

65.27

63.16

100

70

9

10

1fuqb-1-AUTO

1svb-1-AS

136

299

53.97

57.09

11 1gflb-1-AS 230 59.09

The predicted outputs from three different classifiers using molecular weight attributes were approximately -1 for all the test data. Classifiers using isoelectric point attributes produced similar results. Most of the training examples ended up as support vectors. In the end, no valid predictions could be made using these attributes.

Discussion:

The binary classifier Coil/~Coil (using hydrophobicity values) had an overall accuracy of

63% on the 11 test sequences. More accurate performance can be obtained by testing on a larger set of sequences. One of the reasons H/~H and E/~E classifiers didn’t produce any results was that the kernel function of SVM light

was not suited for these examples. A new kernel function might have produced better results. Implementing a new kernel function is beyond the scope of this project. Another reason might be the length of the sliding window. It is unlikely that using the same window length will produce optimal results for all classifiers.

If the three classifiers Coil/~Coil, Helix/~Helix and Sheet/~Sheet had produced valid results, then all three outputs would have been compared to identify the secondary structure with highest probability for each residue in the given protein sequence.

Unfortunately the Helix/~Helix and Sheet/~Sheet model did not give any valid results to perform the comparison.

Classifiers using Molecular weight and isoelectric point failed to distinguish between different secondary structures. Nearly 70% of the training examples ended up as supportvectors. This means that the data was difficult to separate. Results showed that using only attribute value of molecular weight or isoelectric point did not help in predicting the secondary structure.

Using more than one attribute in each classifier may give better results than just using a single attribute. In this way, more information can be known about the neighboring residues that can help in classifying the data sets more accurately.

Conclusion:

Using SVM light

, we designed various binary classifiers to predict the secondary structure of a given protein. Among the various models tested, only one classifier which predicted the secondary structure Coil using only hydrophobicity values of neighboring residues produced a valid result. The accuracy of this model was found to be 63%. The accuracy can be further improved by including multiple attribute values. Future work involves choosing a proper window length, a suitable kernel function and more appropriate training examples for each of the classifiers.

References:

1.

Chou. PY and Fasman. GD, “Prediction of Protein Conformation”, Biochemistry

13(2) 222-45 (1974).

2.

Qian. N. and Sejnowski. T. J. “Predicting the secondary structure of globular proteins using neural network models”, J. Mol. Biol. 202, 865-884 (1988).

3.

Riis. S. K. and Krogh. A, “Improving prediction of protein secondary structure suing structured neural networks and multiple sequence alignments”, J. Comput. Biol. 3,

163-183 (1996).

4.

Hua. S and Sun. Z, “A Novel Method of Protein Secondary Structure prediction with

High Segement Overlap Measure: Support Vector Machine Approach”, J. Mol. Biol.

308, 397-407 (2001).

5.

Vapnik. V, “The Nature of Statistical Learning Theory”, Springer-Verlag, (1995).

6.

Rost B and Sander C, “Prediction of protein secondary structure at better than 70% accuracy,” J. Mol. Biol., 232, 584–599 (1993).

7.

Cuff. J. A. and Barton. G. J, “Evaluation and improvement of multiple sequence methods for protein secondary structure prediction”, Proteins: Struct. Funct. Genet.

34, 508-519, (1999).

8.

Kabsch. W and Sander. C, “Dictionary of Protein Secondary Structure: pattern

Recognition of hydrogen-Bonded and geometrical Features,” Biopolymers., 22, 2577-

637 (1983).

9.

Joachims. T, “Transductive Inference for Text Classification using Support Vector

Machines.” International Conference on Machine Learning (ICML), (1999).

10.

Joachims. T, “Optimizing Search Engines Using Clickthrough Data,” Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD)”, ACM,

(2002).

Download