final (LNCS)

advertisement

Modified Association Rule Mining Approach for the

MHC-Peptide Binding Problem

Galip Gürkan Yardımcı, Alper Küçükural, Yücel Saygın, Uğur Sezerman

{yardimci, kucukural}@su.sabanciuniv.edu

{ysaygin, ugur}@sabanciuniv.edu

Faculty of Engineering and Natural Sciences,

Sabanci University, Turkey

Abstract. Computational approach to predict peptide binding to major histocompatibility complex (MHC) is crucial for vaccine design since these peptides can act as a T-Cell epitope to trigger immune response. There are two main branches for peptide prediction methods; structural and data mining approaches. These methods can be successfully used for prediction of T-Cell epitopes in cancer, allergy and infectious diseases. In this paper, association rule mining methods are implemented to generate rules of peptide selection by

MHCs. To capture the binding characteristics, modified rule mining and data transformation methods are implemented in this paper. Peptides are known to bind to the same MHC show sequence variability, to capture this characteristic, we used a reduced amino acid alphabet by clustering amino acids according to their physico-chemical properties. Using the classification of amino acids and the OR-operator to combine the rules to reflect that different amino acid types and positions along the peptide may be responsible for binding are the innovations of the method presented. We can predict MHC Class-I binding with

75-97% coverage and 76-100% accuracy.

Keywords: Peptides, MHC Class-I, Association rule mining, reduced amino acid alphabet, data mining.

1 Introduction

Peptide binding prediction is a crucial step for vaccine design since it enables the understanding of the mechanism of the immune response to foreign bodies and how vaccines work. There are numerous experimental research results regarding this subject. These experiments take too much time and are costly since there are a vast number of peptides to be tried as a vaccine candidate even for a single MHC.

Therefore, there is an urgent need for developing effective computational methods to solve the peptide binding problem to the MHC. The methods developed in finding peptide sequences specific for the target MHCs can be also used for developing therapeutical proteins as well for other types of receptors.

MHCs recognize antigens which are foreign macromolecules that cause an immune response in the body. There are two types of immune responses to the antigens: humoral and cellular immune response. Class II MHC molecules are

2 Galip Gürkan Yardımcı, Alper Küçükural, Yücel Saygın, Uğur Sezerman involved in humoral immune response whereas Class I MHC molecules are involved in the cellular immune response which is the response after the antigen enters the cell

[9]. In this paper we will focus on cellular response which involves recognition of antigenic fragments by Class I MHCs. After foreign bodies enter the cell they are cleaved into smaller pieces that are called peptides. These peptides are picked up by

Class I MHCs and brought to the cell surface. There are on average three to four different type of Class I MHCs in the human cell, which all bind to different types of peptides including self and antigenic peptides. The T-Cells recognize the infected cell upon binding to the antigenic peptide-MHC complex, which triggers a cascade of events leading to the cellular immune response to foreign bodies. In both Class I and

Class II pathways the most important molecule initiation of the recognition of infected cells is major histocompatibility complex (MHC). Knowing which peptides that are yielding from the cleavage of antigens will be picked up by the MHC molecule and understanding the mechanism of the binding of the peptides (sequence motifs) will be of great use in vaccine design. A peptide presented to a T-Cell together with a MHC molecule is called T-Cell epitope. If the cell is infected, it can be induced to apoptosis by T-Cell. In this paper, we investigate the Class I pathway for prediction of T-Cell epitopes.

Laboratory experiments can be used to determine which peptides bind to which kind of MHC molecules. The peptides that are known to bind to Class I MHCs have variable length but the majority of them have between 8 to 10 residues. Conducting laboratory experiments for all types of peptide binding combinations is not feasible since there are 20 8 to 20 10 possible peptides using 20 amino acid alphabet, but only a few are selected by the MHC[12]. We combine structural and data mining based methods for prediction of T-Cell epitopes. Association rule mining techniques are used for finding correlations between positions of the bound peptides and determining the binding motifs for each type of MHC. These rules will be useful for understanding the mechanism of peptide binding.

2 Background and Related Work

There are two main approaches to the peptide prediction problem: profile based approaches and machine learning. Profile based approaches build profile scoring matrices from the alignment of the binding peptides. These methods control the peptide sequence for the availability of the preferred sequences at certain positions of the peptide as predicted by the scoring matrix. Up to now most successful methods are machine learning methods, like SVMHC[7].

Profile based methods, SYFPEITHI[11], Rankpep[13], and ProPred1 [15], only take into account the positive cases to derive the information therefore they do not have high specifity as compared to machine learning approaches where the non binder class information is also taken into account to distinguish the properties of binders.

[4]

The second group of researchers used machine learning approaches such as

Support Vector Machines and Artificial Neural Networks to find the correlations between the positions of the peptide to build a valid probabilistic model using both the binding and non binding peptides’ data [5],[6],[7]. Another method done by

Milledge et. al. that was used for predicting peptides for HLA 0201 type of MHC has

A Modified Associated Rule Mining Approach for the MHC-Peptide Binding Problem 3 created sequence structural patterns by using association rules to reflect the MHC binding characteristics of peptides [10].

2.1 Association rule mining

The problem of finding association rules among items is formally defined by Agrawal et al. in [1], [2] as follows:

Let I = {i

1

, i

2

, ..., i m

} be a set of all items. Let T be a transaction consisting of a set of items such that T  I. We call D a database of transactions. We say that a transaction T contains X, a set of some items in I, if X

T . An association rule is an implication of the form X  Y, where X  I, Y  I and X  Y =  . An item set X has support s if s% of the transactions contain X. We say that the rule X  Y holds with confidence c if c% of the transactions in D that contain X also contain Y. The rule X  Y has support s if s% of transactions in D contain X  Y.

Association rule mining algorithms scan the database of transactions and calculate the support and confidence of the candidate rules to determine if they are significant or not. For that purpose, threshold values are used by the algorithms to prune the insignificant rules. A rule is significant if its support and confidence is higher than the user specified minimum support and minimum confidence threshold. In this way, algorithms do not retrieve all the association rules that could possibly be generated from a database, instead only a very small subset of rules which satisfies the threshold values are retrieved.

Support of an association rule mimics the coverage of that rule, and confidence of the rule specifies the accuracy. Both of these measures are important for determining the significance of a rule. Therefore we used a combined support confidence measure

(CSC-Measure) 1 . The formula for the CSC-measure is obtained by taking the harmonic mean of the support and confidence measure, which is formulated below:

  c

CSC ( , c )

2 s s s

 c where s is the support and c is the confidence of the rule. CSC-Measure takes both the confidence and support of the rule into account, so rules which have high confidence values and which cover more transactions over the data set will be more valuable.

3 Association Rule Mining Methods for (Peptide-Binding)

Prediction

Our data set D contain amino acid sequences of peptides which are known to bind to

Class I MHC molecules [3]. In D, there are 198 transactions (peptides) known to bind to 4 different MHCs. We have worked with nine amino acid long sequences only

1 In information retrieval context, precision and recall measures are combined in the same way to calculate the F-measure.

4 Galip Gürkan Yardımcı, Alper Küçükural, Yücel Saygın, Uğur Sezerman since the majority of the known peptides were nine-mers. Each peptide is represented by an item-set of nine elements, based on it sequence. So in our case there are 180 different items since there are nine different positions and twenty different amino acids. Set I has 20 the form A

P

9 different item-sets, each set has nine elements for nine positions and each element can be one of the 20 different items. The position of each amino acid in the sequence is important so we have turned the sequences into item sets X of

where A is the one letter code of each amino acid and p is the position of the amino acid in the sequence. The rules mined will be as follows {V

1

}  {G

2

}, meaning that the presence of a Valine in first position of the peptide sequence implies that there will be Glycine in the second position in the peptide sequence. For simplicity, we’ll omit the curly brackets in the following sections.

But MHC molecules are not very decisive when binding the peptides, it can accommodate different types of amino acids at the same position of the peptide. There are pockets at the binding site of MHC, some of these pockets have to be filled with certain types of amino acids for the binding requirements to be fulfilled [19].

Sometimes the second position of the peptide fills the appropriate pocket and sometimes the third position of the peptide occupies the same pocket. Therefore different amino acids and different positions of the peptide may have the same role in defining the peptides’ binding characteristics; association rules cannot catch this property well. So we have decided to change the rule structure to deal with this problem.

2

} V {L

3

}  {I

9

} meaning that the Our association rules have the form {V

2

} V {A presence of a Valine or an Alanine in the second position or a Leucine in the third position of the peptide implies that the ninth position of the peptide sequence will be an Isoleucine. Such rules can capture the binding characteristics better. This rule structure with ORs ( V ) will also increase the CSC-Measure of the rules, resulting with more globally correct binding characteristic rules. The support and the confidence measures' definitions remain unchanged, the only difference is that the calculations are done taking the OR into account.

3.1 Candidate generation and rule mining

The candidate generation step is generally done by the apriori algorithm and its variations [2]. Since using OR as a rule increases the number of candidates so much that the apriori algorithm will not have a reasonable runtime. We first extracted rules with one amino acid on each side by the conventional rule mining algorithm. Then we have combined these rules with the OR operator, to yield rules which reflect the binding characteristic better. The confidence of a new combined rule will be between the values of the minimum and the maximum of confidences of the rules which were combined to yield the new rule. The support will obviously increase as the number of sequences which contain the amino acids on the left side are increasing because of the

OR operator between them.

First we have mined the database for association rules of the form X i

 Y j

where

X and Y are amino acids and i and j are their positions. Small confidence (50%) and support (20%) thresholds are used for two reasons. The first is that we expect these values to go up as we combine the rules with the OR operator so we want as many rules as possible. The second is that low support values imply that the number of

A Modified Associated Rule Mining Approach for the MHC-Peptide Binding Problem 5 sequences or the transactions which contain both of the combined amino acids will be small.

The combining process will be as follows, over the set of all one amino acid rules of the form X i

 Y j

, we will combine the rules which have the same implication, then generate all the possible two amino acid combinations of these rules.

After we have the two amino acid rules, again we combine these rules to yield three amino acid rules. This time the process will be similar to the apriori algorithm.

We combine k amino acid rules which share k-1 amino acids and which have the same implication. Combining these rules yield k+1 amino acid rules. The fact that we are using the OR operator guarantees that new rules’ support values will never decrease so we don’t have to check support values. The pruning criterion is CSC-

Measure of the new rules. If CSC measure does not improve by at least 2% by addition of the new OR rule, the new rule is pruned.

3.2 Amino acid classification

Evolution allows for sequence variability; to capture this information, we have also classified the amino acids according to their physico-chemical properties as given in Table 1. Different classes of amino acids are obtained from a previous study by

Sezerman et. al. which used an encoding decoding algorithm that classified amino acids based on similarity scoring matrices [14] . The classification scheme given in

Table 1 yielded the best results for us. Using the classification table enabled us to distinguish the binding rules according to their physico-chemical properties e.g. HLA-

A2 molecule prefers a peptide with a bulky hydrophobic residue at position two

(Class F) and a small hydrophobic residue at position nine (Class A) for binding.

The classification step reduces the number of items and item-sets, reducing the number of rules but making the rules more compact. The number of possible item-sets reduces to 129 from 209 and number of items reduces to 108 from 180.

Table 1. Classification of amino acids

Class

A

B

C

D

E

F

Amino Acid(s)

I,V,L,M,A

R,K

D,E

S,T

Y

F

Class

G

H

J

K

L

M

Amino Acid(s)

W

H

G

Q,N

C

P

4 Implementation and Experimental Results

First datasets are downloaded from SYFPEITHI[11]. The peptide sequences are rewritten using the classes in Table 1 as a preprocessing operation.

6 Galip Gürkan Yardımcı, Alper Küçükural, Yücel Saygın, Uğur Sezerman

Nine amino acid long binding sequences of different kinds of MHC molecules are used for rule extraction explicitly. The amount of binding peptides for different kinds of MHC molecules varied from 24 to 107. The nature of our data set required data cleaning. Peptide sequences are obtained experimentally. In some cases they obtain

MHC bound peptides and these are sequenced and stored in the databases. In other cases they artificially create polyalanine peptide sequences of length nine, check the binding affinity of this peptide to the specific MHC of interest. They mutate each position to different amino acid types separately and look at the binding affinity of the mutated peptide and compare it with the original one. Therefore many binding peptides coming from these studies had alanine (which is a neutral small amino acid that would not have any impact on the binding) in many positions. Since we are looking for the support and confidence of the binders, this would cause a bias for that amino acid type in our association rules therefore we cleaned our data of such sequences. A peptide sequence was removed from our data set if it had the same amino acid in four consecutive positions.

Table 2.

Some of the best rules for four types of MHC molecule using four fold cross validation.

Molecule Rule

HLA-

A020110

HLA-

A02019

HLA-

B089

HLA-

B27059

A1VA5VA6VA7

A2

A1VA6VA7VA8

A2

69,3

69,3

A1VA5VA6VA7VA8

A2

A1VA3VA5VA6

A2

A1VA3VA5VA6VA7

A2 83,48

A1VA3VA4VA6VA7VA8

A9 85,66

A1VA4VA2

B3

A1VC1VM2VA6VA8

A9

71,92

77,25

68,05

72,22

C4VB5VA6VA7VA8

A9

B1VA3VA6

B2

B1VA3VA6VA9

B2

B1VA3VA5VA6VA7VB9



B2

75

79,27

90,80

95,4

Avg.

Support %

Avg.

Confidence

%

83,22

86,89

83,71

93,25

93,71

93,85

74,17

75,63

77,28

100

100

100

Avg.

CSC-

Measure

%

75,59

77,01

77,32

84,48

88,29

89,56

70,96

73,84

76,11

88,41

95,17

97,64

Avg

Accuracy

%

76,38

76,38

78,88

82,19

85,93

87,82

83,33

87,5

87,5

92,85

100

100

4.1 Testing Method

The data set we have used for the association rule mining is non-redundant and the number of sequences in the data set is not large enough especially for certain MHC data to split the database to yield a test and training set. We have used only binding peptides (positives) for the rule mining and testing processes. Since we haven’t worked with nonbinding (negatives) peptides, we can only calculate sensitivity of our rules. Therefore we refer to sensitivity as the accuracy of our method. For the testing process, rules whose accuracy values are among the top 80% of all accuracy values

A Modified Associated Rule Mining Approach for the MHC-Peptide Binding Problem 7 are used. Some of the best rules are presented in Table 2. The values in Table 2 are obtained by using a training set of 198 peptides total on 4 different MHC types. We can predict the binding up to 100% accuracy and 97% coverage for some cases.

(Table 2).

We have used four fold cross-validation to test the accuracy and validity of the rules we have mined. The data set was split in to four sub-data sets randomly. We obtained the association rules using three data sets and tested it on the fourth set. Then we switched the test set and the training sets until we run all possible combinations of training and test sets. The testing procedure involved using the association rules generated by the training set to identify binders. The values in Table 2 are average values of the four tests. The cross validation showed that association rules can predict between 76% and 100% accuracy.

Accuracy of the resulting rules are dependent on the confidence and support thresholds. For some MHC classes, dataset size is not sufficiently large, thus small confidence and support thresholds must be used. For sufficiently large datasets, large support and confidence thresholds can be set, yielding 90% percent accuracy.

CSC measure gives a better picture of success of our method. CSC values varied between 71% and 92%. Our methods yield approximately 81% percent accuracy.

Brusic et.al. report a predictive value of 78% for binding to human MHC HLA-A2 and 88% for mouse MHC H-2K B using ANNs[6]. Udaka et.al report approximately

80% accuracy using a scoring program for prediction on three mouse MHC binding sequences [17]. Dönnes et. al. report 90% of all the peptides that are known to bind to

MHC can be predicted with 90% specificity using support vector machines on 21

MHC data[7]. In another article, Udaka et al reports that HMMs achieve %84 precision, assessing their method by using a so called precision recall curve analysis in [16]. SYFPEITHI uses a profile based method, evaluating the contribution of each amino acid in a peptide to binding process and assigns an overall score to a given peptide. The scoring process is based on the knowledge of anchor and auxiliary anchor positions. For a given protein, all possible octamers, nonamers and decamers are evaluated and SYFPEITHI reports that the naturally presented epitope is among the top scoring 2% of all peptides in 80% of all predictions.[11] The methods reported here use different data-sets with varying data preprocessing steps so our results are not directly comparable to theirs, except for SYFPEITHI with which we share our dataset.

5 Discussion and Perspectives

The novelties of our approach are the use of the OR operator and reduced amino acid alphabet classification. We have used a new association rule mining operator (OR) to combine the rules to describe binding preference of MHC molecules. This combination gives better explanation to the importance of specific sites at the binding peptide. Second and ninth positions appear most frequently in the motifs. These positions have highly correlated hydrophobicity values which is also supported by

Zhang et al.[19] Zhang et al. also report that HLA-A02 classes require isoleucine, valine, leucine, methionine (class A according to our amino acid classification) as consensus anchors for binding, and HLA-B classes need charged residues (class B and C according to our classification) as consensus anchors. These finding also

8 Galip Gürkan Yardımcı, Alper Küçükural, Yücel Saygın, Uğur Sezerman correlate with our rules. We also used a reduced amino acid alphabet which helped us to determine important physical and chemical properties of amino acids required at significant positions for a successful binding to MHC. Deriving general rules for binding is a crucial contribution of our method. Profile based methods assume contribution of each position on the peptide even though some would contribute more than the others depending on the frequency of occurrence at the given position. Even though a peptide has the binding motif at the key positions, the scores coming from the other sites can cause it to be classified as non binder. According to Gulukota et. al.

[8] profile based methods have 30% accuracy in prediction of binders. Our method points out key positions and significant features for binding. Machine learning based methods can predict the binders with high accuracy and specificity but cannot give out features that are important for binding, which is crucial information for vaccine design. Therefore, they are not well suited for this type of application.

Up to now we did not consider the information coming from non binders in this work. So, for future work, we are developing a new approach which takes non binders’ information into account as well. We are also trying to scan for explicit pairs or triplets in peptide sequences using a Bayesian approach and compare its efficiency with our method.

References

[1] R.Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proc 1993 ACM-SIGMOD Int. Conf: Management of Data

(SIGMOD’93)

, Washington, DC, pp. 207-216 , May 1993.

[2] R.Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. 1994 Int.

Conf. Very Large Data Bases (VLDB’94)

, Santiago, Chile, Sept, pp. 487-499, 1994.

[3] M. Bhasin, H. Singh, G. P. S. Raghava. MHCBN: A Comprehensive Database of MHC

Binding and Non-Binding Peptides. Nucleic Acids Research , Vol. 19 no.5 pp. 665-666,

2002.

[4] V. Brusic, V.B. Bajica, N. Petrovsky. Computational methods for prediction of T-cell epitopes—a framework for modelling, testing, and applications.

Methods, 34(4):436-43,

2004.

[5] V. Brusic and D.R. Flower. Bioinformatics tools for identifying T-cell epitopes . DDT:

BIOSILICO Vol. 2, No. 1, pp. 18-23, January 2004.

[6] V. Brusic, G. Rudy, L. C. Harrison. Prediction of MHC Binding Peptides Using Artificial

Neural Networks. Complexity International , Volume 2, 1995

[7] P. Dönnes, A. Elofsson. “Prediction of MHC class I binding peptides, using SVMHC”.

Bioinformatics , 3:25, 2002.

[8] K. Gulukota, J. Sidney, A. Sette, C. DeLisi. Two complementary methods for predicting peptides binding major histocompatibility complex molecules. J Mol Biol , 267:1258-1267,

1997

[9] P. M. Kloetzel. The proteasome and MHC class I antigen processing. Biochimica et

Biophysica Acta 1695 , pp. 217-225, 2004.

[10] T. Milledge, G. Zheng, G. Narasimhan. An Application Of Association Rule Mining to

Hla-A*0201 Epitope Prediction. ICBA , 2004.

[11] H. G. Rammensee, J. Bachmann, N.P.N. Emmerich, O.A. Bachor, and S. Stevanovic.

SYFPEITHI: database for MHC ligands and peptide motifs. Immunogenetics , 50(3-4):213-

219, 1999.

[12] H.G. Rammensee, T. Friede, S. Stevanovic. MHC ligands and peptide motifs: 1st listing.

Immunogenetics 41, pp. 178-228, 1995.

A Modified Associated Rule Mining Approach for the MHC-Peptide Binding Problem 9

[13] P.A. Reche, J. P. Glutting, and E.L. Reinherz. Prediction of MHC Class I Binding Peptides

Using Profile Motifs. Hum. Immunol ., 63:701-709, 2002.

[14] O.U. Sezerman, R. Islamaj and E. Alpaydin. Three dimensional representation of amino acid characteristics. IEEE EMBC , Vol. 3 2903-2906, 2001

[15] H. Singh and G.P.S. Raghava. ProPred1: prediction of promiscuous MHC Class-I binding sites.

Bioinformatics , Vol. 19 no. 8 pp. 1009-1014, 2003.

[16] K. Udaka, H. Mamitsuka, Y. Nakaseko and N. Abe. Empirical Evaluation of a Dynamic

Experiment Design Method for Prediction of MHC Class I-Binding Peptides. The Journal of Immunology, 169:5744 – 5753, 2002

[17] K. Udaka, K.H. Wiesmuller, S. Kienle, G. Jung, H. Tamamura, H. Yamagishi, K.

Okumura, P. Walden, T. Suto, T. Kawasaki.

An automated prediction of MHC class Ibinding peptides based on positional scanning with peptide libraries . Immunogenetics , pp.

816-828, 2000.

[18] J. Zeng, H. R. Treutlein & G. B. Rudy. Predicting sequences and structures of MHCbinding peptides: a computational combinatorial approach. Journal of Computer-Aided

Molecular Design , pp. 573-576, 2001.

[19] C. Zhang, A. Anderson , C. DeLisi . Structural principles that govern the peptide-binding motifs of class I MHC molecules.

J. Mol Biol , 929 – 947, 1998.

Download