G-Protein-Coupled Receptor Motif sequence analysis:The Machine Learning approach Durgaprasad Bollina ,Biovoice Knowlegde Center, Hyderabad email: durgaprasad_b@yahoo.com Rajasekhara Reddy B, Asst. Prof., Dept. of ECE, Vasavi College of Engineering, Hyderabad email:rajbhima@rediffmail.com Abstract G-protein-coupled receptors (GPCRs) form a large superfamily of proteins that transduce signals across the cell membrane. At the external side they receive a ligand (a photon in case of opsins), and at the cytosolic side they activate a G protein. GPCRs consist of one single protein chain that crosses the membrane seven times,Most ligands bind between the membrane helices, but the periplasmic loops are sometimes also involved in ligand recognition. The second and third cytosolic loop and part of the (cytosolic) C-terminal end of the receptors are involved in G-protein recognition. G -protein activation leads to activation of only a limited set of secondary messengers . An approach to discovering sequence motifs with characteristics of ligand classes is described and applied to agonists G protein-coupled receptors (GPCRs) with Bayesian Neural Networks with Correlating Residuals – A Machine learning technique Introduction In 1971, Martin Rodbell conceived the idea taht a guanine-nucleotide regulatory protein functionally conncets receptors with effectors in the context of hormonal (glucagon) stimulation of the adenylyl cyclase system, generating the second messenger cyclic AMP. Since GPCRs play an important role in pathophysiology and are relatively easy to express, they are amenable targets for drug development. This large family allows for a systemized approach for isolation. Historically, GPCRs have been excellent molecular targets for drug development and it is expected that orphan receptors (GPCRs with unknown ligands) will offer similar potential. Indeed, GPCRs are the target of more than 50% of the current therapeutic agents on the market. Hundreds of GPCRs exist, and they can be activated by a multitude of agonists. Five steps can be observed in the process of GPCR activation: Signal generation by a photon or by ligand binding; Signal transduction through the membrane; Binding to the G protein; Activation of the G protein; Activation of the second messenger. Most experimental data available to date relates to either ligand binding or G protein interaction. It was shown that these two functions involve distinctively different sets of residues One of the central paradigms in protein research is that the sequence determines the structure, and the structure the function. Many G protein-coupled receptor sequences have been determined, and much functional data is available. Structural data, however, is not available, and modelling studies have not yet yielded adequately accurate models to allow for the inference of function. It would therefore be useful if in the sequence -> structure -> function pathway the structure could be skipped, or in other words if functional data could be abstracted directly from the sequence without the need for structure determination or modelling studies. Although these four phases clearly differ in the kind of processes taking place, they are not discrete and independent. For example, allostery between ligand binding and G protein binding has been observed for several GPCR's, as well as cation-dependent allosteric regulation of agonist and antagonist binding . Many residues involved in this allosteric effect have been mutated. All mutations led to changes in ligand and/or G protein binding. Mutation of D-224 to asparagine abolished the cation effect on the allosteric site and weakened the agonist binding to GPCR's without affecting antagonist binding or G protein activation. In case of rhodopsintransducin interaction, a synergistic competitive mechanism has been uncovered for peptides with the same sequence as these regions The GPCR family of proteins has traditionally provided the pharmaceutical industry with a rich source of targets for drug discovery. The recent sequencing of the human genome has led to the compilation of the complete catalog of human GPCRs. Current GPCR research focuses on (1) determining which members of this family represent opportunities for therapeutic intervention and (2) how to efficiently identify small molecule modulators of these targets for drug development. GPCRomics is defined as the application of a wide range of technologies to this research effort. GPCRomics G Protein-coupled receptor (GCPR) is a large protein family that has been used as drug targets. The typical structure of these receptors has an extracellular receptor domain (ECD) at the N terminus, a seven-transmemrane domain (7TMD) in the middle and an effecter domain at C-terminus. After ECD recognise and react with signals carried by biogenic amines, amino acids, ions or lipids, 7TMD will transfer a conformational changeinto the cell to the effector domain. Downstream, these changes will affect gene expression and finally provoke biological responses. Regulation of GPCR involves trafficking and fusion between GPCR and endocytic vesicle. Endocytosis and recycling of the endocytic vesicle also has effect on GPCR GPCRinformatics The GPCRDB (information system for G protein-coupled receptors) is a molecular class-specific information system that collects, combines, validates and disseminates heterogeneous data on G protein-coupled receptors. The database stores data on sequences, ligand binding constants and mutations. The system also provides computationally derived data such as sequence alignments, homology models, and a series of query and visualization tools. A computational method is presented that determines which residues are important for these functions. The method uses correlation analysis to recognise residue patterns that correspond with functions. The basic idea is very simple: if, for example, a residue is conserved in all proteins that bind to the same agonist, then it could be involved in this binding. This sequence pattern correlation technique was used to find the residues that determine the optimal wavelength of the photon absorbed by the retinal in opsins. The technique was also useful in the analysis of residues which play a role in ligand binding in several classes of biogenic amine receptors. An important aim of this work is to automatically detect residues that may be interesting targets for future mutagenesis studies. Need of Computinoal Methods An approach to discovering sequence patterns characteristic of ligand classes is described and applied to aminergic G protein-coupled receptors (GPCRs). Putative ligand-binding residue positions were inferred from considering three lines of evidence: conservation in the subfamily absent or underrepresented in the superfamily, any available mutation data, and the physicochemical properties of the ligand. A minimally defined motif discovered in this fashion is an appropriate computational tool for identifying additional, potentially novel aminergic GPCRs Pros and Cons for the Current Protein Classification Tools Due to the differences in their underlying techniques and also in their focuses (e.g.,family coverage), each method (and database) has different strengths and weaknesses. Inorder to take maximum advantage from these various information sources, usually it isnecessary to conduct multiple pattern/profile searches. Integrated databases, e.g.,InterPro (http://www.ebi.ac.uk/interpro/) and MetaFam (http://metafam.ahc.umn.edu/),were developed to facilitate such tedious procedures.One of the problems inherited in all of these pattern/profile search methods anddatabases is that their patterns are in general derived from relatively short regions. It is particularly the case in the PROSITE ( Expressed inregular expressions ) patterns. Problem Correlated mutation analysis (or for short CMA) is a powerful technique to determine 'important' residues if you have a multiple sequence alignment available. 64 65 66 67 68 69 70 136 137 138 139 140 141 142 273 274 275 276 277 278 279 0 Q Q 3.05 0.10 27 90 QQQQQQQQQQQQQQEEEEEQEEQQQEE 0 H H 3.11 0.20 27 100 HHHHHHHHHHHHHHHHHHHHHHHHHHH 0 K K 3.12 0.20 27 100 KKKKKKKKKKKKKKKKKKKKKKKKKKK 0 K K 3.15 0.21 27 100 KKKKKKKKKKKKKKKKKKKKKKKKKKK 0 L L 3.12 0.20 27 100 LLLLLLLLLLLLLLLLLLLLLLLLLLL 0 R R 3.11 0.17 27 100 RRRRRRRRRRRRRRRRRRRRRRRRRRR 0 T T 3.03 0.10 27 93 TTTTTTTTTSTTTTTTTTTTTTQTTTT 341 Y Y 8.00 2.00 27 80 YYYYYYYYYYYYYYWWWWWYWWYYYWW 0 V V 2.94 0.05 25 70 --VVVVVVIIVVIIMVMVVVMLMIILV 0 V V 3.11 0.14 27 100 VVVVVVVVVVVVVVVVVVVVVVVVVVV 0 V V 3.04 0.10 27 90 VVIVVVVVVVVVVVVVVVVIVVVIIVV 0 C C 3.15 0.21 27 100 CCCCCCCCCCCCCCCCCCCCCCCCCCC 0 K K 3.12 0.14 27 100 KKKKKKKKKKKKKKKKKKKKKKKKKKK 0 P P 3.05 0.11 27 100 PPPPPPPPPPPPPPPPPPPPPPPPPPP 626 F F 8.00 2.00 27 80 FFFFFFFMFFFFFFWWWWWFWWFFFWW 627 Y Y 3.06 0.15 27 96 YYYYYYYYYYYYYYYYYYWYYYYYYYY 628 I I 3.11 0.11 27 100 IIIIIIIIIIIIIIIIIIIIIIIIIII 0 F F 3.07 0.11 27 100 FFFFFFFFFFFFFFFFFFFFFFFFFFF 0 T T 2.91 0.07 27 87 TTTTTTTTTSTSTTTTTTTTTTITTST 0 H H 2.86 0.05 27 74 HHHHHHHHHNNNHHHHHNNHHHNHHNH 0 Q Q 3.02 0.15 27 100 QQQQQQQQQQQQQQQQQQQQQQQQQQQ The columns mean: | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | \--> These are the residues at position 279 in the 27 aligned sequences. \--> The variability at this position in the alignment. 100 means totally conserved. Below 40, things are doubtful. \--> The number of sequences in the alignment that has a residue at this position. \--> The gap elongation penalty used at this position in the alignment \--> The gap open penalty used at this position in the alignment \--> This column is of no value (yet) \--> The consensus sequence at this position in the alignment \--> The so called arbitrary sequence number (the same position in any GPCR alignment always gets this same number, or in other words every Arg in every DRY motif, for example, always get number 340). \--> This is simply the sequential number in the alignment. This number has no value whatsoever and is only needed if you want to compare the profile with the alignment. Things get clearer when we put the three red residues directly underneath each other 123456789012345678901234567890 64 0 Q Q 3.05 0.10 27 90 QQQQQQQQQQQQQQEEEEEQEEQQQEE 136 341 Y Y 8.00 2.00 27 80 YYYYYYYYYYYYYYWWWWWYWWYYYWW 273 626 F F 8.00 2.00 27 80 FFFFFFFMFFFFFFWWWWWFWWFFFWW Now you see why we call this correlated. Lets forget the M at position 626 in sequence 8 and the entire contribution of sequence 20. You than see that these three sequence positions are not conserved, but if one position is different between any two sequences, all three are different, and on top of that, the changes going from one sequence to the other are always between the same residue types. Conserved residues are more important than CMA residues. CMA results are calculated 100% automatically, sometimes something can go wrong, so check the results to see if there really is a correlation. If, for example, 15 out of 16 sequences are conserved, and the 16-th is different the CMA score will be resonably high. This is of course an artefact. If there are no good correlations, the program still colours the best ones. So, there will always be a result, no matter how meaningless. Iterative unbiased profile alignment of sequences We use a profile based alignment method similar as described by Sander and Schneider (1991). All biases that could be introduced by the use of a scoring matrix are removed by making profiles relate directly to the frequency of occurrence of residue types at each position in the protein. Sequences are aligned against this profile, rather than against all other sequences. This now creates a problem: in order to do the alignment, we need a profile, and in order to get a profile, we need an alignment. To solve this problem a multiple sequence alignment is made on a subset of the sequences that show high pairwise homologies. From this initial (very easy to perform) alignment a profile is created. This profile is now used to align all sequences of interest. The aligned sequences are sorted according to their similarity to the profile, and a new profile is made from the highest scoring sequences. This process is repeated until all sequences are incorporated. Of course in every iteration more sequences are incorporated than in the previous one. Sequences that show too little similarity with the consensus sequence of the profile were not used because there is no guarantee that they belong in the same structural family. After incorporating all sequences the alignment procedure is iterated a few more steps. In practice, this iterative alignment method has shown to normally converge to a satisfactory solution for multiple sequence alignments in less than 5 cycles, provided enough sequences are available (typically 20-30 sequences are enough). Analysis of correlated mutations Correlated mutational behaviour can be defined as the tendency of residues to stay conserved or to mutate in tandem between (sets of) sequences. Several techniques have been described to analyse correlated mutations (i.e. Goebel et al., 1993). They found that the correlation coefficient is related to the chance that the residue pair is in contact in the structure. In this method completely conserved residues give rise to a high correlation, indicating their importance for structural integrity. This procedure works well if structure analysis or structure prediction is the primary aim. To determine functional correlations, however, a slightly different approach is required (Casari et al., 1994; Oliveira et al. 1993). Oliveira et al. (1993) used a similar approach as Goebel et al. (1993) to determine the correlation between residue positions in GPCRs, but always require a certain degree of variability at the positions that are analysed for correlating behaviour. Figure 2 shows an example of correlated mutational behaviour.Example of correlated mutations. Seq. 5 10 Sequence position # | | 1 AAAASSSSTTTT Positions 2 and 3 are compared with this one 2 RRRRPPPPHHHH 66 1.00 All residues correlate perfectly with position 1 3 TTTTGGGGEEED 63 0.95 One residue is not correlated with position 1 Bayesian Neural Networks with Correlating Residuals for motif analysis Learning in biological systems involves adjustments to the synaptic connections that exist between the neurons. This is true of ANNs as well. Learning typically occurs by example through training, or exposure to a truthed set of input/output data where the training algorithm iteratively adjusts the connection weights (synapses). These connection weights store the knowledge necessary to solve specific problems the multilayer perceptron which is generally trained with the backpropagation of error algorithm ,with depoleyment of supervised training for motif analysis with three layer approch. The use of constant diagonal covariance with maximum likelihood approach corresponds to the minimization of the sum-of-squares error. Convenient conjugate prior, which allows us easily to define networks. In MCMC the integrations required by Bayesian approach are approximated numerically using a sample of values drawn from the posterior distribution of parameters. In MCMC, samples are generated using a Markov chain that has the desired posterior distribution as its equilibrium distribution. The networks handled have connections from a set of real-valued input units to each of zero or more layers of real-valued hidden units. Each hidden layer (except the last) has connections to the next hidden layer. The output layer has connections from the input layer and from the hidden layers. Any of the connection groups shown above may be absent, which is the same as their weights all being zero. The hidden units use the 'tanh' activation function. Nominally, the output units are real-valued and use a linear activation function, but discrete outputs and non-linearities may be obtained in effect with some data models (see below). Each hidden and output unit has a "bias" that is added to its other inputs before the activation function is applied. Each input and hidden unit has an "offset" that is added to its output after the activation function is applied (or just to the specified input value, for input units). Markov Chain Monte Carlo To the Bayesian inference framework just defined we add the fact of using neural network models described by . The best single valued prediciton that minimizes the squared error is the mean of the predictive distribution that we can define as the mean of the network outputs corresponding to the conditional distribution.In MCMC we do not try to express the posterior in a direct way. The iterative method first gives a state vector and generate a new random state vector from a probability distribution then we obtain by sampling from , and so forth. The transition probability q is constructed in such a way that an ergodic Markov process is defined with stationary distribution equal to the desired posterior distribution. Basically the Metropolis algorithm generates the sequence mentioned above by first generating from by: generating a "candidate state" from a proposal distribution and then accepting or not based in its probability density relative to that of the old state with respect to the desired invariant distribution Q. If is accepted then becames next state in the chain.If is not accepted then the states the same as the old. Bayesian learning takes a distribution of the parameters , combines it with information from a training set and then integrates the posterior obtaining the desired forecast. Important features are that: the results do not overfit the data, the prediction accurancy can be improved and that the prediction intervals can be estimated Conclusion Purpose of this paper is to show how this problem can be solved using full covariance matrix with Bayesian treatment and Markov Chain Monte Carlo (MCMC) methods. An approach to discovering sequence patterns characteristic of ligand classes is described and applied to aminergic G protein-coupled receptors (GPCRs). Putative ligand-binding residue positions were inferred from considering three lines of evidence: conservation in the subfamily absent or underrepresented in the superfamily, any available mutation data, and the physicochemical properties of the ligand. GPCR functions like clustering, signaling, compartmentalization and optimization of transductionSwitch signaling by presence / absence of different components of the complex are used for computing modules . A minimally defined motif discovered in this fashion is an appropriate computational tool for identifying additional, potentially novel aminergic GPCRs from a set of experimentally uncharacterized "orphan" GPCRs, complementing existing sequence matching, clustering, with machine-learning techniques. References [1] Box, G. E. P. & Tiao, G. C. (1973). Bayesian inference in statistical analysis. John Wiley and Sons, Inc. [2] Duane, S., Kennedy, A. D., Pendleton, B. J. & Roweth, D.(1987). Hybrid Monte Carlo. Physics Letters [3] Spiegelhalter, D. J., Best, N. G. & Carlin, B. P. (1998).Bayesian deviance, the effective number of parameters, andthe comparison of arbitrarily complex models. Tech. Rep.98-009, Division of Biostatistics, University of Minnesota. [4] Williams, P. M. (1996). Using neural networks to model conditional variate densities. Neural Computation. [5] A model for G protein interaction in G protein coupled receptors. L. Oliveira, A.C.M. Paiva, C. Sander, G. Vriend. [6] G protein-coupled receptors in silico Florence HORN and Gerrit VRIEND BIOcomputing, European Molecular Biology Laboratory, Meyerhofstraße, 169117 Heidelberg, Germany [7] GPCRDB: an information system for G protein-coupled receptors. Nucleic Acids Res 1998 Jan 1;26(1):275-279