Membrane Protein Arrays: Application for GPCR Screening

advertisement
G-Protein-Coupled Receptor Motif sequence analysis:The Machine
Learning approach
Durgaprasad Bollina ,Biovoice Knowlegde Center, Hyderabad
email: durgaprasad_b@yahoo.com
Rajasekhara Reddy B, Asst. Prof., Dept. of ECE, Vasavi College of Engineering, Hyderabad
email:rajbhima@rediffmail.com
Abstract
G-protein-coupled receptors (GPCRs) form a large superfamily of proteins that transduce signals
across the cell membrane. At the external side they receive a ligand (a photon in case of opsins),
and at the cytosolic side they activate a G protein. GPCRs consist of one single protein chain that
crosses the membrane seven times,Most ligands bind between the membrane helices, but the
periplasmic loops are sometimes also involved in ligand recognition. The second and third
cytosolic loop and part of the (cytosolic) C-terminal end of the receptors are involved in
G-protein recognition. G -protein activation leads to activation of only a limited set of secondary
messengers . An approach to discovering sequence motifs with characteristics of ligand classes
is described and applied to agonists G protein-coupled receptors (GPCRs) with Bayesian Neural
Networks with Correlating Residuals – A Machine learning technique
Introduction
In 1971, Martin Rodbell conceived the idea taht a guanine-nucleotide regulatory protein
functionally conncets receptors with effectors in the context of hormonal (glucagon) stimulation
of the adenylyl cyclase system, generating the second messenger cyclic AMP.
Since GPCRs play an important role in pathophysiology and are relatively easy to express, they
are amenable targets for drug development. This large family allows for a systemized approach
for isolation. Historically, GPCRs have been excellent molecular targets for drug development
and it is expected that orphan receptors (GPCRs with unknown ligands) will offer similar
potential. Indeed, GPCRs are the target of more than 50% of the current therapeutic agents on the
market.
Hundreds of GPCRs exist, and they can be activated by a multitude of agonists. Five steps can be
observed in the process of GPCR activation:
 Signal generation by a photon or by ligand binding;
 Signal transduction through the membrane;
 Binding to the G protein;
 Activation of the G protein;
 Activation of the second messenger.
Most experimental data available to date relates to either ligand binding or G protein interaction.
It was shown that these two functions involve distinctively different sets of residues
One of the central paradigms in protein research is that the sequence determines the structure,
and the structure the function. Many G protein-coupled receptor sequences have been
determined, and much functional data is available. Structural data, however, is not available, and
modelling studies have not yet yielded adequately accurate models to allow for the inference of
function. It would therefore be useful if in the sequence -> structure -> function pathway the
structure could be skipped, or in other words if functional data could be abstracted directly from
the sequence without the need for structure determination or modelling studies.
Although these four phases clearly differ in the kind of processes taking place, they are not
discrete and independent. For example, allostery between ligand binding and G protein binding
has been observed for several GPCR's, as well as cation-dependent allosteric regulation of
agonist and antagonist binding . Many residues involved in this allosteric effect have been
mutated. All mutations led to changes in ligand and/or G protein binding. Mutation of D-224 to
asparagine abolished the cation effect on the allosteric site and weakened the agonist binding to
GPCR's without affecting antagonist binding or G protein activation. In case of rhodopsintransducin interaction, a synergistic competitive mechanism has been uncovered for peptides
with the same sequence as these regions
The GPCR family of proteins has traditionally provided the pharmaceutical industry with a rich
source of targets for drug discovery. The recent sequencing of the human genome has led to the
compilation of the complete catalog of human GPCRs. Current GPCR research focuses on
(1) determining which members of this family represent opportunities for therapeutic
intervention and (2) how to efficiently identify small molecule modulators of these targets for
drug development. GPCRomics is defined as the application of a wide range of technologies to
this research effort.
GPCRomics
G Protein-coupled receptor (GCPR) is a large protein family that has been used as drug targets.
The typical structure of these receptors has an extracellular receptor domain (ECD) at the N
terminus, a seven-transmemrane domain (7TMD) in the middle and an effecter domain at
C-terminus. After ECD recognise and react with signals carried by biogenic amines, amino acids,
ions or lipids, 7TMD will transfer a conformational changeinto the cell to the effector domain.
Downstream, these changes will affect gene expression and finally provoke biological responses.
Regulation of GPCR involves trafficking and fusion between GPCR and endocytic vesicle.
Endocytosis and recycling of the endocytic vesicle also has effect on GPCR
GPCRinformatics
The GPCRDB (information system for G protein-coupled receptors) is a molecular class-specific
information system that collects, combines, validates and disseminates heterogeneous data on G
protein-coupled receptors. The database stores data on sequences, ligand binding constants and
mutations. The system also provides computationally derived data such as sequence alignments,
homology models, and a series of query and visualization tools.
A computational method is presented that determines which residues are important for these
functions. The method uses correlation analysis to recognise residue patterns that correspond
with functions. The basic idea is very simple: if, for example, a residue is conserved in all
proteins that bind to the same agonist, then it could be involved in this binding.
This sequence pattern correlation technique was used to find the residues that determine the
optimal wavelength of the photon absorbed by the retinal in opsins. The technique was also
useful in the analysis of residues which play a role in ligand binding in several classes of
biogenic amine receptors. An important aim of this work is to automatically detect residues that
may be interesting targets for future mutagenesis studies.
Need of Computinoal Methods
An approach to discovering sequence patterns characteristic of ligand classes is described and
applied to aminergic G protein-coupled receptors (GPCRs). Putative ligand-binding residue
positions were inferred from considering three lines of evidence: conservation in the subfamily
absent or underrepresented in the superfamily, any available mutation data, and the
physicochemical properties of the ligand. A minimally defined motif discovered in this fashion is
an appropriate computational tool for identifying additional, potentially novel aminergic GPCRs
Pros and Cons for the Current Protein Classification Tools
Due to the differences in their underlying techniques and also in their focuses (e.g.,family
coverage), each method (and database) has different strengths and weaknesses. Inorder to take
maximum advantage from these various information sources, usually it isnecessary to conduct
multiple
pattern/profile
searches.
Integrated
databases,
e.g.,InterPro
(http://www.ebi.ac.uk/interpro/) and MetaFam (http://metafam.ahc.umn.edu/),were developed to
facilitate such tedious procedures.One of the problems inherited in all of these pattern/profile
search methods anddatabases is that their patterns are in general derived from relatively short
regions. It is particularly the case in the PROSITE ( Expressed inregular expressions ) patterns.
Problem
Correlated mutation analysis (or for short CMA) is a powerful technique to determine 'important'
residues if you have a multiple sequence alignment available.
64
65
66
67
68
69
70
136
137
138
139
140
141
142
273
274
275
276
277
278
279
0 Q Q 3.05 0.10
27 90 QQQQQQQQQQQQQQEEEEEQEEQQQEE
0 H H 3.11 0.20
27 100 HHHHHHHHHHHHHHHHHHHHHHHHHHH
0 K K 3.12 0.20
27 100 KKKKKKKKKKKKKKKKKKKKKKKKKKK
0 K K 3.15 0.21
27 100 KKKKKKKKKKKKKKKKKKKKKKKKKKK
0 L L 3.12 0.20
27 100 LLLLLLLLLLLLLLLLLLLLLLLLLLL
0 R R
3.11 0.17
27 100 RRRRRRRRRRRRRRRRRRRRRRRRRRR
0 T T
3.03 0.10
27 93 TTTTTTTTTSTTTTTTTTTTTTQTTTT
341 Y Y 8.00 2.00
27 80 YYYYYYYYYYYYYYWWWWWYWWYYYWW
0 V V 2.94 0.05
25 70 --VVVVVVIIVVIIMVMVVVMLMIILV
0 V V 3.11 0.14
27 100 VVVVVVVVVVVVVVVVVVVVVVVVVVV
0 V V 3.04 0.10
27 90 VVIVVVVVVVVVVVVVVVVIVVVIIVV
0 C C
3.15 0.21
27 100 CCCCCCCCCCCCCCCCCCCCCCCCCCC
0 K K 3.12 0.14
27 100 KKKKKKKKKKKKKKKKKKKKKKKKKKK
0 P P
3.05 0.11
27 100 PPPPPPPPPPPPPPPPPPPPPPPPPPP
626 F F 8.00 2.00
27 80 FFFFFFFMFFFFFFWWWWWFWWFFFWW
627 Y Y 3.06 0.15
27 96 YYYYYYYYYYYYYYYYYYWYYYYYYYY
628 I I 3.11 0.11
27 100 IIIIIIIIIIIIIIIIIIIIIIIIIII
0 F F 3.07 0.11
27 100 FFFFFFFFFFFFFFFFFFFFFFFFFFF
0 T T
2.91 0.07
27 87 TTTTTTTTTSTSTTTTTTTTTTITTST
0 H H 2.86 0.05
27 74 HHHHHHHHHNNNHHHHHNNHHHNHHNH
0 Q Q 3.02 0.15
27 100 QQQQQQQQQQQQQQQQQQQQQQQQQQQ
The columns mean:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\--> These are the
residues at position
279 in the 27 aligned
sequences.
\--> The variability at this
position in the alignment.
100 means totally conserved.
Below 40, things are doubtful.
\--> The number of sequences in the
alignment that has a residue at
this position.
\--> The gap elongation penalty used at this
position in the alignment
\--> The gap open penalty used at this position
in the alignment
\--> This column is of no value (yet)
\--> The consensus sequence at this position in the alignment
\--> The so called arbitrary sequence number (the same position in
any GPCR alignment always gets this same number, or in other words
every Arg in every DRY motif, for example, always get number 340).
\--> This is simply the sequential number in the alignment. This number has no
value whatsoever and is only needed if you want to compare the profile
with the alignment.
Things get clearer when we put the three red residues directly underneath each other
123456789012345678901234567890
64 0 Q Q 3.05 0.10
27 90 QQQQQQQQQQQQQQEEEEEQEEQQQEE
136 341 Y Y 8.00 2.00
27 80
YYYYYYYYYYYYYYWWWWWYWWYYYWW
273 626 F F 8.00 2.00
27 80 FFFFFFFMFFFFFFWWWWWFWWFFFWW
Now you see why we call this correlated. Lets forget the M at position 626 in sequence 8 and the
entire contribution of sequence 20. You than see that these three sequence positions are not
conserved, but if one position is different between any two sequences, all three are different, and
on top of that, the changes going from one sequence to the other are always between the same
residue types.
Conserved residues are more important than CMA residues. CMA results are calculated 100%
automatically, sometimes something can go wrong, so check the results to see if there really is a
correlation. If, for example, 15 out of 16 sequences are conserved, and the 16-th is different the
CMA score will be resonably high. This is of course an artefact. If there are no good correlations,
the program still colours the best ones. So, there will always be a result, no matter how
meaningless.
Iterative unbiased profile alignment of sequences
We use a profile based alignment method similar as described by Sander and Schneider (1991).
All biases that could be introduced by the use of a scoring matrix are removed by making
profiles relate directly to the frequency of occurrence of residue types at each position in the
protein. Sequences are aligned against this profile, rather than against all other sequences.
This now creates a problem: in order to do the alignment, we need a profile, and in order to get a
profile, we need an alignment. To solve this problem a multiple sequence alignment is made on a
subset of the sequences that show high pairwise homologies. From this initial (very easy to
perform) alignment a profile is created. This profile is now used to align all sequences of interest.
The aligned sequences are sorted according to their similarity to the profile, and a new profile is
made from the highest scoring sequences. This process is repeated until all sequences are
incorporated. Of course in every iteration more sequences are incorporated than in the previous
one. Sequences that show too little similarity with the consensus sequence of the profile were not
used because there is no guarantee that they belong in the same structural family. After
incorporating all sequences the alignment procedure is iterated a few more steps. In practice, this
iterative alignment method has shown to normally converge to a satisfactory solution for
multiple sequence alignments in less than 5 cycles, provided enough sequences are available
(typically 20-30 sequences are enough).
Analysis of correlated mutations
Correlated mutational behaviour can be defined as the tendency of residues to stay conserved or
to mutate in tandem between (sets of) sequences. Several techniques have been described to
analyse correlated mutations (i.e. Goebel et al., 1993). They found that the correlation coefficient
is related to the chance that the residue pair is in contact in the structure. In this method
completely conserved residues give rise to a high correlation, indicating their importance for
structural integrity. This procedure works well if structure analysis or structure prediction is the
primary aim. To determine functional correlations, however, a slightly different approach is
required (Casari et al., 1994; Oliveira et al. 1993). Oliveira et al. (1993) used a similar approach
as Goebel et al. (1993) to determine the correlation between residue positions in GPCRs, but
always require a certain degree of variability at the positions that are analysed for correlating
behaviour. Figure 2 shows an example of correlated mutational behaviour.Example of correlated
mutations.
Seq. 5 10
Sequence position
#
| |
1 AAAASSSSTTTT
Positions 2 and 3 are compared with this one
2 RRRRPPPPHHHH 66 1.00 All residues correlate perfectly with position 1
3 TTTTGGGGEEED 63 0.95 One residue is not correlated with position 1
Bayesian Neural Networks with Correlating Residuals for motif analysis
Learning in biological systems involves adjustments to the synaptic connections that exist
between the neurons. This is true of ANNs as well. Learning typically occurs by example
through training, or exposure to a truthed set of input/output data where the training algorithm
iteratively adjusts the connection weights (synapses). These connection weights store the
knowledge necessary to solve specific problems the multilayer perceptron which is generally
trained with the backpropagation of error algorithm ,with depoleyment of supervised training for
motif analysis with three layer approch.
The use of constant diagonal covariance with maximum likelihood approach corresponds to the
minimization of the sum-of-squares error. Convenient conjugate prior, which allows us easily to
define networks. In MCMC the integrations required by Bayesian approach are approximated
numerically using a sample of values drawn from the posterior distribution of parameters. In
MCMC, samples are generated using a Markov chain that has the desired posterior distribution
as its equilibrium distribution.
The networks handled have connections from a set of real-valued input units to each of zero or
more layers of real-valued hidden units. Each hidden layer (except the last) has connections to
the next hidden layer. The output layer has connections from the input layer and from the hidden
layers. Any of the connection groups shown above may be absent, which is the same as their
weights all being zero. The hidden units use the 'tanh' activation function. Nominally, the output
units are real-valued and use a linear activation function, but discrete outputs and non-linearities
may be obtained in effect with some data models (see below). Each hidden and output unit has a
"bias" that is added to its other inputs before the activation function is applied. Each input and
hidden unit has an "offset" that is added to its output after the activation function is applied (or
just to the specified input value, for input units).
Markov Chain Monte Carlo
To the Bayesian inference framework just defined we add the fact of using neural network
models described by . The best single valued prediciton that minimizes the squared error is the
mean of the predictive distribution that we can define as the mean of the network outputs
corresponding to the conditional distribution.In MCMC we do not try to express the posterior in
a direct way. The iterative method first gives a state vector and generate a new random state
vector from a probability distribution then we obtain by sampling from , and so forth. The
transition probability q is constructed in such a way that an ergodic Markov process is defined
with stationary distribution equal to the desired posterior distribution. Basically the Metropolis
algorithm generates the sequence mentioned above by first generating from by: generating a
"candidate state" from a proposal distribution and then accepting or not based in its probability
density relative to that of the old state with respect to the desired invariant distribution Q. If is
accepted then becames next state in the chain.If is not accepted then the states the same as the
old.
Bayesian learning takes a distribution of the parameters , combines it with information from a
training set and then integrates the posterior obtaining the desired forecast. Important features are
that: the results do not overfit the data, the prediction accurancy can be improved and that the
prediction intervals can be estimated
Conclusion
Purpose of this paper is to show how this problem can be solved using full covariance matrix
with Bayesian treatment and Markov Chain Monte Carlo (MCMC) methods. An approach to
discovering sequence patterns characteristic of ligand classes is described and applied to
aminergic G protein-coupled receptors (GPCRs). Putative ligand-binding residue positions were
inferred from considering three lines of evidence: conservation in the subfamily absent or
underrepresented in the superfamily, any available mutation data, and the physicochemical
properties of the ligand. GPCR functions like clustering, signaling, compartmentalization and
optimization of transductionSwitch signaling by presence / absence of different components of the
complex are used for computing modules .
A minimally defined motif discovered in this fashion is an appropriate computational tool for
identifying additional, potentially novel aminergic GPCRs from a set of experimentally
uncharacterized "orphan" GPCRs, complementing existing sequence matching, clustering, with
machine-learning techniques.
References
[1] Box, G. E. P. & Tiao, G. C. (1973). Bayesian inference in statistical analysis. John Wiley and
Sons, Inc.
[2] Duane, S., Kennedy, A. D., Pendleton, B. J. & Roweth, D.(1987). Hybrid Monte Carlo.
Physics Letters
[3] Spiegelhalter, D. J., Best, N. G. & Carlin, B. P. (1998).Bayesian deviance, the effective
number of parameters, andthe comparison of arbitrarily complex models. Tech. Rep.98-009,
Division of Biostatistics, University of Minnesota.
[4] Williams, P. M. (1996). Using neural networks to model conditional variate densities. Neural
Computation.
[5] A model for G protein interaction in G protein coupled receptors.
L. Oliveira, A.C.M. Paiva, C. Sander, G. Vriend.
[6] G protein-coupled receptors in silico
Florence HORN and Gerrit VRIEND BIOcomputing, European Molecular Biology Laboratory,
Meyerhofstraße, 169117 Heidelberg, Germany
[7] GPCRDB: an information system for G protein-coupled receptors.
Nucleic Acids Res 1998 Jan 1;26(1):275-279
Download