Inferring Property Selection Pressure from Positional Residue Conservation and Roni Rosenfeld

advertisement
Inferring Property Selection Pressure from Positional
Residue Conservation
Rose Hoberman, Yong Lu, Judith
*
Klein-Seetharaman and
Roni Rosenfeld
Carnegie Mellon University, *University of Pittsburgh Medical School, Pittsburgh, PA
Introduction
Amino acid conservation is strongly associated with functional
and structural significance, and thus can be used to identify
functionally or structurally important features, such as active sites
and protein-protein interaction domains. However, identifying
positional conservation is only a first step. Different positions have
different functional and structural roles, and thus the amino acids
in each position will be subject to different constraints. A more
difficult task is to understand the specific selective pressures that
have influenced the distribution of residues at each position.
Objectives
• Develop a method for systematically identifying conservation
of specific physical and chemical properties at each site in a
multiple sequence alignment.
• Design a test to determine whether the apparent conservation
of a particular property in a position is statistically significant.
Method
Previous Work
Existing methods use amino acid similarity
matrices, set-theoretic methods, or ML
optimization of only a small handful of
properties [1,2] and generally don’t provide a
test of statistical significance.
Mapping Amino Acid Distributions into
Property Space
1
2
3
4
...
240
Hydrophobicity
Volume
Net charge
Transfer free energy
Average flexibility
Downloaded about 250 physical,
chemical, and structural
properties of amino acids from
the online database PDBase [3].
Each property scale assigns a
numerical value to each of the
20 amino acids.
A C D E F G H I K L M N P Q R S T V W Y
0.66 2.87 0.10 0.87 3.15 1.64 2.17 1.67 0.09 2.77 0.67 0.87 1.52 0.00 0.85 0.07 0.07 1.87 3.77 2.67
FAMLRIVM
LAMLRIVC
IAMLRIVM
P-EL-IVP
GAELRIVA
PGEIRIVS
L-EVYIVA
L-EVRIVM
I-MLKIVP
WAELRIVP
HAELYIVS
YAILYIVP
WAML-IVA
For each column in an alignment we
calculate the frequency of each amino acid at
that site. By mapping the amino acids to their
respective property values we can then
generate a histogram of showing the
distribution of property values at this site. For
example, the first column in the alignment at
left is shown below plotted against two
different property scales.
Evaluating Gaussian Goodness-of-Fit
Summary of Results for GPCRA
Assumption: We believe that conserved properties will
most often have a unimodal distributions, therefore we
use a Gaussian to model property conservation.
Method:
• Fit a maximum-likelihood Gaussian to amino acid
frequencies in property space
• From Gaussian calculate expected amino acid
frequencies
• Calculate  2 goodness-of-fit to quantify difference
between expected and observed amino acid
frequencies
– Identifies unimodal distributions
– Penalizes missing amino acids (“holes”)
Statistical Significance
The probability of a property appearing conserved purely
by chance is high when the entropy is low (when
sequence conservation is high). Therefore we use a
Monte-Carlo method to estimate p-values for each
significant property. For each position, property pair:
1. Generate a large set of “random” (shuffled)
property scales
2. Calculate goodness-of-fit of each random scale
to a Gaussian
2
3. Distribution of these random
values allows
2

us to estimate the probability of getting a
value more extreme than the observed value
merely by chance.
We use a Bonferroni correction to correct for testing of
multiple properties. For the chance of a type I error to be
no more than 5% per position, then a Bonferroni
correction for multiple testing (of 240 properties) yields a
significance threshold of α = 0.05/240 = 0.0002.
Results
Summary of results for GPCRA, where significant
property conservation at a position is indicated by a dot,
and the color of the dot indicates the type of property
predicted to be conserved. Entropy and location of
transmembrane helices are also indicated. Reference
positions are taken from the Rhodopsin PDB structure.
Validation
We compared our predictions for the GPCRA family to
current knowledge about the structure, function, and
dynamics of GPCR. We analyzed only those positions
with significant property conservation and low sequence
conservation (entropy of 3.5 - 4). Our method identifies
many instances of known property conservation
• Charge conserved at 134
– part of D/E R Y motif of importance to binding and
activation of G-protein
• Size conserved at 54, 80, 87, 123, 132, 153, 299
– helix faces one or two other helices
• Dynamic properties conserved
– especially in third cytoplasmic loop
• in Rhodopsin this is the most flexible interhelical loop
– in TM at 215 and 269
• close proximity to retinal ligand; the ligand-binding
pocket is the most rigid part of crystal structure
We applied our method to the
G-Protein Receptor Family A.
Conclusions
 GPCRA is a family of transmembrane proteins




containing 7 transmembrane helices
Respond to a variety of ligands; bind to G-proteins
Diverse in sequence, but believed to share similar
structure (only known structure is for Rhodopsin)
Used PFAM alignment of GPCRA proteins, but
analyzed only the 253 positions aligned to
nongapped positions in Rhodopsin.
Used significance threshold of α = 0.025/240 = 0.0001
Sequence vs. Property Conservation
• Proposed a method for systematically identifying
conservation of specific physical-chemical
properties at each site in a multiple sequence
alignment.
• Our test identifies which properties are significantly
conserved in each position in a protein family.
• We show that our automated method correctly
identifies many instances of known property
conservation in Rhodopsin.
• We make additional predictions regarding
previously unknown sites of property conservation,
providing guidance for future site-directed
mutagenesis studies by experimental biologists.
Future Work
•
•
We expect conserved properties to have lower variance (indicated
by red arrows above), so at this site hydrophobicity (left) appears
more conserved than the radius of side chain (right).
However, low variance doesn’t
always indicate a conserved
property. In the graph on the
left, for example, the missing
cystein makes it is unlikely that
this is the sole property that is
important at this site.
This graph compares locations of significant property
conservation (yellow lines) with sequence
conservation (indicated by entropy, black lines) in the
GPCR family A. In GPCR the TM helices (red bars)
are much more highly conserved than the loops (blue
bars), and the entropy and property conservation both
illustrate this trend. However, property conservation
can also be found in many positions with very low
sequence conservation (high entropy).
For Additional Information Please Contact roseh@cs.cmu.edu
Simple extension to multivariate Gaussian to identify
conservation of more than one property.
Incorporate method into a full phylogenetic model to
model the effect of codon distances as well as
reduce bias from unequal availability of subfamilies
and non-independence of sequences.
References
[1] Valdar, W. 2002. Scoring residue conservation.
Proteins, 48:22741.
[2] Koshi, J, Mindell, D, and Goldstein, R. 1997.
Beyond mutation matrices: Physical-chemistry
based evolutionary models. Genome Inform Ser
Workshop, Genome Inform, 7:8089.
[3] PDbase:
www.scsb.utmb.edu/comp_biol.html/venkat/prop.html
3.3 Gaussian Goodness-of-Fit
•
Fit a maximum-likelihood Gaussian to amino acid
frequencies in property space
From (discretized) Gaussian calculate expected AA
frequencies
Calculate
goodness-of-fit to learned Gaussian
•
•
– Identifies unimodal distributions
– Penalizes missing amino acids (“holes”)
•
Use Monte-Carlo method to calculate statistical
significance (p-value)
– Otherwise will have high false discovery rate when entropy
is low
• Estimating statistical significance
– What is the probability of a property being conserved
purely by chance?
– Generate a large set of “random” (shuffled) property
scales
– Calculate goodness-of-fit to a Gaussian
– p-value estimated from distribution of the statistic
under “random” null-hypothesis
However, the Bonferroni is an overly conservative adjustment, especially since there are significant correlations between many of the properties we are testing.
Download