here - Jesse R. Walsh

advertisement
Jesse Walsh
BCB 569
December 18, 2009
Final Project
Improving PPI Networks with Correlated Gene Expression Data
Background:
Protein-protein interaction (PPI) networks attempt to model the vast number of interactions
between proteins in a cell. These networks are crucial to our understanding of problems such as
functional annotation, protein complex identification, and signaling cascade determination. As such, the
quality of the data is very important. In practice, many datasets, such as the Database of Interacting
Proteins (DIP) [3], consist of some interactions deduced from small scale experiments and a large
number of them from high-throughput methods such as yeast two-hybrid. The quality of the small scale
experiments are generally considered to be high due to a variety of experimental methods and close
expert curation of the data. It is well documented, however, that data obtained from high-throughput
methods tends to have a large number of false positives and negatives. Additionally, there is commonly
little overlap between the results reported from different experiments. For example, two major
published yeast PPI experiments showed only ≈150 similar interactions out of thousands [1]. It is for this
reason that methods to evaluate and improve the quality of PPI data are needed.
Hypothesis:
One less commonly used method of assessing the quality of a set of high-throughput PPI data is
the EPR index proposed by Deane et al. [2]. This method extracts gene expression data under several
conditions for members of a PPI network, and then compares the density distribution of the expression
distances to the density distribution of a known high-quality set. I would like to propose using the EPR
index to assess the quality of the E. coli PPI data within DIP [3] using expression data from the Many
Microbe Microarray Database (M3D) [4].
Merit:
For my original proposal, I had proposed using expression data to assess the quality of PPI
networks. After discovering the Deane et al. paper on the EPR index [2], I felt as though this method
was a solid implementation of my original hypothesis. I believe this project still has merit based on the
fact that I will be applying the EPR method on the E. coli dataset, while Deane et al. [2] applied this
method to the yeast dataset. Also, this paper was based on PPI data published in 2001. There have
been significant contributions to PPI data since the EPR analysis was done, and as such the current
dataset can be assumed to contain more complete protein interaction data.
Overview of the Data:
Genome Size
Interactions
Proteins
E. Coli (2008)
4.6 million bp
6991
1863
Yeast (2001)
--
8063
4150
Yeast (2008)
12.5 million bp
18440
4943
Figure 1: Statistics for DIP (Database of Interacting Proteins)
It can be seen from figure 1 that DIP [3] contains ~7000 protein interactions with 1863 unique
proteins. The yeast (2001) data is provided to give a sense of the completeness of the E. coli dataset
compared to the set used by the authors of the EPR index. The latest yeast (2008) statistics are provided
to give a sense of how the size of the data has grown. Both of these points are meant to lend
confidence to the usefulness of performing an EPR analysis of the current E. coli data.
The expression data used in this project were retrieved from the M3D database [4]. This data
consists of a universally normalized Affymetrix compendium of E. coli expression data. There are 907
chips from 466 experiments available. For use with the EPR index, it is important to have enough
variance in the conditions of the expression data in order to have a reasonable sampling of gene
expression relationships. Conversely, many of the conditions in the M3D data consist of knockouts,
gene perturbations, and foreign gene expressions. In order to avoid adding any confounding effects to
my data from these artificial gene linkages, a smaller set of 20 experimental conditions were chosen.
While there is no mathematical basis for this choice, the choice was influenced by the fact that the
authors of the EPR index used a set of 12 conditions for their work. These 20 conditions were a part of a
single publication by Allen et al. [5].
Method:
Equation 1: Expression distance between two genes [2]
Equation 1 was used to calculate the expression distance between genes in the M3D dataset.
The value ei refers to the Affymetrix expression value for gene A or B in condition i. The reference (ref)
expression value is meant to be the control of the expression data. Since there was no control
condition, the condition that most resembled a control was chosen, which in this case was under the
conditions: wild type, MOPS media, 5.55mM glucose, log phase, taken at time point 0. Averaging over
the available conditions in order to create a reference set was an option that was considered, but did
not seem to offer any considerable advantage over the condition chosen.
Three PPI datasets were considered in this analysis. Interactions that have been verified by
small scale experiments or were identified in several experiments were collected into the CORE set.
Interactions that were only inferred from high-throughput methods were collected into the NON set.
100,000 randomly selected protein interactions were selected from the available genes in the M3D data.
This data has a low overlap with the CORE and NON sets, and can thus be considered a random
distribution of primarily non-interacting pairs of proteins.
Identifier
Number of Interactions
Mapping Available
Core Interactions
CORE
991
220
Non-core Interactions
NON
5999
903
Randomized Set
RAND
100,000
100,000
Figure 2: Datasets used in analysis
In order to extract expression distances for the interactions in the PPI datasets (CORE, NON, and
RAND), the uniprot protein identifiers in the interaction data had to be mapped to the gene identifiers in
the expression data. This was done by blasting the protein sequences against the translated E. coli
genome. This resulted in 220 interactions from the CORE set and 903 interactions from the NON set
mapping to expression values. The RAND set was selected from the gene list, and as such did not need
to be mapped. The low success rate of this mapping is discussed in the discussion section.
Results:
p(d2AB)
Figure 3: Density distribution p(d 2 A B ) of CORE,
NON, and RAND sets. Bin Size = 0.05
Figure 4: Results adapted from Deane et al. [2]
Bin Size = 1.25
Figure 4 shows the results from the Deane et al. [2] paper as a point of comparison. Results
from this analysis are show in figure 3. Figure 3 plots the density distribution of the expression distances
d2AB against the distances. In this figure, the CORE set is considered representative of the distribution of
a highly accurate set of interactions (accurate in the sense of few false positives or false negatives). The
RAND set is considered representative of the distribution of all non-interacting pairs. The NON set is the
a set of high-throughput interactions. The CORE set is equivalent to the INT set of figure 4, the NON is
equivalent to the Experimental set of figure 4, and the RAND set is equivalent to the RND1 set in figure
4. The key observation is that the NON set appears to be intermediary between the CORE and RAND
sets (this is easier to see in figure 4 where the experimental set is intermediary between the INT and
RND1 sets). This leads to the conclusion that the NON set can be represented by randomly selecting a
certain amount of interactions from both the CORE and RAND sets. Deane et al. proposed Equation 2,
where solving for the linear least squares yields a value for αepr which represents the fraction of NON
that are true interactions.
Equation 2: Solving for the linear least squares yields α e p r , the fraction of true interactions in
ρ e x p . ρ e x p is the distance distribution for the experimental dataset (NON), ρ i is the distance
distribution for the interacting dataset (CORE), and ρ n is the distance distribution for the non interacting dataset (RAND).
This value was regretfully not calculated for the data used in this analysis. There were technical
difficulties that arose preventing me from solving for αepr before this project was due.
Discussion:
There was a very unfortunate complication in the protein-to-gene mapping step which forced
the removal of 85% of the original interaction data. The issue arose from the inclusion of outdated
protein identifiers in the DIP dataset. These 780 of the 1863 identifiers, when queried for a sequence on
Uniprot [6] website, returned with a message stating that the identifier was no longer valid and the
protein had been demerged into several new protein identifiers. As a case study, one protein identifier,
P22885, was demerged into P0A8P6 and P0A8P7. When the sequences for these were retrieved and
aligned, all three identifiers returned with identical sequences, and the two new identifiers returned
with identical annotations.
This problem presented an issue with our confidence in the core interactions that had been
demerged. While it would be possible to map each new id of a demerged protein against each
demerged id of its interaction partner, there would be no way to show which of the new m*n
interactions were true. As such, these values had to be removed to preserve the integrity of the CORE
set. For the same reasons, the demerged proteins were also removed from the NON set. The quantity
of map-able identifiers is given in figure 2. There exists an unexplored possibility that the one to many
relationship of the old demerged proteins to their new identifiers could in fact map many to one at the
gene level, effectively undoing the demerge and allowing the interactions to be used. This represents a
technical challenge that should be pursued in the future.
The second major point I would like to discuss comes as a result of the density plot in figure 3.
When comparing figure 3 to figure 4, it is evident that the CORE and NON sets follow a different
distribution from the RAND set or the sets in figure 4. This stems from the very high percentage of
interactions that are interactions of a protein with another identical protein. If a protein interacts with
another of its type, the expression distance calculation yields a value of 0. 66.6% of the CORE
interactions were of this type, while 43% of the NON interactions were of this type. This presents a start
contrast to the values seen in figure 4, where the lowest bin shown represents 5% or less of the dataset.
The authors did not make any comments on the amount of distance values equal to zero, or comment
on any special treatment of them. I would also like to point out that, while the mapping issue described
above inflated the percent of the NON and CORE datasets represented by identity interactions, the
original 6991 interactions were composed of 15% identity interactions. This is significantly more than
the amount in the yeast data (5%). It would be an interesting line of inquiry to find out if there is any
biological significance to this large representation of identity interactions in E. coli when compared to
yeast.
Conclusion:
The conclusions that can be drawn from this experiment are thin. The original questions I
wanted to answer when I wrote my proposal included using expression data to predict novel
interactions, and using expression data to assign confidence values to individual interactions. This
analysis makes use of the distribution of expression distances between interacting proteins, and as such
can only be used to assess the overall quality of a dataset, rather than of individual interactions. While a
confidence value was not obtained, this project created several new questions to be answered, and has
served as a good opportunity to learn about the challenges of handling large quantities of data of
different types.
References:
[1] Deeds EJ, Ashenberg O, Shakhnovich EI. “A simple physical model for scaling in protein-protein
interaction networks.” Proc. Natl Acad. Sci. USA (2006) 103:311–316
[2] Charlotte M. Deane et al. “Protein Interactions: Two Methods for Assessment of the Reliability of
High Throughput Observations.” Molecular & Cellular Proteomics 1.5 349-356
[3] Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D (2004) The Database of Interacting
Proteins: 2004 update. NAR 32 Database issue:D449-51
[4] Faith JJ, Driscoll ME, Fusaro VA, Cosgrove EJ, Hayete B, Juhn FS, Schneider SJ, and Gardner TS. Many
Microbe Microarrays Database: uniformly normalized Affymetrix compendia with structured
experimental metadata. Nucleic Acids Research
[5] Timothy e. Allen et al. “Genome-Scale Analysis of the Uses of the Escherichia coli Genome: ModelDriven Analysis of Heterogeneous Data Sets.” J Bacteriol. 2003 November, 185(21): 6392-6399
[6] The UniProt Consortium
The Universal Protein Resource (UniProt)
Nucleic Acids Res. 37:D169-D174(2009).
[6] Jain E., Bairoch A., Duvaud S., Phan I., Redaschi N., Suzek B.E., Martin M.J., McGarvey P., Gasteiger E.
Infrastructure for the life sciences: design and implementation of the UniProt website
BMC Bioinformatics 2009, 10:136.
Download