Supplementary Methods

advertisement
Gunsalus et al. Supplementary Methods
Phenotypic signatures
The 45 phenotypic characters1 are considered together as a 45-dimensional vector
describing the phenotypic signature for each EE gene. These phenotypic signatures allow
for systematic computations of similarity in phenotype between genes.
Phenotype correlation
To measure the similarity between the knock-down phenotype of two EE genes we
computed the uncentered Pearson correlation coefficient (PCC) of their phenotypic
signatures. The uncentered PCC can be computed as the standard PCC, substituting 0 for
each mean2.
2D hierarchical clustering
Unsupervised agglomerative hierarchical clustering of EE genes based on their
phenotypic signatures was performed in Matlab 7 (MathWorks) using uncentered Pearson
correlation distance ((1-Ph_PCC)/2) and Ward linkage in both dimensions (Fig. 1b,
Supplementary Fig. 1). We chose a linkage threshold empirically that subdivided the
resulting dendrogram into 23 clusters, of which 13 were significantly enriched for
specific GO annotations (Fig. 1b, Supplementary Fig. 1a, and Supplementary Table S1).
This linkage threshold corresponds roughly to the Ph_PCC similarity threshold used for
data integration. To address the issue of character independence we analyzed correlations
between pairs of phenotypic characters. None of the character pairs show complete
1
correlation and most show no or only modest correlation (mean=0.042; median=0.013;
stddev=0.13; min=-0.29; max=0.74).
Analysis of shared functional attributes
Phenoclusters and EE networks were tested for functional enrichment using
FuncAssociate3 with vocabulary from the Gene Ontology (GO) Consortium (archived
Nov 1, 2002) and GO annotations from WormBase version WS100 (May 1, 2003). For
phenoclusters, P-values less than 0.05 were deemed significant (tests for each cluster
were corrected for multiple hypothesis testing); to evaluate the fraction of shared GO
terms as a function of phenotypic correlation (Ph_PCC), only attributes held by at least
two genes and no more than 800 genes were considered.
To assess the predictive value of EE network models on a global scale, EE
networks were analyzed individually and in combination for their ability to predict a
shared function between two linked gene pairs, considering only GO terms held by fewer
than 75 genes. We examined each single support network (phenome (Ph), transcriptome
(Tr), and interactome (Int)), the union of the single support networks (USN), as well as
multiple support (MSN) and triple support (TSN) networks (Supplementary Table S4),
calculating the overlap between links in the network and pairs of genes having a specific
GO term annotation in common. We then repeated the analysis, excluding genes with no
GO annotation, since genes of unknown function cannot share a function a priori
(Supplementary Table S5). The results showed that each single support network has a
characteristic accuracy (the fraction of links in the network which identify genes sharing
a specific function) and sensitivity (the fraction of shared-function pairs that are captured
2
in each network). This definition of accuracy is conservative because many gene
functions have yet to be discovered and annotated. Considering all interactions
(Supplementary Table S4), among the individual single support networks (Ph, Tr, Int),
the interactome network has the highest accuracy (53%), while the phenome network has
the highest sensitivity (24%). The multiple support network (MSN), representing links
with at least two types of support from different evidence types, has a slightly higher
accuracy than any of the single support networks (57%), while the sensitivity is 4%. The
triple support network (TSN) has the highest accuracy (77%) but the lowest sensitivity
(0.2%). Taken together, these results show that combining the data improves accuracy
with a corresponding loss of sensitivity. Since gene pairs containing unannotated genes
had no opportunity a priori to have shared function, it is reasonable to exclude them from
consideration. When only those pairs for which both members have some GO annotation
are considered (Supplementary Table S5), the accuracy of all networks increase
dramatically, and the benefit obtained by integrating the datasets also increases (MSN
accuracy increases from 57% to 88%; TSN accuracy increases to 92%).
Expression correlation
Expression similarity between genes was computed using the standard PCC (Tr_PCC )
from data in the Kim compendium4 obtained from the Stanford Microarray Database. For
a given gene pair, only those experiments for which both genes were assayed were
considered in calculating the Tr_PCC. Only gene pairs for which there are data from at
least 10 common experiments were considered. All pairs with PCC ≥ 0.7 have at least 99
experiments in common.
3
Network graphs
Networks are represented by graphs in which each gene/protein is represented as a node
and pairwise biological evidence linking two genes/proteins (physical interaction, phenosimilarity, and/or expression similarity above a fixed threshold) is represented by an
edge. Network graphs were visualized using Cytoscape 2.05 or the LEDA C++ graph
library (Algorithmic Solutions Software GmbH). Graphs were arranged using a layered
approach to enhance the visualization of different regions (Fig. 3a), using a spring (Fig.
3b,c) or circular layout (Fig. 3d), or manually after grouping genes of similar function
into metanodes (Fig. 3e).
Worm interactome version 7 (WI7)
We generated an updated version of the worm interactome map, which we refer to as
“WI7”, by combining WI5 (3,228 proteins linked by 5,685 potential interactions)6 with a
set of 887 “interologs”7 derived from “fly-to-worm” inferences (Supplementary Table
S1). WI7 contains a total of 3,848 proteins linked by 6,572 experimental and predicted
interactions.
Fly-to-worm interologs represent predicted interactions between C. elegans
orthologs of D. melanogaster proteins that interact in a yeast two-hybrid (Y2H) assay8.
To identify fly-to-worm interologs, reciprocal whole-proteome BLASTP searches were
performed using NCBI BLAST between coding sequence (CDS) annotations from D.
melanogaster (FlyBase Release 3.1) and C. elegans (WormPep100), and putative
orthologs were identified, defined as reciprocal best hit matches between proteomes.
4
Interologs were derived only from relatively high-confidence D. melanogaster Y2H
protein-protein interactions (i.e. a Y2H assay confidence score of 0.5 or greater)8. Each
pair of worm proteins whose fly orthologs are high-confidence Y2H interactors was
defined as an interolog of the corresponding pair of Y2H interacting fly proteins.
Supplementary Table S2 lists potential fly interologs, including joint BLASTP e-values
and joint percent identity along the length of the query and subject proteins. Joint values
are defined as the geometric average of independent values from reciprocal wholeproteome BLASTP analysis of D. melanogaster and C. elegans. The EE integrated
network contains 29 fly-to-worm interologs; 12 are independently supported by data from
WI56; 4 are self interactors; and the remaining 13 all share at least 70% sequence identity
and/or had joint E-values < 10-70. We conclude that including fly-to-worm interologs
adds high quality interactions to the integrated EE network, based on criteria defined
previously for the reliable cross-species transfer of inferences from interologs9.
EE interactome network
The EE interactome network used here contains all WI7 interactions for which both
interactors are EE proteins. Of the 661 EE gene products, 426 are contained in WI7 (i.e.,
426 have at least one physical interaction in WI7) and 277 of these interact with another
EE protein. The EE interactome network contains 277 nodes and 513 edges (excluding
homodimers and duplicate links).
Random gene networks
5
To compare the connectivity of the EE interactome network with random expectation,
426 proteins (the number of EE proteins in WI7) were chosen at random from WI7 as a
starting set of proteins. To control for potential inspection bias in the interactome data,
133 of these were chosen from genes in WI7 that had previously been used as baits in
high-throughput C. elegans two-hybrid screens6 (the same number as in the EE set), since
baits are expected to have more interaction partners on average than are preys. The
subgraph of WI7 induced by each random set of proteins was then generated
(homodimers excluded). This procedure was repeated to create 1000 within-set random
networks.
Neighbor networks
The EE neighbor network was created from WI7 by extracting all interactions involving
at least one EE protein. This network includes EE proteins as well as other proteins that
interact with an EE protein. Neighbor networks were also generated in this way for the
1000 random lists of WI7 proteins (as described for randomized networks).
Network statistics
For each WI7 subnetwork (EE or random, within-set or neighbor) we computed the
number of nodes (those proteins with at least one interaction in the network), the number
of starting nodes in the network (for neighbor networks), the number of links, the
clustering coefficient, the number of components and their sizes, and the diameter and
characteristic path length of the largest component (Supplementary Table S3). The
distance between two nodes, also called the shortest path length, is the smallest number
6
of links required to traverse from one node to the other in the graph. Distance was
computed using Dijkstra’s algorithm. We used the triangle definition of clustering
coefficient (3*T/P, where T is the number of triangles in the network, and P is the
number of paths of length 2 in the network)10. The diameter is the distance between two
proteins that are the furthest from each other, i.e., the longest direct path between any two
nodes. The characteristic path length is the average distance between pairs of nodes. The
mean, standard deviation, minimum, and maximum was computed for each of these
statistics for the random gene networks.
The products of EE genes are more interconnected by direct protein interactions
than expected by chance (Supplementary Table S3). The EE interactome network
contains nearly twice the proteins (277 vs. 147±15) and four times the interactions (513
vs. 120±20) as comparable networks of random proteins from WI7. Relative to randomly
generated networks, the largest component is nearly four times the average size (196 vs.
55±25) and the clustering coefficient is an order of magnitude higher (0.53 vs.
0.05±0.05), further illustrating the high connectivity, or “cliquishness”, of EE proteins.
Similar results were obtained using an expanded “EE neighbor” network that includes all
interaction partners of EE proteins in WI7 (Supplementary Table S3).
Random phenocluster assignments
To compare the proportion of observed EE protein interactions within a phenocluster
with the random expectation (Fig. 2b), phenoclusters were randomly assigned to EE
proteins, preserving the number of proteins in each phenocluster. This procedure was
repeated 1000 times.
7
Cloning
For each gene selected, the sequence between the predicted initiation and termination
codons, including the intervening introns (in-ORF), was amplified using genomic DNA
as a template and primers containing appropriate restriction enzyme sites. These were
subsequently cloned into GFP vectors driven by two different promoters: a pie-1
promoter vector (pJunc (A. Schetter, unpublished), a derivative of pJH411 into which the
unc-119 gene has been added), which drives expression of N-terminal GFP fusion protein
in the germline (pie-1::GFP::in-ORF), and an npp-1 promoter vector (pNPnpp1 (A.
Schetter, unpublished)), which drives expression of a C-terminal GFP fusion protein in
the germline and the soma (npp-1::in-ORF::GFP). Primer sequences are available upon
request.
Generation of transgenic worms
Transgenic animals expressing extrachromosomal arrays were created by injection of
linearized constructs mixed with PvuII-digested genomic DNA and linearized pRF-4
(rol-6 dominant marker)12, 13. A typical injection recipe included 1 ng/l Sca I cut pRF-4
(rol-6), 1 ng/l GFP transgene linearized with appropriate enzyme, and 40 ng/l Pvu II
cut genomic DNA. To avoid the problems associated with gene silencing for germline
genes, we assayed all F2 transgenic animals. Each transgenic line subsequently silenced.
Time-lapse microscopy
8
Microscopy was carried out on Leica DMLA or DMRA microscopes using 100X
(1.3N.A.) objectives and GFP filters. Acquisition was carried out using a QIcam
(QImaging) camera (2X2 binning) driven by OpenLab (Improvision) software. Images
were captured at 10-second intervals over a period of 50 minutes, assembled into timelapse movies with both Openlab and NIH Image, and analyzed for expression. Movies
were subsequently compressed using Quicktime V6.5 (Apple Computers) to prepare data
for Internet streaming.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
Sönnichsen, B. et al. High content RNAi screen of virtually all predicted C.
elegans genes identifies 662 genes required for the first embryonic cell divisions.
Nature 434, 462-9 (2005).
Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. Cluster analysis and
display of genome-wide expression patterns. Proc Natl Acad Sci U S A 95,
14863-8 (1998).
Berriz, G. F., King, O. D., Bryant, B., Sander, C. & Roth, F. P. Characterizing
gene sets with FuncAssociate. Bioinformatics 19, 2502-4 (2003).
Kim, S. K. et al. A gene expression map for Caenorhabditis elegans. Science 293,
2087-92 (2001).
Shannon, P. et al. Cytoscape: a software environment for integrated models of
biomolecular interaction networks. Genome Res 13, 2498-504 (2003).
Li, S. et al. A map of the interactome network of the metazoan C. elegans.
Science 303, 540-3 (2004).
Matthews, L. R. et al. Identification of potential interaction networks using
sequence-based searches for conserved protein-protein interactions or
"interologs". Genome Res 11, 2120-6 (2001).
Giot, L. et al. A protein interaction map of Drosophila melanogaster. Science
302, 1727-36 (2003).
Yu, H. et al. Annotation transfer between genomes: protein-protein interologs and
protein-DNA regulogs. Genome Res 14, 1107-18 (2004).
Albert, R. & Barabasi, A. L. Statistical mechanics of complex networks. Reviews
Modern Physics 74, 47-97 (2002).
Reese, K. J., Dunn, M. A., Waddle, J. A. & Seydoux, G. Asymmetric segregation
of PIE-1 in C. elegans is mediated by two complementary mechanisms that act
through separate PIE-1 protein domains. Mol Cell 6, 445-55 (2000).
9
12.
13.
Mello, C. C., Kramer, J. M., Stinchcomb, D. & Ambros, V. Efficient gene transfer
in C. elegans: extrachromosomal maintenance and integration of transforming
sequences. EMBO J 10, 3959-70 (1991).
Kelly, W. G., Xu, S., Montgomery, M. K. & Fire, A. Distinct requirements for
somatic and germline expression of a generally expressed Caenorhabditis elegans
gene. Genetics 146, 227-38 (1997).
10
Download