Gunsalus et al. Supplementary Methods Phenotypic signatures The 45 phenotypic characters1 are considered together as a 45-dimensional vector describing the phenotypic signature for each EE gene. These phenotypic signatures allow for systematic computations of similarity in phenotype between genes. Phenotype correlation To measure the similarity between the knock-down phenotype of two EE genes we computed the uncentered Pearson correlation coefficient (PCC) of their phenotypic signatures. The uncentered PCC can be computed as the standard PCC, substituting 0 for each mean2. 2D hierarchical clustering Unsupervised agglomerative hierarchical clustering of EE genes based on their phenotypic signatures was performed in Matlab 7 (MathWorks) using uncentered Pearson correlation distance ((1-Ph_PCC)/2) and Ward linkage in both dimensions (Fig. 1b, Supplementary Fig. 1). We chose a linkage threshold empirically that subdivided the resulting dendrogram into 23 clusters, of which 13 were significantly enriched for specific GO annotations (Fig. 1b, Supplementary Fig. 1a, and Supplementary Table S1). This linkage threshold corresponds roughly to the Ph_PCC similarity threshold used for data integration. To address the issue of character independence we analyzed correlations between pairs of phenotypic characters. None of the character pairs show complete 1 correlation and most show no or only modest correlation (mean=0.042; median=0.013; stddev=0.13; min=-0.29; max=0.74). Analysis of shared functional attributes Phenoclusters and EE networks were tested for functional enrichment using FuncAssociate3 with vocabulary from the Gene Ontology (GO) Consortium (archived Nov 1, 2002) and GO annotations from WormBase version WS100 (May 1, 2003). For phenoclusters, P-values less than 0.05 were deemed significant (tests for each cluster were corrected for multiple hypothesis testing); to evaluate the fraction of shared GO terms as a function of phenotypic correlation (Ph_PCC), only attributes held by at least two genes and no more than 800 genes were considered. To assess the predictive value of EE network models on a global scale, EE networks were analyzed individually and in combination for their ability to predict a shared function between two linked gene pairs, considering only GO terms held by fewer than 75 genes. We examined each single support network (phenome (Ph), transcriptome (Tr), and interactome (Int)), the union of the single support networks (USN), as well as multiple support (MSN) and triple support (TSN) networks (Supplementary Table S4), calculating the overlap between links in the network and pairs of genes having a specific GO term annotation in common. We then repeated the analysis, excluding genes with no GO annotation, since genes of unknown function cannot share a function a priori (Supplementary Table S5). The results showed that each single support network has a characteristic accuracy (the fraction of links in the network which identify genes sharing a specific function) and sensitivity (the fraction of shared-function pairs that are captured 2 in each network). This definition of accuracy is conservative because many gene functions have yet to be discovered and annotated. Considering all interactions (Supplementary Table S4), among the individual single support networks (Ph, Tr, Int), the interactome network has the highest accuracy (53%), while the phenome network has the highest sensitivity (24%). The multiple support network (MSN), representing links with at least two types of support from different evidence types, has a slightly higher accuracy than any of the single support networks (57%), while the sensitivity is 4%. The triple support network (TSN) has the highest accuracy (77%) but the lowest sensitivity (0.2%). Taken together, these results show that combining the data improves accuracy with a corresponding loss of sensitivity. Since gene pairs containing unannotated genes had no opportunity a priori to have shared function, it is reasonable to exclude them from consideration. When only those pairs for which both members have some GO annotation are considered (Supplementary Table S5), the accuracy of all networks increase dramatically, and the benefit obtained by integrating the datasets also increases (MSN accuracy increases from 57% to 88%; TSN accuracy increases to 92%). Expression correlation Expression similarity between genes was computed using the standard PCC (Tr_PCC ) from data in the Kim compendium4 obtained from the Stanford Microarray Database. For a given gene pair, only those experiments for which both genes were assayed were considered in calculating the Tr_PCC. Only gene pairs for which there are data from at least 10 common experiments were considered. All pairs with PCC ≥ 0.7 have at least 99 experiments in common. 3 Network graphs Networks are represented by graphs in which each gene/protein is represented as a node and pairwise biological evidence linking two genes/proteins (physical interaction, phenosimilarity, and/or expression similarity above a fixed threshold) is represented by an edge. Network graphs were visualized using Cytoscape 2.05 or the LEDA C++ graph library (Algorithmic Solutions Software GmbH). Graphs were arranged using a layered approach to enhance the visualization of different regions (Fig. 3a), using a spring (Fig. 3b,c) or circular layout (Fig. 3d), or manually after grouping genes of similar function into metanodes (Fig. 3e). Worm interactome version 7 (WI7) We generated an updated version of the worm interactome map, which we refer to as “WI7”, by combining WI5 (3,228 proteins linked by 5,685 potential interactions)6 with a set of 887 “interologs”7 derived from “fly-to-worm” inferences (Supplementary Table S1). WI7 contains a total of 3,848 proteins linked by 6,572 experimental and predicted interactions. Fly-to-worm interologs represent predicted interactions between C. elegans orthologs of D. melanogaster proteins that interact in a yeast two-hybrid (Y2H) assay8. To identify fly-to-worm interologs, reciprocal whole-proteome BLASTP searches were performed using NCBI BLAST between coding sequence (CDS) annotations from D. melanogaster (FlyBase Release 3.1) and C. elegans (WormPep100), and putative orthologs were identified, defined as reciprocal best hit matches between proteomes. 4 Interologs were derived only from relatively high-confidence D. melanogaster Y2H protein-protein interactions (i.e. a Y2H assay confidence score of 0.5 or greater)8. Each pair of worm proteins whose fly orthologs are high-confidence Y2H interactors was defined as an interolog of the corresponding pair of Y2H interacting fly proteins. Supplementary Table S2 lists potential fly interologs, including joint BLASTP e-values and joint percent identity along the length of the query and subject proteins. Joint values are defined as the geometric average of independent values from reciprocal wholeproteome BLASTP analysis of D. melanogaster and C. elegans. The EE integrated network contains 29 fly-to-worm interologs; 12 are independently supported by data from WI56; 4 are self interactors; and the remaining 13 all share at least 70% sequence identity and/or had joint E-values < 10-70. We conclude that including fly-to-worm interologs adds high quality interactions to the integrated EE network, based on criteria defined previously for the reliable cross-species transfer of inferences from interologs9. EE interactome network The EE interactome network used here contains all WI7 interactions for which both interactors are EE proteins. Of the 661 EE gene products, 426 are contained in WI7 (i.e., 426 have at least one physical interaction in WI7) and 277 of these interact with another EE protein. The EE interactome network contains 277 nodes and 513 edges (excluding homodimers and duplicate links). Random gene networks 5 To compare the connectivity of the EE interactome network with random expectation, 426 proteins (the number of EE proteins in WI7) were chosen at random from WI7 as a starting set of proteins. To control for potential inspection bias in the interactome data, 133 of these were chosen from genes in WI7 that had previously been used as baits in high-throughput C. elegans two-hybrid screens6 (the same number as in the EE set), since baits are expected to have more interaction partners on average than are preys. The subgraph of WI7 induced by each random set of proteins was then generated (homodimers excluded). This procedure was repeated to create 1000 within-set random networks. Neighbor networks The EE neighbor network was created from WI7 by extracting all interactions involving at least one EE protein. This network includes EE proteins as well as other proteins that interact with an EE protein. Neighbor networks were also generated in this way for the 1000 random lists of WI7 proteins (as described for randomized networks). Network statistics For each WI7 subnetwork (EE or random, within-set or neighbor) we computed the number of nodes (those proteins with at least one interaction in the network), the number of starting nodes in the network (for neighbor networks), the number of links, the clustering coefficient, the number of components and their sizes, and the diameter and characteristic path length of the largest component (Supplementary Table S3). The distance between two nodes, also called the shortest path length, is the smallest number 6 of links required to traverse from one node to the other in the graph. Distance was computed using Dijkstra’s algorithm. We used the triangle definition of clustering coefficient (3*T/P, where T is the number of triangles in the network, and P is the number of paths of length 2 in the network)10. The diameter is the distance between two proteins that are the furthest from each other, i.e., the longest direct path between any two nodes. The characteristic path length is the average distance between pairs of nodes. The mean, standard deviation, minimum, and maximum was computed for each of these statistics for the random gene networks. The products of EE genes are more interconnected by direct protein interactions than expected by chance (Supplementary Table S3). The EE interactome network contains nearly twice the proteins (277 vs. 147±15) and four times the interactions (513 vs. 120±20) as comparable networks of random proteins from WI7. Relative to randomly generated networks, the largest component is nearly four times the average size (196 vs. 55±25) and the clustering coefficient is an order of magnitude higher (0.53 vs. 0.05±0.05), further illustrating the high connectivity, or “cliquishness”, of EE proteins. Similar results were obtained using an expanded “EE neighbor” network that includes all interaction partners of EE proteins in WI7 (Supplementary Table S3). Random phenocluster assignments To compare the proportion of observed EE protein interactions within a phenocluster with the random expectation (Fig. 2b), phenoclusters were randomly assigned to EE proteins, preserving the number of proteins in each phenocluster. This procedure was repeated 1000 times. 7 Cloning For each gene selected, the sequence between the predicted initiation and termination codons, including the intervening introns (in-ORF), was amplified using genomic DNA as a template and primers containing appropriate restriction enzyme sites. These were subsequently cloned into GFP vectors driven by two different promoters: a pie-1 promoter vector (pJunc (A. Schetter, unpublished), a derivative of pJH411 into which the unc-119 gene has been added), which drives expression of N-terminal GFP fusion protein in the germline (pie-1::GFP::in-ORF), and an npp-1 promoter vector (pNPnpp1 (A. Schetter, unpublished)), which drives expression of a C-terminal GFP fusion protein in the germline and the soma (npp-1::in-ORF::GFP). Primer sequences are available upon request. Generation of transgenic worms Transgenic animals expressing extrachromosomal arrays were created by injection of linearized constructs mixed with PvuII-digested genomic DNA and linearized pRF-4 (rol-6 dominant marker)12, 13. A typical injection recipe included 1 ng/l Sca I cut pRF-4 (rol-6), 1 ng/l GFP transgene linearized with appropriate enzyme, and 40 ng/l Pvu II cut genomic DNA. To avoid the problems associated with gene silencing for germline genes, we assayed all F2 transgenic animals. Each transgenic line subsequently silenced. Time-lapse microscopy 8 Microscopy was carried out on Leica DMLA or DMRA microscopes using 100X (1.3N.A.) objectives and GFP filters. Acquisition was carried out using a QIcam (QImaging) camera (2X2 binning) driven by OpenLab (Improvision) software. Images were captured at 10-second intervals over a period of 50 minutes, assembled into timelapse movies with both Openlab and NIH Image, and analyzed for expression. Movies were subsequently compressed using Quicktime V6.5 (Apple Computers) to prepare data for Internet streaming. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Sönnichsen, B. et al. High content RNAi screen of virtually all predicted C. elegans genes identifies 662 genes required for the first embryonic cell divisions. Nature 434, 462-9 (2005). Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A 95, 14863-8 (1998). Berriz, G. F., King, O. D., Bryant, B., Sander, C. & Roth, F. P. Characterizing gene sets with FuncAssociate. Bioinformatics 19, 2502-4 (2003). Kim, S. K. et al. A gene expression map for Caenorhabditis elegans. Science 293, 2087-92 (2001). Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13, 2498-504 (2003). Li, S. et al. A map of the interactome network of the metazoan C. elegans. Science 303, 540-3 (2004). Matthews, L. R. et al. Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or "interologs". Genome Res 11, 2120-6 (2001). Giot, L. et al. A protein interaction map of Drosophila melanogaster. Science 302, 1727-36 (2003). Yu, H. et al. Annotation transfer between genomes: protein-protein interologs and protein-DNA regulogs. Genome Res 14, 1107-18 (2004). Albert, R. & Barabasi, A. L. Statistical mechanics of complex networks. Reviews Modern Physics 74, 47-97 (2002). Reese, K. J., Dunn, M. A., Waddle, J. A. & Seydoux, G. Asymmetric segregation of PIE-1 in C. elegans is mediated by two complementary mechanisms that act through separate PIE-1 protein domains. Mol Cell 6, 445-55 (2000). 9 12. 13. Mello, C. C., Kramer, J. M., Stinchcomb, D. & Ambros, V. Efficient gene transfer in C. elegans: extrachromosomal maintenance and integration of transforming sequences. EMBO J 10, 3959-70 (1991). Kelly, W. G., Xu, S., Montgomery, M. K. & Fire, A. Distinct requirements for somatic and germline expression of a generally expressed Caenorhabditis elegans gene. Genetics 146, 227-38 (1997). 10