SUPPLEMENTARY MATERIALS AND METHODS Strains and transgenes used N2 Bristol wildtype (BRENNER 1974), cog-1(sy607) (PALMER et al. 2002), cog-1(ot119), cog-1(ot201) (SARIN et al. 2007). The following transgenes were used: otIs114=Is[lim6prom::gfp; rol-6(d)] (CHANG et al. 2003), otIs151=Is[ceh-36prom::rfp; rol-6(d)] (JOHNSTON and HOBERT 2003), syIs54=Is[unc-119(+); ceh-2:;gfp] (INOUE et al. 2005), otEx1492=Ex[cog-1prom1::gfp; rol-6(d)], otEx2942=Ex[cog-1prom1del2::gfp] (ETCHBERGER et al. 2007). New transgenes: otEx3761-63- three independent lines of Ex[cog-1promA::gfp; elt-2::gfp], otEx3764-66- three independent lines of Ex[cog-1promA-ot119::gfp; elt-2::gfp], otEx3767-69- three independent lines of Ex[cog-1promA-ot201::gfp; elt-2::gfp], otEx377072- three independent lines of Ex[cog-1promB::gfp; elt-2::gfp], otEx3782-84- three independent lines of Ex[cog-1promA-del2::gfp; elt-2::gfp]. otEx3527, 3536, 3538- three independent lines of Ex[gcy-1prom::gfp; elt-2::gfp], otEx3550-52- three independent lines of Ex[gcy-1prom-del1::gfp; elt-2::gfp], otEx3553-56- three independent lines of Ex[gcy1prom-del2::gfp; elt-2::gfp], otEx3779-81- three independent lines of Ex[gcy-1promdel1,2::gfp; elt-2::gfp]. otEx3773-78- six independent rescuing lines of Ex[cog-1; elt2::gfp]. Comparative genome hybridization Details of the DNA preparation and aCGH procedure can be found in the accompanying manuscript and in Maydan et al. Briefly, Roche NimbleGen Inc. manufactured the microarrays and performed the DNA labeling, hybridization, image analysis and extraction of raw fluorescence intensities. The DNA samples (ot119 and ot201) were labeled with Cy3 and the reference DNA (wild-type control; all strains contain the otIs114 transgene) with Cy5. No background has been subtracted before calculating the log2ratio values and the normalization followed a LOESS regression. The microarray contains a single copy of all possible 50-mer oligonucleotides targeting each stand of chromosome II between coordinates 14743042 and 15068429 with the exception of the oligonucleotides affected by the repeats listed in Wormbase data freeze WS180 and the 1 oligonucleotides exceeding 150 cycles in NimbleGen’s synthesis process. No other constraints have been applied while designing the oligonucleotides. DNA constructs and generation of transgenic lines: The cog-1promA reporter construct was created by topo cloning a 6081bp PCR fragment extending from -6081 to -1 bp relative to the transcriptional start site into vector pCR 2.1 using an PstI site and an BamHI site introduced at either end of the PCR product; this fragment was subsequently subcloned into the pPD95.75 expression vector. The cog1promB reporter construct was created by cloning a 2104 bp PCR fragment extending from -6081 to -3977 bp relative to the transcriptional start site directly into pPD95.75 using HindIII and XbaI sites introduced at either end of the PCR product. The gcy1prom::gfp reporter construct was created by cloning a 712 bp fragment from -712 to -1 into pPD95.75 utilizing HindIII and BamHI restriction sites introduced at either end of the PCR product. The cog-1promA-ot119, cog-1promA-ot201, cog-1promA-del2, gcy-1prom-del1, gcy- 1prom-del2, and gcy-1prom-del1,2 reporter constructs were created with the QuickChange II XL Site Directed Mutagenesis Kit (Stratagene) utilizing cog-1promA or gcy-1prom reporter construct as the original template. Primer sequences for these constructs are as follows: cog-1promA PstI fwd: ATGCCTGCAGCCAAAGATGGTTTTTTTCCAC cog-1promA BamHI rev: TAGAGGATCCTCTGGTTATGGTAGAGGGGAG cog-1promB HindIII fwd: TTAAGCTTTCCAAAGATGGTTTTTTTCC cog-1promB XbaI rev: TTTCTAGAGGGCTAGTTTTGGGAAAACG cog-1promA-ot119 fwd: GCGAGACGAGAAGCTCTATCATCATTATAATTTTC cog-1promA-ot119 rev: GAAAATTATAATGATGATAGAGCTTCTCGTCTCGC cog-1promA-ot201 fwd: GAAAGCAGCGAGACGAAAAGCCCTATCATCA cog-1promA-ot201 rev: TGATGATAGGGCTTTTCGTCTCGCTGCTTTC cog-1promA-del2 fwd: CATATTTTCTATCTACATCATCATCTTC cog-1promA-del2 rev: GAAGATGATGATGTAGATAGAAAATATG gcy-1prom HindIII fwd: TTAAGCTTGTGTACTACAACAAGGGACTTTG gcy-1prom BamHI rev: TTGGATCCCGAAAAGATAATTTCAAAACAAT gcy-1prom-del1 fwd: TTTAACTAACTTGCAGACTTAATTTGGAAACATCATGATAAA 2 gcy-1prom-del1 rev: TGATGTTTCCAAATTAAGTCTGCAAGTTAGTTAAAATATTTT gcy-1prom-del2 fwd: CACATTATCAAGAAAAACTAGTGTATTTTATTGGGAAATTTC gcy-1prom-del2 rev: CCCAATAAAATACACTAGTTTTTCTTGATAATGTGTGTAGTT cog-1prom constructs were injected into otIs151 worms at a concentration of 25ng/ul with 45ng/ul elt-2::gfp injection marker. gcy-1prom constructs were first PCR amplified with primers gcy-1prom HindIII fwd and BamHI rev and that the resulting linear PCR product was injected into otIs151 at 50ng/ul with 45ng/ul elt-2::gfp injection marker. Resulting lines were maintained at 20°C and scored under a Zeiss Imager.Z1 microscope. ot119 and ot201 mutants were rescued by injection of a PCR fragment containing the full cog-1 genomic region from -3977 bp upstream relative to the transcriptional start through 975 bp downstream of the stop codon at 25ng/ul with 45ng/ul elt-2::gfp injection marker. Primers used to amplify the cog-1 locus as follows: cog-1prom1 5’: TTGCATGCCTTGCTCAATGTACGTATATG cog-1 3’UTR rvs: GTACGCTCCAGTTCAGAC Bioinformatic analysis of the ASE motif Position weight matrix calculation and scoring We derived the ASE motif position weight matrix from a set of 37 12-mer sites known to be functional (ETCHBERGER et al. 2007). The weights are calculated as a 4 x 12 matrix of the frequency of each possible nucleotide in each position. We did not add pseudo-counts to the counts derived from the functional sites. The ‘score’ referred to in the text for a given putatively functional ASE motif is defined as follows, for a locus in the genome with sequence GAAGCCATAATT. In the following, SIF means ‘site is functional’ GAA… means ‘sequence is GAAGCCATAATT’ 3 PSIF | GAA... PGAA... | SIF * PSIF PGAA... where PGAA... | SIF M 1G M 2 A M 3 A M 4G M 12 A where M’s are the cells of the weight matrix and PGAA... PGAA PG | GAA PC | AAG P A | ATT which is a second-order Markov chain. Mostly the effect of using a second order Markov chain for the background probability has the effect of downgrading the score of poly-A runs, which would be poorly estimated from a zeroth-order Markov chain. Also, for all sites, PSIF is an unknown constant equal to the total number of functional sites in the genome divided by the size of the genome. Fortunately, since it is constant it does not affect comparisons between scores, so it is safely set equal to 1 in our score calculations. Retrieving ASE motifs In a series of statistical experiments, we retrieved all ASE motifs occurring anywhere upstream of all 20,183 C. elegans genes, using an in-house version of CisOrtho (to be published). The upstream region was defined as the interval between the translation start site (ATG of first exon) and the boundary of the nearest exon of a neighboring gene, or 5,000 bases, whichever came first. We used version WS187 of the C. elegans genome and gene annotations from ftp://ftp.wormbase.org/pub/wormbase/genomes/elegans. For genes with multiple 4 isoforms, we defined unique gene bounds as the furthest 5’ and 3’ extent of any isoform. Top-N motif score The top-N motif score for a given gene and value of N is defined as the probability that all N top-scoring upstream ASE motifs are functional: N PSIF1 , SIF2 , , SIFN | seq1 , seq 2 , , seq N PSIFm | seq m m 1 In case N upstream sites do not exist for a gene, the top-N motif score is defined to be zero. We use this score to test the hypothesis that N sites are required to be functional for a given gene. In our experiments, N varies from 1 to 10. As for individual motif scores, the numbers are merely proportional to probabilities through PSIF . However, since we always compare top-N motif scores to others of the same N value, omission of this factor makes no difference in the overall ordering. SAGE tag score After considering a few different ways of sorting genes based on SAGE tag criteria, we found the most predictive quantity to be a modified ASE tag count to AFD tag count ratio ASE AFD 0 ASE 0 and AFD 0 ASE ASE 0 and AFD 0 ASE 0 and AFD 0 Correlation analysis with SAGE data 5 In addition to correlation experiments performed between ASE motifs and ASE expression discussed in text we performed the same ROC curve analysis using SAGE tag ratios as a ‘predictive quantity’ like the top-N motif score was used, despite their experimental origin. We found the tag ratio to be a very good predictor of ASE expression (as defined by the 52 gene set). The top four genes in this ordering are all ASE-expressed. Among the top 1095 (5%) genes, 24 (50%) are ASE-expressed. This correlation turned out to be a better predictor of ASE expression than any of the top-N motif score based predictions. In light of this we also tested the relationship between SAGE tag ratios and the top-N motif score directly. Unlike the other analyses, however, this required first interpreting each gene’s tag ratio as a discrete binary classification of ASE expression. To achieve this we tried several different cutoff tag ratios ranging from 1 to 10, calling genes above the cutoff ‘ASE-expressed’, and the rest ‘not ASE-expressed’. We did a separate ROC analysis of each classification against each of the 10 top-N motif scores. Unfortunately, none yielded a strong correlation. The best involved a SAGE tag ratio cutoff of 7 which defined 55 ‘ASE expressed’ genes. We did however, again observe the same trend, namely that as N increased, the top-N motif score became a better predictor of the SAGE tag ratio based classification of ASE expression. We speculate that overall, the weak correlation between SAGE tag ratios and top-N motif score has to do with a considerable amount of error introduced by forcing the SAGE tag ratio, a continuous quantity, into a binary classification. Also, SAGE tag ratios and top-N motif scores are only indirectly related through actual ASE expression. Therefore, it is to be expected that their correlation be weaker than either direct correlation. Combining top-N motif ranking with SAGE tag ranking The ASE motif scores and the SAGE tag score imply a ranking of all the genes in C. elegans. The ranking is the integer position of the gene in the list sorted by that criterion. We speculated that positive evidence (highly ranked genes) was more reliable as indicating ASE expression than was negative evidence indicating lack of ASE expression, and further that this principle holds for both SAGE-tag and top-N motif score 6 based evidence. To exploit this we defined a ‘combined rank’ as the best rank attained by a gene between its SAGE tag ratio score and top-4 motif score. We performed ROC analysis using this combined rank and found it to give a modest improvement over the SAGE tag ratio score alone, making it the best overall predictor of ASE expression. Raw Data Format Supplementary Table 1 is the complete set of data used in all analyses presented in the main text, including the SAGE tag ratio correlation and combined rank experiments. It has one line per C. elegans gene with the following columns (explanation reproduced in the Excel file as well): Gene WormBase ID, i.e. ‘R03C1.3’ Common common name of the gene, i.e. ‘cog-1’ ase_expressed 1 if belonging to the 52 ASE-expressed gene set, 0 if not ase_tag_count SAGE ASE tag count afd_tag_count SAGE AFD tag count site_info List of all upstream ASE motifs up to 5000 bases, formatted as "offset:score:sequence", where offset is the number of bases upstream, score is the weight matrix score as defined above, and the sequence is given for the site on the strand that it is on. sage_tag_ratio SAGE tag score as defined above xvalues_sage_ratio number of ‘ase_expressed = 0’ genes above this gene when sorted by sage_tag_ratio yvalues_sage_ratio number of ‘ase_expressed = 1’ genes above this gene when sorted by sage_tag_ratio sage_motif_combined_rank ordering based on the best ranking a gene attained between from score4 or sage_tag_ratio xvalues_combined number of ‘ase_expressed = 0’ genes sorted above this gene when sorted by sage_motif_combined_rank yvalues_combined number of ‘ase_expressed = 1’ genes sorted above this gene when sorted by sage_motif_combined_rank scoreN top-N motif score xvaluesN number of ‘ase_expressed = 0’ genes sorted above this gene when sorted by scoreN 7 yvaluesN number of ‘ase_expressed = 1’ genes sorted above this gene when sorted by scoreN Data Visualization We used R (R-project.org) and user-submitted packages ‘ROCR’ and ‘lattice’ avaliable from the CRAN section of the website. 8 LITERATURE CITED BRENNER, S., 1974 The genetics of Caenorhabditis elegans. Genetics 77: 71-94. CHANG, S., R. J. JOHNSTON, JR. and O. HOBERT, 2003 A transcriptional regulatory cascade that controls left/right asymmetry in chemosensory neurons of C. elegans. Genes Dev 17: 2123-2137. ETCHBERGER, J. F., A. LORCH, M. C. SLEUMER, R. ZAPF, S. J. JONES et al., 2007 The molecular signature and cis-regulatory architecture of a C. elegans gustatory neuron. Genes Dev 21: 1653-1674. INOUE, T., M. W ANG, T. O. RIRIE, J. S. FERNANDES and P. W. STERNBERG, 2005 Transcriptional network underlying Caenorhabditis elegans vulval development. Proc Natl Acad Sci U S A. JOHNSTON, R. J., and O. HOBERT, 2003 A microRNA controlling left/right neuronal asymmetry in Caenorhabditis elegans. Nature 426: 845-849. PALMER, R. E., T. INOUE, D. R. SHERWOOD, L. I. JIANG and P. W. STERNBERG, 2002 Caenorhabditis elegans cog-1 Locus Encodes GTX/Nkx6.1 Homeodomain Proteins and Regulates Multiple Aspects of Reproductive System Development. Dev Biol 252: 202-213. SARIN, S., M. O'MEARA M, E. B. FLOWERS, C. ANTONIO, R. J. POOLE et al., 2007 Genetic Screens for Caenorhabditis elegans Mutants Defective in Left/Right Asymmetric Neuronal Fate Specification. Genetics 176: 2109-2130. 9