Supplemental Figure Legends Figure S1.

Supplemental Figure Legends Figure S1. Pipeline for comparing the PS1010 genome to those of C. elegans and C. briggsae. Protein-coding genes predicted by AUGUSTUS on the RNAPATH supercontigs are then analyzed for orthology and domains with OrthoMCL and PFAM; their gene activity is quantitated with ERANGE. The PS1010 assembly is aligned with C. elegans and C. briggsae genomes with TBA/multiz; conserved regions of the alignment are found with phastCons. Figure S2. Size distributions of highly conserved noncoding elements passing all filters (listed in Table S4). Four sets of elements are shown: all 2,672 noncoding elements; a subset of 1,193 elements which have at least one match to at least one motif in Table S6; a subset of 113 elements with at least one match to motif 1-1 (Tables 2 and S6), which is equivalent to the slr2/jmjc-1-responsive motif of Kirienko and Fay (2010); and a subset of 128 elements with at least 1 nt of overlap to the 3,672 RNAz-predicted ncRNAs of Missal et al. (2006). Size ranges of these sets of elements are in Table S4. Supplemental Tables Table S1. PFAM domains conserved in PS1010 but lost in the Elegans group. PS1010 genes PFAM accession Domain name Description Also found in 9 6 5 PF08281.5 PF01842.18 PF08245.5 Sigma70_r4_2 ACT Mur_ligase_M Sigma-70, region 4 ACT domain Mur ligase middle domain 4 PF01580.11 FtsK_SpoIIIE FtsK/SpoIIIE family 3 PF12146.1 Hydrolase_4 2 PF00437.13 GSPII_E 2 PF01408.15 GFO_IDH_MocA 2 PF02548.8 Pantoate_transf Putative lysophospholipase Type II/IV secretion system protein Oxidoreductase family, NAD-binding Rossmann fold Ketopantoate hydroxymethyltransferase M. hapla P. pacificus B. malayi, P. pacificus B. malayi, M. incognita, P. pacificus P. pacificus B. malayi, M. incognita, P. pacificus 2 PF05188.10 MutS_II 2 PF09849.2 DUF2076 1 PF00712.12 DNA_pol3_beta 1 PF00805.15 Pentapeptide 1 PF01132.13 EFP 1 PF01275.12 Myelin_PLP 1 PF01817.14 CM_2 1 PF01930.10 Cas_Cas4 1 PF01935.10 DUF87 1 PF02009.9 Rifin_STEVOR 1 PF02355.9 SecD_SecF 1 PF02562.9 PhoH PhoH-like protein M. hapla, M. incognita, P. pacificus 1 PF02569.8 Pantoate_ligase Pantoate-beta-alanine ligase M. hapla, M. incognita 1 PF03176.8 MMPL MMPL family 1 PF03298.6 Stanniocalcin Stanniocalcin family 1 PF03448.10 MgtE_N 1 PF03623.6 Focal_AT 1 PF04858.6 TH1 MutS domain II Uncharacterized protein conserved in bacteria DNA polymerase III beta subunit, N-terminal domain Pentapeptide repeats (8 copies) Elongation factor P (EF-P) OB domain Myelin proteolipid protein (PLP or lipophilin) Chorismate mutase type II Domain of unknown function (DUF83) Domain of unknown function Rifin/stevor family Protein export membrane protein MgtE intracellular N domain Focal adhesion targeting region TH1 protein M. hapla, M. incognita M. incognita B. malayi, M. hapla, P. pacificus M. hapla, M. incognita B. malayi B. malayi M. hapla B. malayi, M. hapla, P. pacificus M. incognita B. malayi, P. pacificus M. hapla, M. incognita M. incognita M. incognita B. malayi, M. incognita, P. pacificus M. incognita, P. pacificus P. pacificus B. malayi B. malayi, M. hapla, M. incognita, P. pacificus 1 PF05495.5 zf-CHY 1 PF05673.6 DUF815 1 PF05893.7 LuxC 1 PF06209.6 COBRA1 1 PF07224.4 Chlorophyllase 1 PF07464.4 ApoLp-III 1 PF07517.7 SecA_DEAD 1 PF07991.5 IlvN 1 PF08127.6 Propeptide_C1 1 PF08284.4 RVP_2 1 PF08320.5 PIG-X 1 PF08333.4 DUF1725 1 PF08615.4 RNase_H2_suC 1 PF10345.2 Cohesin_load 1 PF10534.2 CRIC_ras_sig 1 PF11259.1 DUF3060 1 PF11464.1 Rbsn 1 PF12130.1 DUF3585 1 PF12140.1 DUF3588 CHY zinc finger Domain of unknown function Acyl-CoA reductase (LuxC) Cofactor of BRCA1 (COBRA1) Chlorophyllase Apolipophorin-III precursor (apoLp-III) SecA DEAD-like domain Acetohydroxy acid isomeroreductase, catalytic domain Peptidase family C1 propeptide Retroviral aspartyl protease PIG-X / PBN1 Domain of unknown function Ribonuclease H2 noncatalytic subunit (Ylr154p-like) Cohesin loading factor Connector enhancer of kinase suppressor of ras Domain of unknown function FYVE-finger-containing Rab5 effector protein rabenosyn-5 Domain of unknown function Domain of unknown function B. malayi, P. pacificus P. pacificus M. incognita B. malayi, M. hapla, P. pacificus B. malayi M. hapla P. pacificus M. incognita B. malayi B. malayi, M. incognita P. pacificus B. malayi B. malayi, M. hapla, M. incognita, P. pacificus B. malayi P. pacificus B. malayi B. malayi B. malayi, P. pacificus B. malayi, P. pacificus Two motifs in boldface, PF04858.6 and and PF08615.4, were found in PS1010 and in all four non-Caenorhabditis nematode proteomes examined, but were missing from both C. elegans and C. briggsae. The E-value threshold for motif detection was 10-6. Table S2. PFAM domains conserved in six non-PS1010 nematode genomes but not found in PS1010. C. elegans genes PFAM accession Domain name Description 5 4 4 3 PF01909.16 PF00752.10 PF02793.15 PF03619.9 NTP_transf_2 XPG_N HRM DUF300 3 PF09762.2 KOG2701 2 PF01974.10 tRNA_int_endo 2 PF02837.11 Glyco_hydro_2_N 2 PF03016.8 Exostosin 2 PF03531.7 SSrecog 2 PF04676.7 CwfJ_C_2 2 PF06624.5 RAMP4 2 2 1 1 1 1 1 1 1 1 PF07039.4 PF08321.5 PF00794.11 PF00832.13 PF01198.12 PF01215.12 PF01331.12 PF01599.12 PF01661.14 PF01776.10 DUF1325 PPP5 PI3K_rbd Ribosomal_L39 Ribosomal_L31e COX5B mRNA_cap_enzyme Ribosomal_S27 Macro Ribosomal_L22e 1 PF02291.8 TFIID-31kDa 1 PF02516.7 STT3 1 PF02836.10 Glyco_hydro_2_C 1 1 1 1 1 1 1 PF03179.8 PF03281.7 PF03484.8 PF03638.8 PF03850.7 PF03911.9 PF03919.8 V-ATPase_G Mab-21 B5 CXC Tfb4 Sec61_beta mRNA_cap_C 1 PF03986.6 Autophagy_N 1 PF04112.6 Mak10 1 1 PF04114.7 PF04129.5 Gaa1 Vps52 1 PF04430.7 DUF498 Nucleotidyltransferase domain XPG N-terminal domain Hormone receptor domain Domain of unknown function Coiled-coil domain-containing protein (DUF2037) tRNA intron endonuclease, catalytic Cterminal domain Glycosyl hydrolases family 2, sugar binding domain Exostosin family Structure-specific recognition protein (SSRP1) Protein similar to CwfJ C-terminus 2 Ribosome associated membrane protein RAMP4 SGF29 tudor-like domain PPP5 PI3-kinase family, ras-binding domain Ribosomal L39 protein Ribosomal protein L31e Cytochrome c oxidase subunit Vb mRNA capping enzyme, catalytic domain Ribosomal protein S27a Macro domain Ribosomal L22e protein family Transcription initiation factor IID, 31kD subunit Oligosaccharyl transferase STT3 subunit Glycosyl hydrolases family 2, TIM barrel domain Vacuolar (H+)-ATPase G subunit Mab-21 protein tRNA synthetase B5 domain Tesmin/TSO1-like CXC domain Transcription factor Tfb4 Sec61beta family mRNA capping enzyme, C-terminal domain Autophagocytosis associated protein (Atg3), N-terminal domain Mak10 subunit, NatC N(alpha)-terminal acetyltransferase Gaa1-like, GPI transamidase component Vps52 / Sac2 family Domain of unknown function (DUF498/DUF598) 1 1 1 1 1 1 1 1 1 1 1 PF04615.6 PF04934.7 PF05026.6 PF05131.7 PF05493.6 PF05743.6 PF06025.5 PF06047.4 PF06093.6 PF06094.5 PF06105.5 Utp14 Med6 DCP2 Pep3_Vps18 ATP_synt_H UEV DUF913 SynMuv_product Spt4 AIG2 Aph-1 1 PF06432.4 GPI2 1 1 PF06732.4 PF06859.5 Pescadillo_N Bin3 1 PF06978.4 POP1 1 1 1 1 1 1 1 1 1 PF07047.5 PF07289.4 PF07542.4 PF07575.6 PF07910.6 PF07947.7 PF08038.5 PF08075.4 PF08170.5 OPA3 DUF1448 ATP12 Nucleopor_Nup85 Peptidase_C78 YhhN Tom7 NOPS POPLD 1 PF08221.4 HTH_9 1 1 1 1 1 1 1 PF08295.5 PF08324.4 PF08366.6 PF08375.4 PF08492.5 PF08518.4 PF08825.3 HDAC_interact PUL LLGL Rpn3_C SRP72 GIT_SHD E2_bind 1 PF08923.3 MAPKK1_Int 1 1 1 1 1 1 1 1 1 1 1 PF09270.3 PF09271.4 PF09785.2 PF10044.2 PF10228.2 PF10240.2 PF10354.2 PF10377.2 PF10484.2 PF10509.2 PF10597.2 Beta-trefoil LAG1-DNAbind Prp31_C Ret_tiss DUF2228 DUF2464 DUF2431 ATG11 MRP-S23 GalKase_gal_bdg U5_2-snRNA_bdg 1 PF10598.2 RRM_4 Utp14 protein MED6 mediator sub complex component Dcp2, box A domain Pep3/Vps18/deep orange family ATP synthase subunit H UEV domain Domain of unknown function Ras-induced vulval development antagonist Spt4/RpoE2 zinc finger AIG2-like family Aph-1 protein Phosphatidylinositol Nacetylglucosaminyltransferase Pescadillo N-terminus Bicoid-interacting protein 3 (Bin3) Ribonucleases P/MRP protein subunit POP1 Optic atrophy 3 protein (OPA3) Protein of unknown function (DUF1448) ATP12 chaperone protein Nup85 Nucleoporin Peptidase family C78 YhhN-like protein TOM7 family NOPS (NUC059) domain POPLD (NUC188) domain RNA polymerase III subunit RPC82 helixturn-helix domain Histone deacetylase (HDAC) interacting PUL domain LLGL2 Proteasome regulatory subunit C-terminal SRP72 RNA-binding domain Spa2 homology domain (SHD) of GIT E2 binding domain Mitogen-activated protein kinase kinase 1 interacting Beta-trefoil LAG1, DNA binding Prp31 C terminal domain Retinal tissue protein Uncharacterised conserved protein Domain of unknown function Domain of unknown function Autophagy-related protein 11 Mitochondrial ribosomal protein S23 Galactokinase galactose-binding signature U5-snRNA binding site 2 of PrP8 RNA recognition motif of the spliceosomal PrP8 1 PF12353.1 eIF3g 1 PF12457.1 TIP_N 1 PF12513.1 SUV3_C 1 PF12572.1 DUF3752 Eukaryotic translation initiation factor 3 subunit G Tuftelin interacting protein N terminal Mitochondrial degradasome RNA helicase subunit C terminal Domain of unknown function Table S3. The 50 most frequent PFAM-A domains in PS1010. Genes 370 340 250 242 199 196 175 160 150 134 120 PFAM accession PF00069.18 PF07714.10 PF10326.2 PF10318.2 PF10317.2 PF10319.2 PF07690.9 PF01391.11 PF00005.20 PF00001.14 PF00400.25 Domain name Pkinase Pkinase_Tyr 7TM_GPCR_Str 7TM_GPCR_Srh 7TM_GPCR_Srd 7TM_GPCR_Srj MFS_1 Collagen ABC_tran 7tm_1 WD40 Description Protein kinase domain Protein tyrosine kinase Serpentine type 7TM GPCR chemoreceptor Str Serpentine type 7TM GPCR chemoreceptor Srh Serpentine type 7TM GPCR chemoreceptor Srd Serpentine type 7TM GPCR chemoreceptor Srj Major Facilitator Superfamily Collagen triple helix repeat (20 copies) ABC transporter 7 transmembrane receptor (rhodopsin family) WD domain, G-beta repeat Neurotransmitter-gated ion-channel ligand 119 PF02931.16 Neur_chan_LBD binding domain 106 PF01484.10 Col_cuticle_N Nematode cuticle collagen N-terminal domain 103 PF00651.24 BTB BTB/POZ domain 90 PF00059.14 Lectin_C Lectin C-type domain 90 PF00102.20 Y_phosphatase Protein-tyrosine phosphatase RNA recognition motif. (a.k.a. RRM, RBD, or 86 PF00076.15 RRM_1 RNP domain) Ligand-binding domain of nuclear hormone 86 PF00104.23 Hormone_recep receptor 83 PF00105.11 zf-C4 Zinc finger, C4 type (two domains) Neurotransmitter-gated ion-channel 82 PF02932.9 Neur_chan_memb transmembrane region 80 PF10327.2 7TM_GPCR_Sri Serpentine type 7TM GPCR chemoreceptor Sri 79 PF00271.24 Helicase_C Helicase conserved C-terminal domain 76 PF01431.14 Peptidase_M13 Peptidase family M13 75 PF07679.9 I-set Immunoglobulin I-set domain 72 PF00106.18 adh_short short chain dehydrogenase 71 PF00071.15 Ras Ras family 70 PF00083.17 Sugar_tr Sugar (and other) transporter 69 PF00046.22 Homeobox Homeobox domain 67 PF00023.23 Ank Ankyrin repeat 65 PF00025.14 Arf ADP-ribosylation factor family 63 PF08477.6 Miro Miro-like protein 63 PF10328.2 7TM_GPCR_Srx Serpentine type 7TM GPCR chemoreceptor Srx 62 PF07885.9 Ion_trans_2 Ion channel 61 PF00270.22 DEAD DEAD/DEAH box helicase 61 PF01549.17 ShK ShK domain-like 57 PF02463.12 SMC_N RecF/RecN/SMC N terminal domain 56 PF00431.13 CUB CUB domain 56 PF00004.22 AAA ATPase family associated with various cellular activities (AAA) 55 PF00036.25 efhand EF hand 55 PF01060.16 DUF290 Transthyretin-like family Serpentine type 7TM GPCR chemoreceptor 54 PF10320.2 7TM_GPCR_Srsx Srsx 54 PF00635.19 Motile_Sperm MSP (Major sperm protein) domain 54 PF00149.21 Metallophos Calcineurin-like phosphoesterase UDP-glucoronosyl and UDP-glucosyl 53 PF00201.11 UDPGT transferase 53 PF00096.19 zf-C2H2 Zinc finger, C2H2 type 52 PF01697.20 DUF23 Domain of unknown function 51 PF03125.11 Sre C. elegans Sre G protein-coupled chemoreceptor 50 PF00047.18 ig Immunoglobulin domain 49 PF00041.14 fn3 Fibronectin type III domain 49 PF00595.17 PDZ PDZ domain (Also known as DHR or GLGF) Table S4. Filtering of phastCons elements with PS1010-elegans-briggsae conservation. DNA elements Original, unfiltered PhastCons predictions +80% overlap w/ elegans/PS1010 alignment +80% overlap w/ elegans/briggsae alignment No RepeatMasker No simple repeats No WS210 protein-coding exons No WS210 ncRNA exons No sig. BlastN hits to WS210 ncRNA No mGene protein-coding exons No other alternative protein-coding exons No AUGUSTUS (+ 355K/pub. Rseq) exons No AUGUSTUS (+ 355K ESTs) exons No AUGUSTUS (+ public RNA-seq) exons No AUGUSTUS (ab initio) exons No AUGUSTUS (+ 2x75 nt N2 cDNA) exons No overlap with 2006 RNAz Any overlap with 2006 RNAz 50% overlap with 2006 RNAz 80% overlap with 2006 RNAz No overlap w/ filtered 2006 RNAz Any overlap w/ filtered 2006 RNAz 50% overlap w/ filtered 2006 RNAz 80% overlap w/ filtered 2006 RNAz At least one match to a predicted motif No matches to any predicted motifs At least one match to motif 1-1 Number 95,712 95,711 85,819 84,616 84,053 4,964 4,545 4,494 3,313 2,761 2,707 2,700 2,695 2,677 2,672 2,544 128 118 101 2,577 95 87 72 1,193 1,479 113 Total nt 6,231,603 6,231,579 5,740,830 5,624,280 5,572,614 188,690 162,811 160,846 103,969 81,264 78,977 78,743 78,570 77,909 77,635 73,035 4,600 4,156 3,347 73,967 3,668 3,316 2,551 39,587 38,048 3,255 % genome 6.21% 6.21% 5.72% 5.61% 5.56% 0.19% 0.16% 0.16% 0.10% 0.08% 0.08% 0.08% 0.08% 0.08% 0.08% 0.07% 0.00% 0.00% 0.00% 0.07% 0.00% 0.00% 0.00% 0.04% 0.04% 0.00% Min. nt 3 3 3 3 3 7 7 7 7 7 7 7 7 7 7 7 9 9 9 7 10 10 10 9 7 11 Avg. nt 65 65 67 66 66 38 36 36 31 29 29 29 29 29 29 29 36 35 33 29 39 38 35 33 26 29 Max. nt 2,514 2,514 2,514 2,514 2,514 444 444 444 173 160 160 160 160 160 160 160 126 126 120 160 126 126 120 160 107 148 The row of elements in boldface is our filtered set of non-coding elements. Rows below it are subsets of these 2,672 sequences. Table S5. Enrichment of GO terms for genes near candidate conserved non-coding elements. GO ID Associated genes GO:0000003 248 of 1478 GO:0040007 236 of 1429 GO:0009792 396 of 2811 p-value 7.67e-20 4.73e-18 2.16e-17 GO:0040010 GO:0010171 GO:0005622 GO:0045449 GO:0040035 GO:0018991 GO:0003677 GO:0006355 270 of 1783 106 of 579 127 of 750 50 of 199 120 of 706 64 of 298 95 of 529 103 of 618 1.33e-15 2.24e-11 5.48e-11 9.26e-11 1.44e-10 3.11e-10 6.50e-10 7.59e-09 GO:0006811 GO:0005216 GO:0005840 GO:0003700 GO:0040026 GO:0018996 41 of 168 27 of 89 33 of 138 91 of 573 15 of 38 43 of 211 1.01e-08 2.67e-08 4.10e-07 4.41e-07 7.74e-07 9.00e-07 GO:0016020 GO:0003735 GO:0002009 GO:0005524 GO:0006412 GO:0043565 146 of 1062 33 of 146 58 of 328 131 of 935 40 of 197 76 of 475 1.50e-06 1.52e-06 1.76e-06 1.90e-06 2.26e-06 2.62e-06 GO term description reproduction growth embryonic development ending in birth or egg hatching positive regulation of growth rate body morphogenesis intracellular regulation of transcription hermaphrodite genitalia development oviposition DNA binding regulation of transcription, DNAdependent ion transport ion channel activity ribosome transcription factor activity positive regulation of vulval development molting cycle, collagen and cuticulinbased cuticle membrane structural constituent of ribosome morphogenesis of an epithelium ATP binding translation sequence-specific DNA binding Table S6. Characteristics of DNA motifs predicted from filtered conserved DNA elements. [See the Excel spreadsheet ps1010_GR2010_Table_S6.xls, provided in the Supplemental Files, for these data.] In addition to similarities described in the main text, further similarities of our predicted motifs were detected to published binding site or computational consensus motifs for the following: the C. elegans neuron-specific N1-box (Ruvinsky et al., 2007); Drosophila Mad (Rushlow et al. 2001), brinker (Rushlow et al., 2001), shn (Dai et al., 2000), and Trithorax-like (Trl, i.e., Drosophila GAGA; Mahmoudi et al., 2003); GAGA (Adkins et al., 2006); human CTGYNNCTYTAA (PF0082.1 in JASPAR PHYLOFACTS; Xie et al., 2005); and core promoter GC-box (Bucher, 1990) and XCPE1 (POL011.1 in JASPAR-POLII; Tokusumi et al., 2007). Table S7. Genomic and RNA library statistics. Genomic Genomic Genomic Genomic RNA-seq Read Type 1x75 2x75 2x75 2x75 2x75 Fragment Size (bp) 200 450 375 200 Raw (Pair) Count (millions of reads) 53.6 40.2 23.5 24.8 53.2 Nominal sequence (Gb) 4.0 6.0 3.5 3.7 8.0 Supplemental Methods Purifying worms and extracting genomic DNA Stock solutions. Worm lysis buffer: 0.1 M Tris-Cl pH 8.5, 0.1 M NaCl, 50 mM EDTA pH 8.0, 1% SDS. CTAB/NaCl solution: 10% CTAB (Sigma M-7635) in 0.7 M NaCl. These solutions tended to precipitate at room temperature but could be heated and redissolved. Protease K: 20 mg/ml in TE pH 8.0. Aliquots were stored at -20° C. until use. Other solutions such as M9 were as described in Lewis and Fleming (1995) or Sambrook and Russell (2001). Worm preparation. We seeded at least four 10-cm nutrient agar/HB101 plates with PS1010 worms, and grew them for one week to get many starved L1 and older larvae. Starved larvae were harvested by washing with M9 buffer, and run through a sucrose gradient to clean up both bacterial waste and agar chunks (starved PS1010 worms burrow intensely in agar, fragmenting it). We pooled cleaned worms by more M9 washes and tabletop centrifugation before aliquotting 100 µl of worms into a 1.5 ml microcentrifuge tube and snap-freezing them with liquid N2 and storing at -80° C. Worms could be frozen and stored before actual DNA preps, allowing supplies of worms to be built up. Before DNA or RNA extraction, we cracked their cuticles by doing 2-3 freeze-thaws in liquid N2 and a 37° C. water bath. DNA preparation, day 1: We added ~4.5 ml of worm lysis buffer to a frozen ~500 µl aliquot of worms (to get a final volume of 5.0 ml, adjusting as needed) and transferred worms to 15 ml disposable tube, added 200 µl of 20 mg/ml Protease K to the worms and mixed by inversion. We incubated worms at 62° C. for 60 minutes while prechilling isopropanol at -20° C., mixing by gentle inversion 4-5 times during the incubation. The solution cleared as the worms disintegrated. We added 800 µl of 5 M NaCl, mixed it thoroughly by inversion, added 800 µl of CTAB, and incubated the extract for 10' at 37° C. We extracted with one volume of chloroform and phenol/choroform (in that order), using gentle inversion and tabletop centrifugation and recovering the aqueous phase. We added one volume of (-20° C.) isopropanol and mixed by inversion; good DNA preparations gave immediate threading that could be hand-picked to 70% EtOH with a micropipettor, while weaker preparations required centrifuging trace DNA precipitate. After three washes with 70% EtOH and 5' centrifugation, supernatant was removed and the pellet air dried before resuspending overnight in 340 µl TE + 10 µl RNAse A, at 4° C. The DNA could be left safely at this step indefinitely. DNA preparation, day 2: We incubated DNA solutions 2 hours at 37° C. to drive RNAse activity to its conclusion; We added 20 µl of 20% SDS, 10 µl of 0.5 M EDTA pH 8.0, and 20 µl of Protease K, mixing by gentle inversion, then incubating at 62° C. for 2 hours. We mixed 80 µl of 5 M ammonium acetate by gentle inversion, then extracted twice with phenol/chloroform and once with chloroform before EtOH precipitation, 70% EtOH washing, air-drying genomic DNA, and resuspending it in 100 µl TE overnight at 4° C. High-quality genomic DNA preparations were viscous and took at least a day to dissolve. Extracting mixed-stage bulk RNA Worms were grown, starved of E. coli, harvested, cleaned with sucrose gradients, snapfrozen, and stored as above. One difference was that it was useful to wash the cleaned worms in S Basal buffer rather than M9, because S Basal's pH favored RNA purification. After cracking the cuticles by two rounds of freeze-thawing, we partially thawed the worms and ground them with an RNAse-free plastic pestle in their microcentrifuge tube, refroze them quickly, and mixed them with 350 µl of RLT buffer (Qiagen) + ß-mercaptoethanol. To completely homogenize the worm extract, we passed the mixture through a disposable 20G 11/2 syringe ≥10 times. We microcentrifuged the mixture for 3 minutes, transferred the supernatant to a fresh microcentrifuge tube, and proceeded to purify bulk RNA into a final volume of 100 µl (in RNAse-free TE) with the RNeasy mini kit by manufacturer's instructions. Because PS1010 does not give many gravid or fecund females even on nutrient agar with HB101, it was practically impossible to get large quantities of well-synchronized starved animals for an RNA harvest. We did get a mixed larval population that was dominated by younger larvae (L1 and L2 stages) but the overall population ranged to a few starved adults. Manipulating and viewing sequence sets Sequences were visualized and intersected by uploading them to a local mirror of the UCSC Genome and Table Browsers (Kuhn et al., 2009). This mirror used the WS190 version of the C. elegans genome sequence; data from WS210 et al. were mapped to WS190 coordinates. To purge highly conserved sequences in the C. elegans genome comprehensively of sequences that were previously annotated or that were unlikely to be regulatory, we first selected against annotated repetitive DNA sequences in the C. elegans genome (from the WS190 release of WormBase, used in the UCSC genome site which we mirrored). We also required conserved sequences to show 80% overlap in all comparisons of PS1010 to C. elegans and C. briggsae. We then generated the following sequence sets, which we imported as custom tracks to our UCSC mirror and used as filters: all exons of officially annotated protein-coding or non-coding RNA genes in the WS210 frozen release of WormBase; all predicted exons from the recent alternative genefinding analysis using mGene by Schweikert et al. (2009), performed on the WS180 release of the C. elegans genome; and all unofficially predicted exons by five alternative protein-coding genefinders archived in the WS204 release of WormBase (GeneMarkHMM; Borodovsky 2003), mSplicer and mSplicer-ORF (Rätsch et al., 2007), nGASP/JIGSAW (Coghlan et al., 2008), and TWINSCAN (Wei et al., 2005). In particular, the WS210 set of official ncRNA genes was far more extensive than that for WS190, containing thousands of recently annotated 21U RNAs (Batista et al., 2008). To import these data to the WS190 coordinates of the UCSC mirror, we used remap_gff_between_releases.pl and unmap_gff_between_releases.pl from the remap.tar.bz2 package at ftp://ftp.sanger.ac.uk/pub2/wormbase/gary. (G. Williams, personal communication). A small number of known ncRNAs passed all these filters, perhaps because of mapping errors; these were detected with BlastN against ncRNAs in WS210 and removed. The original set of 3,672 putative ncRNAs predicted in C. elegans with RNAz by Missal et al. (2006) was generated with a version of the C. elegans genome (WS120) dating to March 2004. To find which of these predictions remained completely unaccounted for in the WS210 genome release (dating to December 2009), we filtered them in exactly the same way that we filtered conserved noncoding DNA elements; this yielded a still-novel subset of 1,290 RNAzpredicted ncRNAs. cDNA-mediated genomic assembly The RNAPATH module of ERANGE 3.2 was used with standard parameters as defined in its internal documentation. Protein motifs Major sperm protein genes were defined as those encoding the PFAM-A motif PF00635.19. Serpentine receptor genes were defined as those encoding any domain whose name contained the abbreviation "7TM_GPCR". Gene mappings and counts were done with Perl. Predicting and comparing DNA motifs MEME will not accept input sequences of under 8 nt in size; we thus removed the (tiny) fraction of our 7-nt elements before predicting motifs. Many of our elements are small and might have only partial overlaps with sites that are nevertheless real and worth detecting. We therefore predicted two sets of motifs: set 1 (with 23 members, named 1-1 through 1-23 in order of descending statistical significance) was extracted from the sequences of the filtered elements alone, whereas set 2 (with 30 members, named 2-1 through 2-30 in descending significance) was extracted from these sequences along with 5 nt of flanking DNA on each side of each element. To find which of these two sets of predicted motifs were nonredundant, we cross-checked the two sets with TOMTOM. Many motifs were indeed discovered in both sets (Table S6), but we did find more predicted motifs in set 2 than in set 1, which included five new predictions, including a match to E2F that had not been found in set 1. Matches to plant- or fungal-specific motifs such as ABI4 or STE12 were omitted. All motifs were checked with FIMO for sites in the original set of elements (without flanking 5 nt), and were found to exist in them. The final set of nonredundant motifs is summarized in Table 2; the full set of motifs is detailed in Table S6. Author Contributions A.M. devised RNAPATH, assembled genomes and transcriptomes, mapped expression data, compared Illumina and Sanger genomic assemblies, predicted protein-coding genes, aligned multiple genomes with TBA/multiz, extracted conserved regions with phastCons, and linked them to GO terms with Cistematic; E.M.S. grew worms, extracted genomic DNA and bulk RNA, identified repetitive DNA, predicted protein-coding genes, analyzed orthology and domains of protein-coding genes, devised most filters for highly conserved non-coding elements, and analyzed DNA motifs; B.W. constructed cDNA libraries for Illumina sequencing; L.S. built genomic libraries and carried out Illumina sequencing; I.A. optimized sequencing protocols; B.J.W. and P.W.S. provided overall management. A.M. and E.M.S. wrote the bulk of the manuscript; all authors read and commented on it. Supplemental References Adkins, N.L., Hagerman, T.A., Georgel, P. GAGA protein: a multi-faceted transcription factor. Biochem. Cell Biol. 84, 559-567 (2006). Batista, P.J., Ruby, J.G., Claycomb, J.M., Chiang, R., Fahlgren, N., Kasschau, K.D., Chaves, D.A., Gu, W., Vasale, J.J., Duan, S., et al. PRG-1 and 21U-RNAs interact to form the piRNA complex required for fertility in C. elegans. Mol Cell. 31, 67-78 (2008). Borodovsky, M., Lomsadze, A., Ivanov, N., Mills, R. Eukaryotic gene prediction using GeneMark.hmm. Curr. Protoc. Bioinformatics Chapter 4, Unit 4.6 (2003). Bucher, P. Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J. Mol. Biol. 212, 563-578 (1990). Coghlan, A., Fiedler, T.J., McKay, S.J., Flicek, P., Harris, T.W., Blasiar, D., nGASP Consortium, Stein, L.D. nGASP -- the nematode genome annotation assessment project. BMC Bioinformatics 9, 549 (2008). Dai, H., Hogan, C., Gopalakrishnan, B., Torres-Vazquez, J., Nguyen, M., Park, S., Raftery, L.A., Warrior, R., Arora, K. The zinc finger protein Schnurri acts as a Smad partner in mediating the transcriptional response to Decapentaplegic. Dev. Biol. 227, 373-387 (2000). Kuhn, R.M., Karolchik, D., Zweig, A.S., Wang, T., Smith, K.E., Rosenbloom, K.R., Rhead, B., Raney, B.J., Pohl, A., Pheasant, M., et al. The UCSC Genome Browser Database: update 2009. Nucleic Acids Res. 37, D755-D761 (2009). Lewis, J.A., Fleming, J.T. Basic culture methods. Methods Cell Biol. 48, 3-29 (1995). Mahmoudi, T., Zuijderduijn, L.M., Mohd-Sarip, A., Verrijzer, C.P. GAGA facilitates binding of Pleiohomeotic to a chromatinized Polycomb response element. Nucleic Acids Res. 31, 41474156 (2003). Rätsch, G., Sonnenburg, S., Srinivasan, J., Witte, H., Müller, K.R., Sommer, R.J., Schölkopf, B. Improving the Caenorhabditis elegans genome annotation using machine learning. PLoS Comput. Biol. 3, e20 (2007). Rushlow, C., Colosimo, P.F., Lin, M.C., Xu, M., Kirov, N. Transcriptional regulation of the Drosophila gene zen by competing Smad and Brinker inputs. Genes Dev. 15, 340-351 (2001). Ruvinsky, I., Ohler, U., Burge, C.B., Ruvkun, G. Detection of broadly expressed neuronal genes in C. elegans. Dev. Biol. 302, 617-626 (2007). Sambrook, J., Russell, D.W. Molecular Cloning: a Laboratory Manual. 3rd. ed. CSHL Press, Cold Spring Harbor, New York (2001). Tokusumi, Y., Ma, Y., Song, X., Jacobson, R.H, Takada, S. The new core promoter element XCPE1 (X Core Promoter Element 1) directs activator-, mediator-, and TATA-binding proteindependent but TFIID-independent RNA polymerase II transcription from TATA-less promoters. Mol. Cell. Biol. 27, 1844-1858 (2007). Wei, C., Lamesch, P., Arumugam, M., Rosenberg, J., Hu, P., Vidal, M., Brent, M.R. Closing in on the C. elegans ORFeome by cloning TWINSCAN predictions. Genome Res. 15, 577-582 (2005). Xie, X., Lu, J., Kulbokas, E.J., Golub, T.R., Mootha, V., Lindblad-Toh, K., Lander, E.S., Kellis, M. Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals. Nature 434, 338-345 (2005).

Supplemental Figure Legends Figure S1.

Related documents

Products

Support

Supplemental Figure Legends Figure S1.

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib