Supplemental Figure Legends Figure S1.

advertisement
Supplemental Figure Legends
Figure S1. Pipeline for comparing the PS1010 genome to those of C. elegans and C. briggsae.
Protein-coding genes predicted by AUGUSTUS on the RNAPATH supercontigs are then
analyzed for orthology and domains with OrthoMCL and PFAM; their gene activity is
quantitated with ERANGE. The PS1010 assembly is aligned with C. elegans and C. briggsae
genomes with TBA/multiz; conserved regions of the alignment are found with phastCons.
Figure S2. Size distributions of highly conserved noncoding elements passing all filters (listed in
Table S4). Four sets of elements are shown: all 2,672 noncoding elements; a subset of 1,193
elements which have at least one match to at least one motif in Table S6; a subset of 113
elements with at least one match to motif 1-1 (Tables 2 and S6), which is equivalent to the slr2/jmjc-1-responsive motif of Kirienko and Fay (2010); and a subset of 128 elements with at least
1 nt of overlap to the 3,672 RNAz-predicted ncRNAs of Missal et al. (2006). Size ranges of these
sets of elements are in Table S4.
Supplemental Tables
Table S1. PFAM domains conserved in PS1010 but lost in the Elegans group.
PS1010
genes
PFAM
accession
Domain name
Description
Also found in
9
6
5
PF08281.5
PF01842.18
PF08245.5
Sigma70_r4_2
ACT
Mur_ligase_M
Sigma-70, region 4
ACT domain
Mur ligase middle domain
4
PF01580.11
FtsK_SpoIIIE
FtsK/SpoIIIE family
3
PF12146.1
Hydrolase_4
2
PF00437.13
GSPII_E
2
PF01408.15
GFO_IDH_MocA
2
PF02548.8
Pantoate_transf
Putative lysophospholipase
Type II/IV secretion
system protein
Oxidoreductase family,
NAD-binding Rossmann
fold
Ketopantoate
hydroxymethyltransferase
M. hapla
P. pacificus
B. malayi, P. pacificus
B. malayi, M. incognita,
P. pacificus
P. pacificus
B. malayi, M. incognita,
P. pacificus
2
PF05188.10
MutS_II
2
PF09849.2
DUF2076
1
PF00712.12
DNA_pol3_beta
1
PF00805.15
Pentapeptide
1
PF01132.13
EFP
1
PF01275.12
Myelin_PLP
1
PF01817.14
CM_2
1
PF01930.10
Cas_Cas4
1
PF01935.10
DUF87
1
PF02009.9
Rifin_STEVOR
1
PF02355.9
SecD_SecF
1
PF02562.9
PhoH
PhoH-like protein
M. hapla, M. incognita,
P. pacificus
1
PF02569.8
Pantoate_ligase
Pantoate-beta-alanine
ligase
M. hapla, M. incognita
1
PF03176.8
MMPL
MMPL family
1
PF03298.6
Stanniocalcin
Stanniocalcin family
1
PF03448.10
MgtE_N
1
PF03623.6
Focal_AT
1
PF04858.6
TH1
MutS domain II
Uncharacterized protein
conserved in bacteria
DNA polymerase III beta
subunit, N-terminal
domain
Pentapeptide repeats (8
copies)
Elongation factor P (EF-P)
OB domain
Myelin proteolipid protein
(PLP or lipophilin)
Chorismate mutase type II
Domain of unknown
function (DUF83)
Domain of unknown
function
Rifin/stevor family
Protein export membrane
protein
MgtE intracellular N
domain
Focal adhesion targeting
region
TH1 protein
M. hapla, M. incognita
M. incognita
B. malayi, M. hapla, P.
pacificus
M. hapla, M. incognita
B. malayi
B. malayi
M. hapla
B. malayi, M. hapla, P.
pacificus
M. incognita
B. malayi, P. pacificus
M. hapla, M. incognita
M. incognita
M. incognita
B. malayi, M. incognita,
P. pacificus
M. incognita, P.
pacificus
P. pacificus
B. malayi
B. malayi, M. hapla, M.
incognita, P. pacificus
1
PF05495.5
zf-CHY
1
PF05673.6
DUF815
1
PF05893.7
LuxC
1
PF06209.6
COBRA1
1
PF07224.4
Chlorophyllase
1
PF07464.4
ApoLp-III
1
PF07517.7
SecA_DEAD
1
PF07991.5
IlvN
1
PF08127.6
Propeptide_C1
1
PF08284.4
RVP_2
1
PF08320.5
PIG-X
1
PF08333.4
DUF1725
1
PF08615.4
RNase_H2_suC
1
PF10345.2
Cohesin_load
1
PF10534.2
CRIC_ras_sig
1
PF11259.1
DUF3060
1
PF11464.1
Rbsn
1
PF12130.1
DUF3585
1
PF12140.1
DUF3588
CHY zinc finger
Domain of unknown
function
Acyl-CoA reductase
(LuxC)
Cofactor of BRCA1
(COBRA1)
Chlorophyllase
Apolipophorin-III
precursor (apoLp-III)
SecA DEAD-like domain
Acetohydroxy acid
isomeroreductase, catalytic
domain
Peptidase family C1
propeptide
Retroviral aspartyl
protease
PIG-X / PBN1
Domain of unknown
function
Ribonuclease H2 noncatalytic subunit
(Ylr154p-like)
Cohesin loading factor
Connector enhancer of
kinase suppressor of ras
Domain of unknown
function
FYVE-finger-containing
Rab5 effector protein
rabenosyn-5
Domain of unknown
function
Domain of unknown
function
B. malayi, P. pacificus
P. pacificus
M. incognita
B. malayi, M. hapla, P.
pacificus
B. malayi
M. hapla
P. pacificus
M. incognita
B. malayi
B. malayi, M. incognita
P. pacificus
B. malayi
B. malayi, M. hapla, M.
incognita, P. pacificus
B. malayi
P. pacificus
B. malayi
B. malayi
B. malayi, P. pacificus
B. malayi, P. pacificus
Two motifs in boldface, PF04858.6 and and PF08615.4, were found in PS1010 and in all four
non-Caenorhabditis nematode proteomes examined, but were missing from both C. elegans and
C. briggsae. The E-value threshold for motif detection was 10-6.
Table S2. PFAM domains conserved in six non-PS1010 nematode genomes but not found in
PS1010.
C. elegans
genes
PFAM
accession
Domain name
Description
5
4
4
3
PF01909.16
PF00752.10
PF02793.15
PF03619.9
NTP_transf_2
XPG_N
HRM
DUF300
3
PF09762.2
KOG2701
2
PF01974.10
tRNA_int_endo
2
PF02837.11
Glyco_hydro_2_N
2
PF03016.8
Exostosin
2
PF03531.7
SSrecog
2
PF04676.7
CwfJ_C_2
2
PF06624.5
RAMP4
2
2
1
1
1
1
1
1
1
1
PF07039.4
PF08321.5
PF00794.11
PF00832.13
PF01198.12
PF01215.12
PF01331.12
PF01599.12
PF01661.14
PF01776.10
DUF1325
PPP5
PI3K_rbd
Ribosomal_L39
Ribosomal_L31e
COX5B
mRNA_cap_enzyme
Ribosomal_S27
Macro
Ribosomal_L22e
1
PF02291.8
TFIID-31kDa
1
PF02516.7
STT3
1
PF02836.10
Glyco_hydro_2_C
1
1
1
1
1
1
1
PF03179.8
PF03281.7
PF03484.8
PF03638.8
PF03850.7
PF03911.9
PF03919.8
V-ATPase_G
Mab-21
B5
CXC
Tfb4
Sec61_beta
mRNA_cap_C
1
PF03986.6
Autophagy_N
1
PF04112.6
Mak10
1
1
PF04114.7
PF04129.5
Gaa1
Vps52
1
PF04430.7
DUF498
Nucleotidyltransferase domain
XPG N-terminal domain
Hormone receptor domain
Domain of unknown function
Coiled-coil domain-containing protein
(DUF2037)
tRNA intron endonuclease, catalytic Cterminal domain
Glycosyl hydrolases family 2, sugar binding
domain
Exostosin family
Structure-specific recognition protein
(SSRP1)
Protein similar to CwfJ C-terminus 2
Ribosome associated membrane protein
RAMP4
SGF29 tudor-like domain
PPP5
PI3-kinase family, ras-binding domain
Ribosomal L39 protein
Ribosomal protein L31e
Cytochrome c oxidase subunit Vb
mRNA capping enzyme, catalytic domain
Ribosomal protein S27a
Macro domain
Ribosomal L22e protein family
Transcription initiation factor IID, 31kD
subunit
Oligosaccharyl transferase STT3 subunit
Glycosyl hydrolases family 2, TIM barrel
domain
Vacuolar (H+)-ATPase G subunit
Mab-21 protein
tRNA synthetase B5 domain
Tesmin/TSO1-like CXC domain
Transcription factor Tfb4
Sec61beta family
mRNA capping enzyme, C-terminal domain
Autophagocytosis associated protein
(Atg3), N-terminal domain
Mak10 subunit, NatC N(alpha)-terminal
acetyltransferase
Gaa1-like, GPI transamidase component
Vps52 / Sac2 family
Domain of unknown function
(DUF498/DUF598)
1
1
1
1
1
1
1
1
1
1
1
PF04615.6
PF04934.7
PF05026.6
PF05131.7
PF05493.6
PF05743.6
PF06025.5
PF06047.4
PF06093.6
PF06094.5
PF06105.5
Utp14
Med6
DCP2
Pep3_Vps18
ATP_synt_H
UEV
DUF913
SynMuv_product
Spt4
AIG2
Aph-1
1
PF06432.4
GPI2
1
1
PF06732.4
PF06859.5
Pescadillo_N
Bin3
1
PF06978.4
POP1
1
1
1
1
1
1
1
1
1
PF07047.5
PF07289.4
PF07542.4
PF07575.6
PF07910.6
PF07947.7
PF08038.5
PF08075.4
PF08170.5
OPA3
DUF1448
ATP12
Nucleopor_Nup85
Peptidase_C78
YhhN
Tom7
NOPS
POPLD
1
PF08221.4
HTH_9
1
1
1
1
1
1
1
PF08295.5
PF08324.4
PF08366.6
PF08375.4
PF08492.5
PF08518.4
PF08825.3
HDAC_interact
PUL
LLGL
Rpn3_C
SRP72
GIT_SHD
E2_bind
1
PF08923.3
MAPKK1_Int
1
1
1
1
1
1
1
1
1
1
1
PF09270.3
PF09271.4
PF09785.2
PF10044.2
PF10228.2
PF10240.2
PF10354.2
PF10377.2
PF10484.2
PF10509.2
PF10597.2
Beta-trefoil
LAG1-DNAbind
Prp31_C
Ret_tiss
DUF2228
DUF2464
DUF2431
ATG11
MRP-S23
GalKase_gal_bdg
U5_2-snRNA_bdg
1
PF10598.2
RRM_4
Utp14 protein
MED6 mediator sub complex component
Dcp2, box A domain
Pep3/Vps18/deep orange family
ATP synthase subunit H
UEV domain
Domain of unknown function
Ras-induced vulval development antagonist
Spt4/RpoE2 zinc finger
AIG2-like family
Aph-1 protein
Phosphatidylinositol Nacetylglucosaminyltransferase
Pescadillo N-terminus
Bicoid-interacting protein 3 (Bin3)
Ribonucleases P/MRP protein subunit
POP1
Optic atrophy 3 protein (OPA3)
Protein of unknown function (DUF1448)
ATP12 chaperone protein
Nup85 Nucleoporin
Peptidase family C78
YhhN-like protein
TOM7 family
NOPS (NUC059) domain
POPLD (NUC188) domain
RNA polymerase III subunit RPC82 helixturn-helix domain
Histone deacetylase (HDAC) interacting
PUL domain
LLGL2
Proteasome regulatory subunit C-terminal
SRP72 RNA-binding domain
Spa2 homology domain (SHD) of GIT
E2 binding domain
Mitogen-activated protein kinase kinase 1
interacting
Beta-trefoil
LAG1, DNA binding
Prp31 C terminal domain
Retinal tissue protein
Uncharacterised conserved protein
Domain of unknown function
Domain of unknown function
Autophagy-related protein 11
Mitochondrial ribosomal protein S23
Galactokinase galactose-binding signature
U5-snRNA binding site 2 of PrP8
RNA recognition motif of the spliceosomal
PrP8
1
PF12353.1
eIF3g
1
PF12457.1
TIP_N
1
PF12513.1
SUV3_C
1
PF12572.1
DUF3752
Eukaryotic translation initiation factor 3
subunit G
Tuftelin interacting protein N terminal
Mitochondrial degradasome RNA helicase
subunit C terminal
Domain of unknown function
Table S3. The 50 most frequent PFAM-A domains in PS1010.
Genes
370
340
250
242
199
196
175
160
150
134
120
PFAM
accession
PF00069.18
PF07714.10
PF10326.2
PF10318.2
PF10317.2
PF10319.2
PF07690.9
PF01391.11
PF00005.20
PF00001.14
PF00400.25
Domain name
Pkinase
Pkinase_Tyr
7TM_GPCR_Str
7TM_GPCR_Srh
7TM_GPCR_Srd
7TM_GPCR_Srj
MFS_1
Collagen
ABC_tran
7tm_1
WD40
Description
Protein kinase domain
Protein tyrosine kinase
Serpentine type 7TM GPCR chemoreceptor Str
Serpentine type 7TM GPCR chemoreceptor Srh
Serpentine type 7TM GPCR chemoreceptor Srd
Serpentine type 7TM GPCR chemoreceptor Srj
Major Facilitator Superfamily
Collagen triple helix repeat (20 copies)
ABC transporter
7 transmembrane receptor (rhodopsin family)
WD domain, G-beta repeat
Neurotransmitter-gated ion-channel ligand
119 PF02931.16 Neur_chan_LBD binding domain
106 PF01484.10 Col_cuticle_N
Nematode cuticle collagen N-terminal domain
103 PF00651.24 BTB
BTB/POZ domain
90 PF00059.14 Lectin_C
Lectin C-type domain
90 PF00102.20 Y_phosphatase
Protein-tyrosine phosphatase
RNA recognition motif. (a.k.a. RRM, RBD, or
86 PF00076.15 RRM_1
RNP domain)
Ligand-binding domain of nuclear hormone
86 PF00104.23 Hormone_recep
receptor
83 PF00105.11 zf-C4
Zinc finger, C4 type (two domains)
Neurotransmitter-gated ion-channel
82 PF02932.9 Neur_chan_memb transmembrane region
80 PF10327.2 7TM_GPCR_Sri
Serpentine type 7TM GPCR chemoreceptor Sri
79 PF00271.24 Helicase_C
Helicase conserved C-terminal domain
76 PF01431.14 Peptidase_M13
Peptidase family M13
75 PF07679.9 I-set
Immunoglobulin I-set domain
72 PF00106.18 adh_short
short chain dehydrogenase
71 PF00071.15 Ras
Ras family
70 PF00083.17 Sugar_tr
Sugar (and other) transporter
69 PF00046.22 Homeobox
Homeobox domain
67 PF00023.23 Ank
Ankyrin repeat
65 PF00025.14 Arf
ADP-ribosylation factor family
63 PF08477.6 Miro
Miro-like protein
63 PF10328.2 7TM_GPCR_Srx Serpentine type 7TM GPCR chemoreceptor Srx
62 PF07885.9 Ion_trans_2
Ion channel
61 PF00270.22 DEAD
DEAD/DEAH box helicase
61 PF01549.17 ShK
ShK domain-like
57 PF02463.12 SMC_N
RecF/RecN/SMC N terminal domain
56 PF00431.13 CUB
CUB domain
56 PF00004.22 AAA
ATPase family associated with various cellular
activities (AAA)
55 PF00036.25 efhand
EF hand
55 PF01060.16 DUF290
Transthyretin-like family
Serpentine type 7TM GPCR chemoreceptor
54 PF10320.2 7TM_GPCR_Srsx Srsx
54 PF00635.19 Motile_Sperm
MSP (Major sperm protein) domain
54 PF00149.21 Metallophos
Calcineurin-like phosphoesterase
UDP-glucoronosyl and UDP-glucosyl
53 PF00201.11 UDPGT
transferase
53 PF00096.19 zf-C2H2
Zinc finger, C2H2 type
52 PF01697.20 DUF23
Domain of unknown function
51 PF03125.11 Sre
C. elegans Sre G protein-coupled chemoreceptor
50 PF00047.18 ig
Immunoglobulin domain
49 PF00041.14 fn3
Fibronectin type III domain
49 PF00595.17 PDZ
PDZ domain (Also known as DHR or GLGF)
Table S4. Filtering of phastCons elements with PS1010-elegans-briggsae conservation.
DNA elements
Original, unfiltered PhastCons predictions
+80% overlap w/ elegans/PS1010 alignment
+80% overlap w/ elegans/briggsae alignment
No RepeatMasker
No simple repeats
No WS210 protein-coding exons
No WS210 ncRNA exons
No sig. BlastN hits to WS210 ncRNA
No mGene protein-coding exons
No other alternative protein-coding exons
No AUGUSTUS (+ 355K/pub. Rseq) exons
No AUGUSTUS (+ 355K ESTs) exons
No AUGUSTUS (+ public RNA-seq) exons
No AUGUSTUS (ab initio) exons
No AUGUSTUS (+ 2x75 nt N2 cDNA) exons
No overlap with 2006 RNAz
Any overlap with 2006 RNAz
50% overlap with 2006 RNAz
80% overlap with 2006 RNAz
No overlap w/ filtered 2006 RNAz
Any overlap w/ filtered 2006 RNAz
50% overlap w/ filtered 2006 RNAz
80% overlap w/ filtered 2006 RNAz
At least one match to a predicted motif
No matches to any predicted motifs
At least one match to motif 1-1
Number
95,712
95,711
85,819
84,616
84,053
4,964
4,545
4,494
3,313
2,761
2,707
2,700
2,695
2,677
2,672
2,544
128
118
101
2,577
95
87
72
1,193
1,479
113
Total nt
6,231,603
6,231,579
5,740,830
5,624,280
5,572,614
188,690
162,811
160,846
103,969
81,264
78,977
78,743
78,570
77,909
77,635
73,035
4,600
4,156
3,347
73,967
3,668
3,316
2,551
39,587
38,048
3,255
% genome
6.21%
6.21%
5.72%
5.61%
5.56%
0.19%
0.16%
0.16%
0.10%
0.08%
0.08%
0.08%
0.08%
0.08%
0.08%
0.07%
0.00%
0.00%
0.00%
0.07%
0.00%
0.00%
0.00%
0.04%
0.04%
0.00%
Min.
nt
3
3
3
3
3
7
7
7
7
7
7
7
7
7
7
7
9
9
9
7
10
10
10
9
7
11
Avg.
nt
65
65
67
66
66
38
36
36
31
29
29
29
29
29
29
29
36
35
33
29
39
38
35
33
26
29
Max.
nt
2,514
2,514
2,514
2,514
2,514
444
444
444
173
160
160
160
160
160
160
160
126
126
120
160
126
126
120
160
107
148
The row of elements in boldface is our filtered set of non-coding elements. Rows below it are
subsets of these 2,672 sequences.
Table S5. Enrichment of GO terms for genes near candidate conserved non-coding elements.
GO ID
Associated genes
GO:0000003
248 of 1478
GO:0040007
236 of 1429
GO:0009792
396 of 2811
p-value
7.67e-20
4.73e-18
2.16e-17
GO:0040010
GO:0010171
GO:0005622
GO:0045449
GO:0040035
GO:0018991
GO:0003677
GO:0006355
270 of 1783
106 of 579
127 of 750
50 of 199
120 of 706
64 of 298
95 of 529
103 of 618
1.33e-15
2.24e-11
5.48e-11
9.26e-11
1.44e-10
3.11e-10
6.50e-10
7.59e-09
GO:0006811
GO:0005216
GO:0005840
GO:0003700
GO:0040026
GO:0018996
41 of 168
27 of 89
33 of 138
91 of 573
15 of 38
43 of 211
1.01e-08
2.67e-08
4.10e-07
4.41e-07
7.74e-07
9.00e-07
GO:0016020
GO:0003735
GO:0002009
GO:0005524
GO:0006412
GO:0043565
146 of 1062
33 of 146
58 of 328
131 of 935
40 of 197
76 of 475
1.50e-06
1.52e-06
1.76e-06
1.90e-06
2.26e-06
2.62e-06
GO term description
reproduction
growth
embryonic development ending in birth
or egg hatching
positive regulation of growth rate
body morphogenesis
intracellular
regulation of transcription
hermaphrodite genitalia development
oviposition
DNA binding
regulation of transcription, DNAdependent
ion transport
ion channel activity
ribosome
transcription factor activity
positive regulation of vulval development
molting cycle, collagen and cuticulinbased cuticle
membrane
structural constituent of ribosome
morphogenesis of an epithelium
ATP binding
translation
sequence-specific DNA binding
Table S6. Characteristics of DNA motifs predicted from filtered conserved DNA elements.
[See the Excel spreadsheet ps1010_GR2010_Table_S6.xls, provided in the Supplemental Files,
for these data.]
In addition to similarities described in the main text, further similarities of our predicted motifs
were detected to published binding site or computational consensus motifs for the following: the
C. elegans neuron-specific N1-box (Ruvinsky et al., 2007); Drosophila Mad (Rushlow et al.
2001), brinker (Rushlow et al., 2001), shn (Dai et al., 2000), and Trithorax-like (Trl, i.e.,
Drosophila GAGA; Mahmoudi et al., 2003); GAGA (Adkins et al., 2006); human
CTGYNNCTYTAA (PF0082.1 in JASPAR PHYLOFACTS; Xie et al., 2005); and core
promoter GC-box (Bucher, 1990) and XCPE1 (POL011.1 in JASPAR-POLII; Tokusumi et al.,
2007).
Table S7. Genomic and RNA library statistics.
Genomic
Genomic
Genomic
Genomic
RNA-seq
Read
Type
1x75
2x75
2x75
2x75
2x75
Fragment
Size (bp)
200
450
375
200
Raw (Pair) Count
(millions of reads)
53.6
40.2
23.5
24.8
53.2
Nominal
sequence (Gb)
4.0
6.0
3.5
3.7
8.0
Supplemental Methods
Purifying worms and extracting genomic DNA
Stock solutions. Worm lysis buffer: 0.1 M Tris-Cl pH 8.5, 0.1 M NaCl, 50 mM EDTA pH
8.0, 1% SDS. CTAB/NaCl solution: 10% CTAB (Sigma M-7635) in 0.7 M NaCl. These solutions
tended to precipitate at room temperature but could be heated and redissolved. Protease K: 20
mg/ml in TE pH 8.0. Aliquots were stored at -20° C. until use. Other solutions such as M9 were
as described in Lewis and Fleming (1995) or Sambrook and Russell (2001).
Worm preparation. We seeded at least four 10-cm nutrient agar/HB101 plates with
PS1010 worms, and grew them for one week to get many starved L1 and older larvae. Starved
larvae were harvested by washing with M9 buffer, and run through a sucrose gradient to clean up
both bacterial waste and agar chunks (starved PS1010 worms burrow intensely in agar,
fragmenting it). We pooled cleaned worms by more M9 washes and tabletop centrifugation
before aliquotting 100 µl of worms into a 1.5 ml microcentrifuge tube and snap-freezing them
with liquid N2 and storing at -80° C. Worms could be frozen and stored before actual DNA
preps, allowing supplies of worms to be built up. Before DNA or RNA extraction, we cracked
their cuticles by doing 2-3 freeze-thaws in liquid N2 and a 37° C. water bath.
DNA preparation, day 1: We added ~4.5 ml of worm lysis buffer to a frozen ~500 µl
aliquot of worms (to get a final volume of 5.0 ml, adjusting as needed) and transferred worms to
15 ml disposable tube, added 200 µl of 20 mg/ml Protease K to the worms and mixed by
inversion. We incubated worms at 62° C. for 60 minutes while prechilling isopropanol at -20° C.,
mixing by gentle inversion 4-5 times during the incubation. The solution cleared as the worms
disintegrated. We added 800 µl of 5 M NaCl, mixed it thoroughly by inversion, added 800 µl of
CTAB, and incubated the extract for 10' at 37° C. We extracted with one volume of chloroform
and phenol/choroform (in that order), using gentle inversion and tabletop centrifugation and
recovering the aqueous phase. We added one volume of (-20° C.) isopropanol and mixed by
inversion; good DNA preparations gave immediate threading that could be hand-picked to 70%
EtOH with a micropipettor, while weaker preparations required centrifuging trace DNA
precipitate. After three washes with 70% EtOH and 5' centrifugation, supernatant was removed
and the pellet air dried before resuspending overnight in 340 µl TE + 10 µl RNAse A, at 4° C.
The DNA could be left safely at this step indefinitely.
DNA preparation, day 2: We incubated DNA solutions 2 hours at 37° C. to drive RNAse
activity to its conclusion; We added 20 µl of 20% SDS, 10 µl of 0.5 M EDTA pH 8.0, and 20 µl
of Protease K, mixing by gentle inversion, then incubating at 62° C. for 2 hours. We mixed 80 µl
of 5 M ammonium acetate by gentle inversion, then extracted twice with phenol/chloroform and
once with chloroform before EtOH precipitation, 70% EtOH washing, air-drying genomic DNA,
and resuspending it in 100 µl TE overnight at 4° C. High-quality genomic DNA preparations
were viscous and took at least a day to dissolve.
Extracting mixed-stage bulk RNA
Worms were grown, starved of E. coli, harvested, cleaned with sucrose gradients, snapfrozen, and stored as above. One difference was that it was useful to wash the cleaned worms in
S Basal buffer rather than M9, because S Basal's pH favored RNA purification. After cracking
the cuticles by two rounds of freeze-thawing, we partially thawed the worms and ground them
with an RNAse-free plastic pestle in their microcentrifuge tube, refroze them quickly, and mixed
them with 350 µl of RLT buffer (Qiagen) + ß-mercaptoethanol. To completely homogenize the
worm extract, we passed the mixture through a disposable 20G 11/2 syringe ≥10 times. We
microcentrifuged the mixture for 3 minutes, transferred the supernatant to a fresh
microcentrifuge tube, and proceeded to purify bulk RNA into a final volume of 100 µl (in
RNAse-free TE) with the RNeasy mini kit by manufacturer's instructions. Because PS1010 does
not give many gravid or fecund females even on nutrient agar with HB101, it was practically
impossible to get large quantities of well-synchronized starved animals for an RNA harvest. We
did get a mixed larval population that was dominated by younger larvae (L1 and L2 stages) but
the overall population ranged to a few starved adults.
Manipulating and viewing sequence sets
Sequences were visualized and intersected by uploading them to a local mirror of the
UCSC Genome and Table Browsers (Kuhn et al., 2009). This mirror used the WS190 version of
the C. elegans genome sequence; data from WS210 et al. were mapped to WS190 coordinates.
To purge highly conserved sequences in the C. elegans genome comprehensively of
sequences that were previously annotated or that were unlikely to be regulatory, we first selected
against annotated repetitive DNA sequences in the C. elegans genome (from the WS190 release
of WormBase, used in the UCSC genome site which we mirrored). We also required conserved
sequences to show 80% overlap in all comparisons of PS1010 to C. elegans and C. briggsae. We
then generated the following sequence sets, which we imported as custom tracks to our UCSC
mirror and used as filters: all exons of officially annotated protein-coding or non-coding RNA
genes in the WS210 frozen release of WormBase; all predicted exons from the recent alternative
genefinding analysis using mGene by Schweikert et al. (2009), performed on the WS180 release
of the C. elegans genome; and all unofficially predicted exons by five alternative protein-coding
genefinders archived in the WS204 release of WormBase (GeneMarkHMM; Borodovsky 2003),
mSplicer and mSplicer-ORF (Rätsch et al., 2007), nGASP/JIGSAW (Coghlan et al., 2008), and
TWINSCAN (Wei et al., 2005). In particular, the WS210 set of official ncRNA genes was far
more extensive than that for WS190, containing thousands of recently annotated 21U RNAs
(Batista et al., 2008). To import these data to the WS190 coordinates of the UCSC mirror, we
used remap_gff_between_releases.pl and unmap_gff_between_releases.pl from the remap.tar.bz2
package at ftp://ftp.sanger.ac.uk/pub2/wormbase/gary. (G. Williams, personal communication).
A small number of known ncRNAs passed all these filters, perhaps because of mapping errors;
these were detected with BlastN against ncRNAs in WS210 and removed.
The original set of 3,672 putative ncRNAs predicted in C. elegans with RNAz by Missal
et al. (2006) was generated with a version of the C. elegans genome (WS120) dating to March
2004. To find which of these predictions remained completely unaccounted for in the WS210
genome release (dating to December 2009), we filtered them in exactly the same way that we
filtered conserved noncoding DNA elements; this yielded a still-novel subset of 1,290 RNAzpredicted ncRNAs.
cDNA-mediated genomic assembly
The RNAPATH module of ERANGE 3.2 was used with standard parameters as
defined in its internal documentation.
Protein motifs
Major sperm protein genes were defined as those encoding the PFAM-A motif
PF00635.19. Serpentine receptor genes were defined as those encoding any domain whose name
contained the abbreviation "7TM_GPCR". Gene mappings and counts were done with Perl.
Predicting and comparing DNA motifs
MEME will not accept input sequences of under 8 nt in size; we thus removed the
(tiny) fraction of our 7-nt elements before predicting motifs. Many of our elements are small and
might have only partial overlaps with sites that are nevertheless real and worth detecting. We
therefore predicted two sets of motifs: set 1 (with 23 members, named 1-1 through 1-23 in order
of descending statistical significance) was extracted from the sequences of the filtered elements
alone, whereas set 2 (with 30 members, named 2-1 through 2-30 in descending significance) was
extracted from these sequences along with 5 nt of flanking DNA on each side of each element.
To find which of these two sets of predicted motifs were nonredundant, we cross-checked the
two sets with TOMTOM. Many motifs were indeed discovered in both sets (Table S6), but we
did find more predicted motifs in set 2 than in set 1, which included five new predictions,
including a match to E2F that had not been found in set 1. Matches to plant- or fungal-specific
motifs such as ABI4 or STE12 were omitted. All motifs were checked with FIMO for sites in the
original set of elements (without flanking 5 nt), and were found to exist in them. The final set of
nonredundant motifs is summarized in Table 2; the full set of motifs is detailed in Table S6.
Author Contributions
A.M. devised RNAPATH, assembled genomes and transcriptomes, mapped expression
data, compared Illumina and Sanger genomic assemblies, predicted protein-coding genes,
aligned multiple genomes with TBA/multiz, extracted conserved regions with phastCons, and
linked them to GO terms with Cistematic; E.M.S. grew worms, extracted genomic DNA and
bulk RNA, identified repetitive DNA, predicted protein-coding genes, analyzed orthology and
domains of protein-coding genes, devised most filters for highly conserved non-coding elements,
and analyzed DNA motifs; B.W. constructed cDNA libraries for Illumina sequencing; L.S. built
genomic libraries and carried out Illumina sequencing; I.A. optimized sequencing protocols;
B.J.W. and P.W.S. provided overall management. A.M. and E.M.S. wrote the bulk of the
manuscript; all authors read and commented on it.
Supplemental References
Adkins, N.L., Hagerman, T.A., Georgel, P. GAGA protein: a multi-faceted transcription factor.
Biochem. Cell Biol. 84, 559-567 (2006).
Batista, P.J., Ruby, J.G., Claycomb, J.M., Chiang, R., Fahlgren, N., Kasschau, K.D., Chaves,
D.A., Gu, W., Vasale, J.J., Duan, S., et al. PRG-1 and 21U-RNAs interact to form the piRNA
complex required for fertility in C. elegans. Mol Cell. 31, 67-78 (2008).
Borodovsky, M., Lomsadze, A., Ivanov, N., Mills, R. Eukaryotic gene prediction using
GeneMark.hmm. Curr. Protoc. Bioinformatics Chapter 4, Unit 4.6 (2003).
Bucher, P. Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements
derived from 502 unrelated promoter sequences. J. Mol. Biol. 212, 563-578 (1990).
Coghlan, A., Fiedler, T.J., McKay, S.J., Flicek, P., Harris, T.W., Blasiar, D., nGASP
Consortium, Stein, L.D. nGASP -- the nematode genome annotation assessment project. BMC
Bioinformatics 9, 549 (2008).
Dai, H., Hogan, C., Gopalakrishnan, B., Torres-Vazquez, J., Nguyen, M., Park, S., Raftery, L.A.,
Warrior, R., Arora, K. The zinc finger protein Schnurri acts as a Smad partner in mediating the
transcriptional response to Decapentaplegic. Dev. Biol. 227, 373-387 (2000).
Kuhn, R.M., Karolchik, D., Zweig, A.S., Wang, T., Smith, K.E., Rosenbloom, K.R., Rhead, B.,
Raney, B.J., Pohl, A., Pheasant, M., et al. The UCSC Genome Browser Database: update 2009.
Nucleic Acids Res. 37, D755-D761 (2009).
Lewis, J.A., Fleming, J.T. Basic culture methods. Methods Cell Biol. 48, 3-29 (1995).
Mahmoudi, T., Zuijderduijn, L.M., Mohd-Sarip, A., Verrijzer, C.P. GAGA facilitates binding of
Pleiohomeotic to a chromatinized Polycomb response element. Nucleic Acids Res. 31, 41474156 (2003).
Rätsch, G., Sonnenburg, S., Srinivasan, J., Witte, H., Müller, K.R., Sommer, R.J., Schölkopf, B.
Improving the Caenorhabditis elegans genome annotation using machine learning. PLoS
Comput. Biol. 3, e20 (2007).
Rushlow, C., Colosimo, P.F., Lin, M.C., Xu, M., Kirov, N. Transcriptional regulation of the
Drosophila gene zen by competing Smad and Brinker inputs. Genes Dev. 15, 340-351 (2001).
Ruvinsky, I., Ohler, U., Burge, C.B., Ruvkun, G. Detection of broadly expressed neuronal genes
in C. elegans. Dev. Biol. 302, 617-626 (2007).
Sambrook, J., Russell, D.W. Molecular Cloning: a Laboratory Manual. 3rd. ed. CSHL Press,
Cold Spring Harbor, New York (2001).
Tokusumi, Y., Ma, Y., Song, X., Jacobson, R.H, Takada, S. The new core promoter element
XCPE1 (X Core Promoter Element 1) directs activator-, mediator-, and TATA-binding proteindependent but TFIID-independent RNA polymerase II transcription from TATA-less promoters.
Mol. Cell. Biol. 27, 1844-1858 (2007).
Wei, C., Lamesch, P., Arumugam, M., Rosenberg, J., Hu, P., Vidal, M., Brent, M.R. Closing in
on the C. elegans ORFeome by cloning TWINSCAN predictions. Genome Res. 15, 577-582
(2005).
Xie, X., Lu, J., Kulbokas, E.J., Golub, T.R., Mootha, V., Lindblad-Toh, K., Lander, E.S., Kellis,
M. Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of
several mammals. Nature 434, 338-345 (2005).
Download