The Chironomus tentans genome sequence and the organization of the Balbiani ring genes ADDITIONAL FILE 1: SUPPLEMENTARY FIGURES Figure S1. Kmer coverage frequency histogram of quality filtered PE Illumina reads. The total number of kmers (k=27 bp) was 4145961139, and the main peak was estimated from the histogram to 21X kmer coverage. 1 Figure S2. A maximum-likelihood phylogenetic tree reconstructed from a concatenation of 531 core gene protein alignments, including 10 arthropods and 3 nematodes. The scale bar depicts amino acid substitutions per site. Branch labels are support values for 100 bootstrap replicates. 2 BLASTT_percid 1 0 4 1 0 2 1 0 0 R p L 4 0 w d s 9 8 U 2 a f3 8 A G O 1 R p L 2 3 R p b 1 0 9 6 D im 1 C G 5 9 4 H s c 7 0 4 s n R N P U 1 C 9 2 R p b 7 9 0 G 9 3 4 4 R p b 5C C G 1 0 7 5 4 p e a t lk R p L 2 3 A S m D 2 C G 6 6 1 0 R p S 1 8 T b p E u k a r y o t ic in it ia t io n f a c t o r2 g a m m a B 5 2 s p e n c y p 3 3 R p S 2 1 R p S 1 2 ts u R p L 2 7 A R p L 3 7 b R b p 1 e IF 3 S 8 R E G A r t4 R p S 8 R p L 2 2 K a p a lp h a 3 R p b 1 2 R p S 3 A p L 4 R p S 1 0 bR C G 1 4 6 4 1 R p L 3 5 A E f1 g a m m a U tx S r p 5 4 c a ly p s o T f I I E b e t a F s ( 2 ) K e t C G 5 L S m 1 b u r 6 8 K d m 4 B 6 6 r in R a t1 m o r D is 3 C G 1 7 6 7 2 6 4 m r n C G 1 1 3 3 7 R p I1 2 S g f2 9 M T A 1 lik e C G 2 0 2 1 e IF 3 S 9 n o n A M B D lik e R p S 3 0 R p L 1 3 A M i2 e x u m le C G 3 4 3 6 R p S 1 9 a P A B P y p s R p L 1 3 P e p S e c 1 3 m s k R p L 2 7 c a z e IF 2 b e ta H B S 1 l( 2 ) 3 5 D f C G 1 7 ia l 6 0 R p S 6 R p L 3 6 R s f1 C G 1 3 9 0 0 R E F 1 /A ly 7 0 6 2 R p S 2 0 A r t1 e IF 5 B b r m L K R C G 1 0 4 6 6 T a f4 R p L R p L 3 4 a S F 2 c r n 7 6 7 4 R p L 1 5 e IF 4 a s q d h o ip 7 8 7 2 I s w i C G 1 3 2 9 8 U p f1 t w in R p S 2 9 S m B S m F 8 2 T a f2 R p L 1 2 R p L 2 6 T fb 5 S m E 8 4 8 0 E f1 a lp h a 1 0 0 E R p S 2 E lf S c e 8 6 R p L 3 6 A R p d 3 C G 1 0 8 8 S F 1 E f1 a lp h a 4 8 D R p S 9 R p S 5 a R p S 1 5 A a u b l R p S 2 3 R p S 1 R p S 2 7 e IF 1 A P r p 8 S p x R p L 2 4 lik e N u r f 3 8 e IF 4 E R p L 7 R p S 1 0 a c 1 2 .1 R p L 1 0 A a 5 8 R p L 6 C h d 3 5 6 C fp 1 S s l1 R p I1 3 5 C G 3 2 2 5 5 4 e IF 4 E 4 M o c s 2 U lp 1 G A P s e c J H D M 2 R p L 2 2 lik e b a ll H c s l( 2 ) 3 5 B d e IF S u ( v a r ) 2 0 5 5 2 T f I I A S 2 R a n b p 2 1 5 0 R r p 4 2 U s p 7 w d a h y d 4 4 C G 1 5 7 4 7 n o t 1 C G 4 8 8 7 4 6 e IF 4 N s 1 e s c l 4 8 D N A p o l_ 6 0 C G 8 0 7 9 G e m 5 R p L 2 4 C G 1 6 7 7 N u p 9 8 9 6 C G 1 2 2 6 7 e s c R p S 1 9 b R p L 2 8 T f I I A L e IF 4 E 7 v ig 2 C G 4 5 6 5 R a n b p 9 4 2 4 0 S b p 2 x m a s 2 G 9 a E ( b x ) G p 2 1 0 R r p 4 5 H r p 4 0 A c n c s u l 3 8 e a r m b o N e o s n o n C C G 1 4 7 4 9 R a n A c f1 R p I I 3 3 n o t 3 0 C G 2 9 3 1 e IF 4 G R r p 4 6 3 2 e IF 4 B S u ( v a r ) 3 9 S e t2 th o c 6 N u p 5 0 C G e IF 4 G 2 3 4 a lp h a K a p 4 M t o r C G 6 5 4 0 b ip 2 a u b 3 6 S m g 5 C d c 2 7 C G 4 9 3 5 R im a r m i 2 8 N d c 1 s c n y 2 6 N u p 1 5 3 c u 2 4 2 2 2 0 1 0 . 8 0 . 6 0 . 4 3 'e n d p r o c e s s in g 0 . 2 0 B a s a ltr a n s c r ip tio n fa c to r s 0 .2 C a p p r o te in s 0 .4 C h r o m a tin fa c to r s 0 .6 E J C c o m p o n e n ts 0 .8 E x o s o m e 1 N u c le a rp o r e 1 .2 R ib o s o m a lla r g e s u b u n it 1 .4 1 .6 1 .8 R ib o s o m a ls m a lls u b u n it R N A c a ta b o lic p r o c e s s e s 2 R N A e x p o r t 2 .2 R N A in te r fe r a n c e 2 .4 R N A p o ly m e r a s e s 2 .6 S M N c o m p le x 2 .8 S p lic in g fa c to r s 3 3 .2 T r a n s la tio n fa c to r s 3 .4 3 .6 3 .8 Log FPKM Figure S3. Gene expression values in Ch. tentans (log FPKM) versus percent protein identity between potential D. melanogaster and Ch. tentans orthologs for 16 expression machineries. Each color indicates a set of proteins constituting an expression machinery. FPKM: Fragments Per Kilobase Of Exon Per Million Fragments Mapped. 3 Figure S4. A) Organization of the predicted BR gene locus. Solid blue line shows assembled scaffold. Two interruptions have been introduced, one in the middle of the predicted gene where sequence information is lacking and one upstream because of space limitations. Dark blue arrows, labelled with small letters, show the locations of predicted genes. Blast hits are shown in pink. P3 and P4 indicate the location of probes used for in situ hybridization. B) In situ hybridization with probe P3. Probe 4 hybridized to the same chromosomal locus (data not shown). On top, fluorescence image. Below, combined fluorescence and phase contrast image. The hybridization signal was confined to region 5B in chromosome IV. The BR1, BR2 and BR3 loci are indicated for reference. 4 SUPPLEMENTARY TABLES Table S1. Statistics on genome sequencing libraries. Type Insert Read length Raw size data Filtered data Coverage* Mapped** DNA-seq Illumina paired-end Illumina mate-pair 454 Total 500 bp 2 x 100 bp 5.8 Gbp 5 Kbp 2 x 44 bp 5.3 Gbp 400 bp 0.9 Mbp 12.0 Gbp 5.6 Gbp 4.2 Gbp 0.8 Gbp 10.6 Gbp 90 % 90 % 68 % 89 % RNA-seq Illumina paired-end 200 bp 2 x 101 bp 11.6 Gbp 28X 21X 4X 53X 97 % * Filtered data. Based on an estimated genome size of 200 Mbp ** BLAT (92% identity) Table S2. Species distribution of sequences in the NCBI nucleotide database (nt) with homology against a 5% random subset of Ch. tentans sequencing reads. Best Blast hits with identity of 98% and >50 bp aligned (454 reads) or with maximum 1 mismatch (Illumina reads) were recorded. Phylum Arthropoda Unclassified Cnidaria Nematoda Mollusca Others 93 % 1% 0.7 % 0.6 % 0.6 % 4.1 % Table S3. Previously identified Chironomus repeat sequences added to the Ch. tentans ab initio repeat library. GI accession number 14531653 156608 156609 156598 3336848 556631 3336845 3002944 2051997 556627 3861491 1448961 14531329 14531331 5 Table S4. The repeat content of the Ch. tentans genome. Simple repeats Complex repeats Repetitive elements Assembly Assembly independent* DNA elements 0.98 1.68 SINEs 0.14 0.25 LINEs 0.25 0.46 LTR elements 0.09 0.10 Satellites 0.20 0.26 Unclassified 5.40 8.41 Minisatellites 2.74 3.45 Low complexity 0.40 0.33 * Estimated from a set of 180,000 454 reads longer than 500 bp. Table S5. Species included in the OrthoMCL-DB gene family analysis and the phylogenetic reconstruction. Species Chironomus tentans Acyrthosiphon pisum Aedes aegypti Anopheles gambiae str. PEST Apis mellifera Bombyx mori Culex quinquefasciatus Drosophila melanogaster Pediculus humanus Ixodes scapularis Brugia malayi* Caenorhabditis briggsae F16* Caenorhabditis elegans* Phylum Arthropoda Arthropoda Arthropoda Arthropoda Arthropoda Arthropoda Arthropoda Arthropoda Arthropoda Arthropoda Nematoda Nematoda Nematoda * Outgroup 6 Table S6. Expression machinery genes in D. melanogaster with no detected orthologous sequence in Ch. tentans. Identifiers according to FlyBase r5.55. Machinery Basal transcription factors Basal transcription factors Basal transcription factors Name Taf12L Trf2 nht Gene FBgn0031623 FBgn0261793 FBgn0041103 Protein FBpp0077111 FBpp0303138 FBpp0289298 Gene length (aa) 139 1716 246 Chromatin remodelling Chromatin remodelling Chromatin remodelling Chromatin remodelling Chromatin remodelling CG12316 Atac1 D12 Atac2 Sgf11 FBgn0036483 FBgn0031876 FBgn0027490 FBgn0032691 FBgn0036804 FBpp0075405 FBpp0078990 FBpp0079312 FBpp0080629 FBpp0112092 1189 357 970 775 197 Nuclear pore Ulp1 FBgn0027603 FBpp0074462 1514 RNA export Nxf3 FBgn0263232 FBpp0305286 560 Splicing factors Splicing factors Splicing factors Splicing factors Splicing factors CG17764 lost CG17098 CG9684 ymp FBgn0029751 FBgn0263594 FBgn0032276 FBgn0037583 FBgn0261287 FBpp0070724 FBpp0078561 FBpp0079692 FBpp0081342 FBpp0310517 273 546 653 643 187 7 Table S7. For each of the U1, U2, U4, U5 and U6 snRNAs, the Ch. tentans scaffolds containing significant sequence match to the D. melanogaster genes are listed. D. melanogaster contains multiple copies of each type of snRNA gene (listed in the right column together with the length of each gene). The length of the matched Ch. tentans regions and the percent identity within the matched regions are shown. snRNA Ch. tentans, scaffold (region) U1 sc2105 (1160-1314) sc1126 (31031-31185) U2 sc4193 (1246-1057) U4 sc237 (102346-102432) sc357 (82409-82327) U5 sc717 (38264-38363) sc717 (32690-32794) sc7116 (1377-1475) U6 sc7078 (2259-2365) sc295 (9373-9268) sc96 (18372-18476) Match length (% identity) 1-156 (80) 1-156 (80) 1-156 (80) 1-156 (80) 1-156 (80) 1-156 (80) 1-156 (80) 1-156 (80) 1-156 (80) 1-156 (80) 1-191 (83) 1-191 (83) 1-191 (84) 1-191 (84) 1-191 (83) 1-190 (83) 1-87 (77) 5-86 (81) 5-86 (80) 5-86 (79) 5-86 (79) 1-99 (86) 17-96 (90) 17-96 (88) 16-86 (91) 17-96 (90) 16-95 (90) 16-94 (90) 19-107 (88) 1-96 (85) 1-96 (84) 16-84 (91) 1-96 (84) 16-103 (88) 16-94 (90) 1-98 (82) 17-96 (87) 17-95 (88) 16-84 (91) 17-96 (90) 16-95 (87) 16-94 (90) 1-107 (100) 1-107 (100) 1-107 (100) 1-107 (98) 1-107 (98) 1-107 (98) 3-107 (97) 3-107 (97) 3-107 (97) D. melanogaster, query (length) U1:21D (164 bp) U1:82Eb (164 bp) U1:95Ca (164 bp) U1:95Cb (164 bp) U1:95Cc (164 bp) U1:21D U1:82Eb U1:95Ca U1:95Cb U1:95Cc U2:14B (192 bp) U2:34ABa (192 bp) U2:34ABb (192 bp) U2:34ABc (192 bp) U2:38ABa (192 bp) U2:38ABb (191 bp) U4:25F (148 bp) U4:38AB (142 bp) U4:39:B (143 bp) U4:38AB U4:39B U5:14B (110 bp) U5:23D (131 bp) U5:34A (127 bp) U5:35D (126 bp) U5:38ABa (127 bp) U5:38ABb (127 bp) U5:63BC (123 bp) U5:14B U5:23D U5:34A U5:35D U5:38ABa U5:38ABb U5:63BC U5:14B U5:23D U5:34A U5:35D U5:38ABa U5:38ABb U5:63BC U6:96Aa (107 bp) U6:96Ab (107 bp) U6:96Ac (107 bp) U6:96Aa U6:96Ab U6:96Ac U6:96Aa U6:96Ab U6:96Ac 8 Table S8. Oligonucleotides used for in situ hybridization experiments. P1-P4 oligonucleotides were used as PCR primers to obtain probes for in situ hybridization. The BR2.1 and BR2.2 oligonucleotides were labeled with CY3 at their 5-ends. P1a P1b P3a P3b 5 GATTGTGGTGTTTAGTACTAGCC 3 5 CATTTGCGAACCATGGTCTGC 3 5 ACAAATGTCTTATGCAGTCTAG 3 5 AAACCAACATTTCAGCTTG 3 5 GTTTGACGTCAATTTGACAG 3 5 CATAAATGACACGTCGATC 3 P4a 5 ATCGACACCACTATATGAAG 3 P4b 5 TGAGGAAATCATTCATGC 3 BR2.2 BR2.1 5-CY3-CTCTGGTTTAATTCCTGACCAACTTGGTCT 3 5-CY3-ACTTGGCTTGCTGTGTTTGCTTGGTTTGCT 3 P2a P2b 9