Additional file 1

advertisement
The Chironomus tentans genome sequence and the organization of the Balbiani ring genes
ADDITIONAL FILE 1:
SUPPLEMENTARY FIGURES
Figure S1. Kmer coverage frequency histogram of quality filtered PE Illumina reads. The total
number of kmers (k=27 bp) was 4145961139, and the main peak was estimated from the histogram to
21X kmer coverage.
1
Figure S2. A maximum-likelihood phylogenetic tree reconstructed from a concatenation of 531 core
gene protein alignments, including 10 arthropods and 3 nematodes. The scale bar depicts amino acid
substitutions per site. Branch labels are support values for 100 bootstrap replicates.
2
BLASTT_percid
1
0
4
1
0
2
1
0
0
R
p
L
4
0
w
d
s
9
8
U
2
a
f3
8
A
G
O
1
R
p
L
2
3
R
p
b
1
0
9
6
D
im
1
C
G
5
9
4
H
s
c
7
0
4
s
n
R
N
P
U
1
C
9
2
R
p
b
7
9
0
G
9
3
4
4
R
p
b
5C
C
G
1
0
7
5
4
p
e
a
t
lk
R
p
L
2
3
A
S
m
D
2
C
G
6
6
1
0
R
p
S
1
8
T
b
p
E
u
k
a
r
y
o
t
ic
in
it
ia
t
io
n
f
a
c
t
o
r2
g
a
m
m
a
B
5
2
s
p
e
n
c
y
p
3
3
R
p
S
2
1
R
p
S
1
2
ts
u
R
p
L
2
7
A
R
p
L
3
7
b
R
b
p
1
e
IF
3
S
8
R
E
G
A
r
t4
R
p
S
8
R
p
L
2
2
K
a
p
a
lp
h
a
3
R
p
b
1
2
R
p
S
3
A
p
L
4
R
p
S
1
0
bR
C
G
1
4
6
4
1
R
p
L
3
5
A
E
f1
g
a
m
m
a
U
tx
S
r
p
5
4
c
a
ly
p
s
o
T
f
I
I
E
b
e
t
a F
s
(
2
)
K
e
t
C
G
5
L
S
m
1
b
u
r
6
8
K
d
m
4
B
6
6
r
in
R
a
t1
m
o
r
D
is
3
C
G
1
7
6
7
2
6
4
m
r
n
C
G
1
1
3
3
7
R
p
I1
2
S
g
f2
9
M
T
A
1
lik
e
C
G
2
0
2
1
e
IF
3
S
9
n
o
n
A
M
B
D
lik
e
R
p
S
3
0
R
p
L
1
3
A
M
i2
e
x
u
m
le
C
G
3
4
3
6
R
p
S
1
9
a
P
A
B
P
y
p
s
R
p
L
1
3
P
e
p
S
e
c
1
3
m
s
k
R
p
L
2
7
c
a
z
e
IF
2
b
e
ta
H
B
S
1
l(
2
)
3
5
D
f
C
G
1
7
ia
l
6
0
R
p
S
6
R
p
L
3
6
R
s
f1
C
G
1
3
9
0
0
R
E
F
1
/A
ly
7
0
6
2
R
p
S
2
0
A
r
t1
e
IF
5
B
b
r
m
L
K
R
C
G
1
0
4
6
6
T
a
f4
R
p
L
R
p
L
3
4
a
S
F
2
c
r
n
7
6
7
4
R
p
L
1
5
e
IF
4
a
s
q
d
h
o
ip
7
8
7
2
I
s
w
i
C
G
1
3
2
9
8
U
p
f1
t
w
in
R
p
S
2
9
S
m
B
S
m
F
8
2
T
a
f2
R
p
L
1
2
R
p
L
2
6
T
fb
5
S
m
E
8
4
8
0
E
f1
a
lp
h
a
1
0
0
E
R
p
S
2
E
lf
S
c
e
8
6
R
p
L
3
6
A
R
p
d
3
C
G
1
0
8
8
S
F
1
E
f1
a
lp
h
a
4
8
D
R
p
S
9
R
p
S
5
a
R
p
S
1
5
A
a
u
b
l
R
p
S
2
3
R
p
S
1
R
p
S
2
7
e
IF
1
A
P
r
p
8
S
p
x
R
p
L
2
4
lik
e
N
u
r
f
3
8
e
IF
4
E
R
p
L
7
R
p
S
1
0
a
c
1
2
.1
R
p
L
1
0
A
a
5
8
R
p
L
6
C
h
d
3
5
6
C
fp
1
S
s
l1
R
p
I1
3
5
C
G
3
2
2
5
5
4
e
IF
4
E
4
M
o
c
s
2
U
lp
1
G
A
P
s
e
c
J
H
D
M
2
R
p
L
2
2
lik
e
b
a
ll
H
c
s
l(
2
)
3
5
B
d
e
IF
S
u
(
v
a
r
)
2
0
5
5
2
T
f
I
I
A
S
2
R
a
n
b
p
2
1
5
0
R
r
p
4
2
U
s
p
7
w
d
a
h
y
d
4
4
C
G
1
5
7
4
7
n
o
t
1
C
G
4
8
8
7
4
6
e
IF
4
N
s
1
e
s
c
l
4
8
D
N
A
p
o
l_
6
0
C
G
8
0
7
9
G
e
m
5
R
p
L
2
4
C
G
1
6
7
7
N
u
p
9
8
9
6
C
G
1
2
2
6
7
e
s
c
R
p
S
1
9
b
R
p
L
2
8
T
f
I
I
A
L
e
IF
4
E
7
v
ig
2
C
G
4
5
6
5
R
a
n
b
p
9
4
2
4
0
S
b
p
2
x
m
a
s
2
G
9
a
E
(
b
x
)
G
p
2
1
0
R
r
p
4
5
H
r
p
4
0
A
c
n
c
s
u
l
3
8
e
a
r
m
b
o
N
e
o
s
n
o
n
C
C
G
1
4
7
4
9
R
a
n
A
c
f1
R
p
I
I
3
3
n
o
t
3
0
C
G
2
9
3
1
e
IF
4
G
R
r
p
4
6
3
2
e
IF
4
B
S
u
(
v
a
r
)
3
9
S
e
t2
th
o
c
6
N
u
p
5
0
C
G
e
IF
4
G
2
3
4
a
lp
h
a
K
a
p
4
M
t
o
r
C
G
6
5
4
0
b
ip
2
a
u
b
3
6
S
m
g
5
C
d
c
2
7
C
G
4
9
3
5
R
im
a
r
m
i
2
8
N
d
c
1
s
c
n
y
2
6
N
u
p
1
5
3
c
u
2
4
2
2
2
0
1
0
.
8
0
.
6
0
.
4
3
'e
n
d
p
r
o
c
e
s
s
in
g
0
.
2
0
B
a
s
a
ltr
a
n
s
c
r
ip
tio
n
fa
c
to
r
s
0
.2
C
a
p
p
r
o
te
in
s
0
.4
C
h
r
o
m
a
tin
fa
c
to
r
s
0
.6
E
J
C
c
o
m
p
o
n
e
n
ts
0
.8
E
x
o
s
o
m
e
1
N
u
c
le
a
rp
o
r
e
1
.2
R
ib
o
s
o
m
a
lla
r
g
e
s
u
b
u
n
it
1
.4
1
.6
1
.8
R
ib
o
s
o
m
a
ls
m
a
lls
u
b
u
n
it
R
N
A
c
a
ta
b
o
lic
p
r
o
c
e
s
s
e
s
2
R
N
A
e
x
p
o
r
t
2
.2
R
N
A
in
te
r
fe
r
a
n
c
e
2
.4
R
N
A
p
o
ly
m
e
r
a
s
e
s
2
.6
S
M
N
c
o
m
p
le
x
2
.8
S
p
lic
in
g
fa
c
to
r
s
3
3
.2
T
r
a
n
s
la
tio
n
fa
c
to
r
s
3
.4
3
.6
3
.8
Log FPKM
Figure S3. Gene expression values in Ch. tentans (log FPKM) versus percent protein identity between
potential D. melanogaster and Ch. tentans orthologs for 16 expression machineries. Each color
indicates a set of proteins constituting an expression machinery. FPKM: Fragments Per Kilobase Of
Exon Per Million Fragments Mapped.
3
Figure S4. A) Organization of the predicted BR gene locus. Solid blue line shows assembled scaffold.
Two interruptions have been introduced, one in the middle of the predicted gene where sequence
information is lacking and one upstream because of space limitations. Dark blue arrows, labelled with
small letters, show the locations of predicted genes. Blast hits are shown in pink. P3 and P4 indicate
the location of probes used for in situ hybridization. B) In situ hybridization with probe P3. Probe 4
hybridized to the same chromosomal locus (data not shown). On top, fluorescence image. Below,
combined fluorescence and phase contrast image. The hybridization signal was confined to region 5B
in chromosome IV. The BR1, BR2 and BR3 loci are indicated for reference.
4
SUPPLEMENTARY TABLES
Table S1. Statistics on genome sequencing libraries.
Type
Insert Read length Raw
size
data
Filtered data Coverage*
Mapped**
DNA-seq
Illumina paired-end
Illumina mate-pair
454
Total
500 bp 2 x 100 bp 5.8 Gbp
5 Kbp 2 x 44 bp 5.3 Gbp
400 bp
0.9 Mbp
12.0 Gbp
5.6 Gbp
4.2 Gbp
0.8 Gbp
10.6 Gbp
90 %
90 %
68 %
89 %
RNA-seq
Illumina paired-end
200 bp 2 x 101 bp 11.6 Gbp
28X
21X
4X
53X
97 %
* Filtered data. Based on an estimated genome size of 200 Mbp
** BLAT (92% identity)
Table S2. Species distribution of sequences in the NCBI nucleotide database (nt) with homology
against a 5% random subset of Ch. tentans sequencing reads. Best Blast hits with identity of 98% and
>50 bp aligned (454 reads) or with maximum 1 mismatch (Illumina reads) were recorded.
Phylum Arthropoda Unclassified Cnidaria Nematoda Mollusca Others
93 %
1%
0.7 %
0.6 %
0.6 %
4.1 %
Table S3. Previously identified Chironomus repeat sequences added to the Ch. tentans ab initio repeat
library.
GI accession number
14531653
156608
156609
156598
3336848
556631
3336845
3002944
2051997
556627
3861491
1448961
14531329
14531331
5
Table S4. The repeat content of the Ch. tentans genome.
Simple
repeats
Complex repeats
Repetitive elements
Assembly
Assembly independent*
DNA elements
0.98
1.68
SINEs
0.14
0.25
LINEs
0.25
0.46
LTR elements
0.09
0.10
Satellites
0.20
0.26
Unclassified
5.40
8.41
Minisatellites
2.74
3.45
Low complexity
0.40
0.33
* Estimated from a set of 180,000 454 reads longer than 500 bp.
Table S5. Species included in the OrthoMCL-DB gene family analysis and the phylogenetic
reconstruction.
Species
Chironomus tentans
Acyrthosiphon pisum
Aedes aegypti
Anopheles gambiae str. PEST
Apis mellifera
Bombyx mori
Culex quinquefasciatus
Drosophila melanogaster
Pediculus humanus
Ixodes scapularis
Brugia malayi*
Caenorhabditis briggsae F16*
Caenorhabditis elegans*
Phylum
Arthropoda
Arthropoda
Arthropoda
Arthropoda
Arthropoda
Arthropoda
Arthropoda
Arthropoda
Arthropoda
Arthropoda
Nematoda
Nematoda
Nematoda
* Outgroup
6
Table S6. Expression machinery genes in D. melanogaster with no detected orthologous sequence in
Ch. tentans. Identifiers according to FlyBase r5.55.
Machinery
Basal transcription factors
Basal transcription factors
Basal transcription factors
Name
Taf12L
Trf2
nht
Gene
FBgn0031623
FBgn0261793
FBgn0041103
Protein
FBpp0077111
FBpp0303138
FBpp0289298
Gene length (aa)
139
1716
246
Chromatin remodelling
Chromatin remodelling
Chromatin remodelling
Chromatin remodelling
Chromatin remodelling
CG12316
Atac1
D12
Atac2
Sgf11
FBgn0036483
FBgn0031876
FBgn0027490
FBgn0032691
FBgn0036804
FBpp0075405
FBpp0078990
FBpp0079312
FBpp0080629
FBpp0112092
1189
357
970
775
197
Nuclear pore
Ulp1
FBgn0027603
FBpp0074462
1514
RNA export
Nxf3
FBgn0263232
FBpp0305286
560
Splicing factors
Splicing factors
Splicing factors
Splicing factors
Splicing factors
CG17764
lost
CG17098
CG9684
ymp
FBgn0029751
FBgn0263594
FBgn0032276
FBgn0037583
FBgn0261287
FBpp0070724
FBpp0078561
FBpp0079692
FBpp0081342
FBpp0310517
273
546
653
643
187
7
Table S7. For each of the U1, U2, U4, U5 and U6 snRNAs, the Ch. tentans scaffolds containing
significant sequence match to the D. melanogaster genes are listed. D. melanogaster contains
multiple copies of each type of snRNA gene (listed in the right column together with the length of
each gene). The length of the matched Ch. tentans regions and the percent identity within the matched
regions are shown.
snRNA
Ch. tentans, scaffold (region)
U1
sc2105 (1160-1314)
sc1126 (31031-31185)
U2
sc4193 (1246-1057)
U4
sc237 (102346-102432)
sc357 (82409-82327)
U5
sc717 (38264-38363)
sc717 (32690-32794)
sc7116 (1377-1475)
U6
sc7078 (2259-2365)
sc295 (9373-9268)
sc96 (18372-18476)
Match length
(% identity)
1-156 (80)
1-156 (80)
1-156 (80)
1-156 (80)
1-156 (80)
1-156 (80)
1-156 (80)
1-156 (80)
1-156 (80)
1-156 (80)
1-191 (83)
1-191 (83)
1-191 (84)
1-191 (84)
1-191 (83)
1-190 (83)
1-87 (77)
5-86 (81)
5-86 (80)
5-86 (79)
5-86 (79)
1-99 (86)
17-96 (90)
17-96 (88)
16-86 (91)
17-96 (90)
16-95 (90)
16-94 (90)
19-107 (88)
1-96 (85)
1-96 (84)
16-84 (91)
1-96 (84)
16-103 (88)
16-94 (90)
1-98 (82)
17-96 (87)
17-95 (88)
16-84 (91)
17-96 (90)
16-95 (87)
16-94 (90)
1-107 (100)
1-107 (100)
1-107 (100)
1-107 (98)
1-107 (98)
1-107 (98)
3-107 (97)
3-107 (97)
3-107 (97)
D. melanogaster, query (length)
U1:21D (164 bp)
U1:82Eb (164 bp)
U1:95Ca (164 bp)
U1:95Cb (164 bp)
U1:95Cc (164 bp)
U1:21D
U1:82Eb
U1:95Ca
U1:95Cb
U1:95Cc
U2:14B (192 bp)
U2:34ABa (192 bp)
U2:34ABb (192 bp)
U2:34ABc (192 bp)
U2:38ABa (192 bp)
U2:38ABb (191 bp)
U4:25F (148 bp)
U4:38AB (142 bp)
U4:39:B (143 bp)
U4:38AB
U4:39B
U5:14B (110 bp)
U5:23D (131 bp)
U5:34A (127 bp)
U5:35D (126 bp)
U5:38ABa (127 bp)
U5:38ABb (127 bp)
U5:63BC (123 bp)
U5:14B
U5:23D
U5:34A
U5:35D
U5:38ABa
U5:38ABb
U5:63BC
U5:14B
U5:23D
U5:34A
U5:35D
U5:38ABa
U5:38ABb
U5:63BC
U6:96Aa (107 bp)
U6:96Ab (107 bp)
U6:96Ac (107 bp)
U6:96Aa
U6:96Ab
U6:96Ac
U6:96Aa
U6:96Ab
U6:96Ac
8
Table S8. Oligonucleotides used for in situ hybridization experiments. P1-P4 oligonucleotides were
used as PCR primers to obtain probes for in situ hybridization. The BR2.1 and BR2.2 oligonucleotides
were labeled with CY3 at their 5-ends.
P1a
P1b
P3a
P3b
5 GATTGTGGTGTTTAGTACTAGCC 3
5 CATTTGCGAACCATGGTCTGC 3
5 ACAAATGTCTTATGCAGTCTAG 3
5 AAACCAACATTTCAGCTTG 3
5 GTTTGACGTCAATTTGACAG 3
5 CATAAATGACACGTCGATC 3
P4a
5 ATCGACACCACTATATGAAG 3
P4b
5 TGAGGAAATCATTCATGC 3
BR2.2
BR2.1
5-CY3-CTCTGGTTTAATTCCTGACCAACTTGGTCT 3
5-CY3-ACTTGGCTTGCTGTGTTTGCTTGGTTTGCT 3
P2a
P2b
9
Download