Data sets Sequences from Genoscope were composed of 4 sets of

advertisement




Data sets
o Sequences from Genoscope were composed of 4 sets of sequences, each with 14300 to 213000
reads (total about 696000 reads). Each set corresponds to 1 / 4 run of Titanium. Each set was
generated by 1 / 4 part of a 454 plate (called region 1 to 4).
o 16 different DNA/cDNA samples [DNA or cDNA from OF (Oslo fjord) of NB (Naples Bay)
by different set of primers] were sequenced in this run. Region-1 = 5 samples; Region-2 = 3
samples; Region-3=5 samples; Region-4=3 samples.
No major contamination
o The first good news is that, if you take several sequences and do BLAST, you fill find that
many of them matches to protists, indicating that the generated sequence sets contain our
targets.
Cleaning and classification of sequences
o The origin of a particular sequence (i.e. the sample from which the sequence came from) was
determined by the combination of “Region” + “MID” + “Forward primer”.
o But this combination was not adequate to recognize some of the cross-contamination across
different regions of the 454 plate. For the next run, it might be better if we use a combination
of “MID” and “Forward primer” in a way that we can distinguish all the samples included in
the 454 plate.
o Some of the samples were sequenced from adaptor-A, while others were sequenced from
adaptor-B. Those sequences from adaptor-B (1 sample from region-2, 4 samples from region-3
and 2 samples from regions4) show complementary sequences. These were reversed to
facilitate the following analysis. For the next run, it might be interesting if we can ensure the
sequencing from adaptor-A (which is next to MID), to ensure a higher quality of sequencing
around MID.
o We noticed that the most right base of the reverse primers was almost always missing. This is
probably due to the cleaning process at Genoscope (when the remove the distal adapter B).
o 50% to 60% of sequences from each set show the following configuration “MID” + “F-primer”
+ “R-primer” (50%-60%).
A very preliminary taxonomic assignment
o All sequences with a “MID + F-primer + sequence + R-primer” configuration was searched
against Laure’s 18S rDNA DB. With E-value<10-10, the taxonomic classification of the best
hit was assigned to the 454 sequences.
o Four cDNA sequence sets by Foraminifera primers were mostly classified as “Rhizaria;
Foraminifera”. More than 97% of sequences in three sets were classified as “foraminifera”.
The remaining one set (OF1 2E cDNA) showed a slightly smaller percentage (~88%) of
foraminifera, with the remaining 12% unclassified.
o V4 vs V9 comparison
 V4 and V9 amplification resulted in a very similar taxonomic distribution. Few
exceptions that we noted are as follows.
 Heterolobosea and Euglenozoa were represented only by V9 amplification.
 Fungi and Alveolata tend to show a larger number of sequences for V9 than V4.
 These seem to be a general trend across different samples. We don’t know yet if these
are due to differences between V4 and V9 primers, or to the composition in the
reference 18S database.
 “Archaeplasdida; Chlrophyta”: V4 >> V9 in NB1 SED1 DNA sample. But reverse
tendency observed for NB1 SED1 RNA.
o DNA vs RNA (cDNA)
 DNA and cDNA shows similar taxonomic distribution. Notable exceptions are as
follows:
 Foraminifera, Cercozoa, Lobosea tend to be more highly represented in RNA set than
in DNA set.
 Metazoa tend to be more highly represented in DNA than in RNA set.
o Other remark
 Naple bay (NB) appears to be a little bit more diverse than Oslo (OF). Radiolaria,
Haptophyta, Rhodophyta are more abundant in NB than OF, for instance.
Download