Data sets o Sequences from Genoscope were composed of 4 sets of sequences, each with 14300 to 213000 reads (total about 696000 reads). Each set corresponds to 1 / 4 run of Titanium. Each set was generated by 1 / 4 part of a 454 plate (called region 1 to 4). o 16 different DNA/cDNA samples [DNA or cDNA from OF (Oslo fjord) of NB (Naples Bay) by different set of primers] were sequenced in this run. Region-1 = 5 samples; Region-2 = 3 samples; Region-3=5 samples; Region-4=3 samples. No major contamination o The first good news is that, if you take several sequences and do BLAST, you fill find that many of them matches to protists, indicating that the generated sequence sets contain our targets. Cleaning and classification of sequences o The origin of a particular sequence (i.e. the sample from which the sequence came from) was determined by the combination of “Region” + “MID” + “Forward primer”. o But this combination was not adequate to recognize some of the cross-contamination across different regions of the 454 plate. For the next run, it might be better if we use a combination of “MID” and “Forward primer” in a way that we can distinguish all the samples included in the 454 plate. o Some of the samples were sequenced from adaptor-A, while others were sequenced from adaptor-B. Those sequences from adaptor-B (1 sample from region-2, 4 samples from region-3 and 2 samples from regions4) show complementary sequences. These were reversed to facilitate the following analysis. For the next run, it might be interesting if we can ensure the sequencing from adaptor-A (which is next to MID), to ensure a higher quality of sequencing around MID. o We noticed that the most right base of the reverse primers was almost always missing. This is probably due to the cleaning process at Genoscope (when the remove the distal adapter B). o 50% to 60% of sequences from each set show the following configuration “MID” + “F-primer” + “R-primer” (50%-60%). A very preliminary taxonomic assignment o All sequences with a “MID + F-primer + sequence + R-primer” configuration was searched against Laure’s 18S rDNA DB. With E-value<10-10, the taxonomic classification of the best hit was assigned to the 454 sequences. o Four cDNA sequence sets by Foraminifera primers were mostly classified as “Rhizaria; Foraminifera”. More than 97% of sequences in three sets were classified as “foraminifera”. The remaining one set (OF1 2E cDNA) showed a slightly smaller percentage (~88%) of foraminifera, with the remaining 12% unclassified. o V4 vs V9 comparison V4 and V9 amplification resulted in a very similar taxonomic distribution. Few exceptions that we noted are as follows. Heterolobosea and Euglenozoa were represented only by V9 amplification. Fungi and Alveolata tend to show a larger number of sequences for V9 than V4. These seem to be a general trend across different samples. We don’t know yet if these are due to differences between V4 and V9 primers, or to the composition in the reference 18S database. “Archaeplasdida; Chlrophyta”: V4 >> V9 in NB1 SED1 DNA sample. But reverse tendency observed for NB1 SED1 RNA. o DNA vs RNA (cDNA) DNA and cDNA shows similar taxonomic distribution. Notable exceptions are as follows: Foraminifera, Cercozoa, Lobosea tend to be more highly represented in RNA set than in DNA set. Metazoa tend to be more highly represented in DNA than in RNA set. o Other remark Naple bay (NB) appears to be a little bit more diverse than Oslo (OF). Radiolaria, Haptophyta, Rhodophyta are more abundant in NB than OF, for instance.