Metagenomic dataset preprocessing – data reduction Konstantinos Mavrommatis KMavrommatis@lbl.gov 1 Complexity Acid Mine Drainage Sargasso Sea Termite Hindgut Cow rumen Soil The total metagenome is the result of a cell community. Cells belong to different organisms ranging from strains to domains. 1 10 Species complexity 100 1000 10000 Who is there? (phylogenetic content) What does it do? (Functional content) Why is it there? (Comparative study) 2 Dataset processing Sample preparation High throughput sequencing Assemble reads Analysis Feature prediction QC Functional annotation and comparative analysis Binning 3 Dataset processing (v 3.0a) Submitted file Submitted file Submitted file Assembled contigs 454 reads Illumina reads Fasta/fast q File QC. Check character set and contig name. Remove trailing Ns. Trimming. Trimming. Q=20 Q=13 Low complexity. Size of 80 bp Fasta Dereplication. Prefix = 5, identity 95%, Clustering. 100% identity File for gene calling fasta Dataset processing Feature prediction pipeline (v 3.0a) File for gene calling fasta Unassembled reads + assembled contigs CRISPR detection. crt / pilercr RNA detection. tRNAscan / hmmer / Blast / (isolates:Rfam) CDS detection. Isolates: prodigal Metagenomes: varies Conflict resolution Concatenation of all results. Creation of final output file File for IMG IMG Dataset processing Quality trimming Courtesy Alex Copeland http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/ Remove sequences from the ends of the reads. lucy for 454 datasets. Illumina (longest high quality string) 6 Dataset processing Low complexity filter tatatatatatatatatat aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa using dust (NCBI) -Remove sequences with less than 80 informative bases 7 Dataset processing Dereplication 8 Dataset processing Sequence dereplication atcccat atc-cat atcccat atcccat atcccat gctacat gctncat gctacat gctacat Not dereplicated using uclust -95% identity (global alignment). -Identical prefix (5nt) 9 Dataset processing Evaluation of processing tools Unassembled sequences due to their small size, quality problems, and large number need to be processed with efficient pipelines. Simulated datasets: a. Using sequences extracted from finished genomes (Perfect sequences) b. Using reads that have been used to assemble finished genomes (Real errors). Evaluation and development of new tools/wrappers. 10 Dataset processing Feature prediction Available methods: Ab initio: Metagene, MetaGeneMark, FragGeneScan, Prodigal. Similarity based: Blastx, USEARCH. isolate CORRECT MISSED NEW WRONG metagenome 11 Trimming 14 454 Ti(no errors) 15 454Ti(with errors) 16 Illumina 115 bp 17 Illumina 74 bp 18 Contigs frameshift Wrong prediction 19 Why annotate unassembled reads? Sample Total size 102,722,384 (2x150) reads Assembled contigs 1,375,950 contigs Assembled reads Mapped (by bwa) 11,778,925 reads Genes called on unassembled reads 64,737,444 genes 5060 different pfams 7481 different pfams 8,373,641 (12%) genes Similar to genes on contigs1 Genes with similarity to isolate genomes 40,778,854 genes Assembled only More accurate statistics based on unassembled + assembled 20 Unassembled + assembled + real metagenome Additional information about functions and phylogeny Processing time(metagenomes) 21 Total submissions Processing time 336 2.45 days (annotation) 24 days (integration) Data size (bp) 174,719,855 (average) 58,006,992,092 (total) Processing time(isolates) Total submissions 3630 22 Processing time 10 hours(annotation) 12 days (integration) Data size (bp) 1,658,242 (average) 4,114,099,773 (total) Thank you for your attention 23