(Meta)genomic dataset preprocessing Konstantinos Mavrommatis KMavrommatis@lbl.gov 1 The underlying question(s) …. • What …#$#%*$ / <3 <3… is happening on my dataset after I submit it to IMG/ER? • Why don’t I see the results immediately on IMG? 2 Dataset processing Sample preparation High throughput sequencing Assemble reads Analysis Feature prediction QC Functional annotation and comparative analysis 3 Outline • Data preprocessing • Annotation • Time considerations 4 Dataset pre-processing (v 3.0a) Submitted file Submitted file Submitted file Assembled contigs 454 reads Illumina reads Fasta/fast q File QC. Check character set and contig name. Remove trailing Ns. Trimming. Trimming. Q=20 Q=13 Fasta Low complexity. Size of 80 bp Dereplication. Prefix = 5, identity 95%, Clustering. 100% identity File for gene calling fasta Dataset pre-processing Quality trimming Courtesy Alex Copeland http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/ Remove sequences from the ends of the reads. lucy for 454 datasets. Illumina (longest high quality string) 6 Dataset pre-processing Low complexity filter tatatatatatatatatat aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa using dust (NCBI) -Remove sequences with less than 80 informative bases 7 Dataset pre-processing Dereplication 8 Dataset pre-processing Sequence dereplication atcccat atc-cat atcccat atcccat atcccat gctacat gctncat gctacat gctacat Not dereplicated using uclust -95% identity (global alignment). -Identical prefix (5nt) 9 Pre-processing in a nutshell • Quality trimming • Low complexity trimming tatatatatatatatatat aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa • Dereplication 10 Outline • Data preprocessing • Annotation • Time considerations 11 Dataset processing Feature prediction pipeline (v 3.0a) File for gene calling fasta Unassembled reads + assembled contigs CRISPR detection. crt / pilercr RNA detection. tRNAscan / hmmer / Blast / (isolates:Rfam) CDS detection. Isolates: prodigal Metagenomes: varies Conflict resolution Concatenation of all results. Creation of final output file File for IMG IMG Genomes • Prodigal to predict CDS • tRNA scan to predict tRNAs • In house models for rRNA • Infernal for ncRNA • CRISPR detection 13 Gene calling on metagenomes is harder and error prone Unassembled sequences •small size, •quality problems, •large number Assembled sequences frequently •contain errors, •low quality regions, •fragmented genes. Gene calling is not accurate. 14 Dataset processing Feature prediction Simulated datasets: a.Using fake reads extracted from finished genomes (Perfect sequences) b.Using real reads that have been used to assemble finished genomes (Real errors). isolate CORRECT metagenome MISSED NEW WRONG Available methods: Ab initio: FragGeneScan, Metagene, MetaGeneMark, Prodigal. Similarity based: Blastx, USEARCH. 15 Trimming 16 454 Ti(no errors) 17 454Ti(with errors) 18 Illumina 115 bp 19 Contigs frameshift Wrong prediction 20 Genomes • Weighted gene callers to predict CDS • tRNAscan & blastn to predict tRNAs • In house models for rRNA • Infernal for ncRNA • CRISPR detection 21 Functional annotation • COG/KOG • Pfam/TIGRfam • Usearch vs reference – KO terms/EC # – Phylogenetic distribution 22 Outline • Data preprocessing • Annotation • Time considerations 23 Processing time(metagenomes) 24 Total submissions Processing time 336 2.45 days (annotation) Data size (bp) 174,719,855 (average) Processing time(isolates) Total submissions 3630 25 Processing time 10 hours(annotation) 12 days (integration) Data size (bp) 1,658,242 (average) 4,114,099,773 (total) The underlying question …. Time for your questions • What …#$#%*$ / <3 <3… is happening on my dataset after I submit it to IMG/ER? • Patience is a virtue : It takes a lot of computations… and there are many datasets to be processed. 26