12. Pre processing Metagenomic Datasets

advertisement
(Meta)genomic dataset
preprocessing
Konstantinos Mavrommatis
KMavrommatis@lbl.gov
1
The underlying question(s) ….
• What …#$#%*$ / <3 <3… is happening on my
dataset after I submit it to IMG/ER?
• Why don’t I see the results immediately on
IMG?
2
Dataset processing
Sample preparation
High throughput sequencing
Assemble reads
Analysis
Feature prediction
QC
Functional annotation
and comparative analysis
3
Outline
• Data preprocessing
• Annotation
• Time considerations
4
Dataset pre-processing
(v 3.0a)
Submitted file
Submitted file
Submitted file
Assembled contigs
454 reads
Illumina reads
Fasta/fast
q
File QC.
Check character set and contig name. Remove trailing Ns.
Trimming.
Trimming.
Q=20
Q=13
Fasta
Low complexity.
Size of 80 bp
Dereplication.
Prefix = 5, identity 95%,
Clustering.
100% identity
File for gene calling
fasta
Dataset pre-processing
Quality trimming
Courtesy Alex Copeland
http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/
Remove sequences from the ends of the reads.
lucy for 454 datasets.
Illumina (longest high quality string)
6
Dataset pre-processing
Low complexity filter
tatatatatatatatatat
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
using dust (NCBI)
-Remove sequences with less than 80
informative bases
7
Dataset pre-processing
Dereplication
8
Dataset pre-processing
Sequence dereplication
atcccat
atc-cat
atcccat
atcccat
atcccat
gctacat
gctncat
gctacat
gctacat
Not
dereplicated
using uclust
-95% identity (global alignment).
-Identical prefix (5nt)
9
Pre-processing in a nutshell
• Quality trimming
• Low complexity
trimming
tatatatatatatatatat
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
• Dereplication
10
Outline
• Data preprocessing
• Annotation
• Time considerations
11
Dataset processing
Feature prediction pipeline (v 3.0a)
File for gene calling
fasta
Unassembled reads + assembled contigs
CRISPR detection.
crt / pilercr
RNA detection.
tRNAscan / hmmer / Blast / (isolates:Rfam)
CDS detection.
Isolates: prodigal
Metagenomes: varies
Conflict resolution
Concatenation of all results.
Creation of final output file
File for IMG
IMG
Genomes
• Prodigal to predict CDS
• tRNA scan to predict tRNAs
• In house models for rRNA
• Infernal for ncRNA
• CRISPR detection
13
Gene calling on metagenomes is
harder and error prone
Unassembled sequences
•small size,
•quality problems,
•large number
Assembled sequences frequently
•contain errors,
•low quality regions,
•fragmented genes.
Gene calling is not accurate.
14
Dataset processing
Feature prediction
Simulated datasets:
a.Using fake reads extracted from finished genomes (Perfect sequences)
b.Using real reads that have been used to assemble finished genomes (Real
errors).
isolate
CORRECT
metagenome
MISSED
NEW
WRONG
Available methods:
Ab initio: FragGeneScan,
Metagene, MetaGeneMark,
Prodigal.
Similarity based: Blastx,
USEARCH.
15
Trimming
16
454 Ti(no errors)
17
454Ti(with errors)
18
Illumina 115 bp
19
Contigs
frameshift
Wrong prediction
20
Genomes
• Weighted gene callers to
predict CDS
• tRNAscan & blastn to predict
tRNAs
• In house models for rRNA
• Infernal for ncRNA
• CRISPR detection
21
Functional annotation
• COG/KOG
• Pfam/TIGRfam
• Usearch vs reference
– KO terms/EC #
– Phylogenetic distribution
22
Outline
• Data preprocessing
• Annotation
• Time considerations
23
Processing time(metagenomes)
24
Total submissions
Processing time
336
2.45 days (annotation)
Data size (bp)
174,719,855 (average)
Processing time(isolates)
Total submissions
3630
25
Processing time
10 hours(annotation)
12 days (integration)
Data size (bp)
1,658,242 (average)
4,114,099,773
(total)
The underlying question ….
Time for your questions
• What …#$#%*$ / <3 <3… is happening on my
dataset after I submit it to IMG/ER?
• Patience is a virtue :
It takes a lot of computations… and there are
many datasets to be processed.
26
Download