K-mer - University of Connecticut

advertisement
Algorithms for Multisample
Read Binning
Student: Gabriel Ilie
Major advisor: Ion Măndoiu
Associate Advisor: Sanguthevar Rajasekaran
Associate Advisor: Yufeng Wu
University of Connecticut
November 2013
Outline
•
•
•
•
•
Motivation
Previous approaches
Algorithms
Results
Ongoing work
Human microbiome
•
•
•
•
The microorganisms inhabiting our bodies comprise the
microbiome.
Microbes outnumber our own cells by 10 to 1 [Wooley et al, 2010].
Diabetes, obesity, cancer and even attractiveness to mosquitoes
seem to correlate with changes in our microbiome [Turnbaugh et al,
2010].
To better understand our own condition we also need to
understand the composition of the microbial communities inhabiting
our bodies and how they interact not just between themselves but
also with their habitat (e.g. the host organism).
Single organism studies
•
•
•
•
•
they rely on clonal cultures
most microorganism cannot be cultivated
suffer from amplification bias
microbes do not live in single species communities
members of these communities interact not just with
each other but also with their habitats, which includes
the host organism
Data obtained from clonal cultures is highly biased and
does not capture a true picture of microbial life.
Transcriptomics
•
•
•
•
Genomes only provide information about the potential
function of organisms.
Having a gene does not mean that this gene is also
expressed inside the host or during a particular
condition.
The transcriptome is the set of all RNA molecules
produced in one or a population of cells.
In order to understand the physiology of the
microorganisms we need to know their transcriptome.
Metatranscriptomics
•
•
•
The transcriptome of a community is the union over the
transcriptomes of each of its members.
In metatranscriptomic studies:
o bulk RNA is extracted directly from environmental
samples
o the RNA is reverse-transcribed
o the resulting RNA-Seq libraries are sequenced
Metatranscriptomic studies are essential in order to
understand the physiology of the microbiome.
Challenges of working with
metatranscriptomic data
1. The volume of sequencing data is several orders of
magnitude larger than single organisms.
2. Reads can come from hundreds of different species,
each with a different abundance level.
3. In addition to having a range in the abundance levels of
the microorganisms, genes are expressed at drastically
different levels.
4. Genes usually have multiple isoforms.
Metatranscriptomics has to deal with all of the challenges
of metagenomics (1 and 2) plus some extra challenges
(3 and 4), therefore algorithms devised for metagenomic
data can also be applied to metatranscriptomic data.
(Meta)Transcriptome assembly
types
Genome independent
reconstruction (de
novo):
•
de Bruijn k-mer graph
❖ Velvet (2008)
❖ Trinity (2011)
Genome guided
reconstruction:
•
•
•
spliced read mapping
exon identification
splice graph
❖ Cufflinks (2010)
❖ TRIP (2012)
(Meta)Transcriptome assembly
types
Genome independent
reconstruction (de
novo):
•
de Bruijn k-mer graph
❖ Velvet (2008)
❖ Trinity (2011)
Genome guided
reconstruction:
•
•
•
spliced read mapping
exon identification
splice graph
❖ Cufflinks (2010)
❖ TRIP (2012)
Outline
•
•
•
•
•
Motivation
Previous approaches
Algorithms
Results
Ongoing work
Clustering reads into bins
•
•
•
Analysis of environmental samples is difficult.
To simplify the assembly process, many
metagenomic tools have been developed to
cluster the reads into bins (i.e. species).
Algorithms developed for binning metagenomic
reads can also be applied to
(meta)transcriptomic reads (bins represent
transcripts instead of species).
Types of reads binning algorithms
Genome dependent
❖ CompostBin(2008)
❖ Metacluster(2012)
DNA composition patterns.
G+C content, dinucleotide
frequencies vary amongst
species.
Drawbacks:
o achieve reasonable
performance only for long
reads (800~1000 bp, [Wu et
al, 2011])
o NGS technologies produce
short reads
•
•
•
Genome independent
❖ AbundanceBin(2011)
❖ MultiBin(2011)
K-mer frequencies are usually
linearly proportional to a genome’s
abundance.
Sufficiently long k-mers are usually
unique.
Works with short sequencing
reads.
Drawbacks
o group together reads from
different species if they have
close abundance levels
o do not perform well on species
with low abundances
•
•
•
•
Metacluster (2012)
❖ Two round unsupervised binning algorithm:
➢ the first round clusters high-abundance reads
■ filter reads with 16-mer that appear less than T times
■ the reads are grouped based on shared 36-mers
■ the groups are clustered using 5-mer distributions
➢ the second round clusters the remaining reads (lowabundance)
■ those with unique 16-mers are discarded
■ the reads are grouped based on shared 22-mers
■ the groups are clustered using 4-mer distributions
❖ Advantages:
➢ better binning of reads from low abundance species
➢ uses both techniques: k-mer abundances and DNA
composition patterns
MultiBin (2011)
•
•
•
Processes multiple samples (N > 1) of the same microbial community.
Clusters the reads into b bins (b is the number of species).
Binning algorithm:
o all reads are pooled together
o a graph G=(V, E) is generated
 V is the set of reads
 edges connect reads with substantial overlap (50 bp)
o greedily partition the vertices into a set, the tags, s.t. each read is
either a tag or affiliated with one which substantially overlaps it
o cluster the set of tags using Vt=(ct1,ct2, …, ctN), cti=number of reads
from sample i which substantially overlap tag t
o each non-tag read is assigned to the same bin as its affiliated tag
❖ b needs to be known in advance or estimated somehow.
❖ Uses abundance differences from any of the samples to tell low
abundance species apart.
Outline
•
•
•
•
•
Motivation
Previous approaches
Algorithms
Results
Ongoing work
Our approach
We propose a novel method for unsupervised abundance-based
multiple samples reads binning algorithm
Pseudo-exons
Pseudo-exons are defined as substrings whose k-mers and (k+1)mers appear in the same transcripts with the same multiplicity.
T1
A
B
C
T2
A
B
C
T3
A
B
C
T4
F
E
Exon signatures:
A: T1x1 T2x2 T3x1
B: T1x1 T2x1 T3x1
C: T1x1 T2x1 T3x1
D: T1x1 T3x1
E: T3x1 T4x1
F: T3x1 T4x1
D
A
E
F
D
Pseudo-exons:
A
BC
D
E
F
Notice that exons E and F form different pseudo-exons because in T4
they are in reverse order compared to T3.
K-mer counting
•
•
•
•
•
•
•
we count k-mers and (k+1)-mers using Jellyfish (2011)
Jellyfish was designed for shared memory parallel computers with
more than one core, it uses several lock-free data structures and
multi-threading to count k-mers much faster than other tools
formally, for a given value of k, we count the number of
occurrences of all k-mers in each of the N samples
our algorithms assume we have strand nonspecific data, therefore
the counts of complementary k-mers are summed together as they
are indistinguishable
we store k-mers in canonical form (the smaller value
lexicographically between a k-mer and its reverse complement)
we combine the counts over all the samples into a list of Ndimensional vectors
the maximum value for k supported by Jellyfish is 31
De Bruijn graph
• We construct the de Bruijn graph; vertices are k-mers and
edges are (k+1)-mers.
•
•
•
Because we store vertices in canonical form, each vertex
represents two k-mers: itself and its reverse-complement.
We add an edge between any two k-mers if and only if there
is a (k+1)-mer in the reads such that its prefix of length k in
canonical form matches one of the vertices, while its suffix of
length k in canonical form matches the other one.
We will sometimes use vertices to refer to k-mers and edges
to refer to (k+1)-mers.
De Bruijn graph
ACG
CGT
CGC
GCG
CGA
TCG
ATC
GAT
Vertices
ACG
CGC
CGA
ATC
Edges
ACGC
CGCG
ACGA
ATCG
● The lists of k-mers and (k+1)-mers define an implicit
representation of the de Bruijn graph, therefore we don’t need
to construct it explicitly.
● Relative to the canonical form of a vertex, for each edge we can
say whether it is incoming or outgoing:
○ if the the k-mer matches the 5’ end of either forms of an
edge then that is an outgoing edge
○ else it is an incoming edge
Error removal
•
•
•
•
A common approach is to remove k-mers which have counts lower
than t >= 1.
We found that even for t = 1, we lose too much information,
because removing unique k-mers compromises the results for ultralow abundance transcripts.
We found that “tip removal” and “bubble removal” give much better
results [Zerbino et al, Velvet, 2008].
These methods use the structure of the de Bruijn graph instead of
coverage information to remove k-mers affected by sequencing
errors.
Tip and bubble error removal
•
•
•
When a read contains a sequencing error
o the first few k-mers may be correct, until they start to overlap the
position where the error occurred
o this creates a branch going out of the “correct” path
o this new branch will either end in a leaf (creating a “tip”), or if the
read is long enough, the k-mers will stop overlapping the error and
the branch will merge back into the path (creating a “bubble”)
A “tip” is a chain of nodes that is disconnected on one end
o we expect the majority of tips to have a maximum length of 2k
o removing tips is straightforward; we remove all tips which have a
length up to some threshold
o removing a tip does not disrupt the connectivity of the graph
Implementing “bubble” removal is still an ongoing work.
Partitioning the de Bruijn graph
•
•
•
From the de Bruijn graph we want to extract
putative pseudo-exons.
These putative pseudo-exons, if we ignore
self-edges, correspond to paths or chordless
cycles in the de Bruijn graph.
We use the structure of the graph and the
vectors with the counts to do the partitioning.
Partitioning the de Bruijn graph
Partitioning the de Bruijn graph
We have the following cases:
If a vertex has indegree and outdegree equal to at most 1 (it
is on a path), we do nothing.
If a vertex has outdegree (indegree) greater than 1,
o then we remove all outgoing (incoming) edges from that
vertex
o however, we keep the most abundant edge if the ratio
between its abundance and the sum of the abundances
of the other edges is higher than a threshold 0 < e < 1
The value of e should be close to 1 (e.g. 0.97)
•
•
•
Our approach
We propose a novel method for unsupervised abundance-based
multiple samples reads binning algorithm
Outline
•
•
•
•
•
Motivation
Previous approaches
Algorithms
Results
Ongoing work
Test data - error free
•
•
•
•
GNF Atlas [Su et al, 2004] is a dataset which contains
information about the expression levels of a set of
genes in several human tissues
From this dataset we extracted the expression levels of
19,371 genes in 10 human tissues
We used only one isoform per gene
We simulated 30 million error free RNA-Seq pairedreads of length 50 from this dataset using a tool called
Grinder (2012)
Test data - with sequencing errors
•
•
•
•
Grinder was very useful for simulating the error free
data, however when we wanted to introduce errors its
long running time became an issue.
Instead, we simulated sequencing errors by using the
error free data.
We simulated only one type of errors, substitutions,
because these are the most common type found in
Illumina datasets.
We introduced substitutions into the error free reads
with a probability of 0.1%, 0.5% and 1% per base.
K-mer counts in the simulated data
transcripts
#30-mers
#31-mers
reads
error
k
#correct
k-mers
#incorrect
k-mers
#missing
k-mers
percentage
of incorect
k-mers
30
49,512,839
0
59,704
0%
31
49,546,279
0
65,412
0%
30
49,508,364
109,380,690
64,179
68.84%
31
49,540,750
108,528,925
70,941
68,66%
30
49,483,820
404,415,622
88,723
89.1%
31
49,510,299
402,714,823
101,392
89.05%
30
49,433,000
708,460,569
139,543
93.48%
31
49,446,451
706,531,643
165,240
93.46%
0%
0.1%
49,572,543
49,611,691
0.5%
1%
Efficiency of different error removal techniques
Error removal method
error
#correct 31mers
#incorrect 31mers
#missing 31mers
none
0%
49,546,279
0
65,412
non-unique in at
least 1 sample
0%
46,520,486
0
3,091,205
non-unique in at
least 1 sample
0.1%
46,257,210
6,906,069
3,354,481
remove tips <= 21 over the union of
the samples
0.1%
49,502,470
21,772,808
109,221
remove tips <= 60 over the union of
the samples
0.1%
49,236,255
12,120,069
375,436
remove tips <= 21 sample-by-sample
and over the union of the samples
0.1%
49,127,456
10,998,956
484,235
remove tips <= 60 sample-by-sample
and over the union of the samples
0.1%
48,707,567
5,304,983
904,124
Results for the graph partitioning
error
error removal
technique
edge
removal
threshold
#pseudo-exons
>= 50bp
#pseudo-exons
>= 50bp and
do not have
wrong 31-mers
#30-mers
%transcriptome
covered by
col 5
0%
none
1
60,285
60,285
49,299,665
99.5%
0%
none
1
65,051
65,051
49,239,335
99.3%
0.1%
remove tips <= 60 over the
union of the samples
0.97
416,853
60,335
37,815,256
76.3%
0.1%
remove tips <= 60 sampleby-sample and over the
union of the samples
0.97
228,841
77,010
46,699,963
94.2%
The first row (green) uses all k-mers, the counts are computed from the transcripts.
Outline
•
•
•
•
•
Motivation
Previous approaches
Algorithms
Results
Ongoing work
Ongoing work
•
•
•
We believe that bubble removal will help us get rid of
most of the erroneous k-mers which are still present in
the data after tip removal.
We want to incorporate an error correction algorithm,
SEECER (2013), to correct the erroneous reads before
doing starting the k-mer counting.
Currently our algorithms do not take into account
strand strand-specific RNA-Seq data. Optimizing the
algorithms to take advantage of this information, when
available, represents another opportunity to improve the
results of this approach.
Q&A
Download