Graph Algorithms

advertisement
Graph Algorithms 8.6-8.10
CS 6030 – Bioinformatics
Summer II 2012
Jason Eric Johnson
Sequencing by Hybridization
• DNA Array gives all strings of length l
• How do we find the order?
• Spectrum(s,l)
– String s of length n
– Spectrum is multiset of n-l+1 l-mers in s
Sequencing by Hybridization
• s = TATGGTGC
• l=3
• Spectrum(s,l) = {TAT,ATG,TGG,GGT,GTG,TGC}
• Problem:
• Input: Set S of all l-mers from s
• Output: String s s.t. Spectrum(s,l) = S
Hybridization on DNA Array
Sequencing by Hybridization
• Special case of Shortest Superstring Problem
• SBH is linear-time
• SSP (NP-Complete) is more general
– In SSP, no guaranteed overlap
– In SBH, we know the length of the target
sequence
Sequencing by Hybridization
• There is a problem with DNA Arrays
• No good way to distinguish a match from a
highly stable mismatch
– Mismatch could give strong hybridization signal
– Need longer probes to deal with mutations
SBH: Hamiltonian Path Approach
• Two l-mers overlap if overlap(p,q) = l – 1
– Last l-1 letters of p are same as first l-1 of q
• Make each l-mer in Spectrum(s,l) a node
• Construct directed graph(s) that connect every
p and q with a directed edge
• 1 to 1 correspondence between paths that
visit each vertex exactly once (Hamiltonian
Paths) and DNA fragments with Spectrum(s,l)
SBH: Hamiltonian Path Approach
S = { ATG AGG TGC TCC GTC GGT GCA CAG }
H
ATG
AGG
TGC
TCC
GTC
ATG C A G G T C C
Path visited every VERTEX once
GGT
GCA
CAG
SBH: Hamiltonian Path Approach
A more complicated graph:
S = { ATG TGG
H
TGC
GTG
GGC
GCA
GCG
CGT }
SBH: Hamiltonian Path Approach
S = { ATG TGG TGC GTG GGC GCA GCG CGT }
Path 1:
H
ATGCGTGGCA
Path 2:
H
ATGGCGTGCA
SBH: Hamiltonian Path Approach
• Problem is that there is no efficient algorithm
• As overlap graph gets larger, this is not a
useful technique since the Hamiltonian Path
problem is NP-Complete
SBH: Eulerian Path Approach
• This leads to simple linear-time algorithm for
sequence reconstruction
• Construct graph whose edges correspond to lmers
• Find path(s) that visit each edge exactly once
SBH: Eulerian Path Approach
S = { ATG, TGC, GTG, GGC, GCA, GCG, CGT }
Vertices correspond to ( l – 1 ) – mers : { AT, TG, GC, GG, GT, CA, CG }
Edges correspond to l – mers from S
GT
AT
CG
TG
GC
GG
CA
Path visited every EDGE once
SBH: Eulerian Path Approach
S = { AT, TG, GC, GG, GT, CA, CG } corresponds to two different paths:
GT
AT
TG
CG
GC
GG
ATGGCGTGCA
GT
CA
AT
CG
TG
GC
GG
ATGCGTGGCA
CA
SBH: Eulerian Path Approach
• If for every vertex the number of incoming
edges is equal to the number of outgoing
edges, the graph is balanced
• Theorem: A connected graph is Eulerian if
and only if each of its vertices is balanced
• Theorem: A connected graph has an Eulerian
path if and only if it contains at most two
semi-balanced vertices and all other vertices
are balanced
Some Difficulties with SBH
• Fidelity of Hybridization: difficult to detect differences
between probes hybridized with perfect matches and 1
or 2 mismatches
• Array Size: Effect of low fidelity can be decreased with
longer l-mers, but array size increases exponentially in l.
Array size is limited with current technology.
• Practicality: SBH is still impractical. As DNA microarray
technology improves, SBH may become practical in the
future
• Practicality again: Although SBH is still impractical, it
spearheaded expression analysis and SNP analysis
techniques
Fragment Assembly
• Now that we have our reads sequenced, we
need to assemble them into the entire DNA
sequence
Fragment Assembly
• We have some problems:
– Errors in reads (1% to 3%)
– Which strand did the read come from?
• Did the read come from the target DNA sequence or its
Watson-Crick complement?
– Repeats in DNA (this is the major problem)
• See page 278 for puzzle example
Fragment Assembly
• Very difficult to put it all together if repeats
are longer than read length
• Could solve this by increasing read length, but
the technology isn’t there yet
Fragment Assembly
• One approach is to break the sequence into
about 30,000 Bacterial Artificial Chromosomes
– Sequence each BAC individually
– Put them all together
– Used and shown effective (if cumbersome) by the
Human Genome Project
Fragment Assembly
• Another option (used in mouse genome
assembly) is the Weber-Meyers approach
– Pairs reads that are separated by a fixed-size gap
– Gap size L is chosen to be longer than most
repeats
– Unlikely both reads lie in large repeat
– Read that is in unique portion of DNA tells us
which copy of a repeat the mate is in
Fragment Assembly
• Most algorithms consist of these steps:
• Overlap
– Find potentially overlapping reads
• Layout:
– Find order of reads along DNA
• Consensus:
– Derive DNA sequence from layout
Overlap
• Find the best match between the suffix of one
read and the prefix of another
• Due to sequencing errors, need to use
dynamic programming to find the optimal
overlap alignment
• Apply a filtration method to filter out pairs of
fragments that do not share a significantly
long common substring
Overlapping Reads
•
Sort all k-mers in reads
•
Find pairs of reads sharing a k-mer
•
(k ~ 24)
Extend to full alignment – throw away if not
>95% similar
TACA TAGATTACACAGATTAC T GA
|| ||||||||||||||||| | ||
TAGT TAGATTACACAGATTAC TAGA
Overlapping Reads and Repeats
• A k-mer that appears N times, initiates N2
comparisons
• For an Alu that appears 106 times  1012
comparisons – too much
• Solution:
Discard all k-mers that appear more than
t  Coverage, (t ~ 10)
Finding Overlapping Reads
Create local multiple alignments from the
overlapping reads
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAG TTACACAGATTATTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAGATTACACAGATTACTGA
TAG TTACACAGATTATTGA
TAGATTACACAGATTACTGA
Layout
• Repeats are a major challenge
• Do two aligned fragments really overlap, or
are they from two copies of a repeat?
• Solution: repeat masking – hide the repeats!!!
• Masking results in high rate of misassembly
(up to 20%)
• Misassembly means alot more work at the
finishing step
Consensus
• A consensus sequence is derived from a
profile of the assembled fragments
• A sufficient number of reads is required to
ensure a statistically significant consensus
• Reading errors are corrected
Derive Consensus Sequence
TAGATTACACAGATTACTGA TTGATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAAACTA
TAG TTACACAGATTATTGACTTCATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
TAGATTACACAGATTACTGACTTGATGGGGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
Derive multiple alignment from pairwise read
alignments
Derive each consensus base by weighted
voting
Protein Sequencing and Identification
• Protein can be digested into peptides by
proteases (such as trypsin)
• Can then sequence the fragments individually
and re-assemble
• Mass spectrometry allows us to find proteins
involved in cell death, for example
Protein Sequencing and Identification
• Tandem mass spectrometer breaks peptides
into smaller fragments
• These fragments have electrical charge
• Fragments are spun around in an magnetic
field until they hit a detector
• Larger masses are harder to spin than smaller
ones, so mass can be determined by the
amount of energy required to fling fragments
around
Protein Sequencing and Identification
• The problem we encounter is how to
reconstruct the amino acid sequence of the
peptide from the masses of the broken pieces
References
• Generated from:
• An Introduction to Bioinformatics Algorithms, Neil C. Jones, Pavel A.
Pevzner, A Bradford Book, The MIT Press, Cambridge, Mass., London,
England, 2004
• Slides 4, 8-10, 13, 14, 16, 23-29 from
http://bix.ucsd.edu/bioalgorithms/slides.php#Ch8
Download