Notes

advertisement
I519 Introduction to Bioinformatics, 2012
Genome Comparison
Whole genome comparison/alignment




Build better phylogenies
Identify polymorphism
Detect gene-level events
Compare different assemblies of a single
genome
Whole genome comparison
 Aligning whole genomes is a fundamentally
different problem than aligning short
sequences.
 Need to consider the presence of large-scale
evolutionary events
–
–
–
–
Gene duplication & loss
Horizontal gene transfer
Repetitive sequences (repeats)
Gene rearrangement and inversion
 Pairwise and multiple genome comparison
– Multiple genome alignment provides a basis for research into
comparative genomics and the study of evolutionary dynamics.
Genome evolution
Genome A
Point Substitution
Translocation
Inversion
Inversion and
Translocation
Insertion
Repeat
(Duplication)
Basic algorithms: use anchoring as a
heuristic to speed alignment
 Assumption: highly similar subsequences can be
found quickly and are likely to be part of the correct
global alignment.
 These local alignments are used to anchor a global
alignment (alignment anchor), reducing the number
of possible global alignments considered during a
subsequent O(n2) dynamic programming step.
 Select a single collinear set of alignment anchors
 Many tools have been developed
Rearrangement free or not
 Free of rearrangement
– Assume the input sequences are free from significant
rearrangements of sequence elements, selecting a single
collinear set of alignment anchors
– Pairwise: MUMmer, GLASS, AVID, and WABA align pairs of
long sequences
– Multiple alignment: MAVID, MLAGAN, and MGA
 Consider rearrangement
– Shuffle-LAGAN (2003, first genome comparison method
described that explicitly deals with genome rearrangements)
– MultiPipMaker (2003)
– Mauve (2004, multiple)
– Enredo and Pecan (2008)
– GR-Aligner (2009, pairwise)
MUMer method
 MUMer combines suffix trees, the longest increasing
subsequence (LIS) and SW alignment
 Maximal Unique Match (MUM) Identification - Identify
the longest strings in Genome 1 that have one
identical match in Genome 2
– Naïve method: O(N2)
– Using suffix tree: O(N)
 Ordered MUM Selection - Identify the longest set of
MUMs such that they occur in order in each of the
genomes (using a variation of the well-known
algorithm to find the LIS of a sequence of integers)
 Processing Non-matched Regions - Classify nonmatched regions as either insertions, SNPs or highly
polymorphic regions
Suffix tree
 Suffix tree is data structure, which allows one to
find, extremely efficiently, all distinct
subsequences in a given sequence.
 There are efficient algorithms to construct
suffix trees given by Weiner (1973) and
McCreight (1976) (in linear time)
 For the task of comparing two DNA sequences,
suffix trees allow one to quickly find all
subsequences shared by the two inputs.
 The genome alignment is then built upon this
information.
Suffix tree for finding MUMs
Suffix Tree for sequence “gaaccgacct”
An internal node is a repeated
sequence in the original string
Leaf is a unique suffix
Every unique matching sequence is
represented by an internal node with
exactly two child nodes, such that the
child nodes are leaf nodes from
different genomes
A toy example
ATCGTA#
#
A#
TA#
GTA#
CGTA#
TCGTA#
ATCGTA#
ATCGAT$
$
T$
AT$
GAT$
CGAT$
TCGAT$
ATCGAT$
7
6
5
4
3
2
1
14
13
12
11
10
9
8
ATCGTA#
#
$
A#
AT$
ATCGAT$
ATCGTA#
CGAT$
CGTA#
GAT$
GTA#
T$
TA#
TCGAT$
TCGTA#
7
14
6
12
8
1
10
3
11
4
13
5
9
2
0
T
1
$
A
CG
1
2
#
A# CG
T$
6 12
13 5 3
AT$
9
T
10
CG
4
AT$
8
1
TA# AT$ TA#
AT$
2
TA#
2
G
TA#
1
3
11
4
Suffix tree & suffix array for string
matching
 Preprocess text T, not pattern P
– O(m) preprocess time (m: the length of the text)
– O(n+k) search time (n: the length of the pattern)
• k is number of occurrences of P in T
 Match pattern P against tree starting at root until
– Case 1, P is completely matched
• Every leaf below this match point is the starting location
of P in T
– Case 2: No match is possible
• P does not occur in T
A toy example of string (pattern) matching
 T = xabxac
– suffixes ={xabxac, abxac, bxac, xac, ac, c}
 Pattern P1: xa
 Pattern P2: xb
b
x
a
c
c
x
a
a
6
c
5
b
x
b
c
x
a
4
c
a
c
3
2
1
Suffix array
Suffix array: a sorted list of the suffixes of a
given string; the start positions are sorted in
lexicographical (alphabetical) order
Straightforward implementation: O(m2logm),
reduced to O(mlogm) (utilizing partial sorts)
m: the length of the text
Suffix array enables binary search for any
substring, e.g. CAD
O(nlogm), reduced to O(n + logm) if use
LCP (longest common prefix)
n: the length of the pattern
Suffix array is more compact than a suffix
tree
ABRACADABRA#
11
10
7
0
3
5
8
1
4
6
9
2
webglimpse.net/pubs/suffix.pdf
#
A#
ABRA#
ABRACADABRA#
ACADABRA#
ADABRA#
BRA#
BRACADABRA#
CADABRA#
DABRA#
RA#
RACADABRA#
Ordered MUM selection
G1
G2
1
2
3
4
...
A
B
C
D
...
MUMs: <1,A>, <2,C>, <3,B>, <4,D>
Possible <1,A>, <2,C>, <4,D>
Selections<1,A>, <3,B>, <4,D>
Then process non-matched regions (by dynamic programming algorithm)
See more at www.cs.rice.edu/~nakhleh/COMP571/GenomeAlignment.ppt
LIS algorithm
B positions is given by the sequence 1, 3, 2, 4, 6, 7, 5
The LIS (longest increasing sequence) is: 1, 2, 4, 6, 7
LIS problem can be solved by a dynamic programming algorithm
Mauve
 Mauve is a system for efficiently constructing
multiple genome alignments in the presence of
large-scale evolutionary events
 Identifies conserved genomic regions,
rearrangements and inversions in conserved
regions, and the exact sequence breakpoints of
such rearrangements across multiple genomes.
 Also performs traditional multiple alignment of
conserved regions to identify nucleotide
substitutions and indels, using the progressive
dynamic programming approach of CLUSTALW
Mauve's anchor selection algorithm
 Relax anchor selection method: do not assume
that the genomes under study are collinear
 Identifie and align regions of local collinearity
called locally collinear blocks (LCBs)
– Each LCB is a homologous region of sequence
shared by two or more of the genomes under
study
– Does not contain any rearrangements of
homologous sequence (within LCB)
Mauve algorithm
1. Find local alignments (multi-MUMs), using seed-and-extend
hashing method (time complexity O(G2n + Gn logGn), G is the
number of genomes and n the average genome length)
2. Use the multi-MUMs to calculate a phylogenetic guide tree.
3. Select a subset of the multi-MUMs to use as anchors—these
anchors are partitioned into collinear groups called LCBs,
using a greedy breakpoint elimination algorithm
4. Perform recursive anchoring to identify additional alignment
anchors within and outside each LCB.
5. Perform a progressive alignment of each LCB using the guide
tree.
Greedy breakpoint
elimination in three
genomes
Darling A C et al. Genome Res. 2004;14:1394-1403
©2004 by Cold Spring Harbor Laboratory Press
An example of LCB identified among nine
enterobacterial genomes
Darling A C et al. Genome Res. 2004;14:1394-1403
LCBs identified among concatenated
chromosomes of the mouse, rat, and human
genomes
Darling A C et al. Genome Res. 2004;14:1394-1403
Turnip vs cabbage: almost identical
mtDNA gene sequences
 In 1980s Jeffrey Palmer studied
evolution of plant organelles by
comparing mitochondrial genomes
of the cabbage and turnip (using
physical mapping)
 99%-99.9% similarity between
genes
 These surprisingly identical gene
sequences differed in gene order
 This study helped pave the way to
analyzing genome rearrangements
in molecular evolution
Why we care about genome
rearrangement
 Evolutionary and functional analysis
 Examples:
– “Dynamics of Genome Rearrangement in Bacterial
Populations”, using comparison of eight Yersinia
(pathogenic bacteria) genomes. PLoS Genet 4(7):
e1000128, 2008
– Genome-wide DNA excision (Oxytricha trifallax destroys
95% of its germline genome during development, including
the elimination of all transposon DNA, through an
exaggerated process of genome rearrangement). Science,
Vol. 324. no. 5929, pp. 935 – 938, 2009
“Transforming” cabbage into turnip
Reversals and breakpoints
1
2
3
9
10
8
4
1, 2, 3, 4, 5, 6, 7, 8, 9, 10
7
5
6
1
2
3
9
8
4
7
10
1, 2, 3, -8, -7, -6, -5, -4, 9, 10
5
6
The reversion introduced two breakpoints (disruptions in order).
Genome rearrangements
Mouse (X chrom.)
Unknown ancestor
~ 75 million years ago
Human (X chrom.)
 What are the similarity blocks and how to find them?
 What is the architecture of the ancestral genome?
 What is the evolutionary scenario for transforming one
genome into the other?
Comparative genomic architectures:
mouse vs human genome
 Humans and mice
have similar genomes,
but their genes are
ordered differently
 ~245 rearrangements
– Reversals
– Fusions
– Fissions
– Translocation
History of Chromosome X
Rat Consortium, Nature, 2004
GRIMM
 Real genome architectures are represented by
signed permutations
 Efficient algorithms to sort signed permutations
have been developed
 GRIMM web server computes the reversal
distances between signed permutations:
http://nbcr.sdsc.edu/GRIMM/mgr.cgi
Download