A Simplified View of DCJ

advertisement
A Simplified View of DCJIndel Distance
Phillip Compeau
University of California-San Diego
Department of Mathematics
1
Abstract
• Braga et al., 2010: Solved problem of DCJ-indel sorting in
linear time.
• Goals:
1. “Hardwire” DCJ sorting into DCJ-indel sorting.
2. Characterize solution space for DCJ-indel sorting.
• DCJ solution space known (Braga and Stoye, 2010).
2
Section 1: Preliminaries
1.
2.
3.
4.
5.
Preliminaries
Encoding Indels as DCJs
DCJ-Indel Sorting
The Solution Space of DCJ-Indel Sorting
Conclusion
3
The Discrete Genome
• Genome (Π): formed of two matchings
• genes g(Π): each numbered gene has a head and a tail.
• adjacencies (a(Π)): a blue matching on V(g(Π))
Π
Γ
4
The Discrete Genome
• Chromosome: component of Π (alternating path or cycle)
• Linear or circular depending on path or cycle of Π
• Telomere: path endpoint of Π; has null adjacency {v, Ø}
Π
Γ
5
union of a(Π ) and a(Γ ), where adjacencies of Γ are colored red (Fig
Observe t hat B (Π , Γ ) is also a disjoint union of pat hs and cycles, which al
between
red and blue edges. T he length
of a component of B(Π , Γ ) is it s n
The
Double-Cut-and-Join
Operation
of edges; we consider an isolat ed vert ex in B(Π , Γ ) t o be a pat h of lengt
A double cut and join operat ion (DCJ) on Π ([9]) uses one or two adja
• Double-cut-and-join
operation (DCJ; Yancopoulos et al.,
of Π via one of t he following four operat ions t o produce a new genome Π
2005): “cuts” genome in two places and rejoins adjacencies.
1.
2.
3.
4.
{ v, w} , { x, y}
{ v, w} , { x, ∅}
{ v, ∅} , { w, ∅}
{ v, w}
−→ { v, x} , { w, y}
−→ { v, x} , { w, ∅}
−→ { v, w}
−→ { v, ∅} , { w, ∅}
T he DCJ incorporat es an array of genome rearrangement s, as shown in
For t he part icular case t hat Π and Γ have t he same genes (i.e., g(Π ) =
G), t he DCJ distance between Π and Γ , writ t en dD CJ (Π , Γ ), is t he mi
number of DCJs required t o t ransform Π int o Γ . A closed formula fo
dist ance was derived in [10] and t ranslat ed int o breakpoint graph not a
[13]:
• DCJ Distance (dDCJ(Π, Γ)): minimum # of DCJs required to
transform Π into Γ (having the same genes). peven (Π , Γ )
dD CJ (Π , Γ ) = N − c(Π , Γ ) −
2
6
The DCJ Incorporates Many Operations
(a)
(b)
v
w
Translocation
v
w
v
x
y
Translocation
x
y
x
v
x
w
Reversal
v
w
w
v
Translocation
x
w
x
v
y
(Affix)
Reversal
Reversal
x
Translocation
v
y
w
v
v
w
Fission
v
w
x
y
Fusion
x
y
x
w
(Affix)
Reversal
x
w
Fission
v
Fusion
x
w
(c)
v
w
Excision
v
w
x
y
Integration
x
y
v
w
Reversal
v
y
x
Reversal
y
x
w
Fusion (#3)
v
v
w
w
Fission (#4)
Circularization (#3)
v
w
v
w
Linearization (#4)
7
The Breakpoint Graph
• B(Π, Γ) is formed from the adjacencies of Π and Γ.
• B(Π, Γ) also comprises (alternating) red-blue paths and cycles.
8
J distance between Π and Γ , writ t en dD CJ (Π , Γ ), is th
DCJs required to t ransform Π into Γ . A closed formu
DCJ Distance Formula
as derived in [10] and translated into breakpoint graph
• Bergeron et al., 2006: If Π and Γ share the same genes, then
the DCJ distance is given by the following formula:
peven (Π , Γ )
dD CJ (Π , Γ ) = N − c(Π , Γ ) −
2
• N = # of genes
• c(Π, Γ) = # of cycles in B(Π, Γ)
• peven(Π, Γ) = # of even paths in B(Π, Γ)
9
Indels and the DCJ-Indel Distance
• Indel: The insertion or deletion of a chromosome or
chromosomal interval (consecutive genes).
• Assumption: we can’t remove a gene common to Π and Γ
ab
cd
Ø
a
bc
Ø
a
b
b
a
c
d
Ø
• DCJ-Indel Distance (dindDCJ(Π, Γ)): Minimum # of DCJs and
indels required to transform Π into Γ.
• Braga et al., 2010: Solve DCJ-indel sorting in linear time.
• Lots of cases…can we simplify it?
10
Section 2: Encoding
Indels as DCJs
1.
2.
3.
4.
5.
Preliminaries
Encoding Indels as DCJs
DCJ-Indel Sorting
The Solution Space of DCJ-Indel Sorting
Conclusion
11
Deletion  DCJ Creating Circular Chromosome
• Ma et al., 2009: View deletion as formation and removal of a
circular chromosome.
ab
cd
Ø
DCJ
a
bc
Ø
DCJ
a
b
Ø
b
a
c
d
DCJ
DCJ
a b
b c
ad
a b
Ø
a d
b c
c
• Idea: Indel = DCJ creating circular chromosome
• Wait…what about the deletion of circular chromosomes?
12
Apparent Exceptions
• Apparent Exception #1: Two deleted circular chromosomes
are created from a single DCJ.
b
a
c
d
DCJ
a d
b c
3 Operations
13
Apparent Exceptions
• Apparent Exception #1: Two deleted circular chromosomes
are created from a single DCJ.
b
a
c
d
c
d
1 Operation
DCJ
a d
b
a
b c
3 Operations
14
Apparent Exceptions
• Apparent Exception #2: A deleted circular chromosome is
never involved in a DCJ
• Circular singleton of Π: A circular chromosome of Π that
shares no genes with Γ.
• Question: Can we delete all circular singletons first?
15
Apparent Exceptions
• Apparent Exception #2: A deleted circular chromosome is
never involved in a DCJ
• Circular singleton of Π: A circular chromosome of Π that
shares no genes with Γ.
• Question: Can we delete all circular singletons first? YES!
16
Handling Circular Singletons
• Proposition: When transforming Π into Γ via a minimum
collection of DCJs and indels, no gene belonging to a circular
singleton of Π can ever appear in the same chromosome as a
gene of Γ.
• Corollary 1: If Π* is formed from Π by removing a circular
singleton from Π, then dindDCJ(Π*, Γ) = dindDCJ(Π, Γ) – 1.
• Let sing(Π, Γ) = # of circular singletons of Π and Γ.
• Corollary 2: If Π0 and Γ0 are formed by removing all circular
singletons from Π and Γ, then
dindDCJ(Π, Γ) = dindDCJ(Π0 , Γ0) + sing(Π, Γ)
17
a(Π ) is composed of a(Π ) t ogether wit h a perfect mat ching on V (Π
We call t he adjacencies of a(Π ) − a(Π ) new. Note that t he chromos
n occur as
a final
step
in the transformat
ion the
of Πcomponents
into Γ , we
embed
as
chromosomes
of Π and Distance
that
of may
Π − int
Π rf
A Novel
View
of
DCJ-Indel
e following
framework.
because
new adjacencies form a perfect mat ching on V (Π ) − V (Π
now
without
ambiguity
these circular
chromosomes
Define
completion
of
Π as acall
genome
Πthathaving
g(Π
forin
• a
WLOG
we may
henceforth
assume
sing(Π,
Γ) =) 0.= ofG Πandthe
A completion
of )atpair
, Γ ) is mat
simply
a pair
, Γ )) −forV
Π ) is composed
of a(Π
ogetof
hergenomes
wit h a(Π
perfect
ching
on (Π
V (Π
Γ are completions
and
Γ , respect
e call t heand
adjacencies
of a(Π ) of
−Π
a(Π
) new.
Not eively.
that Our
thecorrespondence
chromosomes
completion
of
is and
aDCJ-indel
genome
such
that: s of Π − Π form
following
equation
dist
ance:
bed •as A
chromosomes
of ΠΠfor
that tΠ’
he
component
• g(Π’)
= g(Π) Uform
g(Γ)aindperfect mat ching on V (Π ) − V (Π ); we
cause new
adjacencies
dD CJ (Π , Γ ) = min { dD CJ (Π , Γ )}
w wit hout
ambiguity
these circular
of Π the indels
( Π ,Γ
)
• a(Π’)
= a(Π)call
U perfect
matchingchromosomes
on
V(Π’)
– V(Π)
completion
of at he
pairminimum
of genomes
(Π ,over
Γ ) is
pairof(Π
whi
where
is taken
allsimply
completaions
(Π ,, ΓΓ).) Aforcompl
∗
d Γ are completions
ofif Π
and
Γ ,the
respect
ively.inOur
correspondence
yielfo
Γ
)
is
optimal
it
at
tains
minimum
(3).
Applying
t
he
closed
• New chromosomes of Π’ are circular: the indels of Π’
owing equat
DCJ-indel
distimmediat
ance: ely produces the following result
DCJion
distfor
ance
in (1) to (3)
T heor em 3.
The DCJ-indel distance is given by the following equat
• Theorem:
dind
D CJ (Π , Γ ) = min { dD CJ (Π , Γ )}
( Π ,Γ )
peven (Π , Γ )
ind
, Γ ) =allN complet
− max ions
c(Πof, (Π
Γ ),+Γ ). A completion
D CJ (Πover
is dtaken
( Π ,Γ )
2
ere the minimum
) is optimal if it attains the minimum in (3). Applying the closed form f
where
the to
maximum
is taken over
all completions
of (Π , Γresult
). 18.
CJ distance
in (1)
(3) immediately
produces
the following
a(Π ) is composed of a(Π ) t ogether wit h a perfect mat ching on V (Π
We call t he adjacencies of a(Π ) − a(Π ) new. Note that t he chromos
n occur as
a final
step
in the transformat
ion the
of Πcomponents
into Γ , we
embed
as
chromosomes
of Π and Distance
that
of may
Π − int
Π rf
A Novel
View
of
DCJ-Indel
e following
framework.
because
new adjacencies form a perfect mat ching on V (Π ) − V (Π
now
withoutcompletion
ambiguity
call
thesethe
circular
chromosomes
Define
completion
of
Π as aachieves
genome
Π optimum
having
g(Π
) = ofG Πandthe
forin
• a
An
optimal
below.
A completion
of )atpair
, Γ ) is mat
simply
a pair
, Γ )) −forV
Π ) is composed
of a(Π
ogetof
hergenomes
wit h a(Π
perfect
ching
on (Π
V (Π
Γ are completions
and
Γ , respect
e call t heand
adjacencies
of a(Π ) of
−Π
a(Π
) new.
Not eively.
that Our
thecorrespondence
chromosomes
completion
of
is and
aDCJ-indel
genome
such
that: s of Π − Π form
following
equation
dist
ance:
bed •as A
chromosomes
of ΠΠfor
that tΠ’
he
component
• g(Π’)
= g(Π) Uform
g(Γ)aindperfect mat ching on V (Π ) − V (Π ); we
cause new
adjacencies
dD CJ (Π , Γ ) = min { dD CJ (Π , Γ )}
w wit hout
ambiguity
these circular
of Π the indels
( Π ,Γ
)
• a(Π’)
= a(Π)call
U perfect
matchingchromosomes
on
V(Π’)
– V(Π)
completion
of at he
pairminimum
of genomes
(Π ,over
Γ ) is
pairof(Π
whi
where
is taken
allsimply
completaions
(Π ,, ΓΓ).) Aforcompl
∗
d Γ are completions
ofif Π
and
Γ ,the
respect
ively.inOur
correspondence
yielfo
Γ
)
is
optimal
it
at
tains
minimum
(3).
Applying
t
he
closed
• New chromosomes of Π’ are circular: the indels of Π’
owing equat
DCJ-indel
distimmediat
ance: ely produces the following result
DCJion
distfor
ance
in (1) to (3)
T heor em 3.
The DCJ-indel distance is given by the following equat
• Theorem:
dind
D CJ (Π , Γ ) = min { dD CJ (Π , Γ )}
( Π ,Γ )
peven (Π , Γ )
ind
, Γ ) =allN complet
− max ions
c(Πof, (Π
Γ ),+Γ ). A completion
D CJ (Πover
is dtaken
( Π ,Γ )
2
ere the minimum
) is optimal if it attains the minimum in (3). Applying the closed form f
where
the to
maximum
is taken over
all completions
of (Π , Γresult
). 19.
CJ distance
in (1)
(3) immediately
produces
the following
Section 3: DCJ-Indel
Sorting
1.
2.
3.
4.
5.
Preliminaries
Encoding Indels as DCJs
DCJ-Indel Sorting
The Solution Space of DCJ-Indel Sorting
Conclusion
20
Open Vertices
• π-open vertex: vertex not found in Π (must be matched in Π’)
• path endpoint in B(Π, Γ) must be π-open/γ-open or
telomere (or both)
• Define {π, π}-paths, {π, γ}-paths, π-paths in B(Π, Γ)
• Idea: Construct B(Π*, Γ*) from B(Π, Γ) by matching vertices.
21
Necessary Conditions for B(Π*, Γ*)
• Lemma 1: If (Π*, Γ*) is an optimal completion of (Π, Γ), then
every {π, π}-path ({γ, γ}-path) of length 2k – 1 in B(Π, Γ)
embeds into a cycle of length 2k in B(Π*, Γ*).
22
Necessary Conditions for B(Π*, Γ*)
• Lemma 1: If (Π*, Γ*) is an optimal completion of (Π, Γ), then
every {π, π}-path ({γ, γ}-path) of length 2k – 1 in B(Π, Γ)
embeds into a cycle of length 2k in B(Π*, Γ*).
• Picture:
π
π
π
π
Vs.
π
π
π
π
dDCJ(Π’’, Γ’) < dDCJ(Π’, Γ’)
Cycle
B(Π’, Γ’)
B(Π’’, Γ’)
23
Necessary Conditions for B(Π*, Γ*)
• Lemma 1: If (Π*, Γ*) is an optimal completion of (Π, Γ), then
every {π, π}-path ({γ, γ}-path) of length 2k – 1 in B(Π, Γ)
embeds into a cycle of length 2k in B(Π*, Γ*).
• Remaining components of B(Π*, Γ*):
• bracelet: cycle linking {π, γ}-paths
• chain: path linking π-paths/γ-paths via intermediate {π, γ}paths
π
π
π
π
π
γ
γ
3-Chain
2-Bracelet
π
γ
γ
π
π
2-Chain
24
Necessary Conditions for B(Π*, Γ*)
• Lemma 2: B(Π*, Γ*) can contain only 2-bracelets, 2-chains,
and 3-chains.
• Picture:
π
π
π
π
P1
Vs.
P2
γ
γ
B(Π’, Γ’)
P1
π
π
π
π
Cycle
γ
dDCJ(Π’’, Γ’) < dDCJ(Π’, Γ’)
P2
γ
B(Π’’, Γ’)
25
Necessary Conditions for B(Π*, Γ*)
• Lemma 3: B(Π*, Γ*) cannot have one 2-chain joining two odd
π-paths and another 2-chain joining two even π-paths. The
same holds for γ-paths.
• Picture:
Ø
P1
odd
π
P2
odd
P3
even
Ø
π
Ø
Ø
Even
Path
π
π
π
Vs.
π
P4
even
Ø
Ø
B(Π’, Γ’)
π
π
Even
Path
dDCJ(Π’’, Γ’) < dDCJ(Π’, Γ’)
Ø
Ø
B(Π’’, Γ’)
26
Sorting Algorithm
1. Remove all circular singletons of Π and Γ.
2. Lemma 1  Close every {π, π}-path ({γ, γ}-path) into a
cycle by adding a single new adjacency to Π* (Γ*).
3. Form a maximum set of 2-bracelets (only chains remaining).
4. Form a maximum set of even 2-chains by linking pairs of πpaths (γ-paths) having opposite parity (Lemma 3).
5. If pπ, γ is odd, then link the remaining {π, γ}-path with any
remaining π-path and γ-path.
6. Arbitrarily link pairs of remaining π-paths, all of which have
the same parity. Do the same for any γ-paths remaining.
27
A Simplified V iew of DCJ-Indel Dist ance
373
Distance
heorDCJ-Indel
em 8. Algorithm
9, given below, describes an O(N ) time algorithm for
CJ-indel sorting. For pairs { Π , Γ } having sing(Π , Γ ) = 0, the DCJ-indel dis• Theorem: The preceding algorithm solves DCJ-indel sorting
ance is given by the following equation:
in linear time, and it implies a DCJ-indel distance formula:
ind
dD CJ (Π , Γ )
= N−
π,π
c+ p
γ ,γ
+p
pπ,γ
2
+
1 0
+
peven + min { pπodd , pπeven }
2
+ min { pγodd , pγeven } + δ)] (9)
π,γ if pπ, γ is odd and either:
where
δ
=
1
only
here δ = 1 only if p is odd and either pπodd > pπeven , pγodd > pγeven or
π
π1. pγπ
γπ
>
p
, pγodd > pγδeven
<
p
,
p
<
p
; otherwise,
= ;0.or
odd
even
even odd
even
odd
π
π
γ
γ
2.
p
<
p
,
p
<
p
odd
even
odd
roof. We aim to construct an optimaleven
completion (Π ∗ , Γ ∗ ) having
Otherwise, δ = 0.
c(Π ∗ , Γ ∗ ) = c + pπ,π + pγ ,γ +
pπ,γ
2
peven (Π ∗ , Γ ∗ ) = p0even + min { pπodd , pπeven } + min { pγodd , pγeven } + δ
irst, we count t he cycles of
B(Π ∗ , Γ ∗ ).
(10)
(11)
28
By Lemma 5, every { π, π} -path or
Section 4: The Solution
Space of DCJ-Indel
Sorting
1.
2.
3.
4.
5.
Preliminaries
Encoding Indels as DCJs
DCJ-Indel Sorting
The Solution Space of DCJ-Indel Sorting
Conclusion
29
ng
framework.
because
new adjacencies form a perfect mat ching on V (Π ) − V (Π ); we m
without of
ambiguity
these circular
chromosomes
of Π
a now
completion
Π as acall
genome
Π having
g(Π ) = ofG Πandthe
forindels
which
A completion
of )atpair
, Γ ) is mat
simply
a pair
, Γ )) −forV which
omposed
of a(Π
ogetof
hergenomes
wit h a(Π
perfect
ching
on (Π
V (Π
(Π ). Π
Encompassing
all
Possible
Cases
Γ are completions
and
Γ , respect
eand
adjacencies
of a(Π ) of
−Π
a(Π
) new.
Not eively.
that Our
thecorrespondence
chromosomes yields
of Π t
following equation
DCJ-indel
distcomponent
ance:
chromosomes
of Πforand
that t he
s of Π − Π form cycles
• The solution space is known for DCJ-sorting (Braga and
ew adjacencies form aindperfect mat ching on V (Π ) − V (Π ); we may
Stoye, 2010). dD CJ (Π , Γ ) = min { dD CJ (Π , Γ )}
(
out ambiguity call these circular chromosomes
of Π the indels of Π .
( Π ,Γ )
ion
of at he
pairminimum
of genomes
(Π ,over
Γ ) is
pairof(Π
which Π (Π
where
is taken
allsimply
completaions
(Π ,, ΓΓ).) Aforcompletion
• Thus, we only
need
to
find all
optimal
completions, and yields
the
e completions
Γ ,the
respect
ively.
Γ ∗ ) is optimalofif Π
it atand
tains
minimum
inOur
(3). correspondence
Applying t he closed formthe
for t
specific
will
fall out
the wash.
equat
for
DCJ-indel
dist
ance:
DCJion
dist
ance
inoperations
(1) to (3)
immediat
ely in
produces
the following result.
T heor em 3.
distance
is given
dindThe
(ΠDCJ-indel
, Γ ) = min
{ dD CJ
(Π , Γby)}the following equation:(3)
D CJ
( Π ,Γ )
peven (Π , Γ )
ind
∗, (
d
(Π
,
Γ
)
=
N
−
max
c(Π
,
Γ
)
+
D CJ
minimum is taken
over all complet
ions
of
(Π
,
Γ
).
A
completion
(Π
( Π ,Γ )
2
mal if it attains the minimum in (3). Applying the closed form for t he
where
the to
maximum
is taken over
all completions
of (Π , Γresult
).
nce
in (1)
(3) immediately
produces
the following
.
3. The DCJ-indel distance is given by the following equation:
3.3
Const r uct ing an Opt im al Com plet ion
peven (Π , Γ )
30
Handling Circular Singletons
• The circular singletons of Π must be removed in sing(Π) steps.
We have two options:
1. Delete all the circular singletons of Π.
2. Perform k “fusion” DCJs followed by sing(Π) – k
chromosome deletions.
• This poses a straightforward (yet tedious) counting problem.
31
Adding Necessary Conditions on B(Π*, Γ*)
• Proposition 1: Every π-path embedding into a 3-chain of an
optimal completion must have the same parity.
• Proposition 2: If pπ, y is even, then B(Π*, Γ*) must contain a
maximum collection of even 2-chains.
• Proofs are slightly more involved…
32
Finishing the Job
• Four cases, depending on path statistics.
1. pπ, γ is odd:
a) pπodd > pπeven , pγodd > pγeven (or vice-versa); δ = 1
b) pπodd > pπeven , pγodd < pγeven (or vice-versa); δ = 0
2. pπ, γ is even:
a) pπodd > pπeven , pγodd > pγeven (or vice-versa); δ = 0
b) pπodd > pπeven , pγodd < pγeven (or vice-versa); δ = 0
• These cases are tedious but straightforward and can be handled
similarly.
33
Section 5: Conclusion
1.
2.
3.
4.
5.
Preliminaries
Encoding Indels as DCJs
DCJ-Indel Sorting
The Solution Space of DCJ-Indel Sorting
Conclusion
34
Future Work
• Correspondence with Braga et al., 2010?
• Varying the indel cost?
• Charge indel cost ≤ DCJ cost, take minimum total cost.
• Most of the simplifying sorting lemmas hold, but actually
computing the minimum cost appears difficult in this
model.
• The problem is solved! (under framework of Braga et al.,
2010)
35
Questions?
36
Shameless Plug
• www.rosalind.info
• A novel education website that teaches bioinformatics through
programming exercises.
• Have “professor” environment for assigning programming
exercises to your bioinformatics classes.
37
Download