Document 11052141

advertisement
HD28
.M414
Dewey
Q
Analysis of Path-Based Approaches
to Genomic Physical Mapping
by
Alan Rimm-Kaufman
James Orlin
WP #3742-94
November 1994
revised
December 1994
;/,AS£AC!MjSf;TTS INSVlTUTE
OF TECH^!OLOGY
JAN 19
1995
LIBRARIES
Analysis of path-based approaches to
genomic physical mapping
Alan Rimm-Kaufman and James B. Orlini
^
J.
A. Rimm-Kaufman, Operations Research Center, MIT, Cambridge,
B. Orlin, Sloan School of
Management and
MIT, Cambridge, MA. 02139,
USA
the Whitehead Institute
MA.
02139,
MIT Center
USA
for
Genome
Research,
Abstract
An
integrated
genomic physical map combines
multiple sources of
data to position landmarks and clones along a genome.
strategies for constructing
integrated
"Path-based"
maps employ paths
overlapping clones to bridge intervals between genetically
markers.
There
is
a
difficulty
rate of clone overlap error
result in
analytical
difficulty,
many
may
false linkages
method
to
in
of
mapped
with path-based strategies: a small
create numerous spurious paths and
the physical map.
We
evaluate path-based approaches
propose an
in
light
and demonstrate the method on an integrated map
human genome.
of this
of the
genomic physical maps combine many
Integrated
and positional data
structural
the
varieties of
order clones and landmarks along
to
Genetically-mapped sequence tagged sites (STSs) can
genome.
provide a backbone for an integrated physical map, with overlapping
yeast
chromosomes (YACs) used
artificial
provide coverage of the
to
The overlaps between
genetic intervals between adjacent markers.
YACs may be
detected through a variety of
fingerprinting
and hybridization data.
In
we
which each
in
two STSs,
STS
genetically
Sj
and
s:,
and
maps
to
locus
Sj
lies in
y^ by
some
introduce
first
is
assigned
length k between genetic locus
and
STS
k
i
YACs,
i,
(ii)
include only
1
to
locus
content mapping, and
one
YACs
YAC and
j
(iii)
for
(iv)
and
yj
j,
yj^.i
correspond
A
path of
defined as
is
yk such that
maps
s,
consider a
a genetic locus.
to
y2
y-\,
We
terminology.
and genetic locus
(Vi.yi+l) the data indicate that the
of length
including
order to discuss the results of our analysis to integrated
physical maps,
data set
means
(i)
Sj
si
lies in
each step
overlap.
to traditional
content mapping, while longer paths depend on the detection of
overlaps.
An
integrated
map
connecting genetically close
two
loci
A
that contains the
will
loci
assign a
if
It
is
an
STS
YAC
interval
a short path joining the
YAC.
YAC
to
known
is
overlaps can create short false paths
between genetically-mapped STSs,
YACs
is
to
Paths
serious concern with such path-based mapping strategies
that falsely detected
of
there
YAC
y-|
resulting
in
spurious assignment
genetic intervals, and to spurious coverage of the genome.
in
random graph theory
(1)
that,
in
certain
random
structures,
all
paths of bounded length suffice to connect essentially
attention through
Separation,"
in
phenomenon has
This
pairs of points.
recently gained popular
John Guare's award-winning
which
it
play,
asserted that any two people
is
can be connected through a path
of at
most
six
if
Degrees
in
of
the world
acquaintances.
same phenomenon
Within an integrated genomic physical map, this
could occur even
"Six
A
the majority of the clone overlaps are valid.
small rate of falsely declared .clone overlaps can lead to spurious
coverage
of large regions of the
To address
this
concern,
genome.
we developed a methodology
the probability that a shortest path joining two
then applied
this
methodology
generation' integrated physical
describe our methodology
in
to the
map
loci
is
CEPH-Genethon
of the
to
assess
We
correct.
'first-
human genome
(2).
We
the context of this integrated physical
map.
Briefly,
the
CEPH-Genethon
screening the 33,000-clone
STS
methods,
the
first
CEPH mega-YAC
library
YAC
screened
by two different
content mapping and Alu-PCR probe hybridization.
method, 2100 genetically-mapped STSs
against the
half
physical mapping data involved
library
(with half the
partially to
(3)
In
were screened
STSs screened completely and
obtain 1-2 positives).
In
the second
method, Alu-PCR products were prepared from 5300 individual
YACs
and were screened by hybridization against spotted Alu-PCR
products from a subset of 25,000
cell
lines.
based
In
addition,
'fingerprinting'
YACs and monochromosomai
many YACs were subjected
(4).
to
hybrid
hybridization-
A path
is
Two YACs
defined as before.
least
one
of the
two
YAC,
is
either
c,
and
(ii)
each
or hybridized to a set of
(For technical reasons,
50%
YACs had
(a) at
similar
loci
i
and
lie
j
on the same
was used as an Alu-PCR probe
that
to the
chromosomes
monochromosomal panel
chromosome
that included
c.
chromosomal assignments were not always
could be assigned to a single chromosome,
hybridized to multiple
to
yj
(i)
gave no signal when hybridized
unique:
if
j
A chromosomally allowable
(4).
defined as a path where
chromosome,
and genetic locus
are determined to overlap
or (b) the two
fragment fingerprints
restriction
I
YACs was used as an Alu-PCR probe and
hybridized to the other
path
between genetic locus
of length k
19%
chromosomes, and 31% could not be assigned
any chromosome.)
The CEPH-Genethon
be the set
of
all
chromosomally allowable paths
experimental analysis
was advanced
given length were correct.
It
allowed, the coverage of the
defined
(2) to
a specified
of
Neither analytic nor
length connecting loci within 10 cM.
1, 3,
map was
integrated physical
suggest that most paths of a
to
was noted
genome
that as longer paths are
increases.
With paths of length
5 and 7 YACs, the strategy covered 11%, 30%, 70%, and 87%,
respectively, of the total genetic length of the
percentage coverage
centiMorgans
measure
of
lying
is
defined
loci.
(2)
between connected
coverage when
coverage does not account
genetic
in
For example,
genome.
as the proportion
loci.
This
is
restricted to valid paths,
for
ends
YACs
of contigs that
(The
of total
a conservative
since the
extend beyond the
creating a path between two
genetically indistinguishable
STSs do
coverage on the genome by
this definition.)
We
set out to evaluate the
not provide incremental
CEPH-Genethon paths using data
from the February, 1994 CEPH-Genethon data release
QUICKMAP
the publically released computer package
accompanies the CEPH-Genethon data.
minimum
distance between them
-
Figure
(7).
may be connected, as a
that
first
and using
(6)
which
constructed the
length chromosomally allowable path connecting every pair
on the same chromosome
of loci
We
(5)
shows the proportion
1
We
loci.
of loci
and the width
function of the path length
between the
of the genetic interval
regardless of the genetic
were interested
in
determining what fraction of these observed paths were valid and
what
A
is
to
were spurious.
fraction
simple
consider
way
loci
to
estimate the proportion of false connections
separated by >50
cM and connected
by short paths.
YAC
Such paths must surely be spurious inasmuch as the average
length
is
1
genome.
Mb, corresponding
The proportion
of
to only
about
1
0.05, 4, 33,
and 63%,
respectively.
In
connections dominate at these distances.
not
loci
pairs at distances 5-10,
much higher than
suggests that many
of
the
human
loci
pairs
of length
3,
1,
particular, the
dramatically for path lengths exceeding four,
connected
in
such widely-separated
connected by chromosomally allowable paths
is
cM
curve rises
indicating
that
The proportion
10-20, and 20-50
for loci pairs at distances
> 50 cM.
and 7
5,
random
of
cM was
This
these paths are also spurious, and that such
spurious paths span close
as distant
We
loci
loci
same
pairs at approximately the
rate
pairs.
attempted
to
We
estimate the proportions of valid paths.
partitioned the pairs of loci into six groups on the basis of their
inter-locus distance: the groups consisted
10-20, 20-50, and > 50 cM, respectively.
range
of loci at distance
path between them,
d,
let
invalid
path between them.
Figure
1
then,
distribution
We
them, and
loci
is
(ii)
loci
is
A^-j
of the six groups.
(i)
independent
the frequency of spurious
of the distance
V^ and
two points
spurious path of length <
in
a.
of length
<
a.
in
distance range
Let S(j(a) = the
distance range d are connected by a
Let
7t(j(a)
= the probability that two
follows from the independence of \^ and V^j that
quantity
probability
distance range d are connected by any path of length <
in
+ Sd(a)
joint
Under
A^j.
d are connected by a valid path
It
between
of the length of the shortest valid path.
Let P(j(a) = the probability that two points
points
min(V(-j,l(-j).
the cumulative probability
these assumptions, one can estimate an empirical
probability that
=
the length of the shortest spurious path joining two
independent
density for
5-10,
2-5,
length of the shortest
With these definitions,
each
0-2,
the length of the shortest valid
made two assumptions:
paths between two
at
denote the length of the shortest
may be viewed as
of A^j for
next
l^j
loci
For a randomly chosen pair
A^ denote the
V^ denote
let
path between them, and
,
let
of
-
Pd(a)Sd(a).
7r(j(a)
is
Thus, Pd(a) = {n^{a)
directly observable.
-
n^ia) = Pc|(a)
Sd(a))/(1-Sd(a)).
We assume
a.
that Scj(a)
The
is
8
independent
by
at least
and so can be calculated from points separated
of d,
Therefore, Pc|(a) can be calculated.
50 cM.
One can
then estimate, for each distance group and each path
length, the probability
valid shortest path
This
is
7u<j(a)
-
among
A^=a)
&
and there
is.
is
-
one
at least
is
The denominator
the probability there
no path
Under the independence
equal to (Pd(a)
that there
Ajj=a)/P(A(-j=a).
The numerator
71^(3-1).
a-1.
|
the observed shortest paths (Figure 2).
the probability P(Vjj=a
path of length a
most
P(V(j=a
a valid
is
(valid or invalid) of length at
of V^j
and
With
Pd(a-1))(1 -8^(3-1)).
the numerator
1^,
probability,
this
provide a corrected estimate of the observed proportion of
pairs
connected by at least one
out the expected fraction of
whose
pairs
is
we
loci
by subtrscting
valid path (Figure 3)
loci
is
shortest paths are
invalid.
The
results indicate that for two loci within 5
cM
of
one
another and spsnned by shortest p3ths of < 3 YACs, the shortest
P3ths are
if
likely to
YACs
the shortest paths are > 4
cM
one
include at least
or
if
the two
3part, then the shortest paths are likely to
Considering only shortest p3ths of 3t most 3
within 5
cM, 3nd using the
CEPH/Genethon d3t3 cover
coverage
of the
genome
is
definition of
only 3bout
far less
The previous 3n3lysis W3S
including both
of
loci
be
the other hand,
are more th3n 5
invalid.
YACs between
cover3ge given
27%
of the
in
loci
(2),
the
hum3n genome.
than indicated
in
This
(2).
C3rried out on the complete d3t3 set
Alu-PCR connections 3nd
conducted two 3ddition3l sets
On
valid path.
fingerprint d3t3.
3n3lyses
--
We
one with Alu-PCR but
no fingerprints, and one with fingerprint data but no Alu-PCR
--
to
investigate whether the coverage by valid paths might increase.
amount
coverage may possibly increase with the omission
of valid
a data source
The
many spurious
the omitted data has provided
if
The
connections and few valid ones.
results
of
both of these
in
analyses were qualitatively the same, and both additional cases
resulted
first
a decreased coverage of the
in
analysis.
the analysis, involving
In
fingerprints,
two
for
4
YACs
one
or
paths are
within 2
the two
cM
On
of
the probability
each
is
In
one another and spanned by
the other hand,
invalid.
cM
that the shortest path
apart, then the shortest
is
YACs between
(without fingerprints) cover
include
the shortest paths are >
that are
loci
and spanned by a shortest path
other,
55%
if
likely to
(For example, for two
only shortest paths of at most 3
Alu-PCR data
of
are more than 5
loci
be
likely to
cM
to the
Alu-PCR connections but no
YACs, the shortest paths are
valid path.
if
within 5
loci
shortest paths of < 3
at least
genome as compared
24%
of
Considering
invalid.)
loci
within 5
of the
the analysis including fingerprints data but no
4 YACs,
cM,
human genome.
Alu-PCR
connections, the fingerprint data generate longer valid paths:
within 5
cM
paths of < 5
of
fingerprint-only
valid
one another and spanned by shortest
YACs
are
path.
likely to
be covered by
at least
between
loci within
one
valid
While the fingerprint data support longer
overall.
5
loci
fingerprint-only
paths, the relative paucity of these paths leads to lower
genomic coverage
cM
Fingerprint-only paths of < 5
cover only
21%
of the
the
YACs
human genome.
10
These
specific analyses address neither the quality of the
CEPH/Genethon data nor
calculations consider only the quality of shortest paths
CEPH/Genethon data as generated from QUICKMAP.
of
A
in
the
larger fraction
paths might be obtained using alternative analysis
reliable
as well as additional data since added by
strategies,
Genethon
These
alternative path-based strategies. (8)
to their
February, 1994 release.
noisy data, modifying the
Identifying
CEPH/Genethon
considering longer length paths
in
and removing
definition of a path,
addition to the
map
paths, or employing a denser genetic
CEPH and
minimum
length
are possible alternatives
more
that might yield a larger fraction of veracious paths covering
of the
genome.
can indicate the
Further, while our analysis
probability that there
is
a valid path,
it
does not provide a means
Only additional
distinguish valid paths from invalid paths.
experiments can provide
laboratory
invalidation
of
outlined a general
In
method
to
a specific data
for estimating
set,
we have
the percentage of valid
the presence of false connections leading to invalid paths.
order to estimate the probability that a shortest path
needs
is
in
confirmation or
a path.
Aside from an application
paths
final
subtract out the probability that
to
it
is
invalid.
estimated by evaluating the percentage of distance
connected by shortest paths
extends
to
to
of
each length.
is
valid,
This,
loci
in
pairs
This methodology
other analyses as well, including analyses using
alternative
criteria
definitions
of
for
paths.
determining
YAC
overlaps or alternative
one
turn,
1
1
12
and
References
Notes
(1) B.
Bollobas,
Random
(2) D.
Cohen,
Chumakov, and
I.
Graphs. (Academic Press, London, 1985).
Weissenbach, Nature, 366, 698
J.
(1993).
Weissenbach
(3) J.
et
Nature, 359, 794 (1992).
al.,
1059 (1992).
(4)
C. Bellanne-Chantelot et
(5)
These data included 6602 Alu-PCR probes and 3450 STSs.
the
STSs were
genetically
map
the resulting genetic
data employed
in
Rigault et
P.
(7)
We
Cell, 70,
mapped
al.,
contained 1266
an additional 1270 Alu-PCR probes added
with
described
which
in
in
this
available from rigault@ceph.cephb.fr.
QUICKMAP
of the
paper correspond
fingerprint links,
fingerprint
links
(8)
links;
to
loci.
we conducted
rules described
settings,
in
(1).
three analyses: once using
once using only ALU
The
CLONESPATH
to the default
turn correspond to the path creation
only fingerprint links.
package,
a variety of options and parameters; the results
For the "link-type" parameter,
ALU and
to
that described in (1).
generate chromosomally allowable paths between selected
offers
of
These data are the same
loci.
employed CLONESPATH, one module
CLONESPATH
2035
a specific chromosomal location;
to
CEPH/Genethon beyond
the dataset by
(6)
(1),
a!..
links,
and once using
figures correspond to using
the results were qualitatively the
ALU and
same
using
ALU
only.
Varying the criteria used to declare
criteria
used
alternative
to
overlaps or modifying the
declare a genetic interval to be covered leads to
A few examples
path-based construction strategies.
involving overlap detection
probes
YAC
to link
include
two YACs, and
fingerprint overlap to
link
(ii)
(i)
requiring two or
requiring both an
two YACs.
Examples
more ALU
ALU probe and a
involving the definition
13
of
covering a genetic interval include
disjoint
of
this
We
thank D. Cohen,
to
I.
a
specific region
Chumakov, and
J.
(iv)
more YAC-
restricting
P.
L.
Kruglyak,
Rigault for helpful discussions
manuscript. Supported
in
part by
L.
the use
on the genome.
Weissenbach
for
sharing
valuable data with us and the scientific community at large.
thank E. Lander, D. Page,
and
requiring two or
paths to cover the genetic interval, and
each Alu-PCR probe
(9)
(iii)
Stein, D.
Cohen,
I.
Chumakov,
and comments on the
NIH grant HG00098.
We
14
Figure
Figure
by
Legends
1.
Observed cumulative proportion
inter-loci
distance and path length.
constructed between
Figure
2.
all
of
connected
loci
pairs,
Minimal paths were
intra-chromosomal
loci
Estimated probability that there
is
shortest path spanning a inter-locus interval
at
pairs.
one
least
among
all
valid
the observed
shortest paths spanning that interval, by inter-locus distance and
path length.
Figure
3.
Estimated true proportion of connected
inter-locus distance
covered correctly
spanning the
is
and path length.
The
loci
probability
pairs,
an
by
interval
the probability that at least one of the paths
interval
is
valid.
is
1
00%n
Q-
O
O
_l
a
75%
0)
^-»
CJ
c
O
u
50%
c
O
O
Q.
O
25%
D
0)
>
Q)
o
(\J
CO
^
LO
Path Length
Figure
1
to
c
o
t1
o
60%n
i-
oO
^
50%-
0)
<
40%(0
Q-
"o
O
30%-
T3
4-*
u
a>
c
c
o
20%-
CJ
1
0%
o
Q.
O
Path Length
Figure 2
MIT LIBRARIES
llll
IIIIIIHIIII
llllj
lllllll
ilillllM
lllllll
3 9080 00922 4038
0-2 cM
>
.55
0.75-
Q.
T3
>
0)
(/)
o
c
(O
4-'
CO
0.25-
(D
O
Path Length
Figure
3.
Barcode
ON NEXT
TOUST
PAGE
Date Sue
Download