CAR AGENE: Constructing and Joining Maximum

From: ISMB-97 Proceedings. Copyright © 1997, AAAI (www.aaai.org). All rights reserved.
CARAGENE:Constructing and Joining Maximum
Likelihood Genetic Maps*
Thomas Schiex & Christine
Gaspin
Institut National de la RechercheAgronomique
Chemin de Borde Rouge BP 27
Castanet-Tolosan 31326C&lex, France
{tschiex,gaspin}@toulouse.inra.fr
Abstract
Geneticmapping
is an importantstep in the studyof anyorganism.Anaccurate genetic mapis extremelyvaluablefor
locatinggenesor moregenerallyeither qualitativeor quantitative trait loci (QTL).Thispaperpresentsa newapproach
twoimportantproblemsin genetic mapping:automatically
orderingmarkersto obtain a multipointmaximum
likelihood
mapand building a multipoint maximum
likelihood mapusing pooleddata fromseveralcrosses.
Theapproachis embodied
in an hybridalgorithmthat mixes
the statistical optimizationalgorithmEMwithlocal search
techniqueswhichhavebeendeveloped
in the artificial intelligence and operations research communities.Anefficient implementationof the EMalgorithm provides maximumlikelihood recombinationfractions, while the local
search techniqueslook for orders that maximize
this maximum
likelihood.Thespecificity of the approachlies in the
neighborhood
structure usedin the local searchalgorithms
whichhas beeninspired by an analogybetweenthe marker
ordering problemand the famoustraveling salesmanproblem.
Theapproachhas beenused to build joined mapsfor the
waspTrichogramma
brassicae and on randompooled data
sets. In bothcases, it compares
quite favorablywithexisting
softwaresas far as maximum
likelihood is consideredas a
significantcriteria.
Introduction
The aim of genetic mappingis to locate genetic markers at
loci on the chromosomes.The specific content of a locus
in a given chromosomeis termed the allele. Given a locus on a chromosome,individuals of diploid species have
two alleles, one on each of the corresponding chromosomes
contributed by the parents. Anindividual is said to be homozygousat a locus if the two alleles are identical at this
locus, else the individual is said to be heterozygous.
Each chromosomecontributed by each parent is built,
during the meiosis, by using sections of either memberof
each pair of chromosomesof the parent, changes in the
"Copyright(c) 1997,American
Associationfor Artificial Intelligence(www.aaai.org).
All rights reserved.
258 ISMB-97
chromosome
used being called crossovers. At a given locus, there is a 50%chance of having either one of the
parental allele. However,the "closer" two loci are on the
chromosome,
the higher the probability that both alleles on
this chromosomewill appear together on the contributed
chromosome.
Twoloci (or genetic markers) are thus said
be linked if the parental allele combinationsare preserved
more often than would be expected by random choice.
Whena parental allele combinationbetweentwo loci is not
preserved, a recombinationevent is said to have occurred
(this corresponds to an odd numberof crossovers between
the twoloci).
The degree of linkage, or genetic distance, betweentwo
loci is a function of the frequencyof recombinations.The
measurementof genetic distance is expressed in Morgan
(or moreusually cMfor centiMorgan)and is defined as the
expected numberof crossovers betweenthe two loci on the
chromosome.
In the simplest case, the process of building a genetic
mapgoes as follows: starting from usually incomplete observationson the alleles of the loci of interest on several related members
of families, one first builds linkage groups,
composedof loci which are significantly linked together.
For a given linkage group, a mapis defined by locating each
loci on a linear chromosome
i.e., by a linear order of the loci
and a genetic distance betweeneach adjacent pair of loci.
GivenN genetic markers in a linkage group, the marker
orderingproblemis therefore to find an order of the markers
that best respects the available evidence. Thetask is difficult in two aspects: it is not always obvious to say when
an order best respects the available evidence and the number of possible orders becomesrapidly tremendous since
N_2different ordersexist.
2
Existing approaches to the markers ordering problem
varies along two aspects: the criteria used to qualify what
the best mapis and the algorithmic machineryused to actually find the order that maximizesthe criteria. Several
packages, such as GMendel(Echt, Knapp, & Liu 1992)
or Rapidchain delineation (Doerge1996), Seriation (Buetow & Chakravarti 1987), etc. exploit two point measures
(measures taking only into account two markers simultaneously). The criteria defined can be computedvery efficiently under a given order whichenables the use of sophisticated local search techniques(such as simulatedannealing
in GMendel). But these criteria maybe meaningless for
data sets whichcontain a large part of missing (sometwopoint estimations maybe indefinite). Other packages consider that multipoint maximum
likelihood (that exploits the
data on all markerssimultaneously)is the criteria that defines the best order. Examplesof such packages are MAPMAKER
(Lander et aL 1987), LINKAGE
(Lathrop et al.
1985)...The criteria is theoretically firmly grounded,exploits all the available evidence and can therefore perform
better than the previous criteria whenseveral missing exist. However,its computationmaybe muchmoreexpensive
and order optimization techniques have been limited to the
use of simple heuristics or enumerativesearch.
CAR~AGENE
combines the multipoint maximumlikelihood criteria with local search techniques. The choice
of maximum
likelihood as the criteria limits the approach
to reasonably simple pedigree (only backcross data and recombinantinbred lines are allowedactually, an extension to
intercross is currently being workedout) but makesit possible to tackle data sets with several missingin the best possible way. The use of adequate local search techniques makes
it possible to enlarge the maximum
numberof markers that
could be previously handled using exhaustive search while
still offering mostof the usual services, for examplea collection of the mapswhoselikelihood lies in the vicinity of
the best map’slikelihood.
Thepaper goes as follows: the first section introduces the
notion of maximum
likelihood orders. This section does
not contain anything essentially original ; its aim is simply to bring to light the strong theoretical connectionthat
exists betweenthe traveling salesmanproblemand the multipoint maximum
likelihood marker ordering problemin the
simple case of backcross with no missing data. Then, we
rapidly introduce the local search techniquesand the structure of the neighborhood used in CAR~AGENE,
which was
inspired by the previous connection and show howthese
techniques can be extendedto tackle data sets with missing
data and to build joined maps. Wefinally present someresuits that confirmpragmaticallythe interest of the approach.
¯ Maximum Likelihood
Orders
Given a probabilistic modelof recombination for a given
family structure (or pedigree), a genetic mapof a linkage
groupand the set of available observationson the alleles of
the loci of the linkage group, one can define the probability that the observations mayhave occurred given the map.
This is termed the likelihood of the map. The likelihood
in itself is not interesting, it is only meaningfulwhencompared to the likelihood of other maps.In the sequel, we con-
sider, as usual, that the interesting mapsare the mapswith
a maximum
likelihood i.e., whichbest explain the observations. To later justify our approachfor the combinatorial
aspect of the genetic mappingproblem,we first introduce
somebasic definitions.
The simplest pedigree, whichis routinely used for plants
and animalsis the backcrossstructure defined by the crossing of one homozygousparent with an heterozygous one.
Since one parent is hcnnozygous,the alleles of one member
of each chromosomepair of any descendant is known.The
alleles of the other member
of eachpair will be used to track
downrecombinationevents in the heterozygousparent.
In this section, we assume that so-called phases are
known(i.e., it is knownwhich allele appears on which
memberof the chromosome
pairs) and that no data is missing. The alleles carried on one memberof the pairs of the
heterozygousindividual are denoted 0, the others are denoted 1. In this case, the observation on a descendantare
simply defined by the alleles carried on the member
of each
chromosome
pair which is contributed by the heterozygous
parent.
Assumewe have N loci and K descendants. For a given
loci t and a given descendant i, the allele contributed by
the heterozygousparent, whichcan be either 0 or I will be
denoted X~. This defines the observations.
A genetic mapis defined by a linear order on the loci
i.e., a one-to-one mapping~ from the set {1,..., N} to
{1,..., N} and a probability of recombination between
two adjacent loci ~(l) and 4,(t + 1), denoted 8t,t+x. A recombination (resp. non-recombination) event occurs when
the allele contributed by the heterozygousparent on a given
individual changes for two adjacent markers i.e., when
¯
i
i
]X~(t)
- X~(t+l)
] is equal to 1 (resp. 0). If we assume
that there is no interference i.e., that recombinationevents
occur independentlyin each interval, the probability of the
observations given the mapis simply obtuined by multiplying the probabilities of all the events observedwhichyields
the following formula:
l=N-1
i=K
l=l i----1
Obviously, a maximum
likelihood map is also a maximumlog-likelihood map and the previous formula can be
simply transformedby taking the logarithm.
/=N--I
i=K
EE
l=l
i=I
Schiex
.... ............
..........................................................................................................................................................................................................................................................................
259
Tackling
Weare looking for maximum
log-likelihood maps. For a
given order of the loci, we can easily computethe probabilities 0* that maximizethe log-likelihood. Lookingat first
and second-orderderivatives of the previous formula, the
that maximizethe log-likelihood can be easily obtainedl:
K
So, in this case, whenall alleles are known,and as far
as recombinationfractions are considered, multipoint likelihood computationcomesdownto simple two-points likelihood computation. The log-likelihood for optimal recombination fractions can therefore be rewritten:
£=N- 1
^*
^*
- Or,t+1)
Z K’[O’,t+ l l°g(/~;, t+x) + (1- Ot,t+l)log(1
^*
]
t=l
"
.....
"
elementarycontribu~onto log-likelihood
The maximum
log-likelihood of an order is equal to a
sum of elementary contributions which depend only on
two loci. Onecan therefore precomputeall these numbers
for all pairs of loci and the problemof finding an order
that maximizesthis multipoint maximum
likelihood is in
essence identical to problemsdefined using criteria such as
the sum of two-points estimations (eg. SARfor sumof adjacent recombinations or SALfor sum of adjacent LODS).
All these problemsare actually obvious instances of the
symmetric wandering salesman problem (a variant of the
famous symmetric traveling salesman problem): given n
cities and the distances betweeneach pair of cities, find a
path that goes once through each city and that minimizes
the overall distance. The choice of the first and last cities
in the path is free. Onecan simply associate one imaginary city to each marker, and define as the distance between
twocities the opposite of the elementarycontribution to the
log-likelihood defined by the corresponding pair of markers. Wewouldlike to acknowledgethe fact that the similarity betweenmarker ordering and the TSPfor two-points
approachesis mentionedin (Liu 1995).
This connection is interesting in several aspects: the
WSPis knownto be an NP-hard problem and this shows
that the marker ordering problemmaybe difficult in some
cases (algorithm theory tells us that the tremendousnumber
of existing orders is not sufficient to concludethis). More
interestingly, all the techniques whichhave been developed
for the TSP, and which can easily be adapted to the WSP,
can also be applied here.
lThis is obtainedby solvingthe simpleequationsstating that
first-orderderivativesare equalto 0. Atthispoint,onecanfurther
noticethat the matrixof secondorderderivativesis diagonalnegative on the domainof optimization,whichshowsthat this point
is a maximum.
260
ISMB-97
the ordering
problem
Nowthat the maximumlikelihood ordering problem is
knownto be equivalent to the WSP,the considerable experience that has been accumulatedon this problemcan be
exploited to provide markersordering efficient algorithms.
Branchand Cut is probably the best knownoptimal procedure for the TSP. It has been used to solve instances
with several thousandcities to optimality (Applegateet al.
1995). But Branchand Cut finely exploits the mathematical
properties of the salesmanproblemand could probably not
be extended to the more complexcase of maximum
likelihoodwith missing data. However,it could be used to tackle
the problems defined by two-points measures such as SAR
and SAL.
WhenBranch and Cut (or Branch and Bound) cannot
used, an alternative can be foundin heuristics. By heuristics, we meanprocedures that experimentally work in acceptable time and usually find goodquality solutions. In no
case heuristics allow to computesolutions with a guarantee of optimality. A numberof heuristics algorithms have
been experimented on the TSPand could be directly used
for the markersordering problemwhenthe criteria reduces
to a sumof elementarycontributions.
AmongTSP dedicated heuristics, some are knownto
workrapidly. For instance, the nearest neighbor algorithm
begins with a partial tour consisting of a single city, adds
the other cities one after the other by selecting the nearest
next city not already in the tour. Thealgorithm terminates
whenall cities are in the tour. Severalexisting programsassociate the nearest neighboralgorithmwith reshuffling procedures to tackle the markersordering problem(Stam1993;
Doerge1996). The double minimumspanning tree, Greedy,
Christofides etc. are other examplesof such heuristics
which could be used as well (Johnson 1990).
Localsearch is a moregeneral heuristic whichis largely
used in difficult optimization problems.Givenone initial
feasible solution, the algorithmgeneratesa sequenceof solutions by repeatedly searching the so-called neighborhood
of the current solution for a configuration that will become
the newcurrent solution. The configuration chosenis usually the configuration whichmaximizesthe criteria in the
neighborhood.A local optimumis a point which neighborhoodcontains only configurations whichare worsethan the
current solution. Mostlocal search procedures include a
special mechanism
that enables themto break out of local
optima.
For markers ordering, a solution is an order of the
markers, and the criteria is its maximum
likelihood.
CAR~AGENE strongly relies
on the power of the 2change neighborhood, a successful well-knownneighborhood structure introduced in (Lin &Kernighan 1973)
tackle the TSP. Adaptedto the WSP,the 2-change neighborhoodof a mapis the set of all mapsobtained by an in-
3-2-.1-4
.-" .-’"" ’"" .’ .’...’" .-" ..’" .-" ".’.’.’.’.’...
2.1.~
1-~2
".’.’.’.’".""""""""""’".".".’"..
1.~24
".’.’.".’"’" "’"....-..........
""..
1.2~.3
Figure I: The five mapsin the neighborhoodof an initial mapwith four markerslabelled I, 2, 3 and 4. Eachmapis obtained
by flipping a subsectionof the initial map.
version of a subsection of the map. Thus, for N markers,
the neighborhoodhas a size of N.(N-1)
2 _ 1. This is illustrated in figure 1 in the case of 4 markers.
In CAR~AGENE,
this neighborhood is exploited by two
different local search procedures:
Tabu search (TS (Glover 1989; 1990)) repeatedly scans
the current neighborhood,selecting the best neighbor to
be the new solution. To avoid being stucked in local
optima, the content of the neighborhoodof the current
solution is influenced by a memorymechanismwhich
may forbid some moves(which are said to be tabu)
the neighborhood. In CAR~AGENE,
the tabu moves are
the recent moves(i.e., subsections of the mapwhichhave
been recently inverted). The precise definition of "being
recent or not" varies stochastically during search as advocated by (Taillard 1991). A tabu movemayeventually
be chosen if it leads to a mapwhich improvesthe best
likelihood known(this is called "aspiration" in tabu terminology).
Genetic algorithms (GA, (Goldberg 1989)) are based
an analogy with the genetic structure and behavior of
chromosomeswithin a population of individuals. Individuals represent potential solutions to a given problem.
A fitness score is associated to each individual and represents its adaptation ability. The algorithm makesthe
population of individuals evolve maintainingboth diversity and favoringthe existence of best individuals. Starting from an initial population, a newgeneration is created by randomly applying mutation and by crossing
pairs of individuals (favoring crosses of goodindividuals). The hope is that the populationwill evolve towards
one whichcontains optimal individuals w.r.t, the fitness.
In CAR~AGENE,
each individual represents one genetic
map(an ordering of markers). The crossover operator
computes two offspring individuals,/i and/2 from two
parents P1 and P2. The parents are cut into three sections by selecting randomly two markers. The middle
section of P1 is copied into the correspondingposition
of 11, the rest of/1 being filled in with values taken in
order fromthe third, first and secondsection of P2, skipping values that have already been copied from the first
parent. I2is computedin the same wayby reversing the
parents. The mutation operator selects two markers and
exchangesthem. The evaluation of the fitness score of an
individual is moreoriginal and complexbecause it will
usually modifythe individual under evaluation. Indeed,
the 2-changeneighborhoodof the individual is searched
for the order that maximizesthe maximum
likelihood.
If an order is found whichis better than the preceeding
one, search for a better order is carried on in the 2-change
neighborhood
of this order. This processis repeated until
no better order is foundin the current neiborhood:a local
optimumhas been reached. Whenthe evaluation stops,
the individual is replaced by this local optimumand its
likelihood is used as the fitness. Withthis approach,one
skips from a local optimumto another throughout the
crossover and mutation operators.
Finally, CAR~AGENE
maintains a set of fixed size containing the S best different mapsencountered during the
search. A hash table (Cormen, Leiserson, & Rivest 1990)
is used to efficiently test if the mapis already present in
the set, a heapstructure is usedto efficiently manage
insertions/deletions (Cormen,Leiserson, &Rivest 1990). At the
end of the search, the user can browsethis set and checkfor
the existence of other mapswhoselikelihood is close to the
best map’slikelihood in order to get an idea of howstrongly
the best mapis supported.
Missing Data and Joined Maps
Wehave seen that two-points and multipoint approaches
coincide in the specific case of backcross with no missing
data. The underlying assumption of our approach to the
morerealistic case of missing data is that the problemis
still closely related in structure to the WSPand that the previous techniqueswill probably workfairly well in presence
Schiex
261
of missing data i.e., whentwo-points and multipoint estimations maydiffer and the previous connection with WSP
is theoreticallylost.
Whenall the alleles Xj are not known,there is no short
analytical form that can be exploited to computethe ~ that
will maximizethe likelihood and one usually relies on numerical algorithms.Thestatistical iterative optimizationalgorithm EM(Expectation/Maximization (Dempster, Laird,
& Rubin 1977)) is well adapted to simple pedigree.
is currently used in MAPMAKER
(Lander & Green 1987;
Lander et al. 1987). The input of EMis an initial mapand
the output will be a mapusing the same ordering of loci,
but with maximum
likelihood recombinationfractions (the
likelihood of the mapis also producedas a side effect). The
algorithm worksby iteratively going through two steps until the log-likelihood does not increase morethan a given
tolerance.
I. an expectation step where the expected numberof recombinationevents betweeneach pair of adjacent markers is computed,using the current values of the ~. This
is done using a dynamicprogrammingalgorithm that exploits the linear structure of the problem.This also provides the log-likelihood of the current map.
2. a maximizationstep simply updates the 0 in each interval by dividing the expected numberof recombination
events in this interval (computedin the previous step)
the total numberof individuals.
WhenEMis converged, the loglikelihood obtained indicates the quality of the order used and the extension of
the previousdiscrete optimizationalgorithmsis straightforward. Instead of computingthe likelihood of an order by
adding elementary contributions, we directly use EM.For
backcross data, each iteration of EMis simply in O(N.K).
Rather than using the existing EMimplementationof MAPMAKER,
we decided to use our own implementation because several calls to this routine will be needed.It appears
that this implementation, which is finely tuned to handle
backcross data, is much faster than MAPMAKER
implementation (we have observed ratio of more than 100 in
speed on several data sets for a same quality of convergence, using the same machine, language and compiler.
This ratio increases as the size of the data set increases).
This efficiency is probably one of the nice properties of
CAR~AGENE
but it is not obvious that such improvements
will still be possible on pedigreesuch as F2 intercross data.
The powerof local search combinedwith the generality
of multipoint likelihood makesit possible to tackle data sets
with a large part of missing data and morespecifically to
build joined mapsusing data from several backcross populations. The problemof building joined mapsis specifically addressed by JOINMAP
(Stam 1993) and also, to some
extent, by GMendel.CAR~AGENEis more limited in its
262
ISMB-97
scope than JOINMAP
which can work on several data sets
of different nature, but it uses a morepowerfulalgorithmic
machinery to provide maximum
likelihood joined maps.
Mergingtwo data sets is simply done by building a new
data set that involves all the markersand individuals that
appearedin the two original data sets. Whenan allelic informationfor a given markeris not available in one of the
original data set, it is simplyconsideredas missingin the
new one. This yields data sets where missing are not randomlydistributed on all individuals and markersbut where
large blocks of missing are introduced. Whenall crosses
do not involve exactly the sameset of markers(whichis the
usual case), sometwo-point estimations are simply impossible and it becomesessential to rely on multipoint likelihoods.
Because of the large amountof missing in such joined
data sets, simpleheuristic proceduresusually fail to find an
optimal mapon such data, and this, whatever criteria they
rely on. The local search techniques used in CAR~AGENE
usually allow to find maximum
likelihood maps even in
these difficult conditions. Here, the ability to also produce
a set of mapsin the vicinity of the optimum
is highly valuable since it allowsone to qualitatively assess howstrongly
the best order is supportedby the available data.
Experiments
CAR~AGENE
and JOINMAP
(rel. 1.4) have been applied
both on simulated and real backcross-like data involving
multiple populations. CAR~AGENE
has been used to build
individual (intra-population) and joined mapsand is compared with JOINMAP
on this last problem.
Simulated data description
Several individual data sets to be joined must be generated
for one commonmap. In our experiments, we considered
the problem of joining two data sets. Let Ni the number
of markersin each individual (also called intra-population)
data set. Let Nc the number of markers commonto the
two individual data sets. For given values of Ni c,
and N
we first build a map with N = 2 x Ni - IVc markers.
The mapis build by evenly distributing the N markers on
a 200 cMchromosome.This map is called the "original
map"and the order of the N markers in this mapis called
the "original order". For each pair of adjacent markers in
this original map, the recombinationprobability between
the two markersis computedusing the inverse of Haldane’s
mappingfunction (Ott 1991).
The markersthat are informative in each individual map
are chosenas follows: first a set of Nc markersis selected
randomly from the N markers. The set of N - N~ remaining markersis randomlysplit in twosets with the samecardinality. Thesetwo sets, each mergedwith the set of the
N* commonmarkers define the set of the markers which
are informativein each individual data set.
Usingthe mapbuilt, a randomindividual data set for K
individuals and for the subset of the N~ informative markers
in the data set can be generatedas follows:
1. for each individual, we first randomlychoosethe allele
for the first markerin the set {0, 1} with a uniformprobability 0.5. Then for each successive marker, the allele is flipped with a probability equal to the recombination probability. This process is repeated K times to
obtain a data set that represents allelic information for
Kindividuals;
2. to incorporate "missing data", each of the allele generated in this wayis, with probability p, deleted and replaced by a missing measure;
3. then, all the information on allele of markerswhichare
not in the subset of informativemarkersfor the individual
mapis deleted.
4. Finally, the order of the markers in each of these data
sets is scrambled using a randompermutation to avoid
any bias.
Each individual data set is built using this process. The
numberof markers in each individual mapwas either 10 or
15. The number of commonmarkers Nc was either 1, 2,
5 or 10. Globally, the global numberof markers N in the
initial original map varies from 10 (Ni = 10, Nc = 10)
to 29 (N~ = 15, Nc = 1). The number of individuals in
each individual data set is either 25, 50, 100 or 250. The
probability p for an allele to be missing varies from 0 to
20%by 5%step. One hundred joining problems are solved
for each combination of these parameters. Rememberthat
these data sets represent an ideal case with no error.
TS runs with the followingstopping criteria: it will stop
whenthe numberof iterations which have not improvedthe
likelihood of the best solution is equal to twice the number
of markersin the data set. This numberof iterations corresponds to a choice of efficiency, neededbecause of the
numberof tests. Ona PentiumPro 200, the "IS based algorithmrun in less than 0.08" for intra-population data sets
with 10 markersand 25 individuals, to roughly 3’ for joined
data sets with 29 markersand 250 individuals.
GAruns with the followingstopping criteria: it will stop
as soon as two generations of individuals producethe same
best individual (the same order) or the maximum
likelihood
has not been improved(two different individuals mayhave
the same maximumlikelihood). At each generation, the
numberof individuals is ~-. On the same machine, the AG
based algorithm run in less than 0.02" for intra-population
data sets with 10 markers and 25 individuals, to roughly
15’ for joined data sets with 29 markersand 250 individuals. Both approachesgaveresults of similar quality and the
measuresare given only in the case of TS.
Results on intra-population data sets
Figure 2 is dedicated to intra-populations data-sets with 10
markers. The curves give the percentage of original orders
found by CAR~AGENE
(by original orders we mean same
order as the order used to generatethe data-sets). Fournumber of individuals are reported: 25, 50, 100and 250. It appears that 250 or even 100 individuals seemsto be largely
sufficient to find the goodorder with a reasonnableprobability, evenwhenmissingdata is present.
1:
~t
O.....................
, .....................
0.9
~..
~ .......
.............
¥ .............
"~".............
T250
:~I~7,’"
0.8
~0.7...............................
0.6
4-................................4-.....
................
. .............................
TS0
~0.5
0.3
e
~
0.2
0"1100
90
infea’mative
85
80
Figure 2: Percentageof original orders found (intra-pop).
Since CAR~AGENE
maintains a set of the S = 31 best
maps found, we measured the number of maps found in
this set such that their loglikelihoodlies within -3.0 of the
likelihood of the best mapfound (Figure 3). The results
showthat a high percentageof success is "correlated" with
a small numberof maps found within -3.0 and therefore
the numberof mapsfound in the vicinity of the optimum
gives an idea of howstrongly the order found is supported
by the data.
/
f
TSO
"
......... 4-................................
~...................... ""’4-......
............
T100
.....................
~ ......................
~- ..... -.--:===...::::-.-.-~f:::--::::".’:::~.~1-0
100
95
90
85
80
%ageinformative
Figure 3: Meancardinality of the set of mapswith a loglike
ratio above-3.0 wx.t. to best map(intra-pop).
Schiex
263
Wedo not report a last set of curves: the percentage of
mapsfound with a loglikelihood equal to or better than the
loglikelihood of the original map.This gives an idea of the
effectiveness of the optimization algorithm. It was always
equal to 100%.
Results on joined data sets
Figures 4 and 5 are dedicated to joined data-sets using two
sets of 10 markers with 5 markers in common.The curves
have the same meaning as in the previous case. These
curves showthat, even in the relatively nice case of 5 markers in common,building a mergedmapis more difficult.
This is quite natural given the large numberof missing in
mergeddata sets and the increased numberof markers.
1
................
’I"250
~ ...........
:~ ..............
.............
~.
0.9
I0.8
0.7
..... +’i~"
.........~-....................~....
Comparison with JOINMAPon joined data sets
Wenowcompare the results obtained with JOINMAP
(dotted curves) with the results obtained with CAR~AGENE
(filled curves). Thejoined data sets were obtained using
sets of 10 markersand a numberof individuals of 250. The
numberof common
markers considered are 1, 2, 5 and 10.
The figure 6 represents the percentage of original orders
found. CAR~AGENEgives better results than JOINMAP,
which could be explained both by the criteria used and by
the powerof the local search algorithm of CAR~AGENE.
In
the case of 10 markersin common,the joined data set has
also 10 markers and no extra missing introduced because
of the joining process: both systemsworkperfectly in this
case (we have 500 individuals and only 10 markers).
.... "’’"’’’’"’"E. ........
CI0 Both
~
0.6
!
"’"’""’.,,-
0.4
95
9O
%ageinformative
85
80
r
~
C2can~eaz
.....................
.....
0.l
This is again reflected in the meannumberof maps
within -3.0 whichappears pragmaticallyas an excellent
indicator of howstrongly the mapfound is supportedby
the data.
$
oR.
25
"
..... ....’
T
95
T
90
%agehffuiiim~dv¢
~
*
1
85
"’’~
80
CSC~=_eae&ClOBoth
CI Cartagene
T25
C2 Cartagene
0.9 r.
’150
"~
t
Figure 6: Percentageof original orders found.
1
o
C2Joi.nMap
_ C1Ca.,’tagem
0
100
--
.~ ....
0.3
0.2
Figure 4: Percentage of original orders found (joined
maps).
30
C5 JoinMap
0.5
$
0
101)
~~
~""~-~
...............................
0.7 ~ -~ ...........
0.6
0.2
T25
~[
0.81..........................+.............................
~ 0.3
~ C.SCartasem
0l~
t̄jf...........°
0.5 .......:1"50
""-..........
.......
.~ 0.4
"-.¢-............................-4-....
0.1
Wedo not report the percentage of mapfound with a
loglikelihood better than the loglikelihood of the original
mapwhich was again systematically equal to 100%.
..4--................"........
0.8
"’"-
0.7
..........
"*...........
C5 JoinMap
¢" ....................
-o" ......
0.6 °’""°""-15
....’÷.............................."~......
D.5
....
....
""
10 .......
Ti0(
..............
[
-D.....
5 ....................
o
.....................
e .....................
T250-
9’
5
t00
’
90
%ageinformative
’
g5
gO
Figure 5: Meancardinality of the set of mapswith a loglike
ratio above-3.0 w.r.t, to best map(joined maps).
264
ISMB-97
........
a...
--......ClJoln.Map
............
"~-,,,
i’--....
.........
~- ......
°°.
"-,,,
0.4
C2Joi~tap .............
0.3
0"200
-.L.., .....
"’,,,.¯ ....
.......
9’
5
90’
%ageinfonmtive
85’
80
Figure 7: Percentageof mapwith a loglike better than the
original maploglike.
In figure 7 the curves report the percentage of orders
found whoseloglikelihood was larger or equal to the loglikelihood of the original order used for generating the data
(the maps produced by JOINMAP
were optimized using EM
for the comparison). The comparisonis naturally advantageous for CAR~AGENE
which precisely optimizes the loglikelihood. Note that here, a mapwith a likelihood better
than the original mapis not systematically found whenfew
markers are commonmarkers (down to 97%in the worst
case).
Real data
The real data consists of backcross-like data obtained from
segregation of RAPDmarkers in three F2 populations of
haploid male progenies of several F1 female of the Trichogrammabrassicae genome.Full results are described
in (Laurent et al. 1997). Each intra-population involved
around 90 individuals. Five groups were defined. Joined
maps obtained for the group II using CAR~AGENE
and
JOINMAP
are given at the end of the paper in Figure 8. One
should note that the distances obtained using JOINMAP
in
(1) are not maximum
likelihood distances and were therefore optimized using EM(2). Finally, the order obtained
using JOINMAP
is more than 14 time less likely than the
map obtained using CAR~AGENE.
For other groups, the
difference in term of logl~elihood betweenorders obtained
using CAR~AGENE
and JOINMAP
varied from 0 to 16, always in favor of CAR~AGENE.
GAand TS gave the same
map.
Discussion
CAR~AGENE
efficiently builds multipoint maximum
likelihoodmapsin the simple case of backcrossdata (this essentially dedicates CAR~AGENE
tO plant and animal studies).
Its ability to tackle data sets witha large numberof missing
allows the construction of multipoint maximum
likelihood
joined mapswith an apparently better probability of success than the package JOINMAP
(which can tackle more
complexpedigree).
The efficiency of local search also offers services which
were previously offered by the exhaustive search compare
commandof MAPMAKER.
However, these features can
be used for a numberof markers which currently exceeds
the capacity of compare. Again, one should remember
that MAPMAKER
can handle CEPHand intercross pedigree
which is yet impossible in CAR~AGENE.
Aninteresting feature of CAR~AGENE
lies in its ability
to produce a set of mapsin the vicinity of the best map
found. Our exper’anents show that the number of maps
found with a loglikelihood within a constant of the best map
likelihood is an indicator of howstrongly the order is supported by the data. This is an essential feature for building
joined maps.
CAR~AGENE
shows the interest of exploiting the connection betweenthe salesman problemand the markers ordering problem. There is still a large numberof results
on the salesman problemwhich could be exploited to further enhance the efficiency of CAR~AGENEand we intend
to continue our work in this direction. CAR~AGENE
alSO
highlights the interest of using a theoretically groundedcriteria such as multipoint maximum
likelihood.
Anotherdirection that we shall consider is the extension
of CAR~AGENE
to more complex pedigree. This is already well advancedfor intercross data. Weintend to make
the programavailable to the public as soon as this is finished. People whohave backcross-like data and whoneed
CAR~AGENE
rapidly can contact the authors by e-mail.
References
Applegate, D.; Bixby, R.; Chv~tal, V.; and Cook,W.1995.
Finding cuts in the TSP(a preliminary report). Technical
Report 95-05, DIMACS.
Buetow, K., and Chakravarti, A. 1987. Multipoint gene
mappingusing sedation. Am. J. Hum.Genet. 41:180--201.
Cormen,T. H.; Leiserson, C. E.; and Rivest, R. L. 1990.
Introduction to algorithms. MITPress. ISBN: 0-26203141-8.
Dempster, A.; Laird, N.; and Rubin, D. 1977. Maximum
likelihood from incompletedata via the EMalgorithm. J.
R. Statist. Soc. Set. 39:1-38.
Doerge, R. 1996. Constructing genetic maps by rapid
chain delineation. Journalof QuantitativeTrait Loci 2.
Echt, C.; Knapp, S.; and Liu, B.-H. 1992. Genomemapping with non-inbred crosses using GMendel2.0. Maize
Genet. Coop. Newslett. 66:27-29.
Glover, F. 1989. Tabu search - part I. ORSAJournalon
Computing1 (3): 190-206.
Glover, F. 1990. Tabusearch - part II. ORSAJournal on
Computing2(1):4-31.
Goldberg, D. E. 1989. Genetic algorithms in Search, Optimization and MachineLearning. Addison-WesleyPubishing Company.
Johnson, D. 1990. Local optimization and the traveling
salesman problem. In Prec. 17th Int. CoUoquium
on Automata, Languagesand Programming, volume 443, 446461. Springer-Verlag Lectures Notes in ComputerScience.
Lander, E. S., and Green, P. 1987. Construction of multilocus genetic linkage mapsin humans.Prec. Natl. Acad.
Sci. USA84:2363-2367.
Lander, E.; Green, P.; Abrahamson,
J.; Barlow,A.; Daly,
M. J.; Lincoln, S. E.; and Newburg, L. 1987. MAPSchiex
265
0.0 ~ E3a
E3a
0.0
9.0"
",.,,
20.1 ~ ABI4a
24.9 .~ AI2
28.4 .,,,i-- MI2
.........tl.,
25.1 ,~:-~"
ABI4a
.......""
"’"........
......’""
.,..,,,.
AB 14a ...........
E3a
35.3 ~ J18b
46.0
~ AI2
53.3i -...J18b.i~i:i~:........
~11.9...
....
60.7 ~ N3b
65.3 ~ ]lSa
72.1 ~ N3c
88.3 ~ A2a
91.5 ~ A5b
71.41
84.9
85.0
|~
M12
AI2
J18b
N3b
Jl8a
I00.I",,I.A4
107.4 ~
N3c
124.7 ~
131.2 ~
139.9 ~
A2a
A5b
N3b
Jl8a
I15.9~ R8b
129.7~
T6
L3b
A4
149.8 ~ N7a
165.1 ~
187.8 .,~..D 11
T6
:~=:’i~’b
......
..*e-..... MS.:J:::...............
L16a ........
142.63
149.00
A2a
157.76
A4
210.3
A5b
183.03 ¯ .,,,11--R8b
192.15
195.5
..... i9~8
202.~7
LogLik¢:-470.58
(1)
N3c
R8b
.°
:I
182.3
124.98
L16a
L3b M5
T6
N7a
221.74-.,,n-- NTa
246.3
~ DI!
LogLike:-458.67
(2)
257.52
~ DII
LogLike:-457.5
(3)
Figure 8: Mapsfor group II using (1) JoinMap, (2) JoinMap, distances optimized with EM,CAR~AGENE.
266
ISMB-97
MAKER:
An interactive computer package for constructhag primary genetic linkage mapsof experimentaland natural populations. Genomics1:174--181.
Lathrop,G.; Lalouel, J.; JuUer,C.; and Ott, J. 1985. Multilocus linkage analysis in humans:detection of linkage and
estimation of recombination. Am.J. Hum.Genet. 37:482488.
Laurent, V.; Vanlerberghe-Masutti,E; Wajnberg,E.; Mangin, B.; Schiex, T.; and Gaspin, C. 1997. Construction of a composite genetic mapof the parasitoid wasp
Trichogrammabrassicae using RAPDinformations from
three populations. Submitted.
Lin, S., and Kernighan,B. W.1973. Aneffective heuristic
algorithm for the traveling salesman problem. Operation
Research 21:498-516.
Liu, B. H. 1995. The gene ordering problem, an analog of
the traveling salesman problem. In Plant Genome95.
Ott, J. 1991. Analysis of humangenetic linkage. Baltimore, Maryland:John HopkinsUniversity Press, 2nd edition.
Stam, E 1993. Constructing of integrated genetic linkage
maps by means of a new computer package:JOINMAP.
The Plant Journal 3(5):739-744.
Taillard, E. 1991. Robust taboo search for the quadratic
assignment problem. Parallel computing 17:443-455.
Schiex
267
.......... .................................................................................................................................... ................ ...............................................................................................................................................................................................................