From: ISMB-97 Proceedings. Copyright © 1997, AAAI (www.aaai.org). All rights reserved. CARAGENE:Constructing and Joining Maximum Likelihood Genetic Maps* Thomas Schiex & Christine Gaspin Institut National de la RechercheAgronomique Chemin de Borde Rouge BP 27 Castanet-Tolosan 31326C&lex, France {tschiex,gaspin}@toulouse.inra.fr Abstract Geneticmapping is an importantstep in the studyof anyorganism.Anaccurate genetic mapis extremelyvaluablefor locatinggenesor moregenerallyeither qualitativeor quantitative trait loci (QTL).Thispaperpresentsa newapproach twoimportantproblemsin genetic mapping:automatically orderingmarkersto obtain a multipointmaximum likelihood mapand building a multipoint maximum likelihood mapusing pooleddata fromseveralcrosses. Theapproachis embodied in an hybridalgorithmthat mixes the statistical optimizationalgorithmEMwithlocal search techniqueswhichhavebeendeveloped in the artificial intelligence and operations research communities.Anefficient implementationof the EMalgorithm provides maximumlikelihood recombinationfractions, while the local search techniqueslook for orders that maximize this maximum likelihood.Thespecificity of the approachlies in the neighborhood structure usedin the local searchalgorithms whichhas beeninspired by an analogybetweenthe marker ordering problemand the famoustraveling salesmanproblem. Theapproachhas beenused to build joined mapsfor the waspTrichogramma brassicae and on randompooled data sets. In bothcases, it compares quite favorablywithexisting softwaresas far as maximum likelihood is consideredas a significantcriteria. Introduction The aim of genetic mappingis to locate genetic markers at loci on the chromosomes.The specific content of a locus in a given chromosomeis termed the allele. Given a locus on a chromosome,individuals of diploid species have two alleles, one on each of the corresponding chromosomes contributed by the parents. Anindividual is said to be homozygousat a locus if the two alleles are identical at this locus, else the individual is said to be heterozygous. Each chromosomecontributed by each parent is built, during the meiosis, by using sections of either memberof each pair of chromosomesof the parent, changes in the "Copyright(c) 1997,American Associationfor Artificial Intelligence(www.aaai.org). All rights reserved. 258 ISMB-97 chromosome used being called crossovers. At a given locus, there is a 50%chance of having either one of the parental allele. However,the "closer" two loci are on the chromosome, the higher the probability that both alleles on this chromosomewill appear together on the contributed chromosome. Twoloci (or genetic markers) are thus said be linked if the parental allele combinationsare preserved more often than would be expected by random choice. Whena parental allele combinationbetweentwo loci is not preserved, a recombinationevent is said to have occurred (this corresponds to an odd numberof crossovers between the twoloci). The degree of linkage, or genetic distance, betweentwo loci is a function of the frequencyof recombinations.The measurementof genetic distance is expressed in Morgan (or moreusually cMfor centiMorgan)and is defined as the expected numberof crossovers betweenthe two loci on the chromosome. In the simplest case, the process of building a genetic mapgoes as follows: starting from usually incomplete observationson the alleles of the loci of interest on several related members of families, one first builds linkage groups, composedof loci which are significantly linked together. For a given linkage group, a mapis defined by locating each loci on a linear chromosome i.e., by a linear order of the loci and a genetic distance betweeneach adjacent pair of loci. GivenN genetic markers in a linkage group, the marker orderingproblemis therefore to find an order of the markers that best respects the available evidence. Thetask is difficult in two aspects: it is not always obvious to say when an order best respects the available evidence and the number of possible orders becomesrapidly tremendous since N_2different ordersexist. 2 Existing approaches to the markers ordering problem varies along two aspects: the criteria used to qualify what the best mapis and the algorithmic machineryused to actually find the order that maximizesthe criteria. Several packages, such as GMendel(Echt, Knapp, & Liu 1992) or Rapidchain delineation (Doerge1996), Seriation (Buetow & Chakravarti 1987), etc. exploit two point measures (measures taking only into account two markers simultaneously). The criteria defined can be computedvery efficiently under a given order whichenables the use of sophisticated local search techniques(such as simulatedannealing in GMendel). But these criteria maybe meaningless for data sets whichcontain a large part of missing (sometwopoint estimations maybe indefinite). Other packages consider that multipoint maximum likelihood (that exploits the data on all markerssimultaneously)is the criteria that defines the best order. Examplesof such packages are MAPMAKER (Lander et aL 1987), LINKAGE (Lathrop et al. 1985)...The criteria is theoretically firmly grounded,exploits all the available evidence and can therefore perform better than the previous criteria whenseveral missing exist. However,its computationmaybe muchmoreexpensive and order optimization techniques have been limited to the use of simple heuristics or enumerativesearch. CAR~AGENE combines the multipoint maximumlikelihood criteria with local search techniques. The choice of maximum likelihood as the criteria limits the approach to reasonably simple pedigree (only backcross data and recombinantinbred lines are allowedactually, an extension to intercross is currently being workedout) but makesit possible to tackle data sets with several missingin the best possible way. The use of adequate local search techniques makes it possible to enlarge the maximum numberof markers that could be previously handled using exhaustive search while still offering mostof the usual services, for examplea collection of the mapswhoselikelihood lies in the vicinity of the best map’slikelihood. Thepaper goes as follows: the first section introduces the notion of maximum likelihood orders. This section does not contain anything essentially original ; its aim is simply to bring to light the strong theoretical connectionthat exists betweenthe traveling salesmanproblemand the multipoint maximum likelihood marker ordering problemin the simple case of backcross with no missing data. Then, we rapidly introduce the local search techniquesand the structure of the neighborhood used in CAR~AGENE, which was inspired by the previous connection and show howthese techniques can be extendedto tackle data sets with missing data and to build joined maps. Wefinally present someresuits that confirmpragmaticallythe interest of the approach. ¯ Maximum Likelihood Orders Given a probabilistic modelof recombination for a given family structure (or pedigree), a genetic mapof a linkage groupand the set of available observationson the alleles of the loci of the linkage group, one can define the probability that the observations mayhave occurred given the map. This is termed the likelihood of the map. The likelihood in itself is not interesting, it is only meaningfulwhencompared to the likelihood of other maps.In the sequel, we con- sider, as usual, that the interesting mapsare the mapswith a maximum likelihood i.e., whichbest explain the observations. To later justify our approachfor the combinatorial aspect of the genetic mappingproblem,we first introduce somebasic definitions. The simplest pedigree, whichis routinely used for plants and animalsis the backcrossstructure defined by the crossing of one homozygousparent with an heterozygous one. Since one parent is hcnnozygous,the alleles of one member of each chromosomepair of any descendant is known.The alleles of the other member of eachpair will be used to track downrecombinationevents in the heterozygousparent. In this section, we assume that so-called phases are known(i.e., it is knownwhich allele appears on which memberof the chromosome pairs) and that no data is missing. The alleles carried on one memberof the pairs of the heterozygousindividual are denoted 0, the others are denoted 1. In this case, the observation on a descendantare simply defined by the alleles carried on the member of each chromosome pair which is contributed by the heterozygous parent. Assumewe have N loci and K descendants. For a given loci t and a given descendant i, the allele contributed by the heterozygousparent, whichcan be either 0 or I will be denoted X~. This defines the observations. A genetic mapis defined by a linear order on the loci i.e., a one-to-one mapping~ from the set {1,..., N} to {1,..., N} and a probability of recombination between two adjacent loci ~(l) and 4,(t + 1), denoted 8t,t+x. A recombination (resp. non-recombination) event occurs when the allele contributed by the heterozygousparent on a given individual changes for two adjacent markers i.e., when ¯ i i ]X~(t) - X~(t+l) ] is equal to 1 (resp. 0). If we assume that there is no interference i.e., that recombinationevents occur independentlyin each interval, the probability of the observations given the mapis simply obtuined by multiplying the probabilities of all the events observedwhichyields the following formula: l=N-1 i=K l=l i----1 Obviously, a maximum likelihood map is also a maximumlog-likelihood map and the previous formula can be simply transformedby taking the logarithm. /=N--I i=K EE l=l i=I Schiex .... ............ .......................................................................................................................................................................................................................................................................... 259 Tackling Weare looking for maximum log-likelihood maps. For a given order of the loci, we can easily computethe probabilities 0* that maximizethe log-likelihood. Lookingat first and second-orderderivatives of the previous formula, the that maximizethe log-likelihood can be easily obtainedl: K So, in this case, whenall alleles are known,and as far as recombinationfractions are considered, multipoint likelihood computationcomesdownto simple two-points likelihood computation. The log-likelihood for optimal recombination fractions can therefore be rewritten: £=N- 1 ^* ^* - Or,t+1) Z K’[O’,t+ l l°g(/~;, t+x) + (1- Ot,t+l)log(1 ^* ] t=l " ..... " elementarycontribu~onto log-likelihood The maximum log-likelihood of an order is equal to a sum of elementary contributions which depend only on two loci. Onecan therefore precomputeall these numbers for all pairs of loci and the problemof finding an order that maximizesthis multipoint maximum likelihood is in essence identical to problemsdefined using criteria such as the sum of two-points estimations (eg. SARfor sumof adjacent recombinations or SALfor sum of adjacent LODS). All these problemsare actually obvious instances of the symmetric wandering salesman problem (a variant of the famous symmetric traveling salesman problem): given n cities and the distances betweeneach pair of cities, find a path that goes once through each city and that minimizes the overall distance. The choice of the first and last cities in the path is free. Onecan simply associate one imaginary city to each marker, and define as the distance between twocities the opposite of the elementarycontribution to the log-likelihood defined by the corresponding pair of markers. Wewouldlike to acknowledgethe fact that the similarity betweenmarker ordering and the TSPfor two-points approachesis mentionedin (Liu 1995). This connection is interesting in several aspects: the WSPis knownto be an NP-hard problem and this shows that the marker ordering problemmaybe difficult in some cases (algorithm theory tells us that the tremendousnumber of existing orders is not sufficient to concludethis). More interestingly, all the techniques whichhave been developed for the TSP, and which can easily be adapted to the WSP, can also be applied here. lThis is obtainedby solvingthe simpleequationsstating that first-orderderivativesare equalto 0. Atthispoint,onecanfurther noticethat the matrixof secondorderderivativesis diagonalnegative on the domainof optimization,whichshowsthat this point is a maximum. 260 ISMB-97 the ordering problem Nowthat the maximumlikelihood ordering problem is knownto be equivalent to the WSP,the considerable experience that has been accumulatedon this problemcan be exploited to provide markersordering efficient algorithms. Branchand Cut is probably the best knownoptimal procedure for the TSP. It has been used to solve instances with several thousandcities to optimality (Applegateet al. 1995). But Branchand Cut finely exploits the mathematical properties of the salesmanproblemand could probably not be extended to the more complexcase of maximum likelihoodwith missing data. However,it could be used to tackle the problems defined by two-points measures such as SAR and SAL. WhenBranch and Cut (or Branch and Bound) cannot used, an alternative can be foundin heuristics. By heuristics, we meanprocedures that experimentally work in acceptable time and usually find goodquality solutions. In no case heuristics allow to computesolutions with a guarantee of optimality. A numberof heuristics algorithms have been experimented on the TSPand could be directly used for the markersordering problemwhenthe criteria reduces to a sumof elementarycontributions. AmongTSP dedicated heuristics, some are knownto workrapidly. For instance, the nearest neighbor algorithm begins with a partial tour consisting of a single city, adds the other cities one after the other by selecting the nearest next city not already in the tour. Thealgorithm terminates whenall cities are in the tour. Severalexisting programsassociate the nearest neighboralgorithmwith reshuffling procedures to tackle the markersordering problem(Stam1993; Doerge1996). The double minimumspanning tree, Greedy, Christofides etc. are other examplesof such heuristics which could be used as well (Johnson 1990). Localsearch is a moregeneral heuristic whichis largely used in difficult optimization problems.Givenone initial feasible solution, the algorithmgeneratesa sequenceof solutions by repeatedly searching the so-called neighborhood of the current solution for a configuration that will become the newcurrent solution. The configuration chosenis usually the configuration whichmaximizesthe criteria in the neighborhood.A local optimumis a point which neighborhoodcontains only configurations whichare worsethan the current solution. Mostlocal search procedures include a special mechanism that enables themto break out of local optima. For markers ordering, a solution is an order of the markers, and the criteria is its maximum likelihood. CAR~AGENE strongly relies on the power of the 2change neighborhood, a successful well-knownneighborhood structure introduced in (Lin &Kernighan 1973) tackle the TSP. Adaptedto the WSP,the 2-change neighborhoodof a mapis the set of all mapsobtained by an in- 3-2-.1-4 .-" .-’"" ’"" .’ .’...’" .-" ..’" .-" ".’.’.’.’.’... 2.1.~ 1-~2 ".’.’.’.’".""""""""""’".".".’".. 1.~24 ".’.’.".’"’" "’"....-.......... "".. 1.2~.3 Figure I: The five mapsin the neighborhoodof an initial mapwith four markerslabelled I, 2, 3 and 4. Eachmapis obtained by flipping a subsectionof the initial map. version of a subsection of the map. Thus, for N markers, the neighborhoodhas a size of N.(N-1) 2 _ 1. This is illustrated in figure 1 in the case of 4 markers. In CAR~AGENE, this neighborhood is exploited by two different local search procedures: Tabu search (TS (Glover 1989; 1990)) repeatedly scans the current neighborhood,selecting the best neighbor to be the new solution. To avoid being stucked in local optima, the content of the neighborhoodof the current solution is influenced by a memorymechanismwhich may forbid some moves(which are said to be tabu) the neighborhood. In CAR~AGENE, the tabu moves are the recent moves(i.e., subsections of the mapwhichhave been recently inverted). The precise definition of "being recent or not" varies stochastically during search as advocated by (Taillard 1991). A tabu movemayeventually be chosen if it leads to a mapwhich improvesthe best likelihood known(this is called "aspiration" in tabu terminology). Genetic algorithms (GA, (Goldberg 1989)) are based an analogy with the genetic structure and behavior of chromosomeswithin a population of individuals. Individuals represent potential solutions to a given problem. A fitness score is associated to each individual and represents its adaptation ability. The algorithm makesthe population of individuals evolve maintainingboth diversity and favoringthe existence of best individuals. Starting from an initial population, a newgeneration is created by randomly applying mutation and by crossing pairs of individuals (favoring crosses of goodindividuals). The hope is that the populationwill evolve towards one whichcontains optimal individuals w.r.t, the fitness. In CAR~AGENE, each individual represents one genetic map(an ordering of markers). The crossover operator computes two offspring individuals,/i and/2 from two parents P1 and P2. The parents are cut into three sections by selecting randomly two markers. The middle section of P1 is copied into the correspondingposition of 11, the rest of/1 being filled in with values taken in order fromthe third, first and secondsection of P2, skipping values that have already been copied from the first parent. I2is computedin the same wayby reversing the parents. The mutation operator selects two markers and exchangesthem. The evaluation of the fitness score of an individual is moreoriginal and complexbecause it will usually modifythe individual under evaluation. Indeed, the 2-changeneighborhoodof the individual is searched for the order that maximizesthe maximum likelihood. If an order is found whichis better than the preceeding one, search for a better order is carried on in the 2-change neighborhood of this order. This processis repeated until no better order is foundin the current neiborhood:a local optimumhas been reached. Whenthe evaluation stops, the individual is replaced by this local optimumand its likelihood is used as the fitness. Withthis approach,one skips from a local optimumto another throughout the crossover and mutation operators. Finally, CAR~AGENE maintains a set of fixed size containing the S best different mapsencountered during the search. A hash table (Cormen, Leiserson, & Rivest 1990) is used to efficiently test if the mapis already present in the set, a heapstructure is usedto efficiently manage insertions/deletions (Cormen,Leiserson, &Rivest 1990). At the end of the search, the user can browsethis set and checkfor the existence of other mapswhoselikelihood is close to the best map’slikelihood in order to get an idea of howstrongly the best mapis supported. Missing Data and Joined Maps Wehave seen that two-points and multipoint approaches coincide in the specific case of backcross with no missing data. The underlying assumption of our approach to the morerealistic case of missing data is that the problemis still closely related in structure to the WSPand that the previous techniqueswill probably workfairly well in presence Schiex 261 of missing data i.e., whentwo-points and multipoint estimations maydiffer and the previous connection with WSP is theoreticallylost. Whenall the alleles Xj are not known,there is no short analytical form that can be exploited to computethe ~ that will maximizethe likelihood and one usually relies on numerical algorithms.Thestatistical iterative optimizationalgorithm EM(Expectation/Maximization (Dempster, Laird, & Rubin 1977)) is well adapted to simple pedigree. is currently used in MAPMAKER (Lander & Green 1987; Lander et al. 1987). The input of EMis an initial mapand the output will be a mapusing the same ordering of loci, but with maximum likelihood recombinationfractions (the likelihood of the mapis also producedas a side effect). The algorithm worksby iteratively going through two steps until the log-likelihood does not increase morethan a given tolerance. I. an expectation step where the expected numberof recombinationevents betweeneach pair of adjacent markers is computed,using the current values of the ~. This is done using a dynamicprogrammingalgorithm that exploits the linear structure of the problem.This also provides the log-likelihood of the current map. 2. a maximizationstep simply updates the 0 in each interval by dividing the expected numberof recombination events in this interval (computedin the previous step) the total numberof individuals. WhenEMis converged, the loglikelihood obtained indicates the quality of the order used and the extension of the previousdiscrete optimizationalgorithmsis straightforward. Instead of computingthe likelihood of an order by adding elementary contributions, we directly use EM.For backcross data, each iteration of EMis simply in O(N.K). Rather than using the existing EMimplementationof MAPMAKER, we decided to use our own implementation because several calls to this routine will be needed.It appears that this implementation, which is finely tuned to handle backcross data, is much faster than MAPMAKER implementation (we have observed ratio of more than 100 in speed on several data sets for a same quality of convergence, using the same machine, language and compiler. This ratio increases as the size of the data set increases). This efficiency is probably one of the nice properties of CAR~AGENE but it is not obvious that such improvements will still be possible on pedigreesuch as F2 intercross data. The powerof local search combinedwith the generality of multipoint likelihood makesit possible to tackle data sets with a large part of missing data and morespecifically to build joined mapsusing data from several backcross populations. The problemof building joined mapsis specifically addressed by JOINMAP (Stam 1993) and also, to some extent, by GMendel.CAR~AGENEis more limited in its 262 ISMB-97 scope than JOINMAP which can work on several data sets of different nature, but it uses a morepowerfulalgorithmic machinery to provide maximum likelihood joined maps. Mergingtwo data sets is simply done by building a new data set that involves all the markersand individuals that appearedin the two original data sets. Whenan allelic informationfor a given markeris not available in one of the original data set, it is simplyconsideredas missingin the new one. This yields data sets where missing are not randomlydistributed on all individuals and markersbut where large blocks of missing are introduced. Whenall crosses do not involve exactly the sameset of markers(whichis the usual case), sometwo-point estimations are simply impossible and it becomesessential to rely on multipoint likelihoods. Because of the large amountof missing in such joined data sets, simpleheuristic proceduresusually fail to find an optimal mapon such data, and this, whatever criteria they rely on. The local search techniques used in CAR~AGENE usually allow to find maximum likelihood maps even in these difficult conditions. Here, the ability to also produce a set of mapsin the vicinity of the optimum is highly valuable since it allowsone to qualitatively assess howstrongly the best order is supportedby the available data. Experiments CAR~AGENE and JOINMAP (rel. 1.4) have been applied both on simulated and real backcross-like data involving multiple populations. CAR~AGENE has been used to build individual (intra-population) and joined mapsand is compared with JOINMAP on this last problem. Simulated data description Several individual data sets to be joined must be generated for one commonmap. In our experiments, we considered the problem of joining two data sets. Let Ni the number of markersin each individual (also called intra-population) data set. Let Nc the number of markers commonto the two individual data sets. For given values of Ni c, and N we first build a map with N = 2 x Ni - IVc markers. The mapis build by evenly distributing the N markers on a 200 cMchromosome.This map is called the "original map"and the order of the N markers in this mapis called the "original order". For each pair of adjacent markers in this original map, the recombinationprobability between the two markersis computedusing the inverse of Haldane’s mappingfunction (Ott 1991). The markersthat are informative in each individual map are chosenas follows: first a set of Nc markersis selected randomly from the N markers. The set of N - N~ remaining markersis randomlysplit in twosets with the samecardinality. Thesetwo sets, each mergedwith the set of the N* commonmarkers define the set of the markers which are informativein each individual data set. Usingthe mapbuilt, a randomindividual data set for K individuals and for the subset of the N~ informative markers in the data set can be generatedas follows: 1. for each individual, we first randomlychoosethe allele for the first markerin the set {0, 1} with a uniformprobability 0.5. Then for each successive marker, the allele is flipped with a probability equal to the recombination probability. This process is repeated K times to obtain a data set that represents allelic information for Kindividuals; 2. to incorporate "missing data", each of the allele generated in this wayis, with probability p, deleted and replaced by a missing measure; 3. then, all the information on allele of markerswhichare not in the subset of informativemarkersfor the individual mapis deleted. 4. Finally, the order of the markers in each of these data sets is scrambled using a randompermutation to avoid any bias. Each individual data set is built using this process. The numberof markers in each individual mapwas either 10 or 15. The number of commonmarkers Nc was either 1, 2, 5 or 10. Globally, the global numberof markers N in the initial original map varies from 10 (Ni = 10, Nc = 10) to 29 (N~ = 15, Nc = 1). The number of individuals in each individual data set is either 25, 50, 100 or 250. The probability p for an allele to be missing varies from 0 to 20%by 5%step. One hundred joining problems are solved for each combination of these parameters. Rememberthat these data sets represent an ideal case with no error. TS runs with the followingstopping criteria: it will stop whenthe numberof iterations which have not improvedthe likelihood of the best solution is equal to twice the number of markersin the data set. This numberof iterations corresponds to a choice of efficiency, neededbecause of the numberof tests. Ona PentiumPro 200, the "IS based algorithmrun in less than 0.08" for intra-population data sets with 10 markersand 25 individuals, to roughly 3’ for joined data sets with 29 markersand 250 individuals. GAruns with the followingstopping criteria: it will stop as soon as two generations of individuals producethe same best individual (the same order) or the maximum likelihood has not been improved(two different individuals mayhave the same maximumlikelihood). At each generation, the numberof individuals is ~-. On the same machine, the AG based algorithm run in less than 0.02" for intra-population data sets with 10 markers and 25 individuals, to roughly 15’ for joined data sets with 29 markersand 250 individuals. Both approachesgaveresults of similar quality and the measuresare given only in the case of TS. Results on intra-population data sets Figure 2 is dedicated to intra-populations data-sets with 10 markers. The curves give the percentage of original orders found by CAR~AGENE (by original orders we mean same order as the order used to generatethe data-sets). Fournumber of individuals are reported: 25, 50, 100and 250. It appears that 250 or even 100 individuals seemsto be largely sufficient to find the goodorder with a reasonnableprobability, evenwhenmissingdata is present. 1: ~t O..................... , ..................... 0.9 ~.. ~ ....... ............. ¥ ............. "~"............. T250 :~I~7,’" 0.8 ~0.7............................... 0.6 4-................................4-..... ................ . ............................. TS0 ~0.5 0.3 e ~ 0.2 0"1100 90 infea’mative 85 80 Figure 2: Percentageof original orders found (intra-pop). Since CAR~AGENE maintains a set of the S = 31 best maps found, we measured the number of maps found in this set such that their loglikelihoodlies within -3.0 of the likelihood of the best mapfound (Figure 3). The results showthat a high percentageof success is "correlated" with a small numberof maps found within -3.0 and therefore the numberof mapsfound in the vicinity of the optimum gives an idea of howstrongly the order found is supported by the data. / f TSO " ......... 4-................................ ~...................... ""’4-...... ............ T100 ..................... ~ ...................... ~- ..... -.--:===...::::-.-.-~f:::--::::".’:::~.~1-0 100 95 90 85 80 %ageinformative Figure 3: Meancardinality of the set of mapswith a loglike ratio above-3.0 wx.t. to best map(intra-pop). Schiex 263 Wedo not report a last set of curves: the percentage of mapsfound with a loglikelihood equal to or better than the loglikelihood of the original map.This gives an idea of the effectiveness of the optimization algorithm. It was always equal to 100%. Results on joined data sets Figures 4 and 5 are dedicated to joined data-sets using two sets of 10 markers with 5 markers in common.The curves have the same meaning as in the previous case. These curves showthat, even in the relatively nice case of 5 markers in common,building a mergedmapis more difficult. This is quite natural given the large numberof missing in mergeddata sets and the increased numberof markers. 1 ................ ’I"250 ~ ........... :~ .............. ............. ~. 0.9 I0.8 0.7 ..... +’i~" .........~-....................~.... Comparison with JOINMAPon joined data sets Wenowcompare the results obtained with JOINMAP (dotted curves) with the results obtained with CAR~AGENE (filled curves). Thejoined data sets were obtained using sets of 10 markersand a numberof individuals of 250. The numberof common markers considered are 1, 2, 5 and 10. The figure 6 represents the percentage of original orders found. CAR~AGENEgives better results than JOINMAP, which could be explained both by the criteria used and by the powerof the local search algorithm of CAR~AGENE. In the case of 10 markersin common,the joined data set has also 10 markers and no extra missing introduced because of the joining process: both systemsworkperfectly in this case (we have 500 individuals and only 10 markers). .... "’’"’’’’"’"E. ........ CI0 Both ~ 0.6 ! "’"’""’.,,- 0.4 95 9O %ageinformative 85 80 r ~ C2can~eaz ..................... ..... 0.l This is again reflected in the meannumberof maps within -3.0 whichappears pragmaticallyas an excellent indicator of howstrongly the mapfound is supportedby the data. $ oR. 25 " ..... ....’ T 95 T 90 %agehffuiiim~dv¢ ~ * 1 85 "’’~ 80 CSC~=_eae&ClOBoth CI Cartagene T25 C2 Cartagene 0.9 r. ’150 "~ t Figure 6: Percentageof original orders found. 1 o C2Joi.nMap _ C1Ca.,’tagem 0 100 -- .~ .... 0.3 0.2 Figure 4: Percentage of original orders found (joined maps). 30 C5 JoinMap 0.5 $ 0 101) ~~ ~""~-~ ............................... 0.7 ~ -~ ........... 0.6 0.2 T25 ~[ 0.81..........................+............................. ~ 0.3 ~ C.SCartasem 0l~ t̄jf...........° 0.5 .......:1"50 ""-.......... ....... .~ 0.4 "-.¢-............................-4-.... 0.1 Wedo not report the percentage of mapfound with a loglikelihood better than the loglikelihood of the original mapwhich was again systematically equal to 100%. ..4--................"........ 0.8 "’"- 0.7 .......... "*........... C5 JoinMap ¢" .................... -o" ...... 0.6 °’""°""-15 ....’÷.............................."~...... D.5 .... .... "" 10 ....... Ti0( .............. [ -D..... 5 .................... o ..................... e ..................... T250- 9’ 5 t00 ’ 90 %ageinformative ’ g5 gO Figure 5: Meancardinality of the set of mapswith a loglike ratio above-3.0 w.r.t, to best map(joined maps). 264 ISMB-97 ........ a... --......ClJoln.Map ............ "~-,,, i’--.... ......... ~- ...... °°. "-,,, 0.4 C2Joi~tap ............. 0.3 0"200 -.L.., ..... "’,,,.¯ .... ....... 9’ 5 90’ %ageinfonmtive 85’ 80 Figure 7: Percentageof mapwith a loglike better than the original maploglike. In figure 7 the curves report the percentage of orders found whoseloglikelihood was larger or equal to the loglikelihood of the original order used for generating the data (the maps produced by JOINMAP were optimized using EM for the comparison). The comparisonis naturally advantageous for CAR~AGENE which precisely optimizes the loglikelihood. Note that here, a mapwith a likelihood better than the original mapis not systematically found whenfew markers are commonmarkers (down to 97%in the worst case). Real data The real data consists of backcross-like data obtained from segregation of RAPDmarkers in three F2 populations of haploid male progenies of several F1 female of the Trichogrammabrassicae genome.Full results are described in (Laurent et al. 1997). Each intra-population involved around 90 individuals. Five groups were defined. Joined maps obtained for the group II using CAR~AGENE and JOINMAP are given at the end of the paper in Figure 8. One should note that the distances obtained using JOINMAP in (1) are not maximum likelihood distances and were therefore optimized using EM(2). Finally, the order obtained using JOINMAP is more than 14 time less likely than the map obtained using CAR~AGENE. For other groups, the difference in term of logl~elihood betweenorders obtained using CAR~AGENE and JOINMAP varied from 0 to 16, always in favor of CAR~AGENE. GAand TS gave the same map. Discussion CAR~AGENE efficiently builds multipoint maximum likelihoodmapsin the simple case of backcrossdata (this essentially dedicates CAR~AGENE tO plant and animal studies). Its ability to tackle data sets witha large numberof missing allows the construction of multipoint maximum likelihood joined mapswith an apparently better probability of success than the package JOINMAP (which can tackle more complexpedigree). The efficiency of local search also offers services which were previously offered by the exhaustive search compare commandof MAPMAKER. However, these features can be used for a numberof markers which currently exceeds the capacity of compare. Again, one should remember that MAPMAKER can handle CEPHand intercross pedigree which is yet impossible in CAR~AGENE. Aninteresting feature of CAR~AGENE lies in its ability to produce a set of mapsin the vicinity of the best map found. Our exper’anents show that the number of maps found with a loglikelihood within a constant of the best map likelihood is an indicator of howstrongly the order is supported by the data. This is an essential feature for building joined maps. CAR~AGENE shows the interest of exploiting the connection betweenthe salesman problemand the markers ordering problem. There is still a large numberof results on the salesman problemwhich could be exploited to further enhance the efficiency of CAR~AGENEand we intend to continue our work in this direction. CAR~AGENE alSO highlights the interest of using a theoretically groundedcriteria such as multipoint maximum likelihood. Anotherdirection that we shall consider is the extension of CAR~AGENE to more complex pedigree. This is already well advancedfor intercross data. Weintend to make the programavailable to the public as soon as this is finished. People whohave backcross-like data and whoneed CAR~AGENE rapidly can contact the authors by e-mail. References Applegate, D.; Bixby, R.; Chv~tal, V.; and Cook,W.1995. Finding cuts in the TSP(a preliminary report). Technical Report 95-05, DIMACS. Buetow, K., and Chakravarti, A. 1987. Multipoint gene mappingusing sedation. Am. J. Hum.Genet. 41:180--201. Cormen,T. H.; Leiserson, C. E.; and Rivest, R. L. 1990. Introduction to algorithms. MITPress. ISBN: 0-26203141-8. Dempster, A.; Laird, N.; and Rubin, D. 1977. Maximum likelihood from incompletedata via the EMalgorithm. J. R. Statist. Soc. Set. 39:1-38. Doerge, R. 1996. Constructing genetic maps by rapid chain delineation. Journalof QuantitativeTrait Loci 2. Echt, C.; Knapp, S.; and Liu, B.-H. 1992. Genomemapping with non-inbred crosses using GMendel2.0. Maize Genet. Coop. Newslett. 66:27-29. Glover, F. 1989. Tabu search - part I. ORSAJournalon Computing1 (3): 190-206. Glover, F. 1990. Tabusearch - part II. ORSAJournal on Computing2(1):4-31. Goldberg, D. E. 1989. Genetic algorithms in Search, Optimization and MachineLearning. Addison-WesleyPubishing Company. Johnson, D. 1990. Local optimization and the traveling salesman problem. In Prec. 17th Int. CoUoquium on Automata, Languagesand Programming, volume 443, 446461. Springer-Verlag Lectures Notes in ComputerScience. Lander, E. S., and Green, P. 1987. Construction of multilocus genetic linkage mapsin humans.Prec. Natl. Acad. Sci. USA84:2363-2367. Lander, E.; Green, P.; Abrahamson, J.; Barlow,A.; Daly, M. J.; Lincoln, S. E.; and Newburg, L. 1987. MAPSchiex 265 0.0 ~ E3a E3a 0.0 9.0" ",.,, 20.1 ~ ABI4a 24.9 .~ AI2 28.4 .,,,i-- MI2 .........tl., 25.1 ,~:-~" ABI4a ......."" "’"........ ......’"" .,..,,,. AB 14a ........... E3a 35.3 ~ J18b 46.0 ~ AI2 53.3i -...J18b.i~i:i~:........ ~11.9... .... 60.7 ~ N3b 65.3 ~ ]lSa 72.1 ~ N3c 88.3 ~ A2a 91.5 ~ A5b 71.41 84.9 85.0 |~ M12 AI2 J18b N3b Jl8a I00.I",,I.A4 107.4 ~ N3c 124.7 ~ 131.2 ~ 139.9 ~ A2a A5b N3b Jl8a I15.9~ R8b 129.7~ T6 L3b A4 149.8 ~ N7a 165.1 ~ 187.8 .,~..D 11 T6 :~=:’i~’b ...... ..*e-..... MS.:J:::............... L16a ........ 142.63 149.00 A2a 157.76 A4 210.3 A5b 183.03 ¯ .,,,11--R8b 192.15 195.5 ..... i9~8 202.~7 LogLik¢:-470.58 (1) N3c R8b .° :I 182.3 124.98 L16a L3b M5 T6 N7a 221.74-.,,n-- NTa 246.3 ~ DI! LogLike:-458.67 (2) 257.52 ~ DII LogLike:-457.5 (3) Figure 8: Mapsfor group II using (1) JoinMap, (2) JoinMap, distances optimized with EM,CAR~AGENE. 266 ISMB-97 MAKER: An interactive computer package for constructhag primary genetic linkage mapsof experimentaland natural populations. Genomics1:174--181. Lathrop,G.; Lalouel, J.; JuUer,C.; and Ott, J. 1985. Multilocus linkage analysis in humans:detection of linkage and estimation of recombination. Am.J. Hum.Genet. 37:482488. Laurent, V.; Vanlerberghe-Masutti,E; Wajnberg,E.; Mangin, B.; Schiex, T.; and Gaspin, C. 1997. Construction of a composite genetic mapof the parasitoid wasp Trichogrammabrassicae using RAPDinformations from three populations. Submitted. Lin, S., and Kernighan,B. W.1973. Aneffective heuristic algorithm for the traveling salesman problem. Operation Research 21:498-516. Liu, B. H. 1995. The gene ordering problem, an analog of the traveling salesman problem. In Plant Genome95. Ott, J. 1991. Analysis of humangenetic linkage. Baltimore, Maryland:John HopkinsUniversity Press, 2nd edition. Stam, E 1993. Constructing of integrated genetic linkage maps by means of a new computer package:JOINMAP. The Plant Journal 3(5):739-744. Taillard, E. 1991. Robust taboo search for the quadratic assignment problem. Parallel computing 17:443-455. Schiex 267 .......... .................................................................................................................................... ................ ...............................................................................................................................................................................................................