Parallel Genehunter: Implementation of a linkage analysis package for distributed memory architectures Michael Moran CMSC 838T Presentation May 9, 2003 Introduction Goals Link Genes to specific loci in the genome Decrease time and memory requirements through parallelization Motivation Locate genes for specific phenotypes Test for inherited diseases and risk factors Gene therapy CMSC 838T – Presentation Talk Overview Introduction Talk Overview Genetic Linkage Problem Previous Work Parallel Genehunter Evaluation Observations CMSC 838T – Presentation Genetic Linkage Problem Sexual Reproduction Offspring created by two haploid gametes Gametes are produced from diploid/polyploid cells during meiosis www.blc.arizona.edu/courses/181gh/rick/genetics1/ CMSC 838T – Presentation Genetic Linkage Problem Recombination occurs in two ways 1. Random segregation of chromatids 2 x 23 human chromosomes => 223 possible haploid combinations Genes on different chromosomes recombine with probability .5 www.gen.umn.edu/faculty_staff/hatch/1131/ CMSC 838T – Presentation Genetic Linkage Problem Recombination occurs in two ways 1. Random segregation of chromatids 2. Crossover between homologous pairs of chromosomes Genes on the same chromosome recombine with probability depending on their distance and location on the chromosome CMSC 838T – Presentation Genetic Linkage Problem Given This model of recombination Data for a particular pedigree (family) Phenotype information for each individual Genetic markers for each individual Recombination frequencies for each pair of markers Can we apply probabilistic methods to Reconstruct the inheritance patterns Link phenotypes to the markers CMSC 838T – Presentation Previous Work Fisher, Haldane, Smith, Morton (1935 - 1955) Methods to infer genetic maps using maximum likelihood estimators Elston, Stewart (1971) Genetic Linkage Algorithm Linear in pedigree size Exponential in number of markers Lander, Green (1987) Genetic Linkage Algorithm Linear in number of markers Exponential in pedigree size CMSC 838T – Presentation Previous Work Genehunter (2001) Implementation of Lander & Green Analyzes a pedigree containing n non-founders The inheritance of a gene by one non-founder can be summarized by two bits The entire pedigree’s inheritance pattern can be summarized by a 2n bits CMSC 838T – Presentation Previous Work 3 steps of Genehunter: Step 1 : For each marker, calculate the probability of each of the possible inheritance pattern. 0: grandfather’s chromatid 1: grandmother’s chromatid Pr([0,0]) = .5 Pr([0,1]) = .5 Pr([1,0]) = 0 Pr([1,1]) = 0 Store probabilities in a vector of size 22n CMSC 838T – Presentation Previous Work 3 steps of Genehunter: Step 2 : For each marker, calculate the conditional probably of each inheritance pattern conditional on all of the markers to the left, and to the right • For two markers’ inheritance vectors, each disagreeing bit requires a crossover event • The probability of transitioning between inheritance vectors i, j differing in d bits is M i, j d (1 ) 2nd CMSC 838T – Presentation Previous Work 3 steps of Genehunter: Step 2 : For each marker, calculate the conditional probably of each inheritance pattern conditional on all of the markers to the left, and to the right • Mi,j = cost of transitioning between inheritance vectors i&j • P1 , P2 = probability vectors for every inheritance pattern given markers 1 and 2 respectively • P2|1 = P2 • (M P1) • Calculate the probabilities of each marker’s inheritance conditional on all others by Markov Chain or FFT convolution CMSC 838T – Presentation Previous Work 3 steps of Genehunter: Step 3 : For each marker, calculate the probability of unknown gene being located at specific locations • • • • Hypothesizes phenotype has a gene located at a particular location. By default tries 5 evenly-spaced locations between consecutive pairs of markers Calculates PD, the probabilities of each inheritance pattern for based on this phenotype (as in step 1) For a location between markers i&i+1, p= PD • Px|1...i • Px|i+1...m Space Requirement: O(22n) O(22n-f) exploiting symmetry of f founders Time Requirement: O(m22n) O(m22n-f) with f founders CMSC 838T – Presentation Parallel Genehunter Approach Parallelize the 3 Genehunter steps separately Divides each 22n-sized marker vector evenly among the P processors allows greater distribution of memory than assigning O(m/P) entire vectors to each processor CMSC 838T – Presentation Parallel Genehunter Parallelization of step 1 For each marker, calculate the probability of each of the possible inheritance pattern Each processor calculates the probabilities for a particular 22n / P inheritance patterns for ever marker CMSC 838T – Presentation Parallel Genehunter Parallelization of step 2 For each marker, calculate the conditional probably of each inheritance pattern conditional on all of the markers to the left, and to the right FFT convolution As in serial genehunter, 22n x 22n matrix-vector multiplication is replaced FFT-based convolution: 1. 2. 3. 2 forward 1D FFTs on 22n-length vectors element-by-element multiplication inverse FFT Each 1D FFT is equivalent to a 2D FFT on a P x 22n / P matrix There are well-known distributed algorithms for this FFT using all-to-all communication. Dot Product in P2|1 = P2 • (M P1) trivially parallelized: each processor has the same portion of each vector. CMSC 838T – Presentation Parallel Genehunter Parallelization of step 3 For each marker, calculate the probability of unknown gene being located at specific locations computing Px|1...i and Px|i+1...m FFTs parallelized as in step 2 Final dot product p = (PD • Px|1...i • Px|i+1...m) parallelized as in step 2 each processor holds all the same portion of each vector CMSC 838T – Presentation Evaluation Experimental Environment Input data sets 51 family member pedigree {19,21,24}-bit data sets (# bits = 2n-f ) Computing Facilities Cplant Cluster (Sandia National Laboratories) DEC Alpha EV6 processors Myrinet connection CMSC 838T – Presentation Evaluation Runtimes For 19,21 and 24 bit problems CMSC 838T – Presentation Evaluation Runtimes For 19,21 and 24 bit problems CMSC 838T – Presentation Observations Pro: Performs Genehunter computation exactly Pro: Effective for “multipoint linkage” of phenotypes Con: Old-fashioned compared to protein-based methods (?) Pro: Distributes memory requirements Pro: More computers allows larger feasible inputs Con: Experiments based on 1 pedigree Pro: Efficient parallelization up to 32 or 64 processors Con: Only allows pedigrees to grow by only 3 or 4 individuals in equal time CMSC 838T – Presentation References Genetic Recombination Dr. Craig Woodworth, Genetic Recombination in Eukaryotes, Lecture Notes, (www.clarkson.edu/class/by214/powerpoint) Genehunter K. Markianos, M.J. Daly, & L. Kruglyak. Efficient Multipoint Linkage Analysis Through Reduction of Inheritance Space. American Journal of Human Genetics 68, 2001. Parallel Genehunter G. Conant, S. Plimpton, W. Old, A. Wagner, P. Fain, & G. Heffelfinger. Parallel Genehunter: Implementation of a Linkage Analysis Package for Distributed-Memory Architectures, Proceedings of the First IEEE Workshop on High Performance Computational Biology, International Parallel and Distributed Computing Symposium, 2002. CMSC 838T – Presentation Questions? CMSC 838T – Presentation