Parallel Genehunter : Implementation of a linkage analysis package for distributed memory architectures

advertisement
Parallel Genehunter: Implementation of a
linkage analysis package for distributed
memory architectures
Michael Moran
CMSC 838T Presentation
May 9, 2003
Introduction


Goals

Link Genes to specific loci in the genome

Decrease time and memory requirements through
parallelization
Motivation

Locate genes for specific phenotypes

Test for inherited diseases and risk factors

Gene therapy
CMSC 838T – Presentation
Talk Overview

Introduction

Talk Overview

Genetic Linkage Problem

Previous Work

Parallel Genehunter

Evaluation

Observations
CMSC 838T – Presentation
Genetic Linkage Problem

Sexual Reproduction

Offspring created by two haploid gametes

Gametes are produced from diploid/polyploid cells during
meiosis
www.blc.arizona.edu/courses/181gh/rick/genetics1/
CMSC 838T – Presentation
Genetic Linkage Problem

Recombination occurs in two ways
1.
Random segregation of chromatids
2 x 23 human chromosomes
=>
223 possible haploid combinations
Genes on different chromosomes
recombine with probability
  .5

www.gen.umn.edu/faculty_staff/hatch/1131/
CMSC 838T – Presentation
Genetic Linkage Problem

Recombination occurs in two ways
1.
Random segregation of chromatids
2.
Crossover between homologous
pairs of chromosomes
Genes on the same chromosome
recombine with probability

depending on their distance and
location on the chromosome

CMSC 838T – Presentation
Genetic Linkage Problem
Given

This model of recombination

Data for a particular pedigree (family)

Phenotype information for each individual

Genetic markers for each individual
Recombination frequencies for each pair of markers

Can we apply probabilistic methods to

Reconstruct the inheritance patterns

Link phenotypes to the markers
CMSC 838T – Presentation
Previous Work

Fisher, Haldane, Smith, Morton (1935 - 1955)
Methods to infer genetic maps using maximum likelihood
estimators

Elston, Stewart (1971)
Genetic Linkage Algorithm



Linear in pedigree size
Exponential in number of markers
Lander, Green (1987)
Genetic Linkage Algorithm


Linear in number of markers
Exponential in pedigree size
CMSC 838T – Presentation
Previous Work

Genehunter (2001)

Implementation of Lander & Green

Analyzes a pedigree containing n non-founders

The inheritance of a gene by one
non-founder can be summarized
by two bits

The entire pedigree’s inheritance
pattern can be summarized by a
2n bits
CMSC 838T – Presentation
Previous Work

3 steps of Genehunter:
Step 1 : For each marker, calculate the probability of each
of the possible inheritance pattern.
0: grandfather’s chromatid
1: grandmother’s chromatid
Pr([0,0]) = .5
Pr([0,1]) = .5
Pr([1,0]) = 0
Pr([1,1]) = 0
Store probabilities in a vector of size 22n
CMSC 838T – Presentation
Previous Work

3 steps of Genehunter:
Step 2 : For each marker, calculate the conditional probably of
each inheritance pattern conditional on all of the markers to
the left, and to the right
•
For two markers’ inheritance vectors, each disagreeing
bit requires a crossover event
•
The probability of transitioning between inheritance
vectors i, j differing in d bits is
M i, j   d  (1  ) 2nd

CMSC 838T – Presentation
Previous Work

3 steps of Genehunter:
Step 2 : For each marker, calculate the conditional probably of
each inheritance pattern conditional on all of the markers to
the left, and to the right
•
Mi,j = cost of transitioning between inheritance vectors i&j
•
P1 , P2 = probability vectors for every inheritance pattern
given markers 1 and 2 respectively
•
P2|1 = P2 • (M P1)
•
Calculate the probabilities of each marker’s inheritance
conditional on all others by Markov Chain or FFT
convolution
CMSC 838T – Presentation
Previous Work

3 steps of Genehunter:
Step 3 : For each marker, calculate the probability of unknown
gene being located at specific locations
•
•
•
•

Hypothesizes phenotype has a gene located at a particular
location.
By default tries 5 evenly-spaced locations between consecutive
pairs of markers
Calculates PD, the probabilities of each inheritance pattern for
based on this phenotype (as in step 1)
For a location between markers i&i+1, p= PD • Px|1...i • Px|i+1...m
Space Requirement:
O(22n)

O(22n-f) exploiting symmetry of f founders
Time Requirement:
O(m22n)
O(m22n-f) with f founders
CMSC 838T – Presentation
Parallel Genehunter

Approach

Parallelize the 3 Genehunter steps separately

Divides each 22n-sized marker vector evenly among the P
processors

allows greater distribution of memory than assigning
O(m/P) entire vectors to each processor
CMSC 838T – Presentation
Parallel Genehunter

Parallelization of step 1
For each marker, calculate the probability of each of the
possible inheritance pattern
Each processor calculates the probabilities for a particular
22n / P inheritance patterns for ever marker
CMSC 838T – Presentation
Parallel Genehunter

Parallelization of step 2

For each marker, calculate the conditional probably of each
inheritance pattern conditional on all of the markers to the left, and to
the right
FFT convolution

As in serial genehunter, 22n x 22n matrix-vector multiplication
is replaced FFT-based convolution:
1.
2.
3.



2 forward 1D FFTs on 22n-length vectors
element-by-element multiplication
inverse FFT
Each 1D FFT is equivalent to a 2D FFT on a
P x 22n / P matrix
There are well-known distributed algorithms for this FFT using
all-to-all communication.
Dot Product in P2|1 = P2 • (M P1)

trivially parallelized: each processor has the same
portion of each vector.
CMSC 838T – Presentation
Parallel Genehunter

Parallelization of step 3
For each marker, calculate the probability of unknown
gene being located at specific locations

computing Px|1...i and Px|i+1...m


FFTs parallelized as in step 2
Final dot product p = (PD • Px|1...i • Px|i+1...m)

parallelized as in step 2

each processor holds all the same portion of each vector
CMSC 838T – Presentation
Evaluation

Experimental Environment

Input data sets



51 family member pedigree
{19,21,24}-bit data sets (# bits = 2n-f )
Computing Facilities

Cplant Cluster (Sandia National Laboratories)


DEC Alpha EV6 processors
Myrinet connection
CMSC 838T – Presentation
Evaluation

Runtimes For 19,21 and 24 bit problems
CMSC 838T – Presentation
Evaluation

Runtimes For 19,21 and 24 bit problems
CMSC 838T – Presentation
Observations
Pro: Performs Genehunter computation exactly
Pro: Effective for “multipoint linkage” of phenotypes
Con: Old-fashioned compared to protein-based methods (?)
Pro: Distributes memory requirements
Pro: More computers allows larger feasible inputs
Con: Experiments based on 1 pedigree
Pro: Efficient parallelization up to 32 or 64 processors
Con: Only allows pedigrees to grow by only 3 or 4 individuals
in equal time
CMSC 838T – Presentation
References

Genetic Recombination
Dr. Craig Woodworth, Genetic Recombination in Eukaryotes, Lecture Notes,
(www.clarkson.edu/class/by214/powerpoint)

Genehunter
K. Markianos, M.J. Daly, & L. Kruglyak. Efficient Multipoint Linkage Analysis
Through Reduction of Inheritance Space. American Journal of Human Genetics
68, 2001.

Parallel Genehunter
G. Conant, S. Plimpton, W. Old, A. Wagner, P. Fain, & G. Heffelfinger. Parallel
Genehunter: Implementation of a Linkage Analysis Package for Distributed-Memory
Architectures, Proceedings of the First IEEE Workshop on High Performance
Computational Biology, International Parallel and Distributed Computing
Symposium, 2002.
CMSC 838T – Presentation
Questions?
CMSC 838T – Presentation
Download