ECDF and the development of algorithms and tools for mapping epistatic QTL

advertisement
ECDF and the development of
algorithms and tools for mapping
epistatic QTL
Wen-Hua Wei
The Roslin Institute
GridQTL project
Distributed
DistributedComputing
Computing
•• Loosely
Looselycoupled
coupled
•• Heterogeneous
Heterogeneous
•• Single
SingleAdministration
Administration
Cluster
Cluster
•• Tightly
Tightlycoupled
coupled
•• Homogeneous
Homogeneous
•• Cooperative
Cooperativeworking
working
• Epistatic QTL mapping in outbred crosses
• Computing
Combined linkage and linkage
Grid
GridComputing
•• Large
Largescale
scaledisequilibrium mapping (LDLA)
•• Cross-organizational
Cross-organizational
•• Geographical
distribution
Geographical
distribution
•
Expression
QTL (eQTL)
•• Distributed
Management
Distributed Management
• Genome-wide
SNP association scans
Utility Computing
Utility Computing
••Computing
Computing“services”
“services”
••No
knowledge
No knowledgeofofprovider
provider
••Enabled
by
grid
technology
Enabled by grid technology
Source: Hiro Kishimoto GGF17 Keynote May 2006
Problem description
70
60
50
40
30
20
10
0
0
8
16
24
32
40
48
56
64
72
80
88
96 104 112 120
Loction (cM)
y = μ + xβ + Q + e
One dimensional scan (1D)
N (3k) tests a scan
Random tests (10k): 30m
y = μ + xβ + Q1 + Q2 + Q1Q2 + e
Two dimensional scan (2D)
N*(N+1)/2 (4.5m) tests a scan
Random tests (10k): 45,000m
Major challenges
•
Statistical
–
–
•
Multiple testing issues
Various statistical models and methods
Algorithm
–
–
•
ways of searching
how to make automatic
Computing
–
–
a few hours per study?
many simulation replicates to validate?
Models and multiple tests
A: y = μ + Xβ + e
1. Overall Test
B: y = μ + Xβ + Li + e
>> D vs. A or D vs. B
C: y = μ + Xβ + Li + Lj + e
2. Epistasis Test
>> D vs. C
D: y = μ + Xβ + Li + Lj + Li * Lj + e
Scale
•
•
•
•
•
•
2 CPU hrs per job
550 jobs per scenario Æ 1.1k hrs
22 scenarios per algorithm Æ 22k hrs
4 algorithms per simulation Æ 88k hrs
Î more than 10 CPU yrs!
Far more than that:
–
–
Genome-wide association with epistasis
Real application and real data analyses
Go ECDF/Eddie!
•
•
•
•
Ample computing resources
Perfect for distributed computing
Support multiple languages and
program integration
Increasing facilities, e.g. NAG, parallel
computing
Script to run one job
#!/bin/sh
dir="$TMPDIR"
mkdir -p $dir
cd $dir
cp /exports/work/roslin/simulate/sec/modelInit.par modelInit.par
cp /exports/work/roslin/simulate/sec/rng_seed.dat rng_seed.dat
/exports/work/roslin/simulate/sec/f2sim.out <
/exports/work/roslin/simulate/sec/input106
cp rng_seed.dat /exports/work/roslin/simulate/sec/rng_seed.dat
/exports/work/roslin/simulate/sec/coeff.out
/exports/work/roslin/simulate/sec/gateway.out null106
/exports/work/roslin/simulate/sec/epiRsim.out null106_01.job >>
/exports/work/roslin/simulate/sec/output106.txt
rm -fr $dir
Script to submit multi-jobs
#!/bin/sh
pos=100
while [ $pos -lt 660 ]
do
qsub -l h_rt=03:00:00 -cwd -e stderror.$pos.txt subdir109.sh
sleep 200
qsub -l h_rt=03:00:00 -cwd -e stderror.$pos.txt subdir108.sh
sleep 200
qsub -l h_rt=03:00:00 -cwd -e stderror.$pos.txt subdir107.sh
sleep 200
qsub -l h_rt=03:00:00 -cwd -e stderror.$pos.txt subdir106.sh
sleep 200
pos=`expr $pos + 1`
done
Work completed
•
Simulation study
–
–
•
Identify ‘ideal’ algorithm
Paper to be submitted
Real application
–
–
•
more scenarios tested and close to release
used for real data analyses
Epistasis in genome-wide association
–
–
Algorithm applied in GWA using R
Paper in press
Acknowledgement
• People
– Chris Haley, DJ de Koning, Sara Knott
– Jean-Alain Grunchec, John Allen, Dave Berry,
Andy Law
– Alex Lam, Joseph Powell
– Kajsa Ljungberg, Orjan Carlborg
• Funding: BBSRC, SABRE
• Computing: ECDF, Roslin Institute
Download