ECDF and the development of algorithms and tools for mapping epistatic QTL Wen-Hua Wei The Roslin Institute GridQTL project Distributed DistributedComputing Computing •• Loosely Looselycoupled coupled •• Heterogeneous Heterogeneous •• Single SingleAdministration Administration Cluster Cluster •• Tightly Tightlycoupled coupled •• Homogeneous Homogeneous •• Cooperative Cooperativeworking working • Epistatic QTL mapping in outbred crosses • Computing Combined linkage and linkage Grid GridComputing •• Large Largescale scaledisequilibrium mapping (LDLA) •• Cross-organizational Cross-organizational •• Geographical distribution Geographical distribution • Expression QTL (eQTL) •• Distributed Management Distributed Management • Genome-wide SNP association scans Utility Computing Utility Computing ••Computing Computing“services” “services” ••No knowledge No knowledgeofofprovider provider ••Enabled by grid technology Enabled by grid technology Source: Hiro Kishimoto GGF17 Keynote May 2006 Problem description 70 60 50 40 30 20 10 0 0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 Loction (cM) y = μ + xβ + Q + e One dimensional scan (1D) N (3k) tests a scan Random tests (10k): 30m y = μ + xβ + Q1 + Q2 + Q1Q2 + e Two dimensional scan (2D) N*(N+1)/2 (4.5m) tests a scan Random tests (10k): 45,000m Major challenges • Statistical – – • Multiple testing issues Various statistical models and methods Algorithm – – • ways of searching how to make automatic Computing – – a few hours per study? many simulation replicates to validate? Models and multiple tests A: y = μ + Xβ + e 1. Overall Test B: y = μ + Xβ + Li + e >> D vs. A or D vs. B C: y = μ + Xβ + Li + Lj + e 2. Epistasis Test >> D vs. C D: y = μ + Xβ + Li + Lj + Li * Lj + e Scale • • • • • • 2 CPU hrs per job 550 jobs per scenario Æ 1.1k hrs 22 scenarios per algorithm Æ 22k hrs 4 algorithms per simulation Æ 88k hrs Î more than 10 CPU yrs! Far more than that: – – Genome-wide association with epistasis Real application and real data analyses Go ECDF/Eddie! • • • • Ample computing resources Perfect for distributed computing Support multiple languages and program integration Increasing facilities, e.g. NAG, parallel computing Script to run one job #!/bin/sh dir="$TMPDIR" mkdir -p $dir cd $dir cp /exports/work/roslin/simulate/sec/modelInit.par modelInit.par cp /exports/work/roslin/simulate/sec/rng_seed.dat rng_seed.dat /exports/work/roslin/simulate/sec/f2sim.out < /exports/work/roslin/simulate/sec/input106 cp rng_seed.dat /exports/work/roslin/simulate/sec/rng_seed.dat /exports/work/roslin/simulate/sec/coeff.out /exports/work/roslin/simulate/sec/gateway.out null106 /exports/work/roslin/simulate/sec/epiRsim.out null106_01.job >> /exports/work/roslin/simulate/sec/output106.txt rm -fr $dir Script to submit multi-jobs #!/bin/sh pos=100 while [ $pos -lt 660 ] do qsub -l h_rt=03:00:00 -cwd -e stderror.$pos.txt subdir109.sh sleep 200 qsub -l h_rt=03:00:00 -cwd -e stderror.$pos.txt subdir108.sh sleep 200 qsub -l h_rt=03:00:00 -cwd -e stderror.$pos.txt subdir107.sh sleep 200 qsub -l h_rt=03:00:00 -cwd -e stderror.$pos.txt subdir106.sh sleep 200 pos=`expr $pos + 1` done Work completed • Simulation study – – • Identify ‘ideal’ algorithm Paper to be submitted Real application – – • more scenarios tested and close to release used for real data analyses Epistasis in genome-wide association – – Algorithm applied in GWA using R Paper in press Acknowledgement • People – Chris Haley, DJ de Koning, Sara Knott – Jean-Alain Grunchec, John Allen, Dave Berry, Andy Law – Alex Lam, Joseph Powell – Kajsa Ljungberg, Orjan Carlborg • Funding: BBSRC, SABRE • Computing: ECDF, Roslin Institute