SPRINT A Simple Parallel R INTerface Savvas Petrou spetrou@epcc.ed.ac.uk EPCC, The University of Edinburgh Overview • What is SPRINT • How is SPRINT different from other parallel R packages • Biological example: Post-genomic data analysis • Code comparison March 2010 SPRINT 2 SPRINT Simple Parallel R INTerface (www.r-sprint.org) “SPRINT: A new parallel framework for R”, J Hill et al, BMC Bioinformatics, Dec 2008. March 2010 SPRINT 3 Issues of existing parallel R packages • Difficult to program • Require scientist to also be a parallel programmer! • Require substantial changes to existing scripts • Can’t be used to solve some problems • No data dependencies allowed March 2010 SPRINT 4 Biological example • Data: A matrix of expression measurements with genes in rows and samples in columns March 2010 SPRINT 5 Biological example • Problem Using all or many genes will either crash or be very slow (R memory allocation limits, number of computations) Data limitations (correlations) Work load limitations (permutations) Input array Final array size Input array dimensions Estimated total dimensions and size in memory and permutation count run time 11,000 x 320 923.15 MB 36,612 x 76 20,750 seconds 26.85 MB (0.9 GB) 500,000 6 hours 22,000 x 320 3,692.62 MB 36,612 x 76 41,500 seconds 53.7 MB (3.6 GB) 1,000,000 12 hours 35,000 x 320 9,346 MB 73,224 x 76 35,000 seconds 85.44 MB (9.12 GB) 500,000 10 hours 45,000 x 320 15,449.52 MB 73,224 x 76 70,000 seconds 109.86 MB (15.08 GB) 1,000,000 20 hours March 2010 SPRINT 6 Workarounds and solution • Workaround: – Remove as many genes as possible before applying algorithm. This can be an arbitrary process and remove relevant data. Big Post Genomic Data – Perform multiple executions and post-process the data. Can become very painful procedure. • Solution: HPC Parallelisation of R code can be made accessible to bioinformaticians/statisticians. A library with expert coded solutions once, then easy end-point use by all. March 2010 SPRINT R SPRINT Biological Results 7 Benchmarks (256 processes) Data limitations (correlations) Input array Final array size Total run time (in serial) Total run time (in parallel) dimensions and size in memory (in seconds) (in seconds) 11,000 x 320 923.15 MB 26.85 MB (0.9 GB) 63.18 4.76 22,000 x 320 3,692.62 MB “Error: cannot allocate vector 53.7 MB (3.6 GB) of size 3.6 Gb” 35,000 x 320 9,346 MB 85.44 MB (9.12 GB) 45,000 x 320 15,449.52 MB 109.86 MB (15.08 GB) 13.87 CRASHED 36.64 CRASHED 42.18 Work load limitations (permutations) March 2010 Input array dimensions Estimated total Total run time (in parallel) and permutation count run time (in serial) (in seconds) 36,612 x 76 20,750 seconds 500,000 6 hours 36,612 x 76 41,500 seconds 1,000,000 12 hours 73,224 x 76 35,000 seconds 500,000 10 hours 73,224 x 76 70,000 seconds 1,000,000 20 hours SPRINT 73.18 146.64 148.46 294.61 8 Correlation code comparison edata <- read.table("largedata.dat") pearsonpairwise <- cor(edata) write.table(pearsonpairwise, "Correlations.txt") quit(save="no") library("sprint") edata <- read.table("largedata.dat") ff_handle <- pcor(edata) pterminate() quit(save="no") March 2010 SPRINT 9 Permutation testing code comparison data(golub) smallgd <- golub[1:100,] classlabel <- golub.cl resT <- mt.maxT(smallgd, classlabel, test="t", side="abs") quit(save="no") library("sprint") data(golub) smallgd <- golub[1:100,] classlabel <- golub.cl resT <- pmaxT(smallgd, classlabel, test="t", side="abs") pterminate() quit(save="no") March 2010 SPRINT 10 SPRINT • Website: http://www.r-sprint.org/ • Source code can be downloaded from website • Soon also in the CRAN repository • Mailing list: sprint@lists.ed.ac.uk • Contact email: sprint@ed.ac.uk March 2010 SPRINT 11 Acknowledgements DPM Team: EPCC Team: • • • • • Terry Sloan • • • • Savvas Petrou Peter Ghazal Thorsten Forster Muriel Mewissen Michal Piotrowski Bartek Dobrzelecki Jon Hill Florian Scharinger This work is supported by the Wellcome Trust and the NAG dCSE Support service. Numerical Algorithms Group March 2010 SPRINT 12