SPRINT S Savvas Petrou

advertisement
SPRINT
A Simple Parallel R INTerface
Savvas Petrou
spetrou@epcc.ed.ac.uk
EPCC, The University of Edinburgh
Overview
• What is SPRINT
• How is SPRINT different from other parallel R packages
• Biological example: Post-genomic data analysis
• Code comparison
March 2010
SPRINT
2
SPRINT
Simple Parallel R INTerface
(www.r-sprint.org)
“SPRINT: A new parallel framework for R”, J Hill et al, BMC Bioinformatics, Dec 2008.
March 2010
SPRINT
3
Issues of existing parallel R packages
• Difficult to program
• Require scientist to also be a parallel
programmer!
• Require substantial changes to existing
scripts
• Can’t be used to solve some problems
• No data dependencies allowed
March 2010
SPRINT
4
Biological example
• Data: A matrix of expression measurements with genes
in rows and samples in columns
March 2010
SPRINT
5
Biological example
• Problem
Using all or many genes will either crash or be very slow
(R memory allocation limits, number of computations)
Data limitations (correlations)
Work load limitations (permutations)
Input array
Final array size
Input array dimensions
Estimated total
dimensions and size
in memory
and permutation count
run time
11,000 x 320
923.15 MB
36,612 x 76
20,750 seconds
26.85 MB
(0.9 GB)
500,000
6 hours
22,000 x 320
3,692.62 MB
36,612 x 76
41,500 seconds
53.7 MB
(3.6 GB)
1,000,000
12 hours
35,000 x 320
9,346 MB
73,224 x 76
35,000 seconds
85.44 MB
(9.12 GB)
500,000
10 hours
45,000 x 320
15,449.52 MB
73,224 x 76
70,000 seconds
109.86 MB
(15.08 GB)
1,000,000
20 hours
March 2010
SPRINT
6
Workarounds and solution
• Workaround:
– Remove as many genes as possible before applying
algorithm. This can be an arbitrary process and
remove relevant data.
Big Post
Genomic Data
– Perform multiple executions and post-process the
data. Can become very painful procedure.
• Solution:
HPC
Parallelisation of R code can be made accessible to
bioinformaticians/statisticians.
A library with expert coded solutions once, then easy
end-point use by all.
March 2010
SPRINT
R
SPRINT
Biological Results
7
Benchmarks (256 processes)
Data limitations (correlations)
Input array
Final array size
Total run time (in serial)
Total run time (in parallel)
dimensions and size
in memory
(in seconds)
(in seconds)
11,000 x 320
923.15 MB
26.85 MB
(0.9 GB)
63.18
4.76
22,000 x 320
3,692.62 MB
“Error: cannot allocate vector
53.7 MB
(3.6 GB)
of size 3.6 Gb”
35,000 x 320
9,346 MB
85.44 MB
(9.12 GB)
45,000 x 320
15,449.52 MB
109.86 MB
(15.08 GB)
13.87
CRASHED
36.64
CRASHED
42.18
Work load limitations (permutations)
March 2010
Input array dimensions
Estimated total
Total run time (in parallel)
and permutation count
run time (in serial)
(in seconds)
36,612 x 76
20,750 seconds
500,000
6 hours
36,612 x 76
41,500 seconds
1,000,000
12 hours
73,224 x 76
35,000 seconds
500,000
10 hours
73,224 x 76
70,000 seconds
1,000,000
20 hours
SPRINT
73.18
146.64
148.46
294.61
8
Correlation code comparison
edata <- read.table("largedata.dat")
pearsonpairwise <- cor(edata)
write.table(pearsonpairwise, "Correlations.txt")
quit(save="no")
library("sprint")
edata <- read.table("largedata.dat")
ff_handle <- pcor(edata)
pterminate()
quit(save="no")
March 2010
SPRINT
9
Permutation testing code comparison
data(golub)
smallgd <- golub[1:100,]
classlabel <- golub.cl
resT <- mt.maxT(smallgd, classlabel, test="t", side="abs")
quit(save="no")
library("sprint")
data(golub)
smallgd <- golub[1:100,]
classlabel <- golub.cl
resT <- pmaxT(smallgd, classlabel, test="t", side="abs")
pterminate()
quit(save="no")
March 2010
SPRINT
10
SPRINT
• Website: http://www.r-sprint.org/
• Source code can be downloaded from website
• Soon also in the CRAN repository
• Mailing list: sprint@lists.ed.ac.uk
• Contact email: sprint@ed.ac.uk
March 2010
SPRINT
11
Acknowledgements
DPM Team:
EPCC Team:
•
•
•
•
•
Terry Sloan
•
•
•
•
Savvas Petrou
Peter Ghazal
Thorsten Forster
Muriel Mewissen
Michal Piotrowski
Bartek Dobrzelecki
Jon Hill
Florian Scharinger
This work is supported by the Wellcome Trust and the NAG dCSE Support service.
Numerical
Algorithms Group
March 2010
SPRINT
12
Download