R User Meeting Agenda NeSC – Wednesday 20th January 2010 14:00 Welcome & R User Survey Results Muriel Mewissen, Division of Pathway Medicine 14:30 MAGIC, a consensus clustering package for R Dr Ian Simpson, Centre for Integrative Physiology 14:55 R and Eddie for Breast Cancer Bioinformatics Dr Duncan Sproul, Edinburgh Cancer Research Centre 15:15 Using large datasets in R with ff Michal Piotrowski, EPCC 15:35 Coffee & Tea 16:05 SPRINT & MPI IO Savvas Petrou, EPCC 16:25 RMPI Ms Xu Guo, EPCC 1650 Wrap up Muriel Mewissen, Division of Pathway Medicine 2nd R User Meeting, NeSC, 20 Jan 2010 1 nd 2 R User Meeting R User Survey Results Muriel Mewissen – DPM R User Meeting, NeSC - Wednesday 20 January 2010 2nd R User Meeting, NeSC, 20 Jan 2010 2 Talk Outline • • • • 1st R User Meeting & SPRINT prototype R user requirements survey SPRINT beta release Future release functionality 2nd R User Meeting, NeSC, 20 Jan 2010 3 1st R User Meeting • Surge in R use on the ECDF • Provide R users with a forum to discuss issues and best practices when using R on High Performance Computing (HPC) • August 2008 2nd R User Meeting, NeSC, 20 Jan 2010 4 SPRINT Small Post Genomic Data R Big Post Genomic Data R Big Post Genomic Data HPC R SPRINT Biological Results 2nd R User Meeting, NeSC, 20 Jan 2010 Biological Results 5 SPRINT Prototype • Simple Parallel R INTerface Easy Access to HPC for all R users • 3 months Edikt2 project, Nov 07 to Jan 08. • SPRINT framework: – HPC harness – Library of parallel R functions (‘Hello’, pcor) • Ran on Eddie • Published in BMC Bioinformatics in Dec 08 and highly accessed. 2nd R User Meeting, NeSC, 20 Jan 2010 6 Proof of Concept to Project • An intelligent HPC harness: Scalable, portable and flexible • R parallel function library: Popular functions, complex functions, open to contributions • GUI: Aimed at biologists and biostatisticians • Wellcome Trust (Apr 09 to Apr 11) • dCSE (Oct 09 to Mar 10) port to HECToR 2nd R User Meeting, NeSC, 20 Jan 2010 7 User Requirements Survey • Online survey • 55 SPRINT R contacts • 4 mailing lists: – ECDF R users – Scottish Bioinformatics Forum – Bioconductor – R-HPC • 56 replies 2nd R User Meeting, NeSC, 20 Jan 2010 8 Responses SPRINT Contacts 27% Mailing Lists 73% SPRINT Contacts 2nd R User Meeting, NeSC, 20 Jan 2010 Mailing Lists 9 User Requirements Survey The survey had 25 questions in 7 sections: • User Profile • Experience with R • R Limitations • Computer Setup • Access to HPC • SPRINT User Wish List • Further Communication 2nd R User Meeting, NeSC, 20 Jan 2010 10 Results – User Profile • Bioinformatician • Academia • Experienced – Statistical analysis – Data processing – R and general programming • No experience in parallel programming 2nd R User Meeting, NeSC, 20 Jan 2010 11 Results – Experience with R • R console or run R at command line • Transcription microarray, genotyping, sequencing • Very happy with R/Bioconductor features • Moderately happy with R performances 2nd R User Meeting, NeSC, 20 Jan 2010 12 Results – R Limitations • Analysis takes too long • Data larger than RAM • Problematic tasks: – Machine learning, permutation and bootstrapping – Loading, merging, apply(), normalisation, correlation, working with large datasets • Workarounds: – Batch processing, change analysis, reduce the data – 50% parallel processing (SNOW, R/Parallel and RMPI) 2nd R User Meeting, NeSC, 20 Jan 2010 13 Results – Computer Setup • Linux, Windows and Mac OS. • Windows desktop & Linux server • Desktop: – dual core – > 2 GHz – 64 bits – 2 to 4 GB RAM 2nd R User Meeting, NeSC, 20 Jan 2010 14 Results – Access to HPC • Most have access to HPC • Lack of knowhow can’t run R in parallel Parallel programming help 2nd R User Meeting, NeSC, 20 Jan 2010 15 Results - SPRINT User Wish List • Web download • No GUI Standard R functions 15 Permutation, bootstrapping 10 Machine learning algorithms 9 Correlation functions 8 Normalisation 8 Standard Statistics 7 Matrix operations 7 Other 12 2nd R User Meeting, NeSC, 20 Jan 2010 16 Results - Summary Success! • High level of reply – Interest, support & need. • Echo DPM experience • Technical user – No GUI • Full survey report can be downloaded at www.r-sprint.org 2nd R User Meeting, NeSC, 20 Jan 2010 17 SPRINT beta 0.1.0 • Priority HPC harness improvements: – Large data set – Scalability • • • • • Large objects ff (Michal Piotrowski) MPI IO (Savvas Petrou) Runs on Ness & HECToR Available at www.r-sprint.org CRAN soon! 2nd R User Meeting, NeSC, 20 Jan 2010 18 SPRINT beta 0.2.0 and Future Releases • Next Release: – Permutation test: mt.maxt() – Unsupervised clustering algorithm: pam() – Further improvements to the HPC harness to allow a broad range of function and support full analysis workflow. • Future Releases: – More permutation test: RP() – Supervised clustering algorithm: RandomForest() • Full SPRINT release in March 2011. 2nd R User Meeting, NeSC, 20 Jan 2010 19 DPM Team: • Peter Ghazal • Thorsten Forster • Muriel Mewissen EPCC Team: • Terry Sloan • Michal Piotrowski • Savvas Petrou • Bartek Dobrzelecki • Jon Hill • Florian Scharinger This work was supported by the Wellcome Trust grant [086696/Z/08/Z]. http://www.r-sprint.org 2nd R User Meeting, NeSC, 20 Jan 2010 20