R User Meeting Agenda

advertisement
R User Meeting Agenda
NeSC – Wednesday 20th January 2010
14:00
Welcome & R User Survey Results
Muriel Mewissen, Division of
Pathway Medicine
14:30
MAGIC, a consensus clustering package
for R
Dr Ian Simpson,
Centre for Integrative
Physiology
14:55
R and Eddie for Breast Cancer
Bioinformatics
Dr Duncan Sproul,
Edinburgh Cancer Research
Centre
15:15
Using large datasets in R with ff
Michal Piotrowski, EPCC
15:35
Coffee & Tea
16:05
SPRINT & MPI IO
Savvas Petrou, EPCC
16:25
RMPI
Ms Xu Guo, EPCC
1650
Wrap up
Muriel Mewissen, Division of
Pathway Medicine
2nd R User Meeting, NeSC,
20 Jan 2010
1
nd
2
R User Meeting
R User Survey Results
Muriel Mewissen – DPM
R User Meeting, NeSC - Wednesday 20 January 2010
2nd R User Meeting, NeSC,
20 Jan 2010
2
Talk Outline
•
•
•
•
1st R User Meeting & SPRINT prototype
R user requirements survey
SPRINT beta release
Future release functionality
2nd R User Meeting, NeSC,
20 Jan 2010
3
1st R User Meeting
• Surge in R use on the ECDF
• Provide R users with a forum to discuss
issues and best practices when using R on
High Performance Computing (HPC)
• August 2008
2nd R User Meeting, NeSC,
20 Jan 2010
4
SPRINT
Small Post
Genomic
Data
R
Big Post
Genomic Data
R
Big Post
Genomic Data
HPC
R
SPRINT
Biological Results
2nd R User Meeting, NeSC,
20 Jan 2010
Biological Results
5
SPRINT Prototype
• Simple Parallel R INTerface
Easy Access to HPC for all R users
• 3 months Edikt2 project, Nov 07 to Jan 08.
• SPRINT framework:
– HPC harness
– Library of parallel R functions (‘Hello’, pcor)
• Ran on Eddie
• Published in BMC Bioinformatics in Dec 08 and
highly accessed.
2nd R User Meeting, NeSC,
20 Jan 2010
6
Proof of Concept to Project
• An intelligent HPC harness:
Scalable, portable and flexible
• R parallel function library:
Popular functions, complex functions,
open to contributions
• GUI:
Aimed at biologists and biostatisticians
• Wellcome Trust (Apr 09 to Apr 11)
• dCSE (Oct 09 to Mar 10) port to HECToR
2nd R User Meeting, NeSC,
20 Jan 2010
7
User Requirements Survey
• Online survey
• 55 SPRINT R contacts
• 4 mailing lists:
– ECDF R users
– Scottish Bioinformatics Forum
– Bioconductor
– R-HPC
• 56 replies
2nd R User Meeting, NeSC,
20 Jan 2010
8
Responses
SPRINT
Contacts
27%
Mailing
Lists
73%
SPRINT Contacts
2nd R User Meeting, NeSC,
20 Jan 2010
Mailing Lists
9
User Requirements Survey
The survey had 25 questions in 7 sections:
• User Profile
• Experience with R
• R Limitations
• Computer Setup
• Access to HPC
• SPRINT User Wish List
• Further Communication
2nd R User Meeting, NeSC,
20 Jan 2010
10
Results – User Profile
• Bioinformatician
• Academia
• Experienced
– Statistical analysis
– Data processing
– R and general programming
• No experience in parallel programming
2nd R User Meeting, NeSC,
20 Jan 2010
11
Results – Experience with R
• R console or run R at command line
• Transcription microarray, genotyping,
sequencing
• Very happy with R/Bioconductor features
• Moderately happy with R performances
2nd R User Meeting, NeSC,
20 Jan 2010
12
Results – R Limitations
• Analysis takes too long
• Data larger than RAM
• Problematic tasks:
– Machine learning, permutation and bootstrapping
– Loading, merging, apply(), normalisation, correlation,
working with large datasets
• Workarounds:
– Batch processing, change analysis, reduce the data
– 50% parallel processing (SNOW, R/Parallel and
RMPI)
2nd R User Meeting, NeSC,
20 Jan 2010
13
Results – Computer Setup
• Linux, Windows and Mac OS.
• Windows desktop & Linux server
• Desktop:
– dual core
– > 2 GHz
– 64 bits
– 2 to 4 GB RAM
2nd R User Meeting, NeSC,
20 Jan 2010
14
Results – Access to HPC
• Most have access to HPC
• Lack of knowhow
 can’t run R in parallel
 Parallel programming help
2nd R User Meeting, NeSC,
20 Jan 2010
15
Results - SPRINT User Wish List
• Web download
• No GUI
Standard R functions
15
Permutation, bootstrapping
10
Machine learning algorithms
9
Correlation functions
8
Normalisation
8
Standard Statistics
7
Matrix operations
7
Other
12
2nd R User Meeting, NeSC,
20 Jan 2010
16
Results - Summary
Success!
• High level of reply
– Interest, support & need.
• Echo DPM experience
• Technical user
– No GUI
• Full survey report can be downloaded at
www.r-sprint.org
2nd R User Meeting, NeSC,
20 Jan 2010
17
SPRINT beta 0.1.0
• Priority HPC harness improvements:
– Large data set
– Scalability
•
•
•
•
•
Large objects ff (Michal Piotrowski)
MPI IO (Savvas Petrou)
Runs on Ness & HECToR
Available at www.r-sprint.org
CRAN soon!
2nd R User Meeting, NeSC,
20 Jan 2010
18
SPRINT beta 0.2.0 and
Future Releases
• Next Release:
– Permutation test: mt.maxt()
– Unsupervised clustering algorithm: pam()
– Further improvements to the HPC harness to allow a
broad range of function and support full analysis
workflow.
• Future Releases:
– More permutation test: RP()
– Supervised clustering algorithm: RandomForest()
• Full SPRINT release in March 2011.
2nd R User Meeting, NeSC,
20 Jan 2010
19
DPM Team:
• Peter Ghazal
• Thorsten Forster
• Muriel Mewissen
EPCC Team:
• Terry Sloan
• Michal Piotrowski
• Savvas Petrou
• Bartek Dobrzelecki
• Jon Hill
• Florian Scharinger
This work was supported by the Wellcome Trust grant [086696/Z/08/Z].
http://www.r-sprint.org
2nd R User Meeting, NeSC,
20 Jan 2010
20
Download