Science Cloud Paul Watson Newcastle University, UK paul.watson@ncl.ac.uk Research Challenge Understanding the brain is the greatest informatics challenge • Enormous implications for science: • Medicine • Biology • Computer Science Collecting the Evidence 100,000 neuroscientists generate huge quantities of data – – – – molecular (genomic/proteomic) neurophysiological (time-series activity) anatomical (spatial) behavioural Neuroinformatics Problems • Data is: • expensive to collect but rarely shared • in proprietary formats & locally described • The result is: • a shortage of analysis techniques that can be applied across neuronal systems • limited interaction between research centres with complementary expertise Data in Science • Bowker’s “Standard Scientific Model” 1. Collect data 2. Publish papers 3. Gradually loose the original data The New Knowledge Economy & Science & Technology Policy, G.C. Bowker • Problems: – papers often draw conclusions from data that is not published – inability to replicate experiments – data cannot be re-used Codes in Science • Three stages for codes 1. Write code and apply to data 2. Publish papers 3. Gradually loose the original codes • Problems: – papers often draw conclusions from codes that are not published – inability to replicate experiments – codes cannot be re-used Plan • Neuroinformatics - a challenging e-science application • CARMEN – addressing the challenges • Cloud Computing for e-science – Lessons we’ve Learnt • The Promise of Commercial Clouds Focus on Neural Activity raw voltage signal data typically collected using single or multi-electrode array recording neurone 1 neurone 2 neurone 3 cracking the neural code Epilepsy Exemplar Data analysis guides surgeon during operation Further analysis provides evidence WARNING! The next 2 Slides show an exposed human brain CARMEN enables sharing and collaborative exploitation of data, analysis code and expertise that are not physically collocated CARMEN Project UK EPSRC e-Science Pilot $7M (2006-10) 20 Investigators Stirling St. Andrews Newcastle Manchester York Sheffield Leicester Warwick Cambridge Plymouth Imperial Industry & Associates CARMEN e-Science Requirements • Store – very large quantities of data (100TB+) • Analyse – suite of neuroinformatics services – support data intensive analysis • Automate – workflow • Share – under user-control Background: North East Regional e-Science Centre • 25 Research Projects across many domains: • Bioinformatics, Ageing & Health, Neuroscience, Chemical Engineering, Transport, Geomatics, Video Archives, Artistic Performance Analysis, Computer Performance Analysis,.... • Same key needs: Share Automate Analyse Store Result: e-Science Central • Integrated Store-Analyse-Automate-Share infrastructure • Web-based • Generic – CARMEN neuroinformatics & chemistry as pilots Science Cloud Architecture Access over Internet (typically via browser) Upload data & services Run analyses Data storage and analysis Cloud Services Continuum (based on Robert Anderson) http://et.cairene.net/2008/07/03/cloud-services-continuum/ Google Apps Software (SaaS) Salesforce.com Google AppEngine Platform (PaaS) Microsoft Azure Amazon EC2 & S3 Infrastructure (IaaS) Science Cloud Options Users Science App n Science App 1 Service Developers Science App 1 .... Science App n .... Science Platform Cloud Infrastructure: Storage & Compute Cloud Infrastructure: Storage & Compute CARMEN Cloud Filestore with Pattern Search Security Workflow Browsers & Rich Clients Database Workflow Enactment Metadata Service Repository Processing Editing and Running a Workflow on the Web Workflow Result File Viewing the output of Workflow Runs Viewing results Blogs and links Communicating Results Linking to results & workflows What we learnt: Moving into a Cloud • Moving existing technologies into a cloud can be difficult – some can’t run in a Cloud at all Raw Data Exploration with Signal Data Explorer What we learnt : Scalability • Clouds offer the potential for scalability – grab compute power only when needed • But developers have to write scalable code – for Infrastructure as a Service Clouds Dynasoar: Dynamic Deployment Service Repository 2: service fetch & deploy SR A request to s4 node 1 s2, s5 R req 1 C WSP node 2 … res 3 Web Service Provider The deployed service remains in place and can be re-used - unlike job scheduling node n s2 Host Provider 29 Dynasoar node 1 s2, s5 req C node 2 WSP Consumer … res Web Service Provider A request for s2 is routed to an existing node n s2 Host Provider deployment of the service 30 Adaptive Dynamic Deployment with Dynasoar Commercial Pay-as-you-go clouds Would allow us to avoid this limit 18 400 Response time (Seconds) 16 350 processors in pool 14 300 12 250 10 200 Adding Processors as you need them optimises 150 resources and saves money100 in pay-as-you-go clouds 8 6 4 Arrival Rate (messages per second) 1 1 1 0.5 0.5 0.5 0.25 0.25 0.13 0.13 0.13 0.06 0.06 0 0.03 0 0.03 2 0.03 50 Processors in pool Response time (seconds) 450 Hot Off the Press.. • Recent experiments with Microsoft Azure Cloud – running Chemical analyses – Silverlight UI Thanks to: - Paul Appleby & Team at the Microsoft Technology Centre, Reading - & MS e-Science Group Microsoft Azure Cloud for e-Science Demo Why are Commercial Clouds Important: Before Research 1. Have good idea 2. Write proposal 3. Wait 6 months 4. If successful, wait 3 months 5. Install Computers 6. Start Work Science Start-ups 1. Have good idea 2. Write Business Plan 3. Ask VCs to fund 4. If successful.. 5. Install Computers 6. Start Work Why Use Commercial Clouds: 1. 2. 3. 4. Have good idea Grab nodes from Cloud provider Start Work Pay for what you used • also scalability, cost, sustainability Commercial Clouds to the Rescue? • Focus currently on infrastructure as a service • But, this is only part of the stack • Can we have pay-as-you-go Science Cloud Platforms? A Sustainable Science Cloud ? e-Science Central www.inkspotscience.com Problem: delivering the e-science platform Commercial Clouds Science .... Science App 1 App n Science Platform as a Service ? Cloud Infrastructure: Storage & Compute Summary: e-Science Central & CARMEN •Web based •Works anywhere Software as a Service e-Science Central / CARMEN • Dynamic Resource Allocation • Pay-as-you-Go* Social Networking • Controlled Sharing • Collaboration • Communities Cloud Computing Summary • e-Science Central – Store-Analyse-Automate-Share e-science platform – Adding content from a range of domains • CARMEN is piloting this approach for neuroinformatics • Cloud computing can revolutionise e-science – reduce time from idea to realisation