lecture10_eScience - Homepages | The University of Aberdeen

CS5038 The Electronic Society
Lecture 10: e-Science
Lecture Outline
Background: “Big Science”
Grid Computing
Standards for Grid Computing
e-Science – what is it
e-Science Examples:
Social Simulations – modelling land-use change
Particle Physics (LHC),
Astronomy (VirtualObservatory)
Environmental Sciences – Climate Change
Engineering - Aircraft Maintenance
Economics – Predicting Markets
Bio-informatics – Simulated Biology
Healthcare - Cancer Diagnosis
e-Science - Background
“Big Science”
 During early part of 20th Century, Science became crucial in warfare
 World War II : Scientists developed new weapons and tools
 proximity fuse, radar, atomic bomb, cryptography
 Lead to a new form of research facility: Government-sponsored laboratory
 thousands of technicians and scientists, managed by universities
 Enabled hitherto impossible scientific projects
 heavy investment by government and industrial interests:
blurred line between public and private research
 Undermines basic principles of scientific method: Results difficult to verify.
 Access to facilities limited to those who are accomplished -> elitism.
 Increased government funding often implies military agenda
 Subverts the Enlightenment-era ideal of science as quest for knowledge.
 Increased administrative overhead – e.g. filling out grant requests
 Connections between academic, governmental, and industrial interests
 Concern about Scientists’ objectivity (e.g. pharmaceutical industry)
Internet was born from "Big Science"
 August 1991 CERN (Switzerland) : new World Wide Web project
Grid Computing
Grid computing evolved from the computational needs of “Big Science”
“Grid computing uses the resources of many separate computers connected by
a network (usually the internet) to solve large-scale computation problems.”
A conceptual framework rather than a physical resource:
 flexible computational provisioning beyond the local administrative domain.
 Involves sharing computing power:
 heterogeneous resources (based on different platforms, hardware/software
architectures, and computer languages),
 located in different places
 belonging to different administrative domains
 using open standards.
 Requires security : to allow remote users to control computing resources.
Special Purpose Grid – Example: SETI@home project
General Purpose Grid - Example: Parabon Computation (Commercial)
In terms of function: Three types of grid:
 Computational Grids : computationally-intensive operations.
 Data grids: sharing and management of large amounts of distributed data.
 Equipment Grids: control equipment remotely and analyse data produced.
e.g. controlling a telescope
Grid Standards - Globus
Globus Alliance is an association – mainly Universities
(e.g. Chicago, Edinburgh, Southern California)
 Developing fundamental technologies needed to build grid computing
 Most grids in Europe and North America use the Globus Toolkit as their
core middleware.
 Globus software provides (e.g.):
 Resource management: Grid Resource Allocation & Management Protocol
 Information Services: Monitoring and Discovery Service (MDS)
 Security Services: Grid Security Infrastructure (GSI)
 Data Movement and Management: Global Access to Secondary Storage
(GASS) and GridFTP
 XML-based web services allow access to services/applications
 grid computing and web services converge: Grid Service
 Open Grid Services Architecture (OGSA): vision is to describe and build a
well-defined suite of standard interfaces and behaviours that serve as a
common framework for all Grid-enabled systems and applications.
What is e-Science? - science enabled by electronic infrastructure
 Computationally intensive
 Uses highly distributed network environments
 Requires access to immense data sets
 May require Grid Computing
 High performance visualisation back to the individual user scientists
 Social Simulations – modelling land-use change
 Particle Physics (LHC), Astronomy (VirtualObservatory)
 Environmental Sciences – Climate Change
 Engineering - Aircraft Maintenance
 Economics – Predicting Markets
 Bio-informatics – Simulated Biology
 Healthcare - Cancer Diagnosis
 Middleware: Data communication, data integration
 Requires large and complex infrastructure
 Research Labs, Large Universities, Governments (e.g. UK)
e-Science Examples: Particle Physics
 Large Hadron Collider (LHC) at CERN
 Currently the most developed e-Science infrastructure
 LHC due to start generating data in 2007/8/9??
 Massive amount of data generated
Estimated at 10 petabytes each year (peta=1015)
 Thousands of researchers across the world will be
involved in the LHC experiments and in analysing
 GridPP
 UK’s contribution to analysing this data deluge.
 Six-year, £33m project
 Collaboration of around 100 researchers in 19 UK
University particle physics groups, CCLRC and CERN.
 More than 100,000 PCs, spread at one hundred
institutions across the world.
 Three main areas of work:
• Applications to allow physicists to submit data to
Grid for analysis
• Middleware to manage the distribution of
computing jobs around the grid and deal with
• Deploying computing infrastructure at sites
across the UK, to build a prototype Grid.
e-Science Examples: Astronomy
 £10M project to build a data-grid for UK astronomy
 Forms the UK’s contribution to a global
 Three main strands to VirtualObservatory
1. International standards for astronomical data,
metadata, and software Interoperability
2. New software infrastructure using emerging
technology: web services and the Grid.
3. Science user tools to exploit the new infrastructure
will bring the VO to the astronomer’s desktop.
Goals of Astrogrid (mainly thread 2):
 Datagrid for key UK databases
 Datamining facilities for interrogating those databases e.g. search for ‘cloaked’ objects
 A uniform archive query and data-mining interface
 A facility for users to upload code to run their own
algorithms on the datamining machines
 An exploration of techniques for open-ended resource
e-Science Examples: Climate Change
 To address the enormous variation in current climate
 Existing climate models have to include the effects of smallscale physical processes (such as clouds) through
simplifications (parameterisations)
 Results can be out by an order of magnitude
 Experimental Objective: Ensemble Forecasting
 Run thousands of climate models with slightly different
physics in order to represent the whole range of
uncertainties in all the parameterisations.
(parameters are varied within their current range of
 The project has already recruited 37,000 users
Project Goal:
 to make the first fully probability-based
fifty-year forecast of human-induced
climate change using a full-scale 3-D
atmosphere-ocean climate simulation
e-Science Example: Aircraft Maintenance
DAME project
 £3.2 Million, 3 years, commenced Jan 2002.
4 Universities:
 York, Sheffield, Oxford, Leeds
Industrial Partners:
 Rolls-Royce, Data Systems, Cybula Ltd
Aim: aerospace diagnostics
 Remote, secure access to flight data and other operational data and
 Rapid data mining and analysis of fault data
 Distributed search on massive data collections using scalable, neural
network type methods for comparing data with archived fleet engine data.
 Each flight could produce up to 1GB of vibration data
The DAME workbench (portal)
 Analysis tools for the engine diagnosis process
 Central control point for automated workflows
 Manages distributed diagnosis team and virtual organisations
 Manages issues of security and user roles.
e-Science Example: Aircraft Maintenance
Engine flight data
London Airport
New York Airport
Diagnostics Centre
Maintenance Centre
American data center
European data center
e-Science Example: Predicting Markets
 The INWA Grid project (Innovation Node: Western Australia) :
 Investigating suitability of existing Grid technologies for secure, commercial
data mining.
 The three-continent Grid:
 Edinburgh Parallel Computing Center (EPCC)
 Curtin University in Western Australia (WA)
 Chinese Academy of Sciences in Beijing.
 Data mining to predict customer trends, develop new products and better meet
customer needs.
 Samples drawn from a region + publicly available
-> build a clearer picture of regional behaviour within the economy
 But: need a distributed-aggregated approach to preserve anonymity
 UK mortgage data + UK property data
 Australian telco data +Australian property data
 Compute power at EPCC + Curtin
 A bank wants to predict if home owners are likely to move house within 5 years of
taking out a mortgage to buy the house
 Bank wants to use its own data and publicly available data to help improve the
e-Science Example: Simulated Biology
BioSimGrid project
 Aim: to make the results of large-scale
computer simulations of biomolecules more
accessible to the biological community.
 Simulations of the motions of proteins are a
key component in understanding how the
structure of a protein is related to its dynamic
 Data distributed between University of California, San Diego
and Oxford.
 Simulations were run using different programs and protocols
 Data in very different formats.
Software tools for interrogation and data-mining
Generic analysis tools (python), visualisation VMD
Annotation of simulation data
Readily modifiable simple example scripts
Underlying data storage structure hidden
e-Science Examples: Cancer Diagnosis
Telemedicine on the Grid
 Multi-site videoconferencing
 Real-time delivery of microscope imagery
 Communication and archiving of radiological
 Supports multi-disciplinary meetings for the
review of cancer diagnoses and treatment.
 Remote access to computational medical
simulations of tumours and other cancer-related
 Data-mining of patient record databases
 Improved clinical decision making.
 Currently clinicians travel large distances
 Grid technology can provide access to
appropriate clinical information and images
across the network.