- 10th International Bielefeld Conference

advertisement
Life Sciences
Earth
Sciences
e-Science and its
Implications for the Library
Community
Social Sciences
Tony Hey
Corporate Vice President
Technical Computing
Microsoft Corporation
Multidisciplinary
Research
Computer and
Information
Sciences
New Materials,
Technologies
and Processes
Licklider’s Vision
“Lick had this concept – all of the stuff
linked together throughout the world, that
you can use a remote computer, get data
from a remote computer, or use lots of
computers in your job”
Larry Roberts – Principal Architect of the
ARPANET
Physics and the Web





Tim Berners-Lee developed the Web at
CERN as a tool for exchanging information
between the partners in physics
collaborations
The first Web Site in the USA was a link to
the SLAC library catalogue
It was the international particle physics
community who first embraced the Web
‘Killer’ application for the Internet
Transformed modern world – academia,
business and leisure
Beyond the Web?

Scientists developing collaboration
technologies that go far beyond the capabilities
of the Web




To use remote computing resources
To integrate, federate and analyse information from
many disparate, distributed, data resources
To access and control remote experimental
equipment
Capability to access, move, manipulate and
mine data is the central requirement of these
new collaborative science applications



Data held in file or database repositories
Data generated by accelerator or telescopes
Data gathered from mobile sensor networks
What is e-Science?
‘e-Science is about global collaboration
in key areas of science, and the next
generation of infrastructure that will
enable it’
John Taylor
Director General of Research Councils
UK, Office of Science and Technology
The e-Science Vision

e-Science is about multidisciplinary science
and the technologies to support such
distributed, collaborative scientific research



Many areas of science are in danger of being
overwhelmed by a ‘data deluge’ from new highthroughput devices, sensor networks, satellite
surveys …
Areas such as bioinformatics, genomics, drug
design, engineering, healthcare … require
collaboration between different domain experts
‘e-Science’ is a shorthand for a set of
technologies to support collaborative
networked science
e-Science – Vision and Reality
Vision

Oceanographic sensors - Project Neptune

Joint US-Canadian proposal
Reality

Chemistry – The Comb-e-Chem Project

Annotation, Remote Facilities and e-Publishing
http://www.neptune.washington.edu/
Undersea
Sensor
Network
Connected &
Controllable
Over the
Internet
Data
Provenance
Persistent
Distributed
Storage
Visual
Programming
Distributed
Computation
Interoperability
& Legacy
Support via
Web Services
Searching &
Visualization
Live
Documents
Reputation
& Influence
Reproducible
Research
Collaboration
Handwriting
Interactive
Data
Dynamic
Documents
The Comb-e-Chem Project
Diffractometer
Video Data
Stream
Automatic
Annotation
HPC Simulation
Data Mining
and Analysis
Structures
Database
Combinatorial
Chemistry
Wet Lab
National X-Ray
Service
Middleware
National Crystallographic Service
Send sample
material to
NCS service
Collaborate in e-Lab
experiment and
obtain structure
X-Ray e-Laboratory
Search materials database
and predict properties using
Grid computations
Structures
Database
Download full
data on materials
of interest
Computation
Service
A digital lab book
replacement that chemists
were able to use, and liked
Monitoring laboratory
experiments using a
broker delivered over
GPRS on a PDA
Crystallographic e-Prints
Direct Access to Raw Data
from scientific papers
Raw data sets can be very
large - stored at UK National
Datastore using SRB software
eBank Project
Virtual Learning
Environment
Undergraduate
Students
Digital
Library
E-Scientists
E-Scientists
Reprints
PeerReviewed
Journal &
Conference
Papers
Grid
Technical
Reports
Preprints &
Metadata
E-Experimentation
Publisher
Holdings
Graduate
Students
Institutional
Archive
Local
Web
Certified
Experimental
Results &
Analyses
Data,
Metadata &
Ontologies
5
Entire E-Science Cycle
Encompassing
experimentation,
analysis, publication,
research, learning
Support for e-Science

Cyberinfrastructure and e-Infrastructure



In the US, Europe and Asia there is a common
vision for the ‘cyberinfrastructure’ required to
support the e-Science revolution
Set of Middleware Services supported on top of
high bandwidth academic research networks
Similar to vision of the Grid as a set of
services that allows scientists – and industry –
to routinely set up ‘Virtual Organizations’ for
their research – or business


Many companies emphasize computing cycle
aspect of Grids
The ‘Microsoft Grid’ vision is more about data
management than about compute clusters
Six Key Elements for a Global
Cyberinfrastructure for e-Science
1.
2.
3.
4.
5.
6.
High bandwidth Research Networks
Internationally agreed AAA Infrastructure
Development Centers for Open Standard
Grid Middleware
Technologies and standards for Data
Provenance, Curation and Preservation
Open access to Data and Publications
via Interoperable Repositories
Discovery Services and Collaborative
Tools
The Web Services ‘Magic Bullet’
Company A
(J2EE)
Web Services
Company C
(.Net)
Open Source
(OMII)
Computational
Modeling
Persistent
Distributed
Data
Workflow,
Data Mining
& Algorithms
Interpretation
& Insight
Real-world
Data
Technical Computing in Microsoft

Radical Computing


Advanced Computing for Science and
Engineering


Research in potential breakthrough
technologies
Application of new algorithms, tools and
technologies to scientific and engineering
problems
High Performance Computing

Application of high performance clusters
and database technologies to industrial
applications
New Science Paradigms


Thousand years ago:
Experimental Science
- description of natural phenomena
Last few hundred years:
Theoretical Science
- Newton’s Laws, Maxwell’s Equations …

Last few decades:
Computational Science
- simulation of complex phenomena

Today:
e-Science or Data-centric Science
- unify theory, experiment, and simulation
- using data exploration and data mining




Data captured by instruments
Data generated by simulations
Processed by software
Scientist analyzes databases/files
(With thanks to Jim Gray)
2
 . 
4G
c2
a
 a   3   a 2
 
Advanced Computing for
Science and Engineering
...
TOOLS
Workflow, Collaboration, Visualization, Data Mining
DATA
Acquisition, Storage, Annotation, Provenance, Curation, Preservation
CONTENT
Scholarly Communication, Institutional Repositories
Top 500 Supercomputer Trends
Industry
usage
rising
Clusters
over 50%
GigE is
gaining
x86 is
winning
Key Issues for e-Science

Workflows


The Data Chain


The LEAD Project
From Acquisition to Preservation
Scholarly Communication

Open Access to Data and Publications
The LEAD Project
Better predictions for Mesoscale weather
The LEAD Vision
DYNAMIC OBSERVATIONS
Analysis/Assimilation
Prediction/Detection
Quality Control
Retrieval of Unobserved
Quantities
Creation of Gridded Fields
PCs to Teraflop Systems
Product Generation,
Display,
Dissemination
Models and Algorithms Driving Sensors
The CS challenge: Build a virtual “eScience” laboratory to
support experimentation and education leading to this vision.
End Users
NWS
Private Companies
Students
Composing LEAD Services
Need to construct workflows that are:
 Data Driven


Persistent and Agile


The weather input stream defines the nature of the
computation
An agent mines a data stream and notices an
“interesting” feature. This event may trigger a
workflow scenario that has been waiting for months
Adaptive



The weather changes
Workflow may have to change on-the-fly
Resources
Example LEAD Workflow
The e-Science Data Chain








Data Acquisition
Data Ingest
Metadata
Annotation
Provenance
Data Storage
Curation
Preservation
The Data Deluge


In the next 5 years e-Science projects will
produce more scientific data than has been
collected in the whole of human history
Some normalizations:





The Bible = 5 Megabytes
Annual refereed papers = 1 Terabyte
Library of Congress = 20 Terabytes
Internet Archive (1996 – 2002) = 100 Terabytes
In many fields new high throughput devices,
sensors and surveys will be producing
Petabytes of scientific data
The Problem for the e-Scientist
Experiments &
Instruments
Other Archives
Literature
questions
facts
facts
?
answers
Simulations







Data ingest
Managing a petabyte

Common schema

How to organize it?
How to reorganize it?
How to coexist & cooperate with
others?
Data Query and Visualization
tools
Support/training
Performance


Execute queries in a minute
Batch (big) query scheduling
Digital Curation?

In 20 years can guarantee that the operating
system and spreadsheet program and the
hardware used to store data will not exist




Need research ‘curation’ technologies such as
workflow, provenance and preservation
Need to liaise closely with individual research
communities, data archives and libraries
The UK has set up the ‘Digital Curation
Centre’ in Edinburgh with Glasgow, UKOLN
and CCLRC
Attempt to bring together skills of scientists,
computer scientists and librarians
Digital Curation Centre

Actions needed to maintain and utilise digital
data and research results over entire life-cycle


Digital Preservation


Long-run technological/legal accessibility and
usability
Data curation in science


For current and future generations of users
Maintenance of body of trusted data to represent
current state of knowledge
Research in tools and technologies

Integration, annotation, provenance, metadata,
security…..
Berlin Declaration 2003


‘To promote the Internet as a functional
instrument for a global scientific
knowledge base and for human
reflection’
Defines open access contributions as
including:
 ‘original scientific research results,
raw data and metadata, source
materials, digital representations of
pictorial and graphical materials and
scholarly multimedia material’
NSF ‘Atkins’ Report on
Cyberinfrastructure


‘the primary access to the latest findings
in a growing number of fields is through
the Web, then through classic preprints
and conferences, and lastly through
refereed archival papers’
‘archives containing hundreds or
thousands of terabytes of data will be
affordable and necessary for archiving
scientific and engineering information’
MIT DSpace Vision
‘Much of the material produced by
faculty, such as datasets, experimental
results and rich media data as well as
more conventional document-based
material (e.g. articles and reports) is
housed on an individual’s hard drive or
department Web server. Such material
is often lost forever as faculty and
departments change over time.’
Publishing Data & Analysis
Is Changing
Roles
Traditional
Emerging
Authors
Scientists
Collaborations
Publishers
Journals
Project web site
Curators
Libraries
Data+Doc Archives
Archives
Archives
Digital Archives
Consumers Scientists
Scientists
Data Publishing: The Background
In some areas – notably biology – databases are
replacing (paper) publications as a medium of
communication





These databases are built and maintained with a
great deal of human effort
They often do not contain source experimental data sometimes just annotation/metadata
They borrow extensively from, and refer to, other
databases
You are now judged by your databases as well as
your (paper) publications
Upwards of 1000 (public databases) in genetics
Data Publishing: The issues

Data integration


Annotation



‘Where did this data come from?’
Exporting/publishing in agreed formats


Adding comments/observations to existing data
Becoming a new form of communication
Provenance


Tying together data from various sources
To other programs as well as people
Security

Specifying/enforcing read/write access to parts of
your data
Interoperable Repositories?

Paul Ginsparg’s arXiv at Cornell has demonstrated
new model of scientific publishing


David Lipman of the NIH National Library of Medicine
has developed PubMedCentral as repository for NIH
funded research papers


Electronic version of ‘preprints’ hosted on the Web
Microsoft funded development of ‘portable PMC’ now being
deployed in UK and other countries
Stevan Harnad’s ‘self-archiving’ EPrints project in
Southampton provides a basis for OAI-compliant
‘Institutional Repositories’

Many national initiatives around the world moving towards
mandating deposition of ‘full text’ of publicly funded research
papers in repositories
Microsoft Strategy for e-Science
Microsoft intends to work with the
scientific and library communities:
to define open standard and/or interoperable
high-level services, work flows and tools

to assist the community in developing open
scholarly communication and interoperable
repositories

Acknowledgements
With special thanks to Kelvin
Droegemeier, Geoffrey Fox, Jeremy
Frey, Dennis Gannon, Jim Gray, Yike
Guo, Liz Lyon and Beth Plale
Download