EGEE07_--_FINAL_2 - Indico

advertisement
A Grand Challenge for the
Information Age
Dr. Francine Berman
Director, San Diego Supercomputer Center
Professor and High Performance Computing Endowed Chair,
UC San Diego
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
The Fundamental Driver of the
Information Age is Digital Data
Education
Entertainment
Shopping
Health
Information
Business
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Digital Data Critical for Research and Education
Data from multiple sources in the
Geosciences
Data at multiple scales in the
Biosciences
Users
Data
Access
and Use
Data
Integration
Where should we drill for oil?
Disciplinary
Databases
What is the Impact of Global Warming?
How are the continents shifting?
Anatomy
Organisms
Data
Integration
Physiology
Organs
Cell Biology
Cells
Complex
“multiple-worlds”
mediation
Proteomics
Organelles
Genomics
Medicinal
Chemistry
What genes are associated
with cancer?
What parts of the brain are
responsible for Alzheimers?
Biopolymers
Atoms
GeoGeologic
Chemical
Map
GeoPhysical
GeoChronologic
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Foliation
Map
Today’s Presentation
• Data Cyberinfrastructure Today – Designing
and developing infrastructure to enable
today’s data-oriented applications
• Challenges in Building and Delivering
Capable Data Infrastructure
• Sustainable Digital Preservation – Grand
Challenge for the Information age
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Data Cyberinfrastructure Today –
Designing and Developing Infrastructure for
Today’s Data-Oriented Applications
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Today’s Data-oriented Applications Span the
Spectrum
Data and
High Performance
Computing
Data and
Grids
Data
Grid
Applications
Data and
Cyberinfrastructure
Services
DATA (more BYTES)
Designing Infrastructure
for Data:
NETWORK
(more
BW)
Data-intensive
applications
Home, Lab,
Campus,
Desktop
Applications
Data-intensive
and
Computeintensive
HPC
applications
Computeintensive
HPC
Applications
COMPUTE (more FLOPS)
Grid
Applications
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Data and High Performance Computing
• For many applications, development of “balanced systems”
needed to support applications which are both data-intensive and
compute-intensive. Codes for which
• Grid platforms not a strong option
Data-intensive
Data-intensive
applications
applications
• I/O rates exceed WAN capabilities
• Continuous and frequent I/O is
latency intolerant
• Scalability is key
• Need high-bandwidth and largecapacity local parallel file systems,
archival storage
DATA (more BYTES)
• Data must be local to computation
Data-intensive
and
Compute-intensive
HPC
applications
Computeintensive
HPC
Compute-intensive
Applications
applications
COMPUTE (more FLOPS)
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
: Earthquake Simulation at Petascale –
better prediction accuracy creates greater
data-intensive demands
Estimated figures
for simulated 240
second period,
100 hour run-time
TeraShake domain
(600x300x80 km^3)
PetaShake
domain
(800x400x100
km^3)
Fault system
interaction
NO
YES
Inner Scale
200m
25m
Resolution of
terrain grid
1.8 billion mesh
points
2.0 trillion mesh
points
Magnitude of
Earthquake
7.7
8.1
Time steps
20,000
(.012 sec/step)
160,000
(.0015 sec/step)
Surface data
1.1 TB
1.2 PB
Volume data
43 TB
4.9 PB
SAN DIEGO SUPERCOMPUTER CENTER
Information courtesy of the Southern California Earthquake Center
UCSD
Fran Berman
Data and HPC: What you see is what you’ve measured
Cray XD1 -- Custom Interconnect
FLOPS alone
are not
enough.
•
Dalco Linux Cluster -- Quadrics I
nterconnect
Three systems using the
same processor and
number of processors.
•
Sun Fire Cluster -- Gigabit ethernet
Interconnect
AMD Opteron 64 processors
2.2 GHz
•
Appropriate
benchmarks
needed to
rank/bring
visibility to
more
balanced
machines
critical for
today’s
applications.
•
Difference is in way the
processors are
interconnected
HPC Challenge benchmarks
measure different machine
characteristics
•
Linpack and matrix
multiply are computationally
intensive
•
PTRANS (matrix transpose),
RandomAccess ,
bandwidth/latency tests and
other tests begin to reflect
stress on memory system
SAN DIEGO SUPERCOMPUTER CENTER
Information courtesy of Jack Dongarra
Fran Berman
9
UCSD
Data and Grids
• Data applications some of the first applications which
• required Grid environments
• could naturally tolerate longer latencies
• Grid model supports key data application profiles
• Compute at site A
with data from site B
• Store Data Collection at
site A with copies at sites
B and C
• Operate instrument at
site A, move data to
site B for storage, postprocessing, etc.
CERN data providing key driver
for grid technologies
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Data Services Key for TeraGrid Science
Gateways
• Science Gateways
provide common
application interface for
science communities on
TeraGrid
• Data services key for
Gateway communities
NVO
LEAD
• Analysis
• Visualization
• Management
• Remote access, etc.
GridChem
SAN DIEGO SUPERCOMPUTER CENTER
Information and images courtesy of Nancy Wilkins-Diehr
Fran Berman
UCSD
Unifying Data over the Grid – the TeraGrid
GPFS WAN Effort
• User wish list
•
Unlimited data capacity.
(everyone’s aggregate storage
almost looks like this)
•
Transparent, high speed access
anywhere on the Grid
•
Automatic archiving and retrieval
•
No Latency.
• TeraGrid GPFS-WAN effort focuses on providing “infinite“(SDSC)
storage over the grid
•
Looks like local disk to grid sites
•
Uses automatic migration with a large cache to keep files always “online” and
accessible.
•
Data automatically archived without user intervention
SAN DIEGO SUPERCOMPUTER CENTER
Information courtesy of Phil Andrews
Fran Berman
UCSD
Data Services – Beyond Storage to Use
What services do users want?
How do I make
sure that my data
will be there
when I want it?
How should
I display my
data?
How can I
combine my
data with my
colleague’s
data?
What are the trends
and what is the
noise in my data?
How should I
organize my
data?
My data is
confidential; how do
I make sure that it is
seen/used only by
the right people?
How can I make my data
accessible to my
collaborators?
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Services: Integrated Environment Key to
Usability
analysis
modeling
simulation
Data Access
•
Database selection and
schema design
•
Portal creation and
collection publication
•
Data analysis
•
Data mining
•
Data hosting
•
Preservation services
•
Domain-specific tools
visualization
Data Manipulation
File systems,
Database systems,
Collection Management
Data Integration, etc.
Data Management
Data Storage
Many Data
Sources
instruments
Sensornets computers
Integrated Infrastructure
•
Biology Workbench
•
Montage (astronomy
mosaicking)
•
Kepler (Workflow
management)
•
Data visualization
•
Data anonymization, etc.
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Data Hosting: SDSC DataCentral – A
Comprehensive Facility for Research Data
•
Broad program to support research and
community data collections and
databases
•
DataCentral services include:
PDB – 28 TB
•
Public Data Collections and Database
Hosting
•
Long-term storage and preservation
(tape and disk)
•
Remote data management and access
(SRB, portals)
•
Data Analysis, Visualization and Data Mining
•
Professional, qualified 24/7 support
Web-based
portal access
•
DataCentral resources include
•
1 PB On-line disk
•
25 PB StorageTek tape library capacity
•
540 TB Storage-area Network (SAN)
•
DB2, Oracle, MySQL
•
Storage Resource Broker
•
Gpfs-WAN with 700 TB
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
DataCentral Allocated Collections include
Seismology
3D Ground Motion Collection
for the LA Basin
Atmospheric
Sciences50 year Downscaling
of Global Analysis over
California Region
Earth Sciences
NEXRAD Data in
Hydrometerology and
Hydrology
Life Sciences
Protein Data Bank
Neurobiology
Salk data
Geosciences
GEON
Seismology
SCEC TeraShake
Geosciences
GEON-LIDAR
Seismology
SCEC CyberShake
Geochemistry
Kd
Oceanography
SIO Explorer
Biology
Gene Ontology
Networking
Skitter
Astronomy
Sloan Digital Sky Survey
Geochemistry
GERM
Networking
HPWREN
Geology
Sensitive Species Map Server
Ecology
HyperLter
Geology
SD and Tijuana Watershed
data
Elementary
Particle
Physics
AMANDA data
Biology
AfCS Molecule Pages
Biomedical
Neuroscience
BIRN
Networking
IMDC
Oceanography
Seamount Catalogue
Networking
Backbone Header Traces
Biology
Interpro Mirror
Oceanography
Seamounts Online
Networking
Backscatter Data
Biology
JCSG Data
Biodiversity
WhyWhere
Biology
Bee Behavior
Government
Library of Congress Data
Ocean Sciences
Geophysics
Magnetics Information
Consortium data
Southeastern Coastal Ocean
Observing and Prediction
Data
Structural
Engineering
TeraBridge
Biology
Biocyc (SRI)
Art
C5 landscape Database
Geology
Chronos
Biology
CKAAPS
Biology
Education
UC Merced Japanese Art
Collections
Various
TeraGrid data collections
DigEmbryo
Geochemistry
NAVDAT
Biology
Transporter Classification
Database
Earth Science
Education
ERESE
Earthquake
Engineering
NEESIT data
Biology
TreeBase
Earth Sciences
UCI ESMF
Art
Tsunami Data
Education
NSDL
Education
ArtStor
Astronomy
NVO
Biology
Yeast regulatory network
Earth Sciences
EarthRef.org
Earth Sciences
ERDA
Earth Sciences
ERR
Government
NARA
Biology
Apoptosis Database
Biology
Encyclopedia of Life
Anthropology
GAPP
Cosmology
LUSciD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Data Visualization is key
Visualization of Cancer Tumors
SCEC Earthquake simulations
SAN DIEGO SUPERCOMPUTER CENTER
Prokudin– Gorskii historical images
Information and images courtesy of Amit Chourasia, SCEC, Steve Cutchin, Moores Cancer Center,
UCSD
David Minor, U.S.
of Congress
FranLibrary
Berman
Building and Delivering Capable Data
Cyberinfrastructure
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Infrastructure Should be Non-memorable
• Good infrastructure should be
• Predictable
•
•
•
•
Pervasive
Cost-effective
Easy-to-use
Reliable
• Unsurprising
• What’s required to build and provide useful, usable,
and capable data Cyberinfrastructure?
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Building Capable Data Cyberinfrastructure:
Incorporating the “ilities”
•
•
•
•
•
•
•
•
•
•
Scalability
Interoperability
Reliability
Capability
Sustainability
Predictability
Accessibility
Responsibility
Accountability
…
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Reliability
• How can we maximize
data reliability?
• Replication, UPS
systems, heterogeneity,
etc.
• How can we measure
data reliability?
• Network availability=
99.999% uptime (“5
nines”),
• What is the equivalent
number of “0’s” for data
reliability?
Entity at
risk
What can go wrong
Frequency
File
Corrupted media, disk
failure
1 year
Tape
+ Simultaneous failure of 2
copies
5 years
System
+ Systemic errors in vendor
SW, or malicious user, or
operator error that deletes
multiple copies
15 years
Archive
+ Natural disaster,
obsolescence of standards
50 - 100 years
Reliability: What can go wrong
SAN DIEGO SUPERCOMPUTER CENTER
Information courtesy of Reagan Moore
Fran Berman
UCSD
Responsibility and Accountability
• What are reasonable
expectations between users
and repositories?
• What are reasonable
expectations between
federated partner repositories?
• Who owns the data?
• Who takes care of the data?
• Who pays for the data?
• Who can access the data?
• What are appropriate models
for evaluating repositories?
• What incentives promote good
stewardship? What should
happen if/when the system
fails?
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Good Data Infrastructure Incurs Real Costs
Capability Costs
Capacity Costs
Model A (8-yr,15.2-mo 2X)
TB Stored
•
Planned Capacity
100000.0
Archival Storage (TB)
10000.0
1000.0
100.0
Reliability increased by up-to-date
and robust hardware and software for
•
Replication (disk, tape, geographically)
•
Backups, updates, syncing
•
Audit trails
•
Verification through checksums, physical
media, network transfers, copies, etc.
10.0
June-97
June-98
June-99
June-00
June-01
June-02
June-03
June-04
June-05
June-06
June-07
June-08
June-09
Date
•
Data professionals needed to facilitate
•
Most valuable data must be replicated
•
Infrastructure maintenance
•
SDSC research collections have been
doubling every 15 months.
•
Long-term planning
•
Restoration, and recovery
SDSC storage is 25 PB and counting.
•
Access, analysis, preservation, and other
services
•
Reporting, documentation, etc.
•
Data is from supercomputer simulations,
digital library collections, etc.
SAN DIEGO SUPERCOMPUTER CENTER
Information courtesy of Richard Moore
Information courtesy of Richard Moore
Fran Berman
UCSD
Economic Sustainability
Relay
Funding
• Making Infinite
Funding Finite
• Difficult to support
infrastructure for
data preservation
as an infinite,
increasing
mortgage
User fees,
recharges
Consortium support
Geisel Library
at UCSD
Endowments
• Creative
partnerships help
create sustainable
economic models
Hybrid solutions
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Preserving Digital Information
Over the Long Term
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
How much Digital Data is there?
•
•
103
Mega
106
Giga
109
Tera
1012
25% of the 2006 digital universe
is born digital (digital pictures,
keystrokes, phone calls, etc.)
Peta
1015
Exa
1018
75% is replicated (emails
forwarded, backed up
transaction records, movies in
DVD format)
Zetta
1021
161 exabytes of digital
information produced in
2006
•
•
•
Kilo
5 exabytes of digital
information produced in 2003
1 zettabyte aggregate
digital information
projected for 2010
SAN DIEGO SUPERCOMPUTER CENTER
SDSC HPSS tape
archive = 25+ PetaBytes
iPod (up to 20K
songs) = 80 GB
1 novel = 1
MegaByte
U.S. Library of Congress manages 295 TB of
digital data, 230 TB of which is “born digital”
Source: “The Expanding Digital Universe: A forecast of Worldwide Information Growth through
UCSD
2010” IDC Whitepaper,
March 2007
Fran Berman
How much Storage is there?
• 2007 is the
“crossover year”
where the amount of
digital information is
greater than the
amount of available
storage
• Given the projected
rates of growth, we will
never have enough
space again for all
digital information
SAN DIEGO SUPERCOMPUTER CENTER
Source: “The Expanding Digital Universe: A forecast of Worldwide Information Growth through
UCSD
2010” IDC Whitepaper,
March 2007
Fran Berman
Focus for Preservation: the “most valuable”
data
• What is “valuable”?
•
Community reference data
collections
(e.g. UniProt, PDB)
•
Irreplaceable collections
•
Official collections
(e.g. census data, electronic
federal records)
•
Collections which are very
expensive to replicate
(e.g. CERN data)
•
Longitudinal and historical
data
•
and others …
Value
Cost
Time
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
A Framework for Digital Stewardship
• Preservation efforts should
focus on collections deemed
“most valuable”
Increasing
Digital Data Collections
Repositories/Facilities
Increasing
Value
risk/responsibility
Increasing
Increasing
Trust
stability
Increasing
infrastructure
•
Key issues:
•
What do we preserve?
•
How do we guard
against data loss?
•
Who is responsible?
•
Who pays? Etc.
National,
International
Scale
“Regional”
Scale
Local Scale
Reference, nationally
National / and
Internaional-scale
important,
irreplaceable
data repositories,
archives, and
data collections
libraries.
Key research and
“Regional”-scale libraries
community data
and targeted data
collections
centers.
Private
repositories.
Personal
data
collections
The Data Pyramid
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Digital Collections of Community Value
National,
International
Scale
“Regional”
Scale
Local Scale
• Key techniques for
preservation: replication,
heterogeneous support
The Data Pyramid
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
: A Conceptual Model for
Preservation Data Grids
The Chronopolis Model
•
Geographically distributed preservation
data grid that supports long-term
management , stewardship of, and
access to digital collections
•
Implemented by developing and
deploying a distributed data grid, and
by supporting its human, policy, and
technological infrastructure.
•
Integrates targeted technology
forecasting and migration to support of
long-term life-cycle management and
preservation
Distributed Production
Preservation Environment
Digital
Information
of Long-Term
Value
Technology
Forecasting and
Migration
Administration,
Policy,
Outreach
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Chronopolis Focus Areas and Demonstration
Project Partners
• Chronopolis R&D, Policy, and
Infrastructure Focus areas:
•
Assessment of the needs of potential user
communities and development of appropriate
service models
•
Development of formal roles and
responsibilities of providers, partners, users
•
•
•
Assessment and prototyping of best practices
for bit preservation, authentication, metadata,
etc.
Development of appropriate cost and risk
models for long-term preservation
Development of appropriate success metrics to
evaluate usefulness, reliability, and usability of
infrastructure
2 Prototypes:
National Demonstration
Project
Library of Congress Pilot
Project
Partners
UCSD Libraries
SDSC/UCSD
U Maryland
UCSD Libraries
NCAR
NARA
Library of Congress
NSF
ICPSR
Internet Archive
NVO
SAN DIEGO SUPERCOMPUTER CENTER
UCSD
Fran Berman
Demonstration Project information
courtesy of Robert McDonald
National Demonstration Project –
Large-scale Replication and Distribution
•
Focus on supporting
multiple, geographically
distributed copies of
preservation collections:
•
•
•
•
Chronopolis Federation architecture
“Bright copy” –
Chronopolis site supports
ingestion, collection
management, user access
“Dim copy” – Chronopolis
site supports remote
replica of bright copy and
supports user access
“Dark copy” –
Chronopolis site supports
reference copy that may
be used for disaster
recovery but no user
access
Each site may play
different roles for
different collections
NCAR
Dim copy C1
U Md
Dark copy C1
Dark copy C2
Bright copy C2
Bright copy C1
Dim copy C2
SDSC
Chronopolis Site
Demonstration collections included:
•
National Virtual Observatory (NVO) [1 TB Digital Palomar Observatory Sky Survey]
•
Copy of Interuniversity Consortium for Political and Social Research (ICPSR) data
[1 TB Web-accessible
Data]
SAN DIEGO SUPERCOMPUTER
CENTER
•
NCAR Observational Data [3 TB of Observational and Re-Analysis Data]
Fran Berman
UCSD
SDSC/ UCSD Libraries Pilot Project
with U.S. Library of Congress
Goal: To “… demonstrate the feasibility and
performance of current approaches for a production
digital Data Center to support the Library of Congress’
requirements.”
•
Historically important 600 GB Library of
Congress image collection
•
Images over 100 years old with red, blue,
green components (kept as separate
digital files).
•
SDSC stores 5 copies with dark archival
copy at NCAR
•
Infrastructure must support idiosyncratic
file structure. Special logging and
monitoring software developed so that
both SDSC and Library of Congress
could access information
Prokudin-Gorskii
Photographs
(Library of Congress Prints
and Photographs Division)
http://www.loc.gov/exhibits/empire/
(also collection of web crawls from
the Internet Archive)
Library of Congress Pilot Project information courtesy of David Minor
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Pilot Projects provided invaluable
experience with key Issues
• Infrastructure Issues
• Technical Issues
• What kinds of resources
(servers, storage,
networks) are required?
• How should they operate?
• How to address Integrity,
verification, provenance,
authentication, etc.
• Legal/Policy Issues
• Evaluation Issues
• Who is responsible?
• Who is liable?
• What is reliable?
• What is successful?
• Social Issues
• What formats/standards are
acceptable to the
community?
• How do we formalize trust?
• Cost Issues
• What is cost-effective?
• How can support be
sustained over time?
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
It’s Hard to be Successful in the Information
Age without reliable, persistent information
• Inadequate/unrealistic general solution: “Let X do it”
where X is:
• The Government
• The Libraries
• The Archivists
• Google
• The private sector
•
Creative partnerships needed to
provide preservation solutions with
•
Trusted stewards
•
Feasible costs for users
•
Sustainable costs for infrastructure
•
Very low risk for data loss, etc.
• Data owners
• Data generators, etc.
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Blue Ribbon Task Force to Focus on
Economic Sustainability
• International Blue Ribbon Task
Force (BRTF-SDPA) to begin in
2008 to study issues of economic
sustainability of digital
preservation and access
• Support from
•
•
•
•
•
•
National Science Foundation
Library of Congress
Mellon Foundation
Joint Information Systems
Committee
National Archives and Records
Administration
Council on Library and Information
Sources
SAN DIEGO SUPERCOMPUTER CENTER
October 31,
University
State
Federal
College
USER
Non-profit
Commercial
Local
Image courtesy of Chris Greer
Office of CyberInfrastructure
Fran Berman
International
UCSD
BRTF-SDPA
Charge to the Task Force:
1. To conduct a comprehensive analysis of
previous and current efforts to develop and/or
implement models for sustainable digital
information preservation; (First year report)
2. To identify and evaluate best practice regarding
sustainable digital preservation among existing
collections, repositories, and analogous
enterprises;
3. To make specific recommendations for actions
that will catalyze the development of sustainable
resource strategies for the reliable preservation of
digital information; (Second Year report)
4. Provide a research agenda to organize and
motivate future work.
How you can be
involved:
•
Contribute your ideas
(oral and written
“testimony”)
•
Suggest readings
(website will serve as a
community bibliography)
•
Write an article on the
issues for a new
community (Important
component will be to
educate decision makers
and the public about
digital preservation)
Website to be launched this
Fall. Will link from
www.sdsc.edu
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Many Thanks
•
Phil Andrews, Reagan
Moore, Ian Foster, Jack
Dongarra, Authors of
the IDC Report, Ben
Tolo, Reagan Moore,
Richard Moore, David
Moore, Robert
McDonald, Southern
California Earthquake
Center, David Minor,
Amit Chourasia, U.S.
Library of Congress,
Moores Cancer Center,
National Archives and
Records Administration,
NSF, Chris Greer,
Nancy Wilkins-Diehr,
and many others …
www.sdsc.edu
berman@sdsc.edu
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
UCSD
Download