MSICI.anke

advertisement
Welcome and
Cyberinfrastructure Overview
MSI Cyberinfrastructure Institute
June 26-30, 2006
Anke Kamrath
Division Director, San Diego Supercomputer Center
kamratha@sdsc.edu
SAN DIEGO SUPERCOMPUTER CENTER
UCSD
The Digital World
Entertainment
Shopping
Information
SAN DIEGO SUPERCOMPUTER CENTER
UCSD
Science is a Team Sport
Data
Management
and Mining
Astronomy
Geosciences
Life
Sciences
GAMESS
Modeling and Simulation
QCD
Physics
SAN DIEGO SUPERCOMPUTER CENTER
UCSD
Cyberinfrastructure – A Unifying Concept
Cyberinfrastructure =
resources
(computers, data storage,
networks, scientific
instruments, experts, etc.)
+ “glue”
(integrating software, systems,
and organizations).
NSF’s “Atkins Report”
provided a compelling
vision for integrated
Cyberinfrastructure
SAN DIEGO SUPERCOMPUTER CENTER
UCSD
A Deluge of Data
•
Today data comes from everywhere
•
•
•
•
•
•
•
And is used by everyone
•
•
•
•
•
“Volunteer” data
Scientific instruments
Experiments
Sensors and sensornets
Computer simulations
New devices (personal digital devices,
computer-enabled clothing, cars, …)
Data from
sensors
Data
from
instruments
Data from
simulations
Researchers, educators
Consumers
Practitioners
General public
Turning the deluge of data into
usable information for the research
and education community requires
an unprecedented level of integration,
globalization, scale, and access
Volunteer data
Data
from
analysis
SAN DIEGO SUPERCOMPUTER CENTER
UCSD
Using Data as a Driver: SDSC Cyberinfrastructure
Community Databases
and Data Collections,
Data management, mining and
preservation
Data-oriented HPC,
Resources,
High-end storage,
Large-scale data analysis,
simulation, modeling
Biology
Workbench
SDSC
Data
Cyberinfrastructure
Data- and
Computational
Science
Education
and Training
Summer
Institute
I
T
SRB
Data-oriented Tools,
SW Applications,
and Community
Codes
Collaboration, Service
and Community
Leadership for Dataoriented Projects
SAN DIEGO SUPERCOMPUTER CENTER
UCSD
Impact on Technology: Data and Storage are
Integral to Today’s Information Infrastructure
• Today’s “computer” is a
coordinated set of
hardware, software, and
services providing an
“end-to-end” resource.
• Cyberinfrastructure
captures how the
research and education
community has
redefined “computer”
wireless
sensors
field
computer
computer
data
network
network
computer
data
data
storage
computer
viz
field
instrument
network
Data and storage are an integral part of
today’s “computer”
SAN DIEGO SUPERCOMPUTER CENTER
UCSD
Building a National Data Cyberinfrastructure
Center
Goal: SDSC’s Data Cyberinfrastructure
should “extend the reach” of the local
research and education environment.
Access to
community and
reference
data collections
More capable and/or
higher capacity
computational resources
Community codes,
middleware, software
tools and toolkits
Multi-disciplinary
expertise
Long-term Scienctific Data
Preservation
SAN DIEGO SUPERCOMPUTER CENTER
UCSD
Impact on Applications: Data-oriented Research Driving the
Next Generation of Technology Challenges
Data (more BYTES)
Data-oriented
Research
Applications
Home, Lab,
Campus,
Desktop
Applications
Traditional
HPC
Applications
Compute (more FLOPS)
SAN DIEGO SUPERCOMPUTER CENTER
UCSD
Today’s Research Applications Span the Spectrum
Data Mgt. Envt.
Extreme I/O Environment
Data-oriented Environment
SCEC
Visualization
Data (more BYTES)
EOL
SCEC
Climate
Simulation
NVO
simulation
ENZO
Visualization
GridSAT
CiPres
Seti@Home
ENZO
Turbulence
field
CFD
Could be targeted
efficiently on Grid
MCell
Protein
Folding/MD
Home, Lab, Campus,
Desktop
Traditional
HPC
environment
Difficult to target
efficiently on Grid
CPMD
QCD
GAMESS
EverQuest
Lends itself to Grid
Turbulence
Reattachment
length
Compute (more FLOPS)
SAN DIEGO SUPERCOMPUTER CENTER
UCSD
Working with Compute and Data –
Simulation, Analysis, Modeling
Resources Required
Computers and Systems
Simulation of Southern
of 7.7 earthquake on
lower San Andreas Fault
• Physics-based dynamic source
model – simulation of mesh
of 1.8 billion cubes with spatial
resolution of 200 m
• Builds on 10 years of data and models from the Southern
California Earthquake Center
• Simulated first 3 minutes of a magnitude 7.7 earthquake,
22,728 time steps of 0.011 second each
• Simulation generates 45+ TB data
•
•
•
•
•
80,000 hours on DataStar
256 GB memory p690 used
for testing, p655s used for
production run, TG used for
porting
30 TB Global Parallel file
GPFS
Run-time 100 MB/s data
transfer from GPFS to SAMQFS
27,000 hours postprocessing for high resolution
rendering
People
•
•
20+ people for IT support
20+ people in domain
research
Storage
•
•
•
SAM-QFS archival storage
HPSS backup
SRB Collection with
1,000,000 files
SAN DIEGO SUPERCOMPUTER CENTER
UCSD
Big Data & Big Compute:
Simulating an
earthquake 1:
1.
Divide up Southern
California into
“blocks”
2.
For each block, get
all the data on
ground surface
composition,
geological structures,
fault information, etc.
The
Southern
San Andreas
Fault
SAN DIEGO SUPERCOMPUTER CENTER
UCSD
Big Data & Big Compute:
Simulating
earthquake 2:
3.
Map the blocks
on to processors
(brains) of the
computer
SDSC’s DataStar –
one of the 25 fastest
computers in the world
SAN DIEGO SUPERCOMPUTER CENTER
UCSD
Big Data & Big Compute:
Simulating an
earthquake 3:
4.
Run the
simulation using
current information
on fault activity and
the physics of
earthquakes
SAN DIEGO SUPERCOMPUTER CENTER
UCSD
Managing the data
Where to store the data?
•
•
Simulating an
In HPSS, a tape storage library
earthquake
4:
that can hold 10 PetaByes
Terabytes) -- 500 times
5. (100000
The simulation
theoutputs
printeddata
materials
on in the
Library
of wave
Congress
seismic
velocity, earthquake
magnitude,
and other
characteristics
•
How much data was
output?
•
47 TeraBytes which is
•
2+ times the printed
materials in the Library of
Congress! or
•
The amount of music in
2000+ iPods! or
•
47 million copies of a
typical DVD movie!
SAN DIEGO SUPERCOMPUTER CENTER
UCSD
How long will TeraShake take on your
desktop computer?
Computing
Platform
Number of
Processors
Floating Point
(arithmetic)
Operations per
second
Desktop
1
5.3 billion
Can run
TeraShake
in
72 centuries!
DataStar at SDSC
1024
10.4 trillion
(approximate)
5 days
(240 used for
TeraShake)
SAN DIEGO SUPERCOMPUTER CENTER
UCSD
Better Neurosurgery Through
Cyberinfrastructure
Radiologists and neurosurgeons at
Brigham and Women’s Hospital,
Harvard Medical School exploring
transmission of 30/40 MB brain
images (generated during surgery)
to SDSC for analysis and alignment
•
•
•
PROBLEM: Neuro-surgeons
seek to remove as much tumor
tissue as possible while
minimizing removal of healthy
brain tissue
Brain deforms during surgery
Surgeons must align preoperative
brain image with intra-operative
images to provide surgeons the
best opportunity for intra-surgical
navigation
Transmission repeated
every hour during 6-8
hour surgery.
Transmission and
output must take on
the order of minutes
Finite element simulation on
biomechanical model for
volumetric deformation
performed at SDSC; output
results are sent to BWH where
updated images are shown to
surgeons
SAN DIEGO SUPERCOMPUTER CENTER
UCSD
Community Data Repository:
SDSC DataCentral
•
Provides “data allocations” on SDSC
resources to national science and
engineering community
•
Data collection and database hosting
•
•
•
Batch oriented access
Collection management services
First broad program of its kind to
support research and community
data collections and databases
• Comprehensive resources
•
•
•
•
Disk: 400 TB accessible via HPC
systems, Web, SRB, GridFTP
Databases: DB2, Oracle, MySQL
SRB: Collection management
Tape: 6 PB, accessible via file system,
HPSS, Web, SRB, GridFTP
•
24/7 operations, collection specialists
Example Allocated Data
Collections include
•
Bee Behavior (Behavioral Science)
•
C5 Landscape DB (Art)
•
Molecular Recognition Database
(Pharmaceutical Sciences)
•
LIDAR (Geoscience)
•
AMANDA (Physics)
•
SIO_Explorer (Oceanography)
•
Tsunami and Landsat Data
(Earthquake Engineering)
•
Terabridge (Structural Engineering)
DataCentral
infrastructure includes:
Web-based portal, security,
networking, UPS systems,
web services and software
tools
SAN DIEGO SUPERCOMPUTER CENTER
UCSD
Public Data Collections Hosted in SDSC’s DataCentral
Seismology
3D Ground Motion Collection
for the LA Basin
Atmospheric
Sciences50 year Downscaling
of Global Analysis over
California Region
Earth Sciences
NEXRAD Data in
Hydrometerology and
Hydrology
Life Sciences
Protein Data Bank
Neurobiology
Salk data
Geosciences
GEON
Seismology
SCEC TeraShake
Geosciences
GEON-LIDAR
Seismology
SCEC CyberShake
Geochemistry
Kd
Oceanography
SIO Explorer
Biology
Gene Ontology
Networking
Skitter
Astronomy
Sloan Digital Sky Survey
Geochemistry
GERM
Networking
HPWREN
Geology
Sensitive Species Map Server
Ecology
HyperLter
Geology
SD and Tijuana Watershed
data
Elementary
Particle
Physics
AMANDA data
Biology
AfCS Molecule Pages
Biomedical
Neuroscience
BIRN
Networking
IMDC
Oceanography
Seamount Catalogue
Networking
Backbone Header Traces
Biology
Interpro Mirror
Oceanography
Seamounts Online
Networking
Backscatter Data
Biology
JCSG Data
Biodiversity
WhyWhere
Biology
Bee Behavior
Government
Library of Congress Data
Ocean Sciences
Geophysics
Magnetics Information
Consortium data
Southeastern Coastal Ocean
Observing and Prediction
Data
Structural
Engineering
TeraBridge
Biology
Biocyc (SRI)
Art
C5 landscape Database
Geology
Chronos
Biology
CKAAPS
Biology
Education
UC Merced Japanese Art
Collections
Various
TeraGrid data collections
DigEmbryo
Geochemistry
NAVDAT
Biology
Transporter Classification
Database
Earth Science
Education
ERESE
Earthquake
Engineering
NEESIT data
Biology
TreeBase
Earth Sciences
UCI ESMF
Art
Tsunami Data
Education
NSDL
Education
ArtStor
Astronomy
NVO
Biology
Yeast regulatory network
Earth Sciences
EarthRef.org
Earth Sciences
ERDA
Earth Sciences
ERR
Government
NARA
Biology
Apoptosis Database
Biology
Encyclopedia of Life
Anthropology
GAPP
Cosmology
LUSciD
SAN DIEGO SUPERCOMPUTER CENTER
UCSD
Data Cyberinfrastructure Requires
a Coordinated Approach
interoperability
Applications: Medical informatics,
Biosciences, Ecoinformatics,…
integration
Visualization
Data Mining, Simulation Modeling,
Analysis, Data Fusion
Knowledge-Based Integration
Advanced Query Processing
Grid Storage
Filesystems, Database Systems
High speed networking
sensornets
Storage hardware
How do we represent data,
information and knowledge
to the user?
How do we detect trends and
relationships in data?
How do we obtain usable
information from data?
How do we collect, access
and organize data?
How do we configure computer
architectures to optimally support
data-oriented computing?
Networked Storage (SAN)
HPC
How do we combine data, knowledge
and information management with
simulation and modeling?
instruments
SAN DIEGO SUPERCOMPUTER CENTER
UCSD
Working with Data:
Data Integration for New Discovery
Data Integration in the Biosciences
Users
Software
to access
data
Software
to federate
data
Disciplinary
Databases
Anatomy
Data Integration in the Geosciences
Where can we most safely build a nuclear waste dump?
Where should we drill for oil?
What is the distribution and U/ Pb zircon
ages of A-type plutons in VA?
How does it relate to host rock structures?
Organisms
Data
Integration
Physiology
Organs
Cell Biology
Cells
Complex
“multiple-worlds”
mediation
Proteomics
Organelles
Genomics
Medicinal
Chemistry
Biopolymers
Atoms
GeoGeologic
Chemical
Map
GeoPhysical
GeoChronologic
SAN DIEGO SUPERCOMPUTER CENTER
UCSD
Foliation
Map
Preserving Data over the Long-Term
SAN DIEGO SUPERCOMPUTER CENTER
UCSD
Data Preservation
•
Many Science, Cultural, and Official
Collections must be sustained for the
foreseeable future
•
Critical collections must be
preserved:
•
•
community reference data
collections (e.g. Protein Data Bank)
•
irreplaceable collections
(e.g. field data – tsunami recon)
•
longitudinal data
(e.g. PSID – Panel Study of
Income Dynamics)
No plan for preservation often
means that data is lost or damaged
“….the progress of science and useful arts …
depends on the reliable preservation of
knowledge and information for
generations to come.”
“Preserving Our Digital Heritage”,
Library of Congress
SAN DIEGO SUPERCOMPUTER CENTER
UCSD
How much Digital Data*?
iPod Shuffle
(up to 120
songs) = 512
MegaBytes
1 human
brain at the
micron level
= 1 PetaByte
Kilo
1 novel = 1
MegaByte
1 Low
Resolution
Photo = 100
KiloBytes
Printed materials in the Library of
Congress = 10 TeraBytes
103
Mega
106
Giga
109
Tera
1012
Peta
1015
Exa
1018
SDSC
HPSS tape
archive = 6
PetaBytes
All
worldwide
information
in one year
=2
ExaBytes
* Rough/average estimates
SAN DIEGO SUPERCOMPUTER CENTER
UCSD
Key Challenges for Digital Preservation
• What should we preserve?
•
•
What materials must be “rescued”?
How to plan for preservation of materials by
design?
• How should we preserve it?
•
•
•
Formats
Storage media
Stewardship – who is responsible?
Print media provides easy
access for long periods of time
but is hard to data-mine
• Who should pay for preservation?
•
•
•
The content generators?
The government?
The users?
• Who should have access?
Digital media is easier to data-mine but
requires management of evolution of media
and resource planning over time
SAN DIEGO SUPERCOMPUTER CENTER
UCSD
What can go wrong
Entity at
risk
Problem
Frequency
File
Corrupted media, disk failure
1 year
Tape
+ Simultaneous failure of 2
copies
5 years
System
+ Systemic errors in vendor SW,
or Malicious user, or Operator
15 years
error that deletes multiple copies
Archive
+ Natural disaster, obsolescence
50 - 100 years
of standards
SAN DIEGO SUPERCOMPUTER CENTER
UCSD
SDSC Cyberinfrastructure Community
Resources
COMPUTE SYSTEMS
• DataStar
•
•
•
•
TeraGrid Cluster
•
•
•
2396 Power4+ processors,
IBM p655 and p690 nodes
10 TB total memory
Up to 2 GBps I/O to disk
512 Itanium2 IA-64
processors
1 TB total memory
DATA ENVIRONMENT
•
•
•
•
•
•
•
1 PB Storage-area Network
(SAN)
10 PB StorageTek tape library
DB2, Oracle, MySQL
Storage Resource Broker
HPSS
72-CPU Sun Fire 15K
96-CPU IBM p690s
•
http://datacentral.sdsc.edu/
Support for 60+
community data
collections and
databases
Data
management,
mining,
analysis, and
preservation
Intimidata
•
•
•
Only academic IBM Blue
Gene system
2,048 PowerPC processors
128 I/O nodes
http://www.sdsc.edu/
user_services/
SCIENCE and TECHNOLOGY STAFF,
SOFTWARE, SERVICES
•
•
•
•
•
User Services
Application/Community Collaborations
Education and Training
SDSC Synthesis Center
Community SW, toolkits, portals, codes
•
http://www.sdsc.edu/
SAN DIEGO SUPERCOMPUTER CENTER
UCSD
Thank You
kamratha@sdsc.edu
www.sdsc.edu
SAN DIEGO SUPERCOMPUTER CENTER
UCSD
Download