Trends and Gaps in the Emerging Large Data Biomedical Informatics

advertisement
Trends and Gaps in the Emerging
Middleware and the Impact on Handling
Large Data
Philip Papadopoulos
BIRN-CC, OptIPuter
Program Director, Grids and Clusters, SDSC
Biomedical Informatics
Research Network
16 September 2003
Agenda
• Hardware Technology Trends - performance gap in
storage systems
• BIRN as an example distributed data grid
• OptIPuter – investigating how networking changes
the fundamentals of software/hardware
decomposition
• Basic Grid Services as the emerging building block
for distributed systems
• Summary of Gaps - from the IT perspective
We’re at a “Crossing of Technology Exponentials”
or a “triple-point” of phase change
Technology Doubling Laws
• Moore’s Law – Individual computers double in processing
power every 18 months
• Storage Law – disk storage capacity doubles every 12
Months
• Gilder’s Law – Network bandwidth doubles every 9 months
 This exponential growth profoundly changes the landscape
of Information technology
 (High-speed) access to networked information becomes the
dominant feature of future computing
 For large-scale images: secure remote access eventually becomes
routine
CPU speed is growing the slowest
Gilder’s Law
(32X in 4 yrs)
Storage Law
(16X in 4yrs)
Moore’s Law
(5X in 4yrs)
Triumph of Light – Scientific American. George Stix, January 2001
Gap in Raw Storage Transfer Rates
• 4 years ago
•
•
•
•
Capacity: 9 GB (SCSI), 20 GB (IDE)
1 Terabyte ~ 100 Disks
Transfer rate: 10 – 15MB/sec
Fill Time: 10 Minutes to fill disk at peak transfer rate
• Today
• Capacity: 146GB (SCSI), 250GB (IDE), SATA emerging
• ~16X in 4 years, on track
• 1 Terabyte ~ 6 disks
• Transfer rate: 40-60MB/sec
• Fill Time: ~1 hour to fill disk at peak transfer rate
• Extrapolate 4 years
• [ 2TB disk, 160MB/sec, 3.5 Hour fill time]
Expectations
• Growth in storage capacity is rapidly changing
common expectations
•
•
•
•
Keep every email I ever wrote/received (including viruses)
Carry my entire music collection (as MP3s)
Store all my digital photographs online
(Have all my medical records including images online and
available to my doctor, hospital worldwide)
• Our storage use expectations are growing rapidly,
but disk fill times are relatively slowing down
• Managing large medical images illuminates this
fundamental gap now
Biology applications push on all three axes
• Simulation, image comparison, data mining (computing)
• Data size and variety (Storage)
• 3D image data (large data)
• Potentially petabytes at EM scale
• Medical Privacy implies encryption that significantly adds CPU
requirements
• Sensors of all kinds (lots of variety)
• Data banks (eg. PDB, Genbank, …)
• Access to remote resources
• Federation of (very large) data repositories (Networking)
• Seamless integration sensor nets
The Biomedical Informatics Research Network
a Multi-Scale Brain Imaging Federated Repository
BIRN Test-beds:
Multiscale Mouse Models of Disease, Human Brain Morphometrics, and
FIRST BIRN (10 site project for fMRI’s of Schizophrenics)
Some Critical Infrastructure Challenges
• BIRN Scientists wants to share data, databases, and
resources
•
•
•
•
Resources are geographically distributed
Data throughput requirements are aggressive
Encryption/Security is essential
Underlying infrastructure should hide complexity from the
user whenever possible
• Concentrate on several key areas
• Scalable infrastructure, interoperable
• Resource Size (Terabytes today), Resource # (10 + sites)
• Consistency of software deployment across domains
• High-performance, secure data movement
• Software/hardware specification/deployment (simplify
replication complex infrastructure)
Replication and Symmetry for Scalability
Grid Interface
High-speed IP Network
Grid Interface provides security, identity
BIRN User
Mapping, resource access/abstraction
Grid Interface
Site
Policies
Grid Interface
BIRN User
Grid Interface
Site
Policies
Grid Interface
Site
Policies
Grid Interface
Site
Policies
Standard Commodity Hardware for BIRN Sites
NCRR
• Gigabit/10/100 Network Switch – Cisco4006
• Network Statistics System
Cisco 4006
GigE Net Probe
Network Stats
General Compute
Grid POP
Net-Attached
Storage 1- 4TB
• Gigabit Ethernet Network Probe
• Network Attached Storage – Gigabit Ether
• 1.0 to 4.0 TB
• Grid POP (Compaq/HP DL380G3)
• SRB, Globus
• Dual Processor Linux w/ 1GB memory
Optional Storage
APC UPS
Leverage high-volume components for cost/scalability
Some BIRN Statistics
• 15 Racks across 12 institutions
• We’re in the “Getting Started” Phase of distributed
data
• ~ 300,000 data objects and associated meta-data
managed as a collection
• ~ 1TB of raw data (8% of current BIRN capacity)
• Function BIRN now processing human phantoms for
calibration (  # and size of data objects)
• Best-case achievable bandwidth coast-to-coast is ~
10MB/sec (27 Hours/Terabyte, ~2 Min/GB)
• Expectations are outstripping SW ability to keep up
• But the raw network capacity is growing, so where’s the
problem?
Data Intensive Scientific Applications
Requiring Experimental Optical Networks
• Large Data Challenges in Neuro and Earth Sciences
• Each Data Object is 3D and Gigabytes
• Data are Generated and Stored in Distributed Archives
• Research is Carried Out on Federated Repository
• Requirements
•
•
•
•
Computing Requirements  PC Clusters
Communications  “Dedicated” Lambdas Over Fiber
Data  Large Peer-to-Peer Lambda Attached Storage
Visualization  Collaborative Volume Algorithms
• Response
• OptIPuter Research Project
What is OptIPuter?
• It is a large NSF ITR project funded at $13.5M from
2002 – 2007
• Fundamentally, asks the question:
• What happens to the structure of machines and programs
when the network becomes essentially ‘infinite’?
• Enabled by improvements in photonic networking, 10 GigE +
Dense Wave Division Multiplexing
• Coupled tightly with key applications (e.g. BIRN)
• Keeps the IT research grounded and focused
• We are building (in phases) two high-capacity
networks with associated modest-sized endpoints
UCSD is building out a high-speed packetThe UCSD OptIPuter Deployment
switched network
To CENIC
Phase I, Fall 02
Phase II, 2003
Production Router
SDSC
SDSC
SDSC
SDSC
Annex
Annex
JSOE
Engineering
CRCA
Arts
SOM
Medicine
Chemistry
Phys.
Sci Keck
Collocation point
Preuss
High
School
6th
Undergrad
College
College
Node M
Collocation
Chiaro Router
SIO
Earth
Sciences
½ Mile
Per site Links
•1 Gbit/s and 4 Gbit/sec
• 4 Gb/s and 10Gb/s (2004)
• 10Gb/s and 40Gb/s (2005)
• 40 Gb/s and 160Gb/s (2007)
Source: Phil Papadopoulos, SDSC; Greg Hidley, Cal-(IT)2
OptIPuter LambdaGrid
Enabled by Chiaro Networking Router
www.calit2.net/news/2002/11-18-chiaro.html
Medical Imaging
and Microscopy
Chemistry,
Engineering, Arts
switch
switch
• Cluster – Disk
• Disk – Disk
Chiaro
Enstara
• Viz – Disk
• DB – Cluster
switch
switch
San Diego
Supercomputer Center
• Cluster – Cluster
Scripps Institution of
Oceanography
Image Source: Phil Papadopoulos, SDSC
The Center of the UCSD OptIPuter Network
http://132.239.26.190/view/view.shtml
• Unique optical routing
core
• 10 Gigabit wire-speed
routing
• Expandable to 5Tb/s
today
• We have the “baby”
Chiaro at 640 Gbit/sec
OptIPuter Endpoints are “Modest”
• Currently “tiny sized” clustered endpoints
• We are at the midpoint of procurement and
installation the following on Campus
• Four 32-node PC clusters for computation
• One ten node visualization cluster
• Two 9 Mpixel Big Bertha Displays
• One 48 Node Storage cluster
• ~20TB, 300 Disk Spindles
• Better balance of endpoint and network
BIRN + OptIPuter
• BIRN pushing on distributed data (especially image
data)
• OptIPuter pushing on network/storage/cpu
interactions
• There still is the problem of how to build
“compositional, high-performance, authenticated,
and secure software systems”
• High-speed access to remote data still implies
parallel endpoints
• Coordinating high-speed transfers is still a black art
Grids – Wide(r)-Area Computing and Storage
• Grids allow users to access data, computing,
visualization, instruments
• Grid Security (GSI) built-in from the beginning
• Location transparency is key – especially when resources
must be geographically distributed
• Software is rapidly moving from research grade to
production grade
• Grid service abstraction (OGSA) is key software change
Workflows in a Grid Service-Oriented environment
Osaka U.
PACI Resources
Art/Blobs
Ucsd.edu
CCDB
Interface
Backproject
User Sign on
GSI Proxy
Common Security, discovery, and instantiation
framework of Grid services enables construction
of complex workflows that crosses domains
Need to pay special attention to data movement issues
Simplified Grid Services
Formatted
Request
GSI
Client
(Requestor)
Grid services leverages existing
web services infrastructure
Processed
Response
read rawdata;
call.setTargetObjectURI("urn:gtomo-svc")
call.setMethodName(“backproject")
Call.setParams(“unprocesseddata”,rawdata)
Response =
invoke(call,http://ncmir.ucsd.edu/gtomo)
result = Response.getReturnValue();
Backproject
instance
Service
Provider
http://ncmir.ucsd.edu
1.
client formats request
(parameters + security)
2. Provider starts
instance of service for
client
3. Results returned over
net
Simple summary: Remote Procedure Call (RPC) for inter-domain execution
Redux
• Tech trends
• Large storage expectations, but need to be aware of
expanding fill times of rotating storage
• Growth in wide-area networking will rapidly catch up to
storage system speed
• Large Images challenges to the infrastructure
•
•
•
•
Raw storage performance
Interfacing storage to the network (cross-site issues)
Adding encryption of medical data adds complexity
Transfers will imply parallelism at various levels
• Software systems are moving to open grid services
• Starting a standard cross-domain remote procedure call
(RPC)
The Critical issue
• With such rapid changes how do build systems that
meet the needs of applications communities are not
standalone/”one-off” and meet the challenges of:
•
•
•
•
•
Integrity
Security
Performance
Scalability
Reliability
Download