Slide

advertisement
The Computational Grid: Aggregating
Performance and Enhanced Capability from
Federated Resources
Rich Wolski
University of California, Santa Barbara
The Goal
• To provide a seamless, ubiquitous, and highperformance computing environment using a
heterogeneous collection of networked computers.
• But there won’t be one, big, uniform system
– Resources must be able to come and go dynamically
– The base system software supported by each
resource must remain inviolate
– Multiple languages and programming paradigms must
be supported
– The environment must be secure
– Programs must run fast
• For distributed computing…The Holy Grail++
For Example: Rich’s Computational World
umich.edu
wisc.edu
ameslab.gov
lbl.gov
titech.jp
ksu.edu
osc.edu
anl.gov ncsa.edu
uiuc.edu
harvard.edu
wellesley.edu
indiana.edu
virginia.edu
ncni.net
utk.edu
ucsb.edu
isi.edu
csun.edu
caltech.edu
utexas.edu
ucsd.edu
npaci.edu
rice.edu
vu.nl
Zoom In
SDSC
CT94
IBM SP
Desktops
C
HPSS
Sun
T-3E
UCSB
The Landscape
• Heterogeneous
– Processors: X86, SPARC, RS6000, Alpha, MIPS,
PowerPC, Cray
– Networks: GigE, Myrinet, 100baseT, ATM
– OS: Linux, Solaris, AIX, Unicos, OSX, NT, Windows
• Dynamically changing
– Completely dedicated access is impossible =>
contention
– Failures, upgrades, reconfigurations, etc.
• Federated
– Local administrative policies take precedence
• Performance?
The Computational Grid
• Vision: Application programs “plug” into the system to
draw computational “power” from a dynamically changing
pool of resources.
– Electrical Power Grid analogy
 Power generation facilities == computers,
networks, storage devices, palm tops, databases,
libraries, etc.
 Household appliances == application programs
• Scale to national and international levels
• Grid users (both power producers and application
consumers) can join and leave the Grid at will.
The Shape of Things to Come?
• Grid Research Adventures
– Infrastructure
– Grid Programming
• State of the Grid Art
– What do Grids look like today?
• Interesting developments, trends, and prognostications
of the Grid future
Fundamental Questions
• How do we build it?
– software infrastructures
– policies
– maintenance, support, accounting, etc.
• How do we program it?
– concurrency, synchronization
– heterogeneity
– dynamism
• How do we use it for performance?
– metrics
– models
General Approach
• Combine results from distributed operating systems,
parallel computing, and internet computing research
domains
– Remote procedure call/ remote invocation
– Public/private key encryption
– Domain decomposition
– Location independent naming
• Engineering strategy: Implement Grid software
infrastructure as middleware
– Allows resource owners maintain ultimate control
locally over the resources they commit to the Grid
– Permits new resources to be incorporated easily
– Aids in developing a user community
Middleware Research Efforts
• Globus (I. Foster and K. Kesselman)
– Collection of independent remote execution and
naming services
• Legion (A. Grimshaw)
– Distributed object-oriented programming
• NetSolve (J. Dongarra)
– Multi-language brokered RPC
• Condor (M. Livny)
– Idle cycle harvesting
• NINF (S. Matsuoka)
– Java-based brokered RPC
Commonalities
• Runtime systems
– All current infrastructures are implemented as a set
of run time services
• Resource is an abstract notion
– Anything with an API is resource: operating
systems, libraries, databases, hardware devices
• Support for multiple programming languages
– legacy codes
– performance
Infrastructure Concerns
• Leverage emerging distributed technologies
– Buy it rather than build it
– Network infrastructure
– Web services
– Complexity
– Performance
• Installation, configuration, fault-diagnosis
– Mean time to reconfiguration is probably measured
in minutes
– Bringing the Grid “down” is not an option
– Who operates it?
NPACI
• National Partnership for Advanced Computational
Infrastructure
– high-performance computing for the scientific
research community
• Goal: Build a production-quality Grid
– Leverage emerging standards
– Harden and deploy mature Grid technologies
– Packaging, configuration, deployment, diagnostics,
accounting
• Deliver the Grid to scientists
PACI-sized Questions
• If the national infrastructure is managed as a Grid...
– What resources are attached to it?
X86 is certainly plentiful
Earth Simulator is certainly expensive
Mutithreading is certainly attractive
– What is the right blend?
– How are they managed?
How long will you wait for your job to get through
the queue?
• Accounting
– What are the units of Grid allocation?
Grid Programming
• Two models
– Manual: Application is explicitly coded to be a Grid
application
– Automatic: Grid software “Gridifies” a parallel or
sequential program
• Start with the simpler approach: build programs that
can adapt to changing Grid conditions
• What are the current Grid conditions?
– Need a way to assess the available performance
• For example:
– What is the speed of your ethernet?
Ethernet Doesn’t Have a Speed -- it Has Many
End-to-end TCP/IP Throughput
Adjacent Hosts
100mb Ethernet
mb/s
TCP/IP
TCP/IPthroughput
throughput
70
60
50
40
30
20
10
0
time
More Importantly
• It is not what the speed was, but what the speed will
be that matters
• Performance prediction
– Analytical models remain elusive
– Statistical models are difficult
– Whatever models are used, the prediction itself
needs to be fast
The Network Weather Service
• On-line Grid system that
– monitors the performance that is available from
distributed resources
– forecasts future performance levels using fast
statistical techniques
– delivers forecasts ‘‘on-the-fly’’ dynamically
• Uses adaptive, non-parametric time series analysis
models to make short-term predictions
– Records and reports forecasting error with each
prediction stream
• Runs as any user (no privileged access required)
• Scalable and end-to-end
NWS Predictions and Errors
End-to-end TCP/IP Throughput
Adjacent Hosts
100mb Ethernet
70
Red = NWS Prediction, Black = Data
TCP/IP throughput
60
50
40
30
20
10
0
time
MSE = 73.3, FED = 8.5 mb/s, MAE = 5.8 mb/s
Clusters Too
End-to-end TCP/IP Throughput
Intracluster
GigE Interconnect
TCP/IP throughput
250
200
150
100
50
0
time
MSE = 4089, FED = 63 mb/s, MAE = 56 mb/s
Many Challenges, No Waiting
• On-line predictions
– Need it better, faster, cheaper, and more accurate
• Adaptive programming
– Even if predictions are “there” they will have errors
– Performance fluctuates at machines speeds, not
human speeds
– Which resource to use? When?
• Can programmers really manage a fluctuating abstract
machine?
GrADS
• Grid Application Development Software (GrADS)
Project (K. Kennedy, PI)
– Investigates Grid programmability
– Soup-to-nuts integrated approach
 Compilers, Debuggers, libraries, etc.
 Automatic Resource Control strategies
– Selection and Scheduling
– Resource economies (stability)
 Performance Prediction and Monitoring
– Applications and resources
– Effective Grid simulation
– Builds upon middleware successes
– Tested with “real” applications
Four Observations
• The performance of the Grid middleware and services
matters
– Grid fabric must scale even if the individual
applications do not
• Adaptivity is critical
– So far, only short-term performance predictions are
possible
– Both application and system must adapt on same
time scale
• Extracting performance is really really hard
– Things happen at machine speeds
– Complexity is a killer
• We need more compilation technology
Grid Compilers
• Adaptive compilation
– Compiler and program preparation environment
needs to manage complexity
– The “machine” for which the compiler is optimizing
is changing dynamically
• Challenges
– Performance of the compiler is important
– Legacy codes
– Security?
• GrADS has broken ground, but there is much more to
do
Grid Research Challenges
• Four foci characterize Grid “problems”
– Heterogeneity
– Dynamism
– Federalism
– Performance
• Just building the infrastructure makes research
questions out of previously solved problems
– Installation
– Configuration
– Accounting
• Grid programming is extremely complex
– New programming technologies
Okay, so where are we now?
Rational Exuberance
DISCOM
SinRG
APGrid
IPG …
For Example -- TeraGrid
• Joint effort between
– San Diego Supercomputer Center (SDSC)
– National Center for Scientific Applications (NCSA)
– Argonne National Laboratory (ANL)
– Center for Advanced Computational Research
(CACR)
• Stats
– 13.6 Teraflops (peak)
– 600 Terabytes on-line storage
– 40 gb/s full connectivity, cross country, between
sites
• Software Infrastructure is primarily Globus based
• Funded by NSF last year
Non-trivial Endeavor
574p IA-32
Chiba City
256p HP
X-Class
128p Origin
128p HP
V2500
HR Display &
VR Facilities
Caltech: Data
collection and
analysis applications
92p IA-32
HPSS
HPSS
ANL: Visualization
SDSC: Data-oriented
computing
UniTree
HPSS
1024p IA-32
320p IA-64
1176p IBM SP
Blue Horizon
Myrinet
Myrinet
Sun E10K
NCSA: Compute-Intensive
1500p Origin
It’s Big, but there is Room to Grow
• Baseline infrastructure
– IA64 processors running Linux
– Gigabit ethernet
– Myrinet
– The Phone Company
• Designed to be heterogeneous and extensible
– Sites have “plugged” their resources in
IBM Blue Horizon
SGI Origin
Sun Enterprise
Convex X and V Class
Caves, imersadesks, etc.
Middleware Status
• Several research and commercial infrastructures have
reached maturity
– Research: Globus, Legion, NetSolve, Condor, NINF,
PUNCH
– Commercial: Globus, Avaki, Grid Engine
• By far, the most prevalent Grid infrastructure
deployed today is Globus
Globus on One Slide
• Grid protocols for resource access, sharing, and
discovery
– Grid Security Infrastructure (GSI)
– Grid Resource Allocation Manager (GRAM)
– MetaDirectory Service (MDS)
• Reference implementation of protocols in toolkit form
GSI
(Grid User
Security
InfrastrucUser
ture)
process #1
Proxy
MDS-2
(Meta Directory Service)
Reliable
remote
invocation Gatekeeper Reporter
(registry +
(factory)
discovery)
Create process
User
process #2
Proxy #2
Register
GRAM
(Grid Resource Allocation & Management)
Other GSIauthenticated
remote service
requests
GIIS: Grid
Information
Index Server
(discovery)
Other service
(e.g. GridFTP)
Increasing Research Leverage
• Grid research software artifacts turn out to be
valuable
– Much of the extant work is empirical and
engineering focused
– Robustness concerns mean that the prototype
systems need to “work”
– Heterogeneity implies the need for portability
• Open source impetus
• Need to go from research prototypes to nationally
available software infrastructure
– Download, install, run
Packaging Efforts
• NSF Middleware Initiative (NMI)
– USC/ISI, SDSC, U. Wisc., ANL, NCSA, I2
– Identifies maturing Grid services and tools
– Provides support for configuration tools, testing,
packaging
– Implements a release schedule and coordination
– R1 out 8/02
Globus, Condor-G, NWS, KX509/KCA
– Release every 3 months
– Many more packages slated
• The NPACkage
– Use NMI technology for PACI infrastructure
State of the Art
• Dozens of Grid deployments underway
• Linux cluster technology is the primary COTS
computing platform
• Heterogeneity is built in from the start
– Networks
– Extant systems
– Special-purpose devices
• Globus is the leading Middleware
• Grid services and software tools reaching maturity and
mechanisms are in place to maximize leverage
What’s next?
Grid Standards
• Interoperability is an issue
– Technology drift is starting to become a problem
– Protocol zoo is open for business
• The Global Grid Forum (GGF)
– Modeled after IETF (e.g working groups)
– Organized at a much earlier stage of development
(relatively speaking)
– Meetings every 4 months
– Truly an international organization
Webification
• Open Grid Service Architecture (OGSA)
– “The Physiology of the Grid,” I. Foster, K.
Kesselman, J. Nick, S. Tuecke
– Based on W3C standards (XML, WSDL, WSIL,
UDDI, etc.)
– Incorporates web service support for interface
publication, multiple protocol bindings, and
local/remote transparency
– Directly interoperable with Internet-targeted
“hosting environments”
J2EE, .NET
– The Vendors are excited
Grid@Home
• Entropia (www.entropia.com)
– Commercial enterprise
– Peer-2-Peer approach
Napster for compute cycles (without the law
suits)
– Microsoft PC-based instead of Linux/Unix based
More compute leverage -- a lot more
Way more configuration support, deployment
support, fault-management built into the system
– Proprietary technology
– Deployed at NPACI on 250+ hosts
Thanks and Credit
• organizations
– NPACI, SDSC, NCSA, The Globus Project
(ISI/USC), The Legion Project (UVa), UTK, LBL
• support
NSF, NASA, DARPA, USPTO, DOE
More Information
http://www.cs.ucsb.edu/~rich
• Entropia
– http://www.entropia.com
• Globus
– http://www.globus.org
• GrADS
– http://hipersoft.cs.rice.edu/grads
• NMI
– http://www.nsf-middleware.org
• NPACI
– http://www.npaci.edu
• NWS
– http://nws.cs.ucsb.edu
• TeraGrid
– http://www.teragrid.org
Download