Simulation for Grid Computing Henri Casanova Univ. of California, San Diego

advertisement
Simulation for
Grid Computing
Henri Casanova
Univ. of California, San Diego
casanova@cs.ucsd.edu
1
Grid Research
 Grid researchers often ask questions in the broad
area of distributed computing
 Which scheduling algorithm is best for this application
on a given Grid?
 Which design is best for implementing a distributed Grid
resource information service?
 Which caching strategy is best for enabling a community
of users that to distributed data analysis?
 What are the properties of several resource
management policies in terms of fairness and
throughput?
 What is the scalability of my publish-subscribe system
under various levels of failures
 ...
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
2
Grid Research
 Analytical or Experimental?
 Analytical Grid Computing research
 Some have developed purely analytical / mathematical
models for Grid computing
 makes it possible to prove interesting theorems
 often too simplistic to convince practitioners
 but sometimes useful for understanding principles in spite of
dubious applicability
 One often uncovers NP-complete problems anyway
 e.g., routing, partitioning, scheduling problems
 And one must run experiments
 Grid computing research: based on experiments
 Most published works
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
3
Grid Computing as Science?
 You can tell you are a scientific discipline if
 You can read a paper, easily reproduce (at least a
subset of) its results, and improve
 You can tell to a grad student “Here are the standard
tools, go learn how to use them and come back in one
month”
 You can give a 1-hour seminar on widely accepted tools
that are the basis for doing research in the area
 We are not there today
 But perhaps I can give a 1-hour seminar on emerging
tools that could be the basis for doing research in the
area, provided a few open questions are addressed
 Need for standard ways to run “Grid experiments”
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
4
Grid Experiments
 Real-world experiments are good
 Eminently “believable”
 Demonstrates that proposed approach can be
implemented in practice

But...
Can be time-intensive
 Execution of “applications” for hours, days, months, ...
 Can be labor-intensive
 Entire application needs to be built and functional
 including all design / algorithms alternatives
 include all hooks for deployment
 Is it a bad engineering practice to build many full-fledge
solutions to find out which ones work best?
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
5
Grid Experiments (2)
 What experimental testbed?
 My own little testbed
 well-behaved, controlled, stable
 often not representative of real Grids
 Real grid platforms
 (Still) challenging for many grid researchers to obtain
 Not built as a computer scientist’s playpen
 other users may disrupt experiments
 other users may find experiments disruptive
 Platform will experience failures that may disrupt the
experiments
 Platform configuration may change drastically while
experiments are being conducted
 Experiments are uncontrolled and unrepeatable
 even if disruption from other users is part of the experiments, it
prevents back-to-back runs of competitor designs / algorithms
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
6
Grid Experiments (3)
 Difficult to obtain statistically significant results on an
appropriate testbed

And to make things worse...
Experiments are limited to the testbed




What part of the results are due to idiosyncrasies of the testbed?
Extrapolations are possible, but rarely convincing
Must use a collection of testbeds...
Still limited explorations of “what if” scenarios
 what if the network were different?
 what if we were in 2015?
 Difficult for others to reproduce results
 This is the basis for scientific advances!
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
7
Simulation
 Simulation can solve many (all) of these difficulties





No need to build a real system
Conduct controlled / repeatable experiments
In principle, no limits to experimental scenarios
Possible for anybody to reproduce results
...
 Simulation
 “representation of the operation of one system (A)
through the use of another (B)”
 Computer simulation: B  a computer program
 Key question: Validation
 correspondence between simulation and real-world
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
8
Simulation in Computer Science
 Microprocessor Design
 A few standard “cycle-accurate” simulators are used extensively
 http://www.cs.wisc.edu/~arch/www/tools.html
 Possible to reproduce simulation results
 Networking
 A few standard “packet-level” simulators
 NS-2, DaSSF, OMNeT++




Well-known datasets for network topologies
Well-known generators of synthetic topologies
SSF standard: http://www.ssfnet.org/
Possible to reproduce simulation results
 Grid Computing?
 None of the above up until a few years ago
 Most people built their own “ad-hoc” solutions
 Promising recent developments
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
9
Simulation of Distributed
Computing Platforms?
 Simulation of parallel platforms used for decades
 Most simulations have made drastic assumptions
 Simplistic platform model
 Processors perform work at some fixed rate (Flops)
 Network links send data at some fixed rate
 Topology is fully connected (no communication interference) or
a bus (simple communication interference)
 Communication and computation are perfectly overlappable
 Simplistic application model
 All computation is CPU intensive
 Clear-cut communication and computation phases
 Application is deterministic
 Straightforward simulation in most cases
 Just fill in a Gantt chart with a computer rather than by hand
 No need for a “simulation standard”
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
10
Grid Simulations?
 Simple models:
 perhaps justifiable for a switched dedicated cluster
running a matrix multiply application
 Hardly justifiable for grid platforms
 Complex and wide-reaching network topologies
 multi-hop networks, heterogeneous bandwidths and
latencies
 non-negligible latencies
 complex bandwidth sharing behaviors
 contention with other traffic
 Overhead of middleware
 Complex resource access/management policies
 Interference of communication and computation
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
11
Grid Simulations
 Recognized as a critical area

Grid eXplorer (GdX) project (INRIA)

Build an actual scientific instrument





Databases of “experimental conditions”
1000-node cluster for simulation/emulation
Visualization tools
Still in planning
What simulation technology???
 Two main questions
1. What does a “representative” Grid look like?
2. How does one do simulation on a synthetic
representative Grid?
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
12
Presentation Outline
 Introduction
 Simulation for Grid Computing?
 Generating Synthetic Grids
 Simulating Applications on Synthetic Grids
 Current Work and Future Directions
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
13
Why Synthetic Platforms?
 Two goals of simulations:
 Simulate platforms beyond the ones at hand
 Perform sensitivity analyses
 Need: Synthetic platforms
 Examine real platforms
 Discover principles
 Implement “platform generators”
 What Simulation results in my paper?
 Results for a few “real” platforms
 Results for many synthetic platforms
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
14
Generation of Synthetic Grids
 Three main elements
 Network Topology
 Graph
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
15
Generation of Synthetic Grids
 Three main elements
 Network Topology
 Graph
 Bandwidth and Latencies
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
16
Generation of Synthetic Grids
 Three main elements
 Network Topology
 Graph
 Bandwidth and Latencies
 Compute Resources
 And other resources
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
17
Generation of Synthetic Grids
 Three main elements
 Network Topology
 Graph
 Bandwidth and Latencies
 Compute Resources
 And other resources
 “Background” Conditions
 Load and unavailability
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
18
Generation of Synthetic Grids
 Three main elements
 Network Topology
 Graph
 Bandwidth and Latencies
 Compute Resources
 And other resources
 “Background” Conditions
X
X
X
X
 Load and unavailability
 Failures
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
19
Generation of Synthetic Grids
 Three main elements
 Network Topology
 Graph
 Bandwidth and Latencies
 Compute Resources
X
X
 And other resources
 “Background” Conditions
X
X
 Load and unavailability
 Failures
 What is Representative and Tractable?
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
20
Synthetic Network Topologies
 The network community has wondered
about the properties of the Internet topology
for years
 The Internet grows in a decentralized fashion
with seemingly complex rules and incentives
 Could it have a mathematical structure?
 Could we then have generative models?
 Three “generations of graph generators”
 Random
 Structural
 Degree-based
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
21
Random Topology Generators
 Simplistic
 Place N nodes randomly in a square
 Vertices u,v connected with prob. P(u,v) = 
 Waxman [JSAC’88]
 Place N nodes randomly in a CxC square
 P(u,v) =  e-d / ( C √2), 0 < ,  ≤ 1
 d = Euclidian distance between u and v
 First model to be widely used for network simulations
 Others
 Exponential, Locality-based, ...
 Shortcoming: Real networks have a non-random
structure with some “hierarchy”
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
22
Structural Generators
 “... the primary structural characteristic affecting


the paths between nodes in the Internet is the
distribution between stub and transit domains... In
other words, there is a hierarchy imposed on
nodes...” [Zegura et al, 1997]
Both at the “AS level” (peering relationships) and
at the “router level” (Layer 2)
Quickly became accepted wisdom:
 Transit-Stub [Calvert et al., 1997]
 Tiers [Doar, 1996]
 GT-ITM [Zegura et al., 1997]
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
23
Power-Laws!
 In 1999, Faloutsos et al. [SIGCOMM’99] rocked
topology generation with power laws
 Results obtained both for AS-level and router-level
information from real networks
 Outdegree (number of edges leaving a node)
 For each node v, compute its outdegree dv
 Node rank, rv: the index of v in
the order of decreasing degree
 Nodes can have the same rank
 Law: dv proportional to rvR
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
24
Power-Laws!
 Random Generators do not
agree with Power-laws
 Structural Generators do not agree with Power-laws
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
25
Degree-based Generators
 New common wisdom: A topology that does not

agree with power-laws cannot be representative
Flurry of development of “power-law generators”
after the Faloutsos paper




CMU power law graph generator [Palmer et al., 2000]
Inet [Jin et al., 2000]
BRITE [Medina et al., 2001]
PLRG [Aiello et al., 2000]
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
26
Structure vs. Power-law
 We know network have structure AND power laws
 Combine both?
 GridG project [Dinda et al., SC’03]
 Use a Tiers-generated topology
 Add random links to satisfy the power-laws
 How about just using power-laws?
 Comparison paper [Tangmunarunkit et al., SIGCOMM’02]
 Degree-based generators capture the large-scale structure of real
networks very well, better than structural generators!
 structural generators impose “too strict” a hierarchy
 Hierarchy arises naturally from the degree-based generators
 e.g., backbone links
 Works both for AS-level and router-level
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
27
Short story
 What generator?
 To model large networks (e.g., > 500 routers)
 use degree-based generators
 To model small networks (e.g., < 100 routers)
 use structural generators
 degree-based will not create any hierarchy
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
28
Bandwidths, latencies, traffic, etc.
 Topology generators only produce a graph
 We need link characteristics as well
 Option #1: Model physical characteristics
 Some “models” in topology generators
 Need to simulate background traffic
 No accepted model for generating background traffic
 Simulation can be very costly
 Option #2: Model end-to-end performance
 Models ([Lee, HCW’01]) or Measurements (NWS, ...)
 Go from path modeling to link modeling?
 Turns out to be a difficult question
 DARPA workshop on network modeling
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
29
Bandwidths, latencies, traffic, etc.
 Maybe none of this matters?
 Fiber inside the network mostly unused
 Communication bottleneck is the local link
 Appropriate tuning of TCP or better
protocols should saturate the local link
 Don’t care about topology at all!
 Or maybe none of this matters for my
application
 No network contention
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
30
Compute Resources
 What resources do we put at the endpoints?
 Option #1: “ad-hoc” generalization
 Look at the TeraGrid
 Generate new sites based on existing sites
 Option #2: Statistical modeling
 Examing many production resources
 Identify key statistical characteristics
 Come up with a generative/predictive model
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
31
Synthetic Clusters?
 Many Grid resources are clusters
 What is the “typical” distribution of clusters?
 “Commodity Cluster synthesizer” [Kee et al.,
SC’04]
 Examined 114 production clusters (10K+ procs)
 Came up with statistical models
 Validated model against a set of 191 clusters
(10K+ procs)
 Models allow “extrapolation” for future
configurations
 Models implemented in a resource generator
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
32
Architecture/Clock Models
Processor
Fraction (%)
Pentium2
Celeron
Pentium3
Pentium4
Itanium
4.0
Pentium2
1.4
Celeron
4.1
Pentium3
40.3
3.5
Pentium4
34.6
Itanium
3.9
Clock Rate (Ghz)
3.0
2.5
2.0
1.5
1.0
0.5
0.0
AthlonMP
12.4
AthlonXP
Opteron
5000
1998
1999
2000
2001
2002
2003
2004
2005
Year
4000
# of Processors
Athlon
0.0
1997
1.3
3000
2000
2.0
1000
0
 Current distribution of proc families
 Linear fit between clock-rate and release year within a processor family
 Quadratic fraction of processors released on a given year
1998
1999
2000
Year
2001
2002
2003
 Model future distributions and speeds
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
33
Other models?
 Other models
 Cache size: grows logarithmically
 Number of processors per node: log2-normal
 Memory size: log2-normal
 Number of nodes per cluster: log2-normal
 Models were validated against a snapshot of ROCKS
clusters
 These clusters have been added to the “training” set
 More clusters are added every month
 GridG
 Provide a generic framework in which such laws and other
correlations can be encoded
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
34
Resource Availability / Workload
 Probabilistic models
 Naive: exp. distributed availability and unavailability
intervals
 Sophisticated: Weibull distributions [Wolski et al.]
 Traces
 NWS, etc.
 Desktop Grid resources [Kondo, SC’04]
 Workload models
 e.g., Batch schedulers
 Traces
 Models [Feitelson, JPDC’03]
 job inter-arrival times: Gamma
 amount of work requested: Hyper-Gamma
 number of processors requested: Compounded (2^, 1, ...)
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
35
A Sample Synthetic Grid
 Generate 5,000 routers with BRITE
 Annotate latency according to BRITE’s Euclidian distance





method (scaling to obtain the desired network diameter)
Annotate bandwidth based on a set of end-to-end NWS
measurements
Pick 30% of the end-points
Generate a cluster at each end-point according to Kee’s
synthesizer for Year 2006
Model cluster load with Feitelson’s model with a range of
parameters for the random distributions
Model resource failures based on Inca measurements on
TeraGrid
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
36
Synthetic Grid Generation
 Still far from widely accepted standards
 Many ongoing, promising efforts
 Researchers have recognized this as an issue
 Tools from networking can be reused
 A few “Grid” tools are available
 What is really needed
 Repository of “Grid Measurements”
 Repository of “Synthetic/Generated Grids”
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
37
Presentation Outline
 Introduction
 Simulation for Grid Computing?
 Generating Synthetic Grids
 Simulating Applications on Synthetic Grids
 Current Work and Future Directions
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
38
Simulation Levels
 Simulating applications on a synthetic platform?
 Spectrum of simulation levels
more
abstract
less
abstract
Mathematical
Simulation
Based solely on equations
Discrete-event
Simulation
Abstraction of system as a set
of dependent actions and events
(fine- or coarse-grain)
Emulation
Trapping and virtualization of low-level
application/system actions
 Boundaries above are blurred (d.e. simulation ~ emulation)
 A simulation can combine all paradigms at different levels
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
39
Simulation Options
 Network
 Macroscopic: Flows in “pipes”
more
abstract
 coarse-grain d.e. simulation + mathematical simulation
 Microscopic: Packet-level simulation
 fine-grain d.e. simulation
 Actual flows go through “some” network
 emulation
 CPU
 Macroscopic: Flows in a “pipe”
less
abstract
more
abstract
 coarse-grain d.e. simulation + mathematical simulation
 Microscopic: Cycle-accurate simulation
 fine-grain d.e. simulation
 Virtualization via another CPU / Virtual Machine
 emulation
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
less
abstract
40
Simulation Options
 Application
 Macroscopic: Application = analytical “flow”
 Less Macroscopic: sets of abstract “tasks” with
resource needs and dependencies
more
abstract
 coarse-grain d.e. simulation
 Application specification or pseudo-code API
 Virtualization
 emulation of actual code with trapping of
application-generated events
less
abstract
 Two projects
 MicroGrid (UCSD)
 SimGrid (UCSD + IMAG + Univ. Nancy)
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
41
MicroGrid
 Set of simulation tools for evaluating middleware,

applications, and network services for Grid systems
Applications are supported by emulation and virtualization
 Actual application code is executed on “virtualized” resources
 Microgrid accounts for
Application
 CPU
 network
 Virtualization
Virtual
Resources
MicroGrid
Physical
Resources
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
42
MicroGrid Virtualization
 Resource virtualization
 “resource names” are virtualized
 gethostname, sockets, Globus’ GIS, MDS, NWs, etc.
 Time virtualization
 Simulating the TeraGrid on a 4 node cluster
 Simulating a 4 node cluster on the TeraGrid
 CPU virtualization
 Direct execution on (a fraction of) a physical resource
 No application modification
 Main challenge
 Synchronization between real time and virtual time
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
43
MicroGrid Network
 Packet-level simulation
 Network calls are intercepted
 Are sent to a network simulator that has been configured with the
virtual network topology
 Implemented with MaSSF
 An MPI version of DaSSF [Liu et al., 2001]
 Configured via SSF standard (MDL, etc.)
 Protocol stacks implemented on top of MaSSF
 TCP, UDP, BGP, etc.
 Socket calls are trapped without application modification
 real data is sent
 delays are scaled to match the virtual time
 Main challenges
 scaling (20,000 routers on a 128-node cluster)
 synchronization with computation
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
44
MicroGrid in a NutShell
(emulation)
Virtualization via trapping of
application events
(emulation)
 Can be really slow
 But hopefully accurate
 Can have high overhead
 But captures the overhead!
Virtualization via another CPU
Microscopic: Packet-level simulation
(fine-grain discrete event simulation)
 Can be really slow for long transfers
 But hopefully accurate
CPU
Application Network
less
abstract
Emulation
Edinburgh, Scotland, June 2005
Discrete-event
Simulation
Grid Performance Workshop 2005
Mathematical
Simulation
more
abstract
45
SimGrid
 Originally developed for scheduling research
 Must be fast to allow for thousands of simulation
 Application
 No real application code is executed
 Consists of tasks that have
 dependencies
 resource consumptions
 Resources
 No virtualization
 A resource is defined by
 a rate a which it does “work”
 a fixed overhead that must be paid by each task
 traces of the above if needed + failures
 A task can use multiple resources
 data transfer over multiple links, computation that uses a disk and
a CPU
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
46
SimGrid
 Uses a combination of mathematical simulation
and coarse-grain discrete event simulation
 Simple API to “specify” an application rather than having
it already implemented
 Fast simulation
 Key issue: Resource sharing
 In MicroGrid: resource sharing “emerges” out of the lowlevel emulation and simulation
 Packets of different connections interleaved by routers
 CPU cycles of different processes get slices of the CPU
 Drawback: slow simulation
 How can one do something faster that is still
reasonable?
 Come up with macroscopic models of resource sharing
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
47
Resource Sharing in SimGrid
 Macroscopic resource sharing can be easy
 A CPU: CPU-bound processes/threads get a fair share
of the CPU “in steady state”
 Why go through the trouble of emulating CPU-bound
processes?
 Just say how many cycles they need, and “compute”
how many cycles they get per second
 Macroscopic resource sharing can be not so easy
 The Network
 Many end-points, routers, and links
 Many end-to-end TCP flows?
 How much bandwidth does each flow receive?
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
48
Bandwidth Sharing
 Macroscopic TCP modeling is a field
 “Fluid in Pipe” analogy
 Rule of Thumb: Share of what a flow gets on its
bottleneck link is inversely proportional to its Round-Trip
Time
10 Mb/sec
10 Mb/sec
10 Mb/sec
8 Mb/sec
8 Mb/sec
10 Mb/sec
2 Mb/sec
8 Mb/sec
8 Mb/sec
 Turns out TCP in steady-state implements a type
of resource sharing called: Max-Min Fairness
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
49
Max-Min Fairness
 Principle
 Consider the set of all network links, L
 cl is the capacity of link l
 Considers the set of all flows, R
 a flow = a subset of L
 xr is the bandwidth allocated to flow r
 Bandwidth capacities are respected
 l  L, ∑ r  R | l  r xr ≤ cl
 TCP in steady-state is such that:
minr  R xr
is maximized
 The above can be solved efficiently (with
appropriate data structures)
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
50
SimGrid
 Uses the Max-Min fairness principle for all
resource sharing
 fast
 validated in the real-world for CPUs
 validated with NS-2 for networks
 Limitation
 Max-Min fairness is for steady-state
 e.g., no TCP slow-start
 e.g., no process priority boosts
 Unclear when it starts breaking down
 Is justified for “long enough” transfers and computations
 reasonable for scientific applications
 not so much for applications such as a Grid information service
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
51
SimGrid in a NutShell
Macroscopic: Flows in a “pipe
(mathematical simulation +
coarse-grain d.e. simulation)
 Very fast
 Not accurate for short transfers
Macroscopic: Flows in a “pipe
(mathematical simulation +
coarse-grain d.e. simulation)
 Very fast
 Abstract application model
Macroscopic: abstract “tasks” with
resource needs and dependencies
(coarse-grain d.e. simulation)
 Very fast
 Abstract application model
CPU
Application
less
abstract
Emulation
Edinburgh, Scotland, June 2005
Discrete-event
Simulation
Grid Performance Workshop 2005
Network
Mathematical
Simulation
more
abstract
52
Other Projects
 ModelNet
 Network emulation
 unmodified application
 packets routed through a “core cluster”
 GigaBit-switched nodes running a modified kernel
 Emulates router queues
 More “emulation” than MicroGrid
 Only for networking, but plans to add support for computation
 Still many “in-house” simulators that may aspire to become
widely-used tools






EmuLab/DummyNet
ChicagoSim
OptorSim
EDGSim
GridSim
...
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
53
So what should I use?
 It really depends on your goal / resources
 SimGrid’s network model has clear limitations, e.g. for short
transfers
 SimGrid simulations are easy to set up
 MicroGrid simulations take a lot of time (although they can be
parallelized)
 ModelNet requires some hardware setup
 SimGrid does not require for a full application to be written
 MicroGrid models overhead of system calls implicitly
 ...
 Key trade-off: accuracy and speed
 The more abstract the simulation the fastest
 The less abstract the simulation the most accurate
 Does this trade-off really hold?
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
54
Simulation Validation
 The crux of most simulation work in most domains of

computer science
Validation is difficult and almost never done convincingly
 Provide justification that the model is plausible
 Convince people that the simulator implements the model
(“verification”)
 Provide a few graphs that show that it’s reasonable
 validation in a few special cases, at best
 validation against another “validated” simulator
 Argue that although absolute values are off, the trends are
respected
 Conclude that the simulator is useful to compare
algorithms/designs
 Obtain scientific results?????
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
55
FLASH vs. FLASH
 FLASH vs. (Simulated) FLASH: Closing the Simulation Loop [Gibson et
al., ASPLOS’00]
 FLASH project at Stanford
 building large-scale shared-memory multiprocessors
 Went from conception, to design, to actual hardware (32-node)
 Used simulation heavily over 6 years
 The authors went back and compared simulation to the real world!
 Simulation error is unavoidable
 30% error in their case was not rare
 Negating the impact of “we got 1.5% improvement”
 One should focus on simulating the important things
 A more complex simulator does not ensure better simulation
 simple simulators worked better than sophisticated ones, which were unstable
 simple simulators predicted trends as well as slower, sophisticated ones
 It is key to use the real-world to tune/calibrate the simulator
 Conclusion: for FLASH, the simple simulator was all that was needed
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
56
Presentation Outline
 Introduction
 Simulation for Grid Computing?
 Generating Synthetic Grids
 Simulating Applications on Synthetic Grids
 Current Work and Future Directions
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
57
Grid Simulation:
Accuracy vs. Speed
 Comparing simulators and validating them is a gigantic
amount of work
 It will never been clear-cut
 identify simulation “regimes”
 It doesn’t lead to many papers
 It’s eminently politically incorrect
 Results depend on what the simulation is used for
 Therefore nobody does it
 The story one would like to tell is, e.g.:
 Start with SimGrid simulations at first to identify promising
approaches
 Move to MicroGrid emulations to precisely quantify the trade-offs
 How can we substantiate this story?
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
58
Current Work in SimGrid
 SimGrid uses a simulation engine called SURF
 currently SURF performs a blend of mathematical simulation and
discrete event simulation
 Current work
 Adding a “MaSSF” back-end (i.e., MicroGrid)
 Adding a “ModelNet” back-end
SimGrid
SURF MaSSF ModelNet
 Goal: evaluate the speed-accuracy trade-off for

simulation of the network
Expected result:
 SimGrid sufficient and fast for large-enough messages
 e.g., Good for scientific applications
 MaSSF/ModelNet required for small/frequent messages
 e.g., Good for middleware applications (e.g., NWS), p2p applications
 But beware FLASH vs. FLASH!
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
59
What about Validation?
 The GRAS project [Quinson et al.]
 Idea: provide a way to compile code into real-world code
and into simulation code
 Write the application using the GRAS API
 Compile it into a SimGrid simulation
 Compile it into a real-world code
 currently provides its own back-end and deployment
 could use Globus as a back-end
 Run and compare
 Goals
 smooth transition from design to prototyping to production
 easy validation and simulation calibration
 Plan
 Use GRAS to easily compare SimGrid, MaSSF, ModelNet, and real
world networks
 Move on to full applications
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
60
Conclusion
 Simulation is difficult
 Eternal question: What does really matters?
 Grid researchers are actively working on it
 Usable tools exist
 Grid simulation today should not “re-invent the wheel”
 Two crucial next steps
 Repository of synthetic Grids, Grid measurement datasets, and
Grid simulation software
 Scientifically sound validation experiments
 Validate simulators
 Understand what matters and what does not
 Only then will we have a scientific discipline with a
standard way to conduct experiments and a way for
researcher to reproduce each other’s results.
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
61
Questions?
62
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
63
A Simple Experiment
- Sent out files of 500MB up to 1.5GB with TTCP
- Using from 1 up to 16 simultaneous connections
- Recorded the data transfer rate per connection
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
64
Normalized data rate per connection
Experimental Results
Number of concurrent TCP connections
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
65
Max-Min Fairness
 Captures other resource sharing beyond networks!
Interference of Communication
and Computation
[kreaseck et al., IJHPCA’05]
CPU sharing
Edinburgh, Scotland, June 2005
Grid Performance Workshop 2005
70
Download