Methods - the Department of Computer and Information Science

advertisement
Simulating
a $2M Commercial Server
on a $2K PC
Alaa Alameldeen, Milo Martin, Carl Mauer,
Kevin Moore, Min Xu, Daniel Sorin,
Mark D. Hill, & David A. Wood
Multifacet Project (www.cs.wisc.edu/multifacet)
Computer Sciences Department
University of Wisconsin—Madison
February 2003
(C) 2003 Mulitfacet Project
University of Wisconsin-Madison
Summary
• Context
– Commercial server design is important
– Multifacet project seeks improved designs
– Must evaluate alternatives
• Commercial Servers
– Processors, memory, disks  $2M
– Run large multithreaded transaction-oriented workloads
– Use commercial applications on commercial OS
• To Simulate on $2K PC
– Scale & tune workloads
– Manage simulation complexity
– Cope with workload variability
Methods
2
Keep L2 miss rates, etc.
Separate timing & function
Use randomness & statistics
Wisconsin Multifacet Project
Outline
• Context
– Commercial Servers
– Multifacet Project
•
•
•
•
Workload & Simulation Methods
Separate Timing & Functional Simulation
Cope with Workload Variability
Summary
Methods
3
Wisconsin Multifacet Project
Why Commercial Servers?
• Many (Academic) Architects
– Desktop computing
– Wireless appliances
• We focus on servers
–
–
–
–
(Important Market)
Performance Challenges
Robustness Challenges
Methodological Challenges
Methods
4
Wisconsin Multifacet Project
3-Tier Internet Service
Multifacet Focus
LAN
/
SAN
PCs w/
“soft” state
Methods
LAN
/
SAN
Servers running
applications
for “business” rules
5
Servers running
databases for
“hard” state
Wisconsin Multifacet Project
Multifacet: Commercial Server Design
• Wisconsin Multifacet Project
– Directed by Mark D. Hill & David A. Wood
– Sponsors: NSF, WI, Compaq, IBM, Intel, & Sun
– Current Contributors: Alaa Alameldeen, Brad Beckman,
Nikhil Gupta, Pacia Harper, Jarrod Lewis, Milo Martin, Carl Mauer,
Kevin Moore, Daniel Sorin, & Min Xu
– Past Contributors: Anastassia Ailamaki, Ender Bilir,
Ross Dickson, Ying Hu, Manoj Plakal, & Anne Condon
• Analysis
– Want 4-64 processors
– Many cache-to-cache misses
– Neither snooping nor directories ideal
• Multifacet Designs
– Snooping w/ multicast [ISCA99] or unordered network [ASPLOS00]
– Bandwidth-adaptive [HPCA02] & token coherence [ISCA03]
Methods
6
Wisconsin Multifacet Project
Outline
• Context
• Workload & Simulation Methods
–
–
–
–
Select, scale, & tune workloads
Transition workload to simulator
Specify & test the proposed design
Evaluate design with simple/detailed processor models
• Separate Timing & Functional Simulation
• Cope with Workload Variability
• Summary
Methods
7
Wisconsin Multifacet Project
Multifacet Simulation Overview
Full Workloads
Commercial Server
(Sun E6000)
Scaled Workloads
Workload Development
Memory Protocol
Generator (SLICC)
Pseudo-Random
Protocol Checker
Full System Functional
Simulator (Simics)
Memory Timing
Simulator (Ruby)
Protocol Development
Processor Timing
Simulator (Opal)
Timing Simulator
• Virtutech Simics (www.virtutech.com)
• Rest is Multifacet software
Methods
8
Wisconsin Multifacet Project
Select Important Workloads
Full Workloads
•
•
•
•
•
Online Transaction Processing: DB2 w/ TPC-C-like
Java Server Workload: SPECjbb
Static web content serving: Apache
Dynamic web content serving: Slashcode
Java-based Middleware: (soon)
Methods
9
Wisconsin Multifacet Project
Setup & Tune Workloads (on real hardware)
Full Workloads
Commercial Server
(Sun E6000)
• Tune workload, OS parameters
• Measure transaction rate, speed-up, miss rates, I/O
• Compare to published results
Methods
10
Wisconsin Multifacet Project
Scale & Re-tune Workloads
Commercial Server
(Sun E6000)
Scaled Workloads
• Scale-down for PC memory limits
• Retaining similar behavior (e.g., L2 cache miss rate)
• Re-tune to achieve higher transaction rates
(OLTP: raw disk, multiple disks, more users, etc.)
Methods
11
Wisconsin Multifacet Project
Transition Workloads to Simulation
Scaled Workloads
Full System Functional
Simulator (Simics)
• Create disk dumps of tuned workloads
• In simulator: Boot OS, start, & warm application
• Create Simics checkpoint (snapshot)
Methods
12
Wisconsin Multifacet Project
Specify Proposed Computer Design
Memory Protocol
Generator (SLICC)
Memory Timing
Simulator (Ruby)
•
•
•
•
Coherence Protocol (control tables: states X events)
Cache Hierarchy (parameters & queues)
Interconnect (switches & queues)
Processor (later)
Methods
13
Wisconsin Multifacet Project
Test Proposed Computer Design
Pseudo-Random
Protocol Checker
•
•
•
•
•
Memory Timing
Simulator (Ruby)
Randomly select write action & later read check
Massive false-sharing for interaction
Perverse network stresses design
Transient error & deadlock detection
Sound but not complete
Methods
14
Wisconsin Multifacet Project
Simulate with Simple Blocking Processor
Scaled Workloads
Full System Functional
Simulator (Simics)
Memory Timing
Simulator (Ruby)
• Warm-up caches or sometimes sufficient (SafetyNet)
• Run for fixed number of transactions
– Some transaction partially done at start
– Other transactions partially done at end
• Cope with workload variability (later)
Methods
15
Wisconsin Multifacet Project
Simulate with Detailed Processor
Scaled Workloads
Full System Functional
Simulator (Simics)
Memory Timing
Simulator (Ruby)
Processor Timing
Simulator (Opal)
• Accurate (future) timing & (current) function
• Simulation complexity decoupled (discussed soon)
• Same transaction methodology
& work variability issues
Methods
16
Wisconsin Multifacet Project
Simulation Infrastructure & Workload Process
Full Workloads
Commercial Server
(Sun E6000)
Memory Protocol
Generator (SLICC)
Pseudo-Random
Protocol Checker
•
•
•
•
•
Scaled Workloads
Full System Functional
Simulator (Simics)
Memory Timing
Simulator (Ruby)
Processor Timing
Simulator (Opal)
Select important workloads: run, tune, scale, & re-tune
Specify system & pseudo-randomly test
Create warm workload checkpoint
Simulate with simple or detailed processor
Fixed #transactions, manage simulation complexity (next),
cope with workload variability (next next)
Methods
17
Wisconsin Multifacet Project
Outline
• Context
• Simulation Infrastructure & Workload Process
• Separate Timing & Functional Simulation
–
–
–
–
Simulation Challenges
Managing Simulation Complexity
Timing-First Simulation
Evaluation
• Cope with Workload Variability
• Summary
Methods
18
Wisconsin Multifacet Project
Challenges to Timing Simulation
• Execution driven simulation is getting harder
• Micro-architecture complexity
– Multiple “in-flight” instructions
– Speculative execution
– Out-of-order execution
• Thread-level parallelism
– Hardware Multi-threading
– Traditional Multi-processing
Methods
19
Wisconsin Multifacet Project
Challenges to Functional Simulation
• Commercial workloads have high functional fidelity
demands
Application complexity
Target Application
Web Server
Kernels
SPEC
Benchmarks
(Simulated)
Target System
Database
Operating
System
MMU
Status
Registers
Real Time
Clock
Serial Port
I/O MMU
Controller
DMA
Controller
IRQ
Controller
Terminal
Processor
RAM
PCI Bus
Graphics
Card
Methods
20
Ethernet
Controller
CDROM
SCSI
Disk
Fiber
Channel
Controller
SCSI
Controller
…
SCSI
Disk
Wisconsin Multifacet Project
Managing Simulator Complexity
Timing and Functional
Simulator
Integrated (SimOS)
- Complex
Functional
Simulator
Timing
Simulator
Functional-First (Trace-driven)
Timing
Simulator
Functional
Simulator
Timing-Directed
Complete Timing
No? Function
Timing
Simulator
Complete Timing
Partial Function
Methods
- Timing feedback
No Timing
Complete Function
+ Timing feedback
- Tight Coupling
- Performance?
Timing-First (Multifacet)
Functional
Simulator
No Timing
Complete Function
21
+ Timing feedback
+ Using existing simulators
+ Software development advantages
Wisconsin Multifacet Project
Timing-First Simulation
• Timing Simulator
– does functional execution of user and privileged operations
– does speculative, out-of-order multiprocessor timing simulation
– does NOT implement functionality of full instruction set or any devices
• Functional Simulator
add
load
Execute
Cache
CPU
Network
– does full-system multiprocessor simulation
– does NOT model detailed micro-architectural timing
CPU
Verify
Timing
Simulator
Methods
System
Commit
RAM
Functional
Simulator
22
Wisconsin Multifacet Project
Timing-First Operation
• As instruction retires, step CPU in functional simulator
• Verify instruction’s execution
• Reload state if timing simulator deviates from functional
add
load
Execute
Cache
Network
– Loads in multi-processors
– Instructions with unidentified side-effects
– NOT loads/store to I/O devices
CPU
System
Commit
Verify
RAM
CPU
Timing
Simulator
Methods
Reload
23
Functional
Simulator
Wisconsin Multifacet Project
Benefits of Timing-First
• Supports speculative multi-processor timing models
• Leverages existing simulators
• Software development advantages
– Increases flexibility and reduces code complexity
– Immediate, precise check on timing simulator
• However:
– How much performance error is introduced in this approach?
– Are there simulation performance penalties?
Methods
24
Wisconsin Multifacet Project
Evaluation
• Our implementation, TFsim uses:
– Functional Simulator: Virtutech Simics
– Timing simulator: Implemented less than one-person year
• Evaluated using OS intensive commercial workloads
– OS Boot: > 1 billion instructions of Solaris 8 startup
– OLTP: TPC-C-like benchmark using a 1 GB database
– Dynamic Web: Apache serving message board, using code
and data similar to slashdot.org
– Static Web: Apache web server serving static web pages
– Barnes-Hut: Scientific SPLASH-2 benchmark
Methods
25
Wisconsin Multifacet Project
Measured Deviations
• Less than 20 deviations per 100,000 instructions (0.02%)
Methods
26
Wisconsin Multifacet Project
If the Timing Simulator Modeled Fewer Events
Methods
27
Wisconsin Multifacet Project
Analysis of Results
• Runs full-system workloads!
• Timing performance impact of deviations
– Worst case: less than 3% performance error
• ‘Overhead’ of redundant execution
– 18% on average for uniprocessors
– 18% (2 processors) up to 36% (16 processors)
Functional
Simulator
Timing
Simulator
Total Execution
Time
Methods
29
Wisconsin Multifacet Project
Performance Comparison
Target Application
SPLASH-2
Kernels
match
SPLASH-2
Kernels
(Simulated)
Target System
Out-of-Order MP
SPARC V9
close
Out-of-Order MP
Full-system
SPARC V9
Host Computer
400 MHz SPARC
running Solaris
different
1.2 GHz Pentium
running Linux
RSIM
TFsim
• Absolute simulation performance comparison
– In kilo-instructions committed per second (KIPS)
– RSIM Scaled: 107 KIPS
– Uniprocessor TFsim: 119 KIPS
Methods
30
Wisconsin Multifacet Project
Timing-First Conclusions
• Execution-driven simulators are increasingly complex
• How to manage complexity?
• Our answer:
Timing
Simulator
Complete Timing
Partial Function
Functional
Simulator
Timing-First Simulation
No Timing
Complete Function
– Introduces relatively little performance error (worst case: 3%)
– Has low-overhead (18% uniprocessor average)
– Rapid development time
Methods
32
Wisconsin Multifacet Project
Outline
•
•
•
•
Context
Workload Process & Infrastructure
Separate Timing & Functional Simulation
Cope with Workload Variability
– Variability in Multithreaded Workloads
– Coping in Simulation
– Examples & Statistics
• Summary
Methods
33
Wisconsin Multifacet Project
What is Happening Here?
OLTP
Methods
34
Wisconsin Multifacet Project
What is Happening Here?
• How can slower memory lead to faster workload?
• Answer: Multithreaded workload takes different path
– Different lock race outcomes
– Different scheduling decisions
• (1) Does this happen for real hardware?
• (2) If so, what should we do about it?
Methods
35
Wisconsin Multifacet Project
One Second Intervals (on real hardware)
OLTP
Methods
36
Wisconsin Multifacet Project
60 Second Intervals (on real hardware)
16-day
simulation
OLTP
Methods
37
Wisconsin Multifacet Project
Coping with Workload Variability
• Running (simulating) long enough not appealing
• Need to separate coincidental & real effects
• Standard statistics on real hardware
– Variation within base system runs
vs. variation between base & enhanced system runs
– But deterministic simulation has no “within” variation
• Solution with deterministic simulation
– Add pseudo-random delay on L2 misses
– Simulate base (enhanced) system many times
– Use simple or complex statistics
Methods
38
Wisconsin Multifacet Project
Coincidental (Space) Variability
Methods
39
Wisconsin Multifacet Project
Wrong Conclusion Ratio
• WCR (16,32) = 18%
• WCR (16,64) = 7.5%
• WCR (32,64) = 26%
Methods
40
Wisconsin Multifacet Project
More Generally: Use Standard Statistics
• As one would for a measurement of a “live” system
• Confidence Intervals
– 95% confidence intervals contain true value 95% of the time
– Non-overlapping confidence intervals give statistically
significant conclusions
• Use ANOVA or Hypothesis Testing – even better!
Methods
41
Wisconsin Multifacet Project
Confidence Interval Example
ROB
• Estimate #runs to get
non-overlapping confidence intervals
Methods
42
Wisconsin Multifacet Project
Also Time Variability (on real hardware)
OLTP
• Therefore, select checkpoint(s) carefully
Methods
43
Wisconsin Multifacet Project
Workload Variability Summary
• Variability is a real phenomenon for multi-threaded
workloads
– Runs from same initial conditions are different
• Variability is a challenge for simulations
– Simulations are short
– Wrong conclusions may be drawn
• Our solution accounts for variability
– Multiple runs, confidence intervals
– Reduces wrong conclusion probability
Methods
44
Wisconsin Multifacet Project
Talk Summary
• Simulations of $2M Commercial Servers must
– Complete in reasonable time (on $2K PCs)
– Handle OS, devices, & multithreaded hardware
– Cope with variability of multithreaded software
• Multifacet
– Scale & tune transactional workloads
– Separate timing & functional simulation
– Cope w/ workload variability via randomness & statistics
• References (www.cs.wisc.edu/multifacet/papers)
– Simulating a $2M Commercial Server on a $2K PC [Computer03]
– Full-System Timing-First Simulation [Sigmetrics02]
– Variability in Architectural Simulations … [HPCA03]
Methods
45
Wisconsin Multifacet Project
Other Multifacet Methods Work
• Specifying & Verifying Coherence Protocols
– [SPAA98], [HPCA99], [SPAA99], & [TPDS02]
• Workload Analysis & Improvement
– Database systems [VLDB99] & [VLDB01]
– Pointer-based [PLDI99] & [Computer00]
– Middleware [HPCA03]
• Modeling & Simulation
–
–
–
–
–
Methods
Commercial workloads [Computer02] & [HPCA03]
Decoupling timing/functional simulation [Sigmetrics02]
Simulation generation [PLDI01]
Analytic modeling [Sigmetrics00] & [TPDS TBA]
Micro-architectural slack [ISCA02]
46
Wisconsin Multifacet Project
Backup Slides
Methods
47
Wisconsin Multifacet Project
One Ongoing/Future Methods Direction
• Middleware Applications
– Memory system behavior of Java Middleware [HPCA 03]
– Machine measurements
– Full-system simulation
• Future Work: Multi-Machine Simulation
– Isolate middle-tier from client emulators and database
• Understand fundamental workload behaviors
– Drives future system design
Methods
48
Wisconsin Multifacet Project
Cache-to-Cache Transfers (%)
ECPerf vs. SpecJBB
100
80
60
40
20
0
0
256
512
768
1024
Touched Cache Lines (KB)
ECperf
SPECjbb
• Different cache-to-cache transfer ratios!
Methods
49
Wisconsin Multifacet Project
Online Transaction Processing (OLTP)
•
•
DB2 with a TPC-C-like workload. The TPC-C benchmark is widely used to
evaluate system performance for the on-line transaction processing market.
The benchmark itself is a specification that describes the schema, scaling rules,
transaction types and transaction mix, but not the exact implementation of the
database. TPC-C transactions are of five transaction types, all related to an
order-processing environment. Performance is measured by the number of
“New Order” transactions performed per minute (tpmC).
Our OLTP workload is based on the TPC-C v3.0 benchmark. We use IBM’s
DB2 V7.2 EEE database management system and an IBM benchmark kit to
build the database and emulate users. We build an 800 MB 4000-warehouse
database on five raw disks and an additional dedicated database log disk. We
scaled down the sizes of each warehouse by maintaining the reduced ratios of
3 sales districts per warehouse, 30 customers per district, and 100 items per
warehouse (compared to 10, 30,000 and 100,000 required by the TPC-C
specification). Each user randomly executes transactions according to the
TPC-C transaction mix specifications, and we set the think and keying times
for users to zero. A different database thread is started for each user. We
measure all completed transactions, even those that do not satisfy timing
constraints of the TPC-C benchmark specification.
Methods
50
Wisconsin Multifacet Project
Java Server Workload (SPECjbb)
• Java-based middleware applications are increasingly used in modern ebusiness settings. SPECjbb is a Java benchmark emulating a 3-tier
system with emphasis on the middle tier server business logic.
SPECjbb runs in a single Java Virtual Machine (JVM) in which threads
represent terminals in a warehouse. Each thread independently
generates random input (tier 1 emulation) before calling transactionspecific business logic. The business logic operates on the data held in
binary trees of java objects (tier 3 emulation). The specification states
that the benchmark does no disk or network I/O.
• We used Sun’s HotSpot 1.4.0 Server JVM and Solaris’s native thread
implementation. The benchmark includes driver threads to generate
transactions. We set the system heap size to 1.8 GB and the new object
heap size to 256 MB to reduce the frequency of garbage collection.
Our experiments used 24 warehouses, with a data size of
approximately 500 MB.
Methods
51
Wisconsin Multifacet Project
Static Web Content Serving: Apache
• Web servers such as Apache represent an important enterprise server
application. Apache is a popular open-source web server used in many
internet/intranet settings. In this benchmark, we focus on static web
content serving.
• We use Apache 2.0.39 for SPARC/Solaris 8 configured to use pthread
locks and minimal logging at the web server. We use the Scalable URL
Request Generator (SURGE) as the client. SURGE generates a
sequence of static URL requests which exhibit representative
distributions for document popularity, document sizes, request sizes,
temporal and spatial locality, and embedded document count. We use a
repository of 20,000 files (totalling ~500 MB), and use clients with
zero think time. We compiled both Apache and Surge using Sun’s
WorkShop C 6.1 with aggressive optimization.
Methods
52
Wisconsin Multifacet Project
Dynamic Web Content Serving: Slashcode
• Dynamic web content serving has become increasingly important for
web sites that serve large amount of information. Dynamic content is
used by online stores, instant news, and community message board
systems. Slashcode is an open-source dynamic web message posting
system used by the popular slashdot.org message board system.
• We used Slashcode 2.0, Apache 1.3.20, and Apache’s mod_perl
module 1.25 (with perl 5.6) on the server side. We used MySQL
3.23.39 as the database engine. The server content is a snapshot from
the slashcode.com site, containing approximately 3000 messages with
a total size of 5 MB. Most of the run time is spent on dynamic web
page generation. We use a multi-threaded user emulation program to
emulate user browsing and posting behavior. Each user independently
and randomly generates browsing and posting requests to the server
according to a transaction mix specification. We compiled both server
and client programs using Sun’s WorkShop C 6.1 with aggressive
optimization.
Methods
53
Wisconsin Multifacet Project
Download