Workload Selection and Characterization Andy Wang CIS 5930

advertisement
Workload Selection and
Characterization
Andy Wang
CIS 5930
Computer Systems
Performance Analysis
Workloads
• Types of workloads
• Workload selection
2
Types of Workloads
•
•
•
•
•
•
•
What is a Workload?
Instruction Workloads
Synthetic Workloads
Real-World Benchmarks
Application Benchmarks
“Standard” Benchmarks
Exercisers and Drivers
3
What is a Workload?
• Workload: anything a computer is asked
to do
• Test workload: any workload used to
analyze performance
• Real workload: any observed during
normal operations
• Synthetic workload: created for
controlled testing
4
Real Workloads
• Advantage: represent reality
• Disadvantage: uncontrolled
– Can’t be repeated
– Can’t be described simply
– Difficult to analyze
• Nevertheless, often useful for “final
analysis” papers
– E.g., “We ran system foo and it works well”
5
Synthetic Workloads
• Advantages:
– Controllable
– Repeatable
– Portable to other systems
– Easily modified
• Disadvantage: can never be sure real
world will be the same
6
Instruction Workloads
• Useful only for CPU performance
– But teach useful lessons for other
situations
• Development over decades
– “Typical” instruction (ADD)
– Instruction mix (by frequency of use)
• Sensitive to compiler, application, architecture
• Still used today (GFLOPS)
– Processor clock rate
• Only valid within processor family
7
Instruction Workloads
(cont’d)
• Modern complexity makes mixes invalid
– Pipelining
– Data/instruction caching
– Prefetching
• Kernel is inner loop that does useful
work:
– Sieve, matrix inversion, sort, etc.
– Ignores setup, I/O, so can be timed by
analysis if desired (at least in theory)
8
Synthetic Workloads
• Complete programs
– Designed specifically for measurement
– May do real or “fake” work
– May be adjustable (parameterized)
• Two major classes:
– Benchmarks
– Exercisers
9
Real-World Benchmarks
•
•
•
•
Pick a representative application
Pick sample data
Run it on system to be tested
Modified Andrew Benchmark, MAB, is a
real-world benchmark
• Easy to do, accurate for that sample data
• Fails to consider other applications, data
10
Application Benchmarks
• Variation on real-world benchmarks
• Choose most important subset of
functions
• Write benchmark to test those functions
• Tests what computer will be used for
• Need to be sure important
characteristics aren’t missed
• Mix of functions must reflect reality
11
“Standard” Benchmarks
• Often need to compare general-purpose
computer systems for general-purpose
use
– E.g., should I buy a Sony or a Dell PC?
– Tougher: Mac or PC?
• Desire for an easy, comprehensive
answer
• People writing articles often need to
compare tens of machines
12
“Standard” Benchmarks
(cont’d)
• Often need to make comparisons over
time
– Is this year’s CPU faster than last year’s
• Probably yes, but by how much?
• Don’t want to spend time writing own code
– Could be buggy or not representative
– Need to compare against other people’s
results
• “Standard” benchmarks offer solution
13
•
•
•
•
•
•
•
•
•
•
Popular “Standard”
Benchmarks
Sieve, 8 queens, etc.
Whetstone
Linpack
Dhrystone
Debit/credit
TPC
SPEC
MAB
Winstone, webstone, etc.
...
14
Sieve, etc.
• Prime number sieve (Erastothenes)
– Nested for loops
– Often such small array that it’s silly
• 8 queens
– Recursive
• Many others
• Generally not representative of real
problems
15
Whetstone
• Dates way back (can compare against
70’s)
• Based on real observed frequencies
• Entirely synthetic (no useful result)
– Modern optimizers may delete code
• Mixed data types, but best for floating
• Be careful of incomparable variants!
16
LINPACK
• Based on real programs and data
• Developed by supercomputer users
• Great if you’re doing serious numerical
computation
17
Dhrystone
• Bad pun on “Whetstone”
• Motivated by Whetstone’s perceived
excessive emphasis on floating point
• Dates to when p’s were integer-only
• Very popular in PC world
• Again, watch out for version
mismatches
18
Debit/Credit Benchmark
• Developed for transaction processing
environments
– CPU processing is usually trivial
– Remarkably demanding I/O, scheduling
requirements
• Models real TPS workloads
synthetically
• Modern version is TPC benchmark
19
SPEC Suite
• Result of multi-manufacturer consortium
• Addresses flaws in existing benchmarks
• Uses 10 real applications, trying to
characterize specific real environments
• Considers multiple CPUs
• Geometric mean gives SPECmark for
system
• Becoming standard comparison method
20
Modified Andrew
Benchmark
• Used in research to compare file
system, operating system designs
• Based on software engineering
workload
• Exercises copying, compiling, linking
• Probably ill-designed, but common use
makes it important
• Needs scaling up for modern systems
21
Winstone, Webstone, etc.
• “Stone” has become suffix meaning
“benchmark”
• Many specialized suites to test
specialized applications
– Too many to review here
• Important to understand strengths &
drawbacks
– Bias toward certain workloads
– Assumptions about system under test
22
Exercisers and Drivers
• For I/O, network, non-CPU
measurements
• Generate a workload, feed to internal or
external measured system
– I/O on local OS
– Network
• Sometimes uses dedicated system,
interface hardware
23
Advantages of Exercisers
• Easy to develop, port
• Can incorporate measurement
• Easy to parameterize, adjust
24
Disadvantages
of Exercisers
• High cost if external
• Often too small compared to real
workloads
– Thus not representative
– E.g., may use caches “incorrectly”
• Internal exercisers often don’t have real
CPU activity
– Affects overlap of CPU and I/O
• Synchronization effects caused by loops
25
Workload Selection
• Services exercised
• Completeness
– Sample service characterization
•
•
•
•
Level of detail
Representativeness
Timeliness
Other considerations
26
Services Exercised
• What services does system actually
use?
– Network performance useless for matrix
work
• What metrics measure these services?
– MIPS/GIPS for CPU speed
– Bandwidth/latency for network, I/O
– TPS for transaction processing
27
Completeness
• Computer systems are complex
– Effect of interactions hard to predict
• Dynamic voltage scaling can speed up heavy
loads (e.g., accessing encrypted files)
– So must be sure to test entire system
• Important to understand balance
between components
– I.e., don’t use 90% CPU mix to evaluate
I/O-bound application
28
Component Testing
• Sometimes only individual components
are compared
– Would a new CPU speed up our system?
– How does IPV6 affect Web server
performance?
• But component may not be directly
related to performance
– So be careful, do ANOVA (analysis of
variance), don’t extrapolate too much
29
Service Testing
• May be possible to isolate interfaces to
just one component
– E.g., instruction mix for CPU
• Consider services provided and used by
that component
• System often has layers of services
– Can cut at any point and insert workload
30
Characterizing a Service
• Identify service provided by major
subsystem
• List factors affecting performance
• List metrics that quantify demands and
performance
• Identify workload provided to that
service
31
Example: Web Server
Web Page Visits
Web Client
TCP/IP Connections
Network
HTTP Requests
Web Server
Web Page Accesses
File System
Disk Transfers
Disk Drive
32
Web Client Analysis
• Services: visit page, follow hyperlink,
display page information
• Factors: page size, number of links,
fonts required, embedded graphics,
sound
• Metrics: response time (both definitions)
• Workload: a list of pages to be visited
and links to be followed
33
Network Analysis
• Services: connect to server, transmit
request, transfer data
• Factors: packet size, protocol used
• Metrics: connection setup time,
response latency, achieved bandwidth
• Workload: a series of connections to
one or more servers, with data transfer
34
Web Server Analysis
• Services: accept and validate
connection, fetch & send HTTP data
• Factors: Network performance, CPU
speed, system load, disk subsystem
performance
• Metrics: response time, connections
served
• Workload: a stream of incoming HTTP
connections and requests
35
File System Analysis
• Services: open file, read file (writing
often doesn’t matter for Web server)
• Factors: disk drive characteristics, file
system software, cache size, partition
size
• Metrics: response time, transfer rate
• Workload: a series of file-transfer
requests
36
Disk Drive Analysis
•
•
•
•
Services: read sector, write sector
Factors: seek time, transfer rate
Metrics: response time
Workload: a statistically-generated
stream of read/write requests
37
Level of Detail
• Detail trades off accuracy vs. cost
• Highest detail is complete trace
• Lowest is one request, usually most
common
• Intermediate approach: weight by
frequency
• We will return to this when we discuss
workload characterization
38
Representativeness
• Obviously, workload should represent
desired application
– Arrival rate of requests
– Resource demands of each request
– Resource usage profile of workload over
time
• Again, accuracy and cost trade off
• Need to understand whether detail
matters
39
Timeliness
• Usage patterns change over time
– File size grows to match disk size
– Web pages grow to match network
bandwidth
• If using “old” workloads, must be sure
user behavior hasn’t changed
• Even worse, behavior may change after
test, as result of installing new system
– “Latent demand” phenomenon
40
Other Considerations
• Loading levels
– Full capacity
– Beyond capacity
– Actual usage
• External components not considered as
parameters
• Repeatability of workload
41
Workload
Characterization
•
•
•
•
•
•
•
•
Terminology
Averaging
Specifying dispersion
Single-parameter histograms
Multi-parameter histograms
Principal-component analysis
Markov models
Clustering
42
Workload Characterization
Terminology
• User (maybe nonhuman) requests
service
– Also called workload component or
workload unit
• Workload parameters or workload
features model or characterize the
workload
43
Selecting
Workload Components
• Most important: components should be
external: at interface of SUT (system
under test)
• Components should be homogeneous
• Should characterize activities of interest
to the study
44
Choosing
Workload Parameters
• Select parameters that depend only on
workload (not on SUT)
• Prefer controllable parameters
• Omit parameters that have no effect on
system, even if important in real world
45
Averaging
• Basic character of a parameter is its
average value
• Not just arithmetic mean
• Good for uniform distributions or gross
studies
46
Specifying Dispersion
• Most parameters are non-uniform
• Specifying variance or standard
deviation brings major improvement
over average
• Average and s.d. (or C.O.V.) together
allow workloads to be grouped into
classes
– Still ignores exact distribution
47
Single-Parameter
Histograms
• Make histogram or kernel density
estimate
• Fit probability distribution to shape of
histogram
• Chapter 27 (not covered in course) lists
many useful shapes
• Ignores multiple-parameter correlations
48
Multi-Parameter
Histograms
• Use 3-D plotting package to show 2
parameters
– Or plot each datum as 2-D point and look
for “black spots”
• Shows correlations
– Allows identification of important
parameters
• Not practical for 3 or more parameters
49
Principal-Component
Analysis (PCA)
• How to analyze more than 2
parameters?
• Could plot endless pairs
– Still might not show complex relationships
• Principal-component analysis solves
problem mathematically
– Rotates parameter set to align with axes
– Sorts axes by importance
50
Advantages of PCA
•
•
•
•
Handles more than two parameters
Insensitive to scale of original data
Detects dispersion
Combines correlated parameters into
single variable
• Identifies variables by importance
51
Disadvantages of PCA
• Tedious computation (if no software)
• Still requires hand analysis of final
plotted results
• Often difficult to relate results back to
original parameters
52
Markov Models
•
•
•
•
Sometimes, distribution isn’t enough
Requests come in sequences
Sequencing affects performance
Example: disk bottleneck
– Suppose jobs need 1 disk access per CPU
slice
– CPU slice is much faster than disk
– Strict alternation uses CPU better
– Long disk access strings slow system
53
Introduction to
Markov Models
• Represent model as state diagram
• Probabilistic transitions between states
• Requests generated on transitions
Network
0.4
0.6
0.3
CPU
0.3
0.4
Disk
0.2
0.8
54
Creating a Markov Model
• Observe long string of activity
• Use matrix to count pairs of states
• Normalize rows to sum to 1.0
CPU Network Disk
CPU
0.6
0.4
Network 0.3
0.4
0.3
Disk
0.8
0.2
55
Example Markov Model
• Reference string of opens, reads,
closes:
ORORRCOORCRRRRCC
• Pairwise frequency matrix:
Open Read Close Sum
Open 1
3
4
Read 1
4
3
8
Close 1
1
1
3
56
Markov Model
for I/O String
• Divide each row by its sum to get
transition matrix:
Open
Open 0.25
Read 0.13
Close 0.33
Read Close
0.75
0.50 0.37
0.33 0.34
• Model:
Read
0.50
0.75
0.33
0.13
0.25
0.37
Open
Close
0.34
0.33
57
Clustering
• Often useful to break workload into
categories
• “Canonical example” of each category
can be used to represent all samples
• If many samples, generating categories
is difficult
• Solution: clustering algorithms
58
Steps in Clustering
•
•
•
•
•
•
•
•
Select sample
Choose and transform parameters
Drop outliers
Scale observations
Choose distance measure
Do clustering
Use results to adjust parameters, repeat
Choose representative components
59
Selecting A Sample
• Clustering algorithms are often slow
– Must use subset of all observations
• Can test sample after clustering: does
every observation fit into some cluster?
• Sampling options
– Random
– Heaviest users of component under study
60
Choosing and Transforming
Parameters
• Goal is to limit complexity of problem
• Concentrate on parameters with high
impact, high variance
– Use principal-component analysis
– Drop a parameter, re-cluster, see if
different
• Consider transformations such as Sec.
15.4 (logarithms, etc.)
61
Dropping Outliers
• Must get rid of observations that would
skew results
– Need great judgment here
– No firm guidelines
• Drop things that you know are “unusual”
• Keep things that consume major
resources
– E.g., daily backups
62
Scale Observations
• Cluster analysis is often sensitive to
parameter ranges, so scaling affects
results
• Options:
– Scale to zero mean and unit variance
– Weight based on importance or variance
– Normalize range to [0, 1]
– Normalize 95% of data to [0, 1]
63
Choosing
a Distance Measure
• Endless possibilities available
• Represent observations as vectors in kspace
• Popular measures include:
– Euclidean distance, weighted or
unweighted
– Chi-squared distance
– Rectangular distance
64
Clustering Methods
•
•
•
•
Many algorithms available
Computationally expensive (NP hard)
Can be simple or hierarchical
Many require you to specify number of
desired clusters
• Minimum Spanning Tree is not only
option!
65
Minimum Spanning Tree
Clustering
• Start with each point in a cluster
• Repeat until single cluster:
– Compute centroid of each cluster
– Compute intercluster distances
– Find smallest distance
– Merge clusters with smallest distance
• Method produces stable results
– But not necessarily optimum
66
K-Means Clustering
•
•
•
•
One of most popular methods
Number of clusters is input parameter
First randomly assign points to clusters
Repeat until no change:
– Calculate center of each cluster: x, y 
– Assign each point to cluster with nearest
center
67
Interpreting Clusters
• Art, not science
• Drop small clusters (if little impact on
performance)
• Try to find meaningful characterizations
• Choose representative components
– Number proportional to cluster size or to
total resource demands
68
Drawbacks of Clustering
• Clustering is basically AI problem
• Humans will often see patterns where
computer sees none
• Result is extremely sensitive to:
– Choice of algorithm
– Parameters of algorithm
– Minor variations in points clustered
• Results may not have functional meaning
69
White Slide
Download