Applying Benchmark Data To A Relative Server Capacity Model

advertisement
Applying Benchmark Data To A
Model for Relative Server Capacity
CMG 2013
Joseph Temple, LC-NS Consulting
John J Thomas, IBM
Relative Server Capacity


“How do I compare machine capacity?”
“What platform is best fit to deliver a given workload?”

Simple enough questions, but difficult to answer!

Establishing server capacity is complex
 Different platform design points
 Different machine architectures
 Continuously evolving platform generations

“Standard” benchmarks (SPECInt, TPC-C etc.) and composite metrics
(RPE2, QPI etc.) help, but may not be sufficient
 Some platforms do not support these metrics
 May not be sufficient to decide best fit for a given workload

We need a model to address Relative Server Capacity
 See “Alternative Metrics for Server RFPs” [J. Temple]
CMG 2013
2
Local Factors /
Constraints
Cost
Models
Non-Functional
Requirements
System
z
Strategic
Direction
System
x
Workload
Fit
Power
Technology
Adoption
Reference
Architectures
CMG 2013
3
Fit for Purpose Workload Types
Mixed Workload – Type 1
Parallel Data Structures –
Type 3
• Scales up
• Updates to shared data
and work queues
• Complex virtualization
• Business Intelligence with
heavy data sharing and
ad hoc queries
• Scales well on clusters
• XML parsing
• Buisness intelligence with
Structured Queries
• HPC applications
Application Function Data Structure Usage Pattern SLA Integration Scale
Highly Threaded – Type 2
Small Discrete – Type 4
• Scales well on large SMP
• Web application servers
• Single instance of an ERP
system
• Some partitioned
databases
Black are design factors
• Limited scaling needs
• HTTP servers
• File and print
• FTP servers
• Small end user apps
Blue are local factors
CMG 2013
4
Fitness Parameters in Machine Design
Fitness Parameters
Thread Count is bubble size
1.2
1
Thread Speed
Sandy Bridge
0.8
SB ST
0.6
HE P7+ SMT4
0.4
HE P7+ SMT2
0.2
HE p7+ ST
zEC12
0
0
0.2
0.4
0.6
0.8
1
1.2
Cache/Thread
Can be customized to machines of interest. Need to know specific comparisons desired
These parameters were chosen to represent the ability to handle parallel, serial and bulk data traffic.
This is based on Greg Pfister’s work on workload characterization in In Search of Clusters
CMG 2013
5
Key Aspects Of The Theoretical Model

Throughput (TP)
 Common concept: Units of Work Done / Units of Time Elapsed
 Theoretical model defines TP as a function of Thread Speed:
TP = Thread Speed x Threads
−

Thread Speed is calculated as clock rate x Threading multiplier / Threads per Core.
Threading multiplier is the increase in throughput due to multiple threads per core
Thread Capacity (TC)
 Throughput (TP) gives us an idea of instantaneous peak throughput rate
−


In the world of dedicated systems, TP is the parameter of interest because it
tells us the peak load the machine can handle without causing queues to form
However in the world of virtualized/consolidated workloads, we are stacking
multiple workloads on threads of the machine
−

In order to sustain this rate the load has to keep all threads of the machine busy
Thread capacity is an estimator of how deep these stacks can be
Theoretical model defines TC as:
TC = Thread Speed x Cache per Thread
CMG 2013
6
Throughput, Saturation, Capacity
TP
TP  Pure Parallel CPU
Measured ITR
ITR  Other resources and Serialization
CMG 2013
Capacity
ETR  Load and Response Time
7
7
Single Dimension Metrics Do Not Reflect True
Capacity
Common Metrics:
ITR  TP
ETR  ITR
Power advantaged
z is not price competitive
Consolidation:
ETR << ITR unless loads are consolidated
Consolidation accumulates working sets
Power and z advantaged
Cache can also mitigate “Saturation”
The “standard metrics” do not leverage cache.
This leads to the pure ITR view of relative capacity on the right.
CMG 2013
8
Bridging Two Worlds - I

There appears to be a disconnect between “common
benchmark metrics” and “theoretical model metrics” like
TP

Does this mean metrics like TP are invalid? No

We see the effect of TP/TC in real world deployments
−

Does this mean benchmark metrics are useless? No


a machine performs either better or poorer than what a common
benchmark metric would have suggested
They provide valuable data points
A better approach would be to try and bridge these two
worlds in a meaningful way
CMG 2013
9
Bridging Two Worlds - II

Theoretical model calculates TP and TC using estimated values for thread speed
 Based on machine specifications

Example: TP calculation for POWER7
 A key factor in TP calculation is Thread Speed, which in turn depends on the
value of the thread multiplier
−
But this factor is only an estimate.
− We estimated the thread multiplier for POWER7 in SMT-4 mode was 2


However, using an estimate for thread speed assumes common path length
and linear scaling
An inherent problem here – these estimates are not measured or specified
using any common metric across platforms
−

As an example, should the thread multiplier be the same for POWER7 in SMT-2
mode as Intel running with HyperThreading?
Recommendation: Refine factors in the theoretical model with benchmark results
 Instead of using theoretical values for thread speed, pathlength etc., plug in
benchmark observations
CMG 2013
10
Two Common Categories Of Benchmarks

Stress tests
 Measure raw throughput
−

VM density tests
 Consolidation ratios (VM density) that can be achieved on a
platform
 Usually do not try to maximize throughput of a system
−

Measure the maximum throughput that can be driven through a system,
focusing all system resources to this particular task
They usually look at how multiple workloads can be stacked efficiently
to share the resources on a system, while delivering steady throughput
Adjusting Thread Speed affects both TP and TC
CMG 2013
11
Example of a Stress Test, A Misleading One If Used In
Isolation!
Peak ITR:
3467 tps
2ch/16co Intel
2.7GHz Blade
TradeLite
workload
Online trading
WAS ND workload
driven as a stress test
Peak ITR:
3984 tps
Linux on System z
16 IFLs

This benchmark result is quite misleading, it suggests a z core yields only 15%
better ITR. But we know that z has much higher “capacity”
 What is wrong here?
 System z design point is to run multiple workloads together, not a single
atomic application under stress
 This particular application doesn’t seem to leverage many of z’s capabilities
(cache, IO etc.)
 Can this benchmark result be used to compare capacity?
CMG 2013
12
Use Benchmark Data To Refine Relative Capacity Model

Calculate Effective thread speed from measured values



What is the benchmarked thread speed?
Normalizing thread speed and clock to a platform allows us to
calculate pathlength for a given platform
This in turn allows us to calculate Effective thread speed

Doing this affects both TP and TC

Plug in Effective thread speed values into Relative Capacity
calculation model
CMG 2013
13
Use Benchmark Data To Refine Relative Capacity Model Results
ITR / Threads
Clock ratio /
Threadspeed ratio
Effective
Threadspeed *
Total Threads *
Cache/Thread
In this case, System z ends up with a 13.5x Relative Capacity factor, relative to Intel
CMG 2013
14
Example of a VM Density Test: Consolidating Standalone
VMs With Light CPU Requirements
Light
workloads
48 VMs
Common x86 hypervisor
per IPAS Intel
blade
2ch/16co Intel
2.7GHz Blade
68 VMs
PowerVM
per IPAS
POWER7+ blade
2ch/16co POWER7+
3.6GHz Blade
Online banking WAS ND
workloads, each driving 22
transactions per second with
light I/O
100 VMs
per 16-way
z/VM
Consolidation ratios derived from IBM internal studies. Results will vary
based on workload profiles/characteristics.
CMG 2013
z/VM on zEC12
16 IFLs
15
Use Benchmark Data To Refine Relative Capacity Model Results

Follow a similar exercise to calculate effective thread speed

Each VM is driving a certain fixed throughput
−
−




This test used a constant injection rate
If throughput varies (for example, holding a constant think time),
need to adjust for that
Calculate benchmarked thread speed
Normalize to a platform to get path length
Calculate effective thread speed
Plug into relative server capacity calculation
In this case, System z ends up with a 22.2x Relative Capacity factor relative to Intel
CMG 2013
16
Math Behind Consolidation
Roger’s Equation:
Uavg = 1/(1+HR(avg))
Where
HR(avg) = kcN1/2
For consolidation, N is the number of loads (VMs)
k is a design parameter (Service Level)
c is the variability of the initial load
CMG 2013
17
Larger Servers With More Resources Make More
Effective Consolidation Platforms

Most workloads experience variance in
demand

When you consolidate workloads with variance on a
virtualized server, the variance of the sum is less (statistical
multiplexing)

The more workloads you can consolidate, the smaller is the
variance of the sum

Consequently, bigger servers with capacity to run more
workloads can be driven to higher average utilization levels
without violating service level agreements, thereby reducing
the cost per workload
CMG 2013
18
A Single Workload Requires a Machine Capacity Of
6x the Average Demand
Server utilization = 17%
Server
Capacity
Required
60/sec
6x Peak To Average
Average
Demand
m=10/sec
Assumes coefficient of variation = 2.5, required to meet 97.7% SLA
CMG 2013
19
Consolidation Of 4 Workloads Requires Server
Capacity Of 3.5x Average Demand
Server utilization = 28%
Server
Capacity
Required
140/sec
3.5x Peak To Average
Average
Demand
4*m =
40/sec
CMG 2013
20
Consolidation Of 16 Workloads Requires Server
Capacity Of 2.25x Average Demand
Server utilization = 44%
2.25x Peak To Average
Assumes coefficient of variation = 2.5, required to meet 97.7% SLA
CMG 2013
Server
Capacity
Required
360/sec
Average
Demand
16*m =
160/sec
21
Consolidation Of 144 Workloads Requires Server
Capacity Of 1.42x Average Demand
Server utilization = 70%
1.42x Peak To Average
Server
Capacity
Required
2045/sec
Average
Demand
144*m =
1440/sec
Assumes coefficient of variation = 2.5, required to meet 97.7% SLA
CMG 2013
22
Let’s Look At Actual Customer Data

Large US insurance company

13 Production POWER7 frames


Detailed CPU utilization data




Some large servers, some small servers
30 minute intervals, one whole week
For each LPAR on the frame
For each frame in the data center
Measure peak, average, variance
CMG 2013
23
Detailed Data Example: One Frame
PAF5PDC
100
90
80
CPU %
70
60
50
40
30
20
10
0
12/9/1912
0:00
12/10/1912
0:00
12/11/1912
0:00
12/12/1912
0:00
12/13/1912
0:00
12/14/1912
0:00
12/15/1912
0:00
12/16/1912
0:00
MSP159
12
10
Cores
8
All
6
Guidewire
4
2
0
12/9/1912 12/10/1912 12/11/1912 12/12/1912 12/13/1912 12/14/1912 12/15/1912 12/16/1912
0:00
0:00
0:00
0:00
0:00
0:00
0:00
0:00
CMG 2013
24
Customer Data Confirms Theory
Workloads vs. Peak-to-Average
(Final Theoretical Model Overlaid)
8
Peak To Average Ratio
7
6
5
4
3
2
1
0
0
10
20
30
40
50
60
LPAR Count
Servers with more workloads have less variance in their utilization
and less headroom requirements
CMG 2013
25
Consolidation Observations

There is a benefit to large scale servers




The headroom required to accommodate variability goes up
only by sqrt(n) when n workloads are pooled
The larger the shared processor pool is, the more statistical
benefit you get
Large scale virtualization platforms are able to consolidate
large numbers of virtual machines because of this
Servers with capacity to run more workloads can be driven
to higher average utilization levels without violating service
level agreements
CMG 2013
26
Summary

We need a theoretical model for relative server capacity
comparisons
 Purely theoretical models need to be grounded in reality
 Atomic benchmarks can sometimes be quite misleading in
terms of overall system capability
 Refine theoretical models with benchmark measurements
 Real world (customer) data trumps everything!


Validates or negates models
Customer data validates sqrt(n) model for consolidation
CMG 2013
27
Download