State of the Benchmarks - Individual CMG Regions and SIGs

advertisement
State of the Benchmarks
Daniel Bowers
daniel.bowers@gartner.com
@Daniel_Bowers
Agenda
• Server Benchmarks
• Benchmark Types
• Public Result Pools
• Why aren’t there more results?
• TPC and SPEC
• Vendor Benchmarks
• Virtualization Benchmarks
• Public and Open-Source Benchmarks
• “Good” and “Bad” Benchmarks
Server Benchmarks: A Definition
The level of performance
for a given server
configuration that vendors
promise you can’t achieve
What People Do With Benchmark Results
• Purchase or deployment decisions
• Right-Sizing & Configuration Definition
• Capacity Planning / Consolidation
• Normalize server value for chargeback
• Performance baselining and monitoring
• Troubleshooting
• Setting hardware and software pricing
• Configuration advice
Result Pools
Bad
Using Benchmarks: Good to Great.
Basically: Better than nothing.
If you size systems this way…
Your results will
be…
Nothing
Catastrophic
Use published SPEC CPU2006 for everything
Poor
Use someone else’s results on a standard benchmark
Marginal
Run standard benchmarks on your gear
OK
Have someone else run your applications in their lab
OK
Run profiles of your workload on your gear
Good
Run your actual workloads on lab systems
Really Good
Run your actual workloads on actual production systems
Great
Smart, Hard-Working, Rich
Why Use Published Results?
• They’re free
• They’re quick
• No skill required
• You don’t need to have the servers
• More accurate than MHz or MIPS
• They’re often audited (or at least scrutinized)
Examples of Benchmarks with Public Result Pools
•
•
Consortia
•
- SPEC
- LINPACK / HPL
- TPC
- Stream
Vendor Application
Benchmarks
•
- SAP
- Geekbench
•
- VMware
Vendor – Relative Metrics
- IBM rPerf
Government
- APP
- ENERGY STAR
Embedded
- EEMBC CoreMark
•
- Fujitsu OLTP2
•
Desktop
- SiSoftware Sandra
- Oracle
•
Academic & HPC
•
“Open”
•
DVDStore
•
BogoMIPS
Purchasing metrics
- Amazon ECU
- Google GCE
Example Results Pools
Oracle R12 (Payroll Large 2T)
8
SPECjEnterprise2010
28
VMmark 2.x
38
TPC-E
54
TPC-H
162
TPC-C
264
SPECpower_ssj2008
357
SPECjbb2005
669
SAP R/3 SD 2-Tier
673
SPECint2000 (retired)
1100
SPECint2006_rate
6675
0
200
400
600
800
1000
1200
1400
Benchmarks
“Dilbert”: December 25, 2004. Source: Dilbert.com.
The Gap
19%
Server configurations with published SPEC CPU2006 results
Why aren’t there more results?
• Vendors won’t publish losers
• Publishing is rarely required, and can be prohibited
• Can take lots of manpower & money
• Little incentive for end-users to publish results
• Benchmarks get broken or become stale
Why Aren’t There More Results?
“Benchmarks are
chosen to show off a
platform, not to allow
comparisons”
- From an IBM presentation to CMG, 2008
Why aren’t there more results?
Why aren’t there more results?
Why Aren’t There More Results?
“A TPC-E measurement can
take up to 3 months from
preparation through the
official acceptance by the
TPC committee.”
- From a Fujitsu document on server performance
TPC
Non-profit industry consortium
• Creates benchmarks to compare database systems
• Membership is primarily server hardware, OS, database vendors
• 1997 - 53 members (including associates)
• Today: 21 members
• Disclosure: Gartner is an associate member
All benchmark results are audited by 3rd party & subject to challenge
• Full disclosure required (actually 2); must include pricing
• Estimates and non-audited results not usually allowed
Produces specifications, not programs
Benchmark setups are large, usually dominated by storage
1990
1995
2000
TPC-A
2010
2005
OLTP
Lightweight OLTP
TPC-B
TPC-E
OLTP (Brokerage House)
Batch/Database Stress
TPC-C
OLTP (Product Supplier)
TPC-H
Ad Hoc Decision Support
DSS
TPC-DS
TPC-D
Decision
Support
(2012)
Decision Support
TPC-R
Business Reporting
Legend
TPC-W
Retired
Web commerce transactional
Active
New
Other
TPC-Energy
TPC-APP
Application Server
TPC-VMS
(2012)
TPC-C
• Long history, tremendous mindshare
• Results, Estimates, and Predictions for “tmpC” are plentiful
• Allows across many generations
• OLTP Workload that’s old and weak
• Disparity between processor & I/O performance growth
• Storage costs dominate
• Server -> Storage IO path is bottleneck
• Quiz: Why don’t SSDs yield higher results?
•TPC has tried to replace it
Cost breakdown of
example TPC-C result
DL385 G1 using Microsoft SQL2005.
Full report:
http://tpc.org/tpcc/results/tpcc_result_detail.asp?id=
106032001
Software
13%
Storage
68%
Client Hardware
5%
Client Software
1%
Server
Hardware
11%
Other / 3rd
Party
2%
TPC-C : Example
IBM TPC-C Full Disclosure Report http://tpc.org/tpcc/results/tpcc_result_detail.asp?id=112041101
Microsoft & TPC-C
Microsoft: “TPC-E is far superior to TPC-C.”
- Charles Levine, Principal Program Manager, SQL Server Performance Engineering
Microsoft won’t approve TPC-C publications using SQL2008 or later
Chart Source: Microsoft (http://blogs.technet.com/b/dataplatforminsider/archive/2012/08/28/tpc-e-raising-the-bar-in-oltp-performance.aspx )
Benchmarks
“Pepper…and Salt” January 29, 2013 Source: Wall Street Journal
TPC-E
• OLTP, like TPC-C
• More tables (33)
• More transaction types (~100) including more complex
transactions
• Only results to date are on x86 with Microsoft SQL
• Trivia: Dataset based on NYSE company list and uses some US
census data
Helpful Hint: Fujitsu OLTP2
• Results for all recent Xeon processor models
• Search for PDF files entitled “Fujitsu PRIMERGY Servers Performance
Report”
TPC-H
Benchmark results are for specific database sizes (scales)
• TPC: Don’t compare different sizes (but my research says that’s OK)
Parts of the data set scales linearly with performance
* Some have become unrealistic: e.g. 50 billion customer records
Smaller database sizes are “broken” by in-memory, columnar databases
• Actian VectorWise results are about double the expected results
Benchmark appears to be fading away, but may see surge of activity as
Oracle & Microsoft adding columnar support to databases
Source: HP Whitepaper
TPC-DS
Decision Support database benchmark meant to replace TPC-H.
Released in mid-2012. No results to date. (No auditors either)
• Includes many more query types than TPC-H.
• Periodic database update process that more closely matches that of today’s databases.
• “Modern” : modified star schema with fact tables and dimension tables
Other TPC Benchmarks
TPC-Energy
• Optional add-on to other TPC benchmarks
TPC-VMS
• Just released
• Runs 3 other TPC benchmarks simultaneously on single system under
test
In Development:
• TPC-V
• TPC-ETL
• TPC-Express
Benchmarks
Why Pay Attention to BogoMIPs?
“To see whether your system is faster
than mine. Of course this is completely
wrong, unreliable, ill-founded, and utterly
useless, but all benchmarks suffer from
this same problem. So why not use it?”
- From the Linux BogoMIPS HOWTO
Standard Performance Evaluation
Corporation (SPEC)
“The goal of SPEC is to ensure that the marketplace has a fair and
useful set of metrics to differentiate candidate systems.”
• Sells source code including common ports
• Searchable results pool
115 members across four independent groups:
1) Open System Group (Mostly vendors)
2) Workstation group
3) Graphics group
4) Research group (Mostly academics)
Disclosure: Gartner is a member
Results generally require a standard-format report
- Lists intermediary results, optimizations used
- Price not included
- Estimates are allowed for most benchmarks
SPEC CPU2006
Measures CPU Integer and Floating Point capacity
• Often correlates with overall system performance because server designs
typically balance memory, IO, and CPU
Actually 8 different metrics:
• Integer and Floating Point tests
• Speed and Rate tests
• Base and Peak results
~25 different individual workloads, from games to quantum computing
Changes versions every 6-8 years
• CPU92, CPU2000, CPU2006
• CPUv6 currently under development
• Results not comparable between versions
SPEC CPU2006
• Depends almost entirely on CPU model, core count, clock
speed
• Some impact from Compiler (e.g. +15%)
• Small impact from OS, cache
• Floating Point impacted by memory speed
• “Turbo Mode” frequency correlation
• Benchmarked configurations must be “offered”
• Published results are peer reviewed (by ‘competitors’)
• Reviewers are PICKY!
SPEC CPU2006
“Benchmarks Results are usually
for
•Sales and marketing
•Customer awareness
•Customer confidence”
- Fujitsu presentation
SPEC jbb2005
Server-side Java benchmark
Heavily dependant on JVM
• Also highly dependant on processor speed, core
count, Java garbage collection
• “Plateau” amount of cache, memory
• Disk and network I/O play no part
Emulates 3-tier system on a single host
-Database tier is emulated, memory-resident
Useful Tidbits:
- Cheap & Easy to run, so lots of results
- Measures transactions per second, similar
transactions to TPC-C
- Full report includes performance beyond peak
- Being replaced (SPECjbb2013)
SPEC jbb2013
Released last month!
Scales more realistically than
SPECjbb2005
* Includes inter-JVM communication
Includes a response-time requirement &
reporting, in addition to “operations per
second”
Like jbb2005, a key design goal was
making it easy to run.
SPEC jEnterprise2010
Java server benchmark designed to test whole system Java EE
performance
• Includes database and storage
• System-Under-Tests can be more than 1 server
Harder to set up and run vs. SPECjbb2005, so fewer results
Product
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Oracle - SPARC T4-4
Oracle - Sun Server X2-8
Oracle - Sun Server X2-8
Cisco - UCS B440 M2
Cisco - UCS B440 M1
IBM - Power 780 (MHB)
IBM - Power 780 (MHB)
IBM - BladeCenter HS22
Dell - PowerEdge R910
IBM - Power 780 (MHD)
IBM - System x3650 M4
Oracle - SPARC T3-4
Oracle - Sun Server X3-2
Oracle - Sun Server X3-2
IBM - BladeCenter HS22
Processor
SPARC T4
Chips
WebLogic 12.x
16 Oracle WebLogic 10.X
Database
Oracle 11g
OS
Solaris
EjOPS
40,104.86
Xeon E7-8870
8 WebLogic 12.x
Oracle 11g
Linux
27,150.05
Xeon E7-8870
8 WebLogic 12.x
Oracle 11g
Linux
27,150.05
Xeon E7-4870
8 Oracle WebLogic 10.X
Oracle 11g
Linux
26,118.67
Xeon X7560
8 Oracle WebLogic 10.X
Oracle 11g
Linux
17,301.86
POWER7
8 WebSphere V7
DB2 9.7
AIX
16,646.34
POWER7
8 WebSphere V7
DB2 9.7
AIX
15,885.09
12 WebSphere V7
DB2 9.7
Linux
15,829.86
Xeon X5670
Xeon E7-4870
4 Oracle WebLogic 10.X
Oracle 11g
Linux
11,946.60
POWER7+
4 WebSphere V8
DB2 10.1
AIX
10,902.30
Xeon E5-2690
2 WebSphere V8
DB2 10.1
Linux
9,696.43
SPARC T3
4 Oracle WebLogic 10.X
Oracle 11g
Solaris
9,456.28
Xeon E5-2690
2 Oracle WebLogic 10.X
Oracle 11g
Linux
8,310.19
Xeon E5-2690
2 Oracle WebLogic 10.X
Oracle 11g
Linux
8,310.19
DB2 9.7
Linux
7,903.16
Xeon X5570
16 WebSphere V7
Top 15 results (as of 1 Feb 2013)
Other SPEC benchmarks
Power: SPECpower_ssj2008
HPC Benchmarks: SPEC MPI, SPEC OMP
File System: SPECsfs2008
Messaging: SPECsip
SPECweb2009, SPECweb2005
SPECmail2009
SPECCloud
Handheld working group
SPEC also has an research group that creates benchmarks for research
& development purposes
Vendor-Sponsored Application Benchmarks
SAP
• Various, but SD 2-Tier is most popular
• Results published on x86 due to support requirements
• Correlates with clock, cores, OS, database.
• Plateaus on relatively low memory
• Pre-2009 results not comparable to current results
• Used for SAP “QuickSizer” system-sizing tool
Oracle
• Official : EBS benchmarks, Siebel & Peoplesoft benchmarks, etc.
• Good: Workload-specific.
• Bad: Seeing fewer results than in the past.
Microsoft
• Fast-Track system benchmarks: MCR/BCR
Oracle Benchmarks
# Results Architectures
Oracle EBS
Oracle Applications Release 12 (12.1.3)
Single Instance R12 Batch
Order-to-Cash Large
2-Tier
1
1
Oracle EBS
Oracle Applications Release 12 (12.1.3)
Single Instance R12 Batch
Payroll Large/Extra-Large 2-Tier
2
1
Oracle EBS
Oracle Applications Release 12 (12.1.2)
Single Instance R12 OLTP
OLTP X-Large
3-Tier
3
2
Oracle EBS
Oracle Applications Release 12 (12.1.2)
Single Instance R12 Batch
Order-to-Cash Large
2-Tier
3
2
Oracle EBS
Oracle Applications Release 12 (12.1.2)
Single Instance R12 Batch
Payroll Large/Extra-Large 2-Tier
5
2
Oracle EBS
Oracle Applications Release 12 (RUP 4)
Single Instance R12 Batch
Order-to-Cash Medium
3-Tier
1
0
Oracle EBS
Oracle Applications Release 12 (RUP 4)
Single Instance R12 Batch
Order-to-Cash Medium
2-Tier
6
1
Oracle EBS
Oracle Applications Release 12 (RUP 4)
Single Instance R12 Batch
Payroll Large/Extra-Large 2-Tier
4
0
Oracle EBS
Oracle Applications Release 12 (RUP 4)
Single Instance R12 Batch
Payroll Medium
3-Tier
2
0
Oracle EBS
Oracle Applications Release 12 (RUP 4)
Single Instance R12 Batch
Payroll Medium
2-Tier
7
1
Oracle EBS
Oracle Applications Release 11i (11.5.10) Medium Configuration - RAC
6
0
Oracle EBS
Oracle Applications Release 11i (11.5.10) Medium Configuration - Single DB (Database Tier)
33
0
Oracle EBS
Oracle Applications Release 11i (11.5.10) Small Configuration - Single DB
(Database Tier)
42
0
Oracle EBS
Oracle Applications Release 11i (11.5.10) Small configuration - RAC
(Database Tier)
3
0
Oracle EBS
Real Application Clusters (11.5.9)
2
0
Peoplesoft
Human Capital Management
Payroll for North America 9.x
Batch
12
2
Peoplesoft
Financial Management
Financials 9.x day-in-the-life
Online
3
2
Siebel
CRM Release 8.0
15
0
Siebel
CRM Release 8.1.x
3
2
Benchmarks
“Dilbert”: March 02, 2009. Source: Dilbert.com.
Virtualization Benchmarks
VMmark
• Includes both application and “infrastructure” workload
• DVDStore, OLIO, Exchange 2007
• Idle machine
• vMotion, storage vMotion
• Base on concept of “tiles”; each tile = 8 workloads
• VMware (and therefore x86) only
• CPU is the bottleneck, not memory
• With the same CPU, results from different vendors almost identical
• vSphere license contains DeWitt Clause
SPECvirt
• Uses 3 other SPEC benchmarks as its workloads:
• SPECweb2005
• SPECjAppServer2004
• SPECmail2008
• Uses similar “tiles” concept to Vmmark
• Just vSphere, KVM, Xen results
Other Consortia Benchmarks
Government “Benchmarks”
ENERGY STAR
• Sponsored by US EPA
• Rewards servers that achieve “best in class” power efficiency targets
• Version 1 and upcoming Version 2 disqualify some server categories
APP
• Calculated number used by US for export control reasons
• Similar to MIPS
Some commercial benchmark software
Server-oriented
Quest (Dell) Benchmark Factory
Desktop-oriented
SiSoftware Sandra*
Primate Labs Geekbench*
SysMark
Phoronix Test Suite*
Maxon Cinebench
* Include public repositories of user-submitted results
Repositories
CloudHarmony (Cloud instances)
Tools with metrics
BMC Capacity Optimization
Computer Associates Hyperformix
Vmware Server Capacity Planner
Popular Open Source or Public Domain
benchmarks
STREAM
• Simple memory bandwidth test
• Gets close to server theoretical maximums
LINPACK / HPL
• Floating-point tests used to compare HPC & supercomputer
performance
• “Results should not be taken too seriously”
Other examples
• PRIME95
• Terasort
• DVDStore
• ApacheBench
• OLIO
Vendor attitudes towards benchmarks
Source: http://online.asbis.sk/documents/ORACLE/dokumenty/Kubacek_Dalibor_OS_Positioning.pdf
Benchmarks We Lack
• Converged Systems
• Public Cloud versus On-Premises
• “Microserver” Power
• 3rd Party Mainframe
My advice for using other people’s benchmark results
• Only use when you’re lazy and poor
- Full Disclosure: I am lazy and poor
• Ask vendors for non-published results
• Ignore differences < 10%
• For big servers, don’t dividing results by the
number of cores
• If you’re going to just use SPEC CPU2006…
- Use SPEC CPU Integer Rate base results
Questions?
46
Download