Evaluating the Impact of Simultaneous
Multithreading on Network Servers
Using Real Hardware
Yaoping Ruan
Princeton University
Vivek Pai, Princeton University
Erich Nahum , IBM T.J. Watson
John Tracey , IBM T.J. Watson
Network servers
Throughput matters
Hardware intensive
Simultaneous Multithreading (SMT)
Processor support for high throughput
Simulated since mid-90s
Now - Intel Xeon/Pentium 4 (Hyper-
Threading), IBM POWER5 available
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 2
Simultaneous execution of multiple jobs
Higher utilization of functional units cycles (direction of data flow)
Job 1
Processor 1
Job 2
Processor 2
Job 1&2
SMT processor
(Colored blocks are functional units currently in use)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 3
Appear as multi-processors for OS and app.
Duplicated
Resource
Architectural State
Registers #1
Architectural State
Registers #2
Shared
Resource
SIGMETRICS’05
Pipeline Execution Units
Cache Hierarchy
System Bus
Main Memory http://www.cs.princeton.edu/~yruan 4
Detailed analysis of multiple real hardware platforms and server packages
Includes previously ignored OS overheads
Micro-architectural performance analysis
Demonstrates dominance of memory hierarchy
Comparison with simulation studies
Explain why SMT provides relatively small benefits on real hardware
Overly-aggressive memory simulation yielded higher expected benefits
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 5
Background
Measurement methodology
Throughput & improvement
Micro-architectural performance
Discussion
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 6
Metrics
Server throughput
Throughput improvements (relative speedups)
Architectural features (CPI, miss ratio, etc.)
Multiple configurations
Hardware platforms (clock speed, cache, etc.)
Server software (Apache, Flash, TUX, etc.)
Kernel configuration (uniprocessor and multiprocessor)
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 7
Three models of Xeon processors
Clock rate 2.0GHz 3.06Ghz
3.06GHz L3
L3 1MB
Mem latency
(cycles)
220 350 cycles
L1/L2 cache sizes, main memory, buses and # threads/processor are the same
Clock rate http://www.cs.princeton.edu/~yruan
Cache
SIGMETRICS’05
8
5 Web server packages
Apache-MP: multi-process
Apache-MT: multi-thread
Flash: event-driven
TUX: in-kernel
Haboob: Java server, staged multi-thread model
Benchmark
SPECweb96 and SPECweb99
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 9
5 configuration labels
# CPUs, SMT on/off, kernel type
# CPUs
SMT kernel
1P-UP 1P-MP 2T 2P 4T
(T – # threads, P – # processors)
1 2 on
Multiprocessor kernel on
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 10
Background
Measurement methodology
Throughput & improvement
Single processor
Dual-processor
Micro-architectural performance
Discussion
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 11
1200
1000
800
600
400
Apache-MP, 3.06GHz
2T vs. 1P-UP
2T vs. 1P-MP
4T vs. 2P
200 single processor
0
SIGMETRICS’05
1P-UP dual-processor
1P-MP 2T w/ SMT http://www.cs.princeton.edu/~yruan
2P 4T w/ SMT
12
Improvement on Single Processor
2T : 2 threads , multiprocessor kernel
1P-MP: 1 thread , multiprocessor kernel
2T vs. 1P-MP 40
30
20
10
0
-10
Apache-MP Apache-MT
2.0GHz
SIGMETRICS’05
Flash TUX
3.06GHz
3.06GHz L3 http://www.cs.princeton.edu/~yruan
Haboob
13
Improvement on Single Processor
40
30
20
2T : 2 threads, Multiprocessor kernel
1P-UP: 1 threads, Uniprocessor kernel
2T vs. 1P-UP
Kernel overhead
10
0
-10
Apache-MP Apache-MT
2.0GHz
SIGMETRICS’05
Flash TUX
3.06GHz
3.06GHz L3 http://www.cs.princeton.edu/~yruan
Haboob
14
40
30
20
4T: 4 threads (2 processors, 2T/Processor)
2P: 2 physical processors (SMT disabled)
4T vs. 2P
2.0GHz & 3.06GHz with
L3 are better
Memory is still the bottleneck
10
0
-10
-20
Apache-MP Apache-MT
2.0GHz
Flash TUX
3.06GHz
3.06GHz L3
Haboob
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 15
Use Oprofile
In-house patch to measure extra events
About 25 performance events
Cache miss/hit
TLB miss/hit
Branches
Pipeline stall, clear, etc.
Bus utilization
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 16
20%
18%
16%
14%
12%
10%
8%
6%
4%
2%
0%
1P-UP 1P-MP
2P
Apache-MP Apache-MT
SIGMETRICS’05
Flash http://www.cs.princeton.edu/~yruan
TUX
2T(SMT)
4T(SMT)
Haboob
17
Instruction & data unified
Lower rate in SMT due to higher L1 misses
10%
1P-UP 1P-MP 2T(SMT)
8%
2P 4T(SMT)
6%
4%
2%
0%
Apache-MP Apache-MT Flash
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan
TUX Haboob
18
work
DTLB
L1 Miss L2 Miss ITLB
Branch Clear Buffer
16
14
12
10
8
6
4
2
0
Apache-MP
1P-UP 1P-MP 2T
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan
2P 4T others
L2 Miss
L1 Miss work
19
L1/L2 miss penalty dominates
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 20
Event: FSB_DATA_ACTIVITY
CPU cycles when the bus is busy
Normalized to CPU speed
Comparable across all CPU clock rate
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 21
20
15
10
5
0
Apache-MP
2.0GHz & 3.06GHz
L3 have less data transfer cycles
Lower memory latency in 2.0GHz &
3.06GHz with L3
Coefficient of correlation between bus utilization & speedups : 0.62 ~
0.95
2.0GHz
3.06GHz
3.06GHz L3
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 22
Background
Measurement parameters
Throughput speedup
Micro-architectural performance
Discussion
Compare to simulation
Other Web workloads
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 23
SMT Performance on Web Servers
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
-10%
SIGMETRICS’05
Simulation
Multiprocessor kernel
Uniprocessor kernel http://www.cs.princeton.edu/~yruan
Dual processor
24
L1-I
L1-D
L2
Mem latency
Size
Simulation Measurement
Miss rate Size Miss rate
128 KB 2.0% 12 KB 17%
128 KB 3.6%
16 MB 1.4%
90 cycles
8 KB
512 KB
5.7%
3.9%
220 ~ 350 cycles
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 25
Simulated models:
62-cycle mem
32 KB L1
256 KB L2
90-cycle mem
128 KB L1
16384 KB L2
1996
2000
Actual processors:
74-cycle mem
16 KB L1
256 KB L2
94-cycle mem
16 KB L1
512 KB L2
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan
90-cycle mem
64 KB L1
16384 KB L2
2003
350-cycle mem
8-12 KB L1
512 KB L2
26
SPECweb99 results in paper
Dynamic + static
Multiple programs
• CGI requests, user profile logging, etc.
Speedup very close to static-only workloads
No more negative speedups in Flash
May be due to better sharing of resources of different programs
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 27
More realistic speedup evaluation of SMT
3 processors, 5 servers, 2 kernels
Exposed factors not previously examined
5~15% speedup in our best cases
Detailed analysis of memory hierarchy impact on SMT performance
All other architecture overheads secondary
Reasons why simulation results were overly optimistic
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 28
http://www.cs.princeton.edu/~yruan
Ways of improving Simultaneous
Multithreading performance
Server performance on POWER5
Using execution driven simulation for deeper understanding
Study Chip Multiprocessor (CMP)
Intel, AMD, and IBM
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 30
Conditions when the whole pipeline needs to be flushed
0.30
0.25
0.20
0.15
0.10
0.05
0.00
Apache-MP Apache-MT Flash TUX Haboob
1T-UP 1T-MP 2T 2P 4T
SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 31