Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

advertisement

Evaluating the Impact of Simultaneous

Multithreading on Network Servers

Using Real Hardware

Yaoping Ruan

Princeton University

Vivek Pai, Princeton University

Erich Nahum , IBM T.J. Watson

John Tracey , IBM T.J. Watson

Motivation

 Network servers

 Throughput matters

 Hardware intensive

 Simultaneous Multithreading (SMT)

 Processor support for high throughput

 Simulated since mid-90s

 Now - Intel Xeon/Pentium 4 (Hyper-

Threading), IBM POWER5 available

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 2

How Does SMT Work?

 Simultaneous execution of multiple jobs

 Higher utilization of functional units cycles (direction of data flow)

Job 1

Processor 1

Job 2

Processor 2

Job 1&2

SMT processor

(Colored blocks are functional units currently in use)

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 3

SMT Architecture

Appear as multi-processors for OS and app.

Duplicated

Resource

Architectural State

Registers #1

Architectural State

Registers #2

Shared

Resource

SIGMETRICS’05

Pipeline Execution Units

Cache Hierarchy

System Bus

Main Memory http://www.cs.princeton.edu/~yruan 4

Contributions

 Detailed analysis of multiple real hardware platforms and server packages

 Includes previously ignored OS overheads

 Micro-architectural performance analysis

 Demonstrates dominance of memory hierarchy

 Comparison with simulation studies

 Explain why SMT provides relatively small benefits on real hardware

 Overly-aggressive memory simulation yielded higher expected benefits

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 5

Outline

 Background

 Measurement methodology

 Throughput & improvement

 Micro-architectural performance

 Discussion

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 6

Measurements Overview

 Metrics

 Server throughput

 Throughput improvements (relative speedups)

 Architectural features (CPI, miss ratio, etc.)

 Multiple configurations

 Hardware platforms (clock speed, cache, etc.)

 Server software (Apache, Flash, TUX, etc.)

 Kernel configuration (uniprocessor and multiprocessor)

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 7

Hardware Platforms

 Three models of Xeon processors

Clock rate 2.0GHz 3.06Ghz

3.06GHz L3

L3 1MB

Mem latency

(cycles)

220 350 cycles

L1/L2 cache sizes, main memory, buses and # threads/processor are the same

Clock rate http://www.cs.princeton.edu/~yruan

Cache

SIGMETRICS’05

8

Web Servers

 5 Web server packages

 Apache-MP: multi-process

 Apache-MT: multi-thread

 Flash: event-driven

 TUX: in-kernel

 Haboob: Java server, staged multi-thread model

 Benchmark

 SPECweb96 and SPECweb99

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 9

System Configuration

 5 configuration labels

 # CPUs, SMT on/off, kernel type

# CPUs

SMT kernel

1P-UP 1P-MP 2T 2P 4T

(T – # threads, P – # processors)

1 2 on

Multiprocessor kernel on

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 10

Outline

 Background

 Measurement methodology

 Throughput & improvement

 Single processor

 Dual-processor

 Micro-architectural performance

 Discussion

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 11

Throughput Evaluation

1200

1000

800

600

400

Apache-MP, 3.06GHz

2T vs. 1P-UP

2T vs. 1P-MP

4T vs. 2P

200 single processor

0

SIGMETRICS’05

1P-UP dual-processor

1P-MP 2T w/ SMT http://www.cs.princeton.edu/~yruan

2P 4T w/ SMT

12

Improvement on Single Processor

2T : 2 threads , multiprocessor kernel

1P-MP: 1 thread , multiprocessor kernel

2T vs. 1P-MP 40

30

20

10

0

-10

Apache-MP Apache-MT

2.0GHz

SIGMETRICS’05

Flash TUX

3.06GHz

3.06GHz L3 http://www.cs.princeton.edu/~yruan

Haboob

13

Improvement on Single Processor

40

30

20

2T : 2 threads, Multiprocessor kernel

1P-UP: 1 threads, Uniprocessor kernel

2T vs. 1P-UP

Kernel overhead

10

0

-10

Apache-MP Apache-MT

2.0GHz

SIGMETRICS’05

Flash TUX

3.06GHz

3.06GHz L3 http://www.cs.princeton.edu/~yruan

Haboob

14

Improvement on Dual-processor

40

30

20

4T: 4 threads (2 processors, 2T/Processor)

2P: 2 physical processors (SMT disabled)

4T vs. 2P

 2.0GHz & 3.06GHz with

L3 are better

 Memory is still the bottleneck

10

0

-10

-20

Apache-MP Apache-MT

2.0GHz

Flash TUX

3.06GHz

3.06GHz L3

Haboob

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 15

Micro-architectural Analysis

 Use Oprofile

 In-house patch to measure extra events

 About 25 performance events

 Cache miss/hit

 TLB miss/hit

 Branches

 Pipeline stall, clear, etc.

 Bus utilization

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 16

L1 Instruction Cache Miss Rate

20%

18%

16%

14%

12%

10%

8%

6%

4%

2%

0%

1P-UP 1P-MP

2P

Apache-MP Apache-MT

SIGMETRICS’05

Flash http://www.cs.princeton.edu/~yruan

TUX

2T(SMT)

4T(SMT)

Haboob

17

L2 Cache Miss Rate

 Instruction & data unified

 Lower rate in SMT due to higher L1 misses

10%

1P-UP 1P-MP 2T(SMT)

8%

2P 4T(SMT)

6%

4%

2%

0%

Apache-MP Apache-MT Flash

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan

TUX Haboob

18

Putting Events Together

work

DTLB

L1 Miss L2 Miss ITLB

Branch Clear Buffer

16

14

12

10

8

6

4

2

0

Apache-MP

1P-UP 1P-MP 2T

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan

2P 4T others

L2 Miss

L1 Miss work

19

Non-overlapped CPI

 L1/L2 miss penalty dominates

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 20

Measuring Bus Utilization

 Event: FSB_DATA_ACTIVITY

 CPU cycles when the bus is busy

 Normalized to CPU speed

 Comparable across all CPU clock rate

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 21

20

15

10

5

0

Bus Utilization Results

Apache-MP

 2.0GHz & 3.06GHz

L3 have less data transfer cycles

 Lower memory latency in 2.0GHz &

3.06GHz with L3

 Coefficient of correlation between bus utilization & speedups : 0.62 ~

0.95

2.0GHz

3.06GHz

3.06GHz L3

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 22

Outline

 Background

 Measurement parameters

 Throughput speedup

 Micro-architectural performance

 Discussion

 Compare to simulation

 Other Web workloads

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 23

SMT Performance on Web Servers

100%

90%

80%

70%

60%

50%

40%

30%

20%

10%

0%

-10%

SIGMETRICS’05

Simulation

Multiprocessor kernel

Uniprocessor kernel http://www.cs.princeton.edu/~yruan

Dual processor

24

Compare to Simulation

L1-I

L1-D

L2

Mem latency

Size

Simulation Measurement

Miss rate Size Miss rate

128 KB 2.0% 12 KB 17%

128 KB 3.6%

16 MB 1.4%

90 cycles

8 KB

512 KB

5.7%

3.9%

220 ~ 350 cycles

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 25

Processor Development Trend

Simulated models:

62-cycle mem

32 KB L1

256 KB L2

90-cycle mem

128 KB L1

16384 KB L2

1996

2000

Actual processors:

74-cycle mem

16 KB L1

256 KB L2

94-cycle mem

16 KB L1

512 KB L2

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan

90-cycle mem

64 KB L1

16384 KB L2

2003

350-cycle mem

8-12 KB L1

512 KB L2

26

SMT on SPECweb99

 SPECweb99 results in paper

 Dynamic + static

 Multiple programs

• CGI requests, user profile logging, etc.

 Speedup very close to static-only workloads

 No more negative speedups in Flash

 May be due to better sharing of resources of different programs

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 27

Summary

 More realistic speedup evaluation of SMT

 3 processors, 5 servers, 2 kernels

 Exposed factors not previously examined

 5~15% speedup in our best cases

 Detailed analysis of memory hierarchy impact on SMT performance

 All other architecture overheads secondary

 Reasons why simulation results were overly optimistic

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 28

Thank you

http://www.cs.princeton.edu/~yruan

Future Work

 Ways of improving Simultaneous

Multithreading performance

 Server performance on POWER5

 Using execution driven simulation for deeper understanding

 Study Chip Multiprocessor (CMP)

 Intel, AMD, and IBM

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 30

Pipeline Clears (per Byte)

 Conditions when the whole pipeline needs to be flushed

0.30

0.25

0.20

0.15

0.10

0.05

0.00

Apache-MP Apache-MT Flash TUX Haboob

1T-UP 1T-MP 2T 2P 4T

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 31

Download