Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

Evaluating the Impact of Simultaneous

Multithreading on Network Servers

Using Real Hardware

Yaoping Ruan

Princeton University

Vivek Pai, Princeton University

Erich Nahum , IBM T.J. Watson

John Tracey , IBM T.J. Watson

Motivation

 Network servers

 Throughput matters

 Hardware intensive

 Simultaneous Multithreading (SMT)

 Processor support for high throughput

 Simulated since mid-90s

 Now - Intel Xeon/Pentium 4 (Hyper-

Threading), IBM POWER5 available

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan 2

How Does SMT Work?

 Simultaneous execution of multiple jobs

 Higher utilization of functional units cycles (direction of data flow)

Job 1

Processor 1

Job 2

Processor 2

Job 1&2

SMT processor

(Colored blocks are functional units currently in use)


SMT Architecture

Appear as multi-processors for OS and app.

Duplicated

Resource

Architectural State

Registers #1

Architectural State

Registers #2

Shared

Resource

SIGMETRICS’05

Pipeline Execution Units

Cache Hierarchy

System Bus

Main Memory http://www.cs.princeton.edu/~yruan 4

Contributions

 Detailed analysis of multiple real hardware platforms and server packages

 Includes previously ignored OS overheads

 Micro-architectural performance analysis

 Demonstrates dominance of memory hierarchy

 Comparison with simulation studies

 Explain why SMT provides relatively small benefits on real hardware

 Overly-aggressive memory simulation yielded higher expected benefits


Outline

 Background

 Measurement methodology

 Throughput & improvement

 Micro-architectural performance

 Discussion


Measurements Overview

 Metrics

 Server throughput

 Throughput improvements (relative speedups)

 Architectural features (CPI, miss ratio, etc.)

 Multiple configurations

 Hardware platforms (clock speed, cache, etc.)

 Server software (Apache, Flash, TUX, etc.)

 Kernel configuration (uniprocessor and multiprocessor)


Hardware Platforms

 Three models of Xeon processors

Clock rate 2.0GHz 3.06Ghz

3.06GHz L3

L3 1MB

Mem latency

(cycles)

220 350 cycles

L1/L2 cache sizes, main memory, buses and # threads/processor are the same

Clock rate http://www.cs.princeton.edu/~yruan

Cache

SIGMETRICS’05

8

Web Servers

 5 Web server packages

 Apache-MP: multi-process

 Apache-MT: multi-thread

 Flash: event-driven

 TUX: in-kernel

 Haboob: Java server, staged multi-thread model

 Benchmark

 SPECweb96 and SPECweb99


System Configuration

 5 configuration labels

 # CPUs, SMT on/off, kernel type

# CPUs

SMT kernel

1P-UP 1P-MP 2T 2P 4T

(T – # threads, P – # processors)

1 2 on

Multiprocessor kernel on


Outline

 Background

 Measurement methodology

 Throughput & improvement

 Single processor

 Dual-processor


 Discussion


Throughput Evaluation

1200

1000

800

600

400

Apache-MP, 3.06GHz

2T vs. 1P-UP

2T vs. 1P-MP

4T vs. 2P

200 single processor

0

SIGMETRICS’05

1P-UP dual-processor

1P-MP 2T w/ SMT http://www.cs.princeton.edu/~yruan

2P 4T w/ SMT

12

Improvement on Single Processor

2T : 2 threads , multiprocessor kernel

1P-MP: 1 thread , multiprocessor kernel

2T vs. 1P-MP 40

30

20

10

0

-10

Apache-MP Apache-MT

2.0GHz

SIGMETRICS’05

Flash TUX

3.06GHz

3.06GHz L3 http://www.cs.princeton.edu/~yruan

Haboob

13

Improvement on Single Processor

40

30

20

2T : 2 threads, Multiprocessor kernel

1P-UP: 1 threads, Uniprocessor kernel

2T vs. 1P-UP

Kernel overhead

10

0

-10


2.0GHz

SIGMETRICS’05

Flash TUX

3.06GHz

3.06GHz L3 http://www.cs.princeton.edu/~yruan

Haboob

14

Improvement on Dual-processor

40

30

20

4T: 4 threads (2 processors, 2T/Processor)

2P: 2 physical processors (SMT disabled)

4T vs. 2P

 2.0GHz & 3.06GHz with

L3 are better

 Memory is still the bottleneck

10

0

-10

-20


2.0GHz

Flash TUX

3.06GHz

3.06GHz L3

Haboob


Micro-architectural Analysis

 Use Oprofile

 In-house patch to measure extra events

 About 25 performance events

 Cache miss/hit

 TLB miss/hit

 Branches

 Pipeline stall, clear, etc.

 Bus utilization


L1 Instruction Cache Miss Rate

20%

18%

16%

14%

12%

10%

8%

6%

4%

2%

0%

1P-UP 1P-MP

2P


SIGMETRICS’05

Flash http://www.cs.princeton.edu/~yruan

TUX

2T(SMT)

4T(SMT)

Haboob

17

L2 Cache Miss Rate

 Instruction & data unified

 Lower rate in SMT due to higher L1 misses

10%

1P-UP 1P-MP 2T(SMT)

8%

2P 4T(SMT)

6%

4%

2%

0%

Apache-MP Apache-MT Flash

SIGMETRICS’05 http://www.cs.princeton.edu/~yruan

TUX Haboob

18

Putting Events Together

work

DTLB

L1 Miss L2 Miss ITLB

Branch Clear Buffer

16

14

12

10

8

6

4

2

0

Apache-MP

1P-UP 1P-MP 2T


2P 4T others

L2 Miss

L1 Miss work

19

Non-overlapped CPI

 L1/L2 miss penalty dominates


Measuring Bus Utilization

 Event: FSB_DATA_ACTIVITY

 CPU cycles when the bus is busy

 Normalized to CPU speed

 Comparable across all CPU clock rate


20

15

10

5

0

Bus Utilization Results

Apache-MP

 2.0GHz & 3.06GHz

L3 have less data transfer cycles

 Lower memory latency in 2.0GHz &

3.06GHz with L3

 Coefficient of correlation between bus utilization & speedups : 0.62 ~

0.95

2.0GHz

3.06GHz

3.06GHz L3


Outline

 Background

 Measurement parameters

 Throughput speedup


 Discussion

 Compare to simulation

 Other Web workloads


SMT Performance on Web Servers

100%

90%

80%

70%

60%

50%

40%

30%

20%

10%

0%

-10%

SIGMETRICS’05

Simulation

Multiprocessor kernel

Uniprocessor kernel http://www.cs.princeton.edu/~yruan

Dual processor

24

Compare to Simulation

L1-I

L1-D

L2

Mem latency

Size

Simulation Measurement

Miss rate Size Miss rate

128 KB 2.0% 12 KB 17%

128 KB 3.6%

16 MB 1.4%

90 cycles

8 KB

512 KB

5.7%

3.9%

220 ~ 350 cycles


Processor Development Trend

Simulated models:

62-cycle mem

32 KB L1

256 KB L2

90-cycle mem

128 KB L1

16384 KB L2

1996

2000

Actual processors:

74-cycle mem

16 KB L1

256 KB L2

94-cycle mem

16 KB L1

512 KB L2


90-cycle mem

64 KB L1

16384 KB L2

2003

350-cycle mem

8-12 KB L1

512 KB L2

26

SMT on SPECweb99

 SPECweb99 results in paper

 Dynamic + static

 Multiple programs

• CGI requests, user profile logging, etc.

 Speedup very close to static-only workloads

 No more negative speedups in Flash

 May be due to better sharing of resources of different programs


Summary

 More realistic speedup evaluation of SMT

 3 processors, 5 servers, 2 kernels

 Exposed factors not previously examined

 5~15% speedup in our best cases

 Detailed analysis of memory hierarchy impact on SMT performance

 All other architecture overheads secondary

 Reasons why simulation results were overly optimistic


Thank you

http://www.cs.princeton.edu/~yruan

Future Work

 Ways of improving Simultaneous

Multithreading performance

 Server performance on POWER5

 Using execution driven simulation for deeper understanding

 Study Chip Multiprocessor (CMP)

 Intel, AMD, and IBM


Pipeline Clears (per Byte)

 Conditions when the whole pipeline needs to be flushed

0.30

0.25

0.20

0.15

0.10

0.05

0.00

Apache-MP Apache-MT Flash TUX Haboob

1T-UP 1T-MP 2T 2P 4T


Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

Motivation

How Does SMT Work?

SMT Architecture

Contributions

Outline

Measurements Overview

Hardware Platforms

Web Servers

System Configuration

Outline

Throughput Evaluation

Improvement on Dual-processor

Micro-architectural Analysis

L1 Instruction Cache Miss Rate

L2 Cache Miss Rate

Putting Events Together

Non-overlapped CPI

Measuring Bus Utilization

Bus Utilization Results

Outline

Compare to Simulation

Processor Development Trend

SMT on SPECweb99

Summary

Thank you

Future Work

Pipeline Clears (per Byte)

Related documents

Products

Support

Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

Motivation

How Does SMT Work?

SMT Architecture

Contributions

Outline

Measurements Overview

Hardware Platforms

Web Servers

System Configuration

Outline

Throughput Evaluation

Improvement on Dual-processor

Micro-architectural Analysis

L1 Instruction Cache Miss Rate

L2 Cache Miss Rate

Putting Events Together

Non-overlapped CPI

Measuring Bus Utilization

Bus Utilization Results

Outline

Compare to Simulation

Processor Development Trend

SMT on SPECweb99

Summary

Thank you

Future Work

Pipeline Clears (per Byte)

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib