10-11-ca-multithreading

advertisement

Multithreaded Architectures

Uri Weiser and Yoav Etsion

Computer Architecture 2015 - Multithreading

1/40

Overview

Multithreaded Software

Multithreaded Architecture

Multithreaded Micro-Architecture

Conclusions

Computer Architecture 2015 - Multithreading

2/40

Multithreading

Basics

 Process

 Each process has its unique address space

 Can consist of several threads

 Thread – each thread has its unique execution context

 Thread has its own PC (sequencer) + registers + stack

 All threads (within a process) share same address space

 Private heap is optional

Multithreaded app’s: process broken into threads

 #1 example: transactions (databases, web servers)

Increased concurrency

Partial blocking (your transaction blocks, mine doesn’t have to)

 Centralized (smarter) resource management (by process)

Computer Architecture 2015 - Multithreading

3/40

Superscalar Execution

Example – 5 stages, 4 wide machine

F D

1

D

2

EX/RET WB

F D

1

D

2

EX/RET WB

F D

1

D

2

EX/RET WB

F D

1

D

2

EX/RET WB

Computer Architecture 2015 - Multithreading

5/40

Superscalar: Execution Stage

Time

Generally increases throughput, but decreases utilization

Computer Architecture 2015 - Multithreading

6/40

CMP – Chip Multi-Processor

Multi-threads/Multi-tasks runs on Multi-Processor/core

Time

Better utilization / Lower thread-level throughput / Higher overall throughput /

Computer Architecture 2015 - Multithreading

7/40

Multiple Hardware Threads

 A thread can be viewed as a stream of instructions

 State is represented by PC, Stack pointer, GP Registers

 Equip multithreaded processors with multiple hardware contexts (threads)

 Multiple register files, PCs, SPs

 OS views CPU as multiple logical processors

 Execution of instructions from different threads are interleaved in the pipeline

Interleaving policy is critical…

Computer Architecture 2015 - Multithreading

8/40

Blocked Multithreading

Multi-threading/multi-tasking runs on single core

Time

How is this different from OS scheduling?

May increase utilization and throughput, but must switch when current thread goes to low utilization/throughput section (e.g. L2 cache miss)

Computer Architecture 2015 - Multithreading

9/40

Fine Grained Multithreading

Multi-threading/multi-tasking runs on single core

Time

Increases utilization/throughput by reducing impact of dependences

Computer Architecture 2015 - Multithreading

10/40

Simultaneous Multithreading (SMT)

Multi-threading/multi-tasking runs on single core

Time

Increases utilization/throughput

Computer Architecture 2015 - Multithreading

11/40

Blocked Multithreading

(SoE-MT- Switch on Event MT, aka ’ – “Poor Man MT”)

Critical decision: when to switch threads

When current thread’s utilization/throughput is about to drop

(e.g. L2 cache miss)

Requirements for throughput:

 (Thread switch) + (pipe fill time) << blocking latency

 Would like to get some work done before other thread comes back

Fast thread-switch: multiple register banks

Fast pipe-fill: short pipe

 Advantage: small changes to existing hardware

 Drawback: high single thread performance requires long thread switch

 Examples

Macro-dataflow machine

MIT Alewife

IBM Northstar

Computer Architecture 2015 - Multithreading

12/40

Interleaved (Fine grained)

Multithreading

 Critical decision: none?

 Requirements for throughput:

 Enough threads to eliminate data access and dependencies

 Increasing number of threads reduces single-thread performance

 Implementation options: fixed vs. flexible

I

1

F D E M W

F D E M W

I

2

F D E M W

I

1 of T

1

F D E M W

I

1 of T

2

F D E M W

I

2 of T

1

F D E M W

Superscalar pipeline Multithreaded pipeline

Computer Architecture 2015 - Multithreading

13/40

Interleaved (Fine grained)

Multithreading (cont.)

 Advantages:

 (w/ flexible interleave:) Reasonable single thread performance

 High processor utilization (esp. in case of many thread)

 Drawback:

 Complicated hardware

 Requires highly parallel software for efficient execution

 Massive cache pressure

 (w/ fixed interleave:) limits single thread performance

 Examples:

 HEP Denelcor: 8 threads (latencies were shorter then)

 TERA: 128 threads

 MicroUnity - 5 x 1GHz threads = 200 MHz like latency

 GPUs employ a variant of fine-grain interleaving

Computer Architecture 2015 - Multithreading

14/40

Fine Grained Multithreading in GPGPUs

Time

• 32 threads per group (warp); GPGPU core alternates between 48 warps

• Total number of threads running on a core: 32 * 48 = 1536

• Huge register file (around 128KB)

Computer Architecture 2015 - Multithreading

15/40

Simultaneous

Multi-threading (SMT)

 Critical decision: fetch-interleaving policy

 Requirements for throughput:

 Enough threads to utilize resources

 Fewer than needed to stretch dependences

 Limit: too many threads reduces cache effectiveness

 Examples:

 Compaq Alpha EV8 (cancelled)

 Intel Hyper-Threading Technology

Computer Architecture 2015 - Multithreading

16/40

Caching and Data Sharing

 Decoupled caches = Private caches

Guarantee each thread’s cache space

 Requires protocols across caches

X

Cache allocation is rigid

 Shared Caches

 Faster to communicate among threads

 No coherence overhead

X

Flexibility in allocating resources

Threads can step on each other toes…

Tag Data Tag Data Tag Data Tag Data

Way 0 Way 1 Way 2 Way 7

Computer Architecture 2015 - Multithreading

17/40

SMT Case Study: EV8

8-issue OOO processor (wide machine)

SMT Support

Multiple sequencers (PC): 4

Alternate fetch from multiple threads

Separate register renaming for each thread

More (4x) physical registers

Thread tags on all shared resources

 ROB, LSQ, BTB, etc.

 Allow per thread flush/trap/retirement

Process tags on all address space resources: caches, TLB’s, etc.

Notice: none of these things are in the “core”

Instruction queues, scheduler, wakeup are SMT-blind

Computer Architecture 2015 - Multithreading

18/40

Basic EV8

Fetch Decode/

Map

Queue Reg

Read

Execute Dcache/

Store

Buffer

Reg

Write

Retire

PC

Register

Map

Regs

Dcache

Regs

Icache

Thread-blind

Computer Architecture 2015 - Multithreading

19/40

SMT EV8

Fetch Decode/

Map

Queue Reg

Read

Execute Dcache/

Store

Buffer

Reg

Write

Retire

PC

Icache

Register

Map

Regs

Dcache Regs

Thread-blind

Thread Blind?

Not really

Computer Architecture 2015 - Multithreading

20/40

Performance Scalability

Multiprogrammed Workload

250%

200%

150%

100%

50%

0%

SpecInt SpecFP Mixed Int/FP

250%

Decomposed SPEC95 Applications

200%

150%

100%

50%

0%

Turb3d Swm256 Tomcatv

1T

2T

3T

4T

50%

1T

2T

3T

4T

0%

300%

250%

200%

150%

100%

Multithreaded Applications

Barnes Chess Sort TP

1T

2T

4T

Computer Architecture 2015 - Multithreading

21/40

Performance Gain – when?

When threads do not “step” on each other

“predicted state”  $, Branch Prediction, data prefetch….

 If ample resources are available

 Slow resource, can hamper performance

Computer Architecture 2015 - Multithreading

22/40

Scheduling in SMT

 What if one thread gets “stuck”?

 Round-robin

 eventually it will fill up the machine (resource contention)

 ICOUNT: thread with fewest instructions in pipe has priority

 Thread doesn ’t get to fetch until it gets “unstuck”

 Other policies have been devised to get best balance, fairness, and overall performance

Computer Architecture 2015 - Multithreading

23/40

Scheduling in SMT

 Variation: what if one thread is spinning?

 Not really stuck, gets to keep fetching

 Have to stop/slow it artificially (QUIESCE)

Sleep until a memory location changes

On Pentium 4 – use the Pause instruction

Computer Architecture 2015 - Multithreading

24/40

Intermediate summary

 Multithreaded Software

 Multithreaded Architecture

Advantageous cost/throughput …

 Blocked MT - long latency tolerance

Good single thread performance, good throughput

Needs fast thread switch and short pipe

 Interleaved MT – ILP increase

Bad single thread performance, good throughput

Needs many threads; several loaded contexts…

 Simultaneous MT – ILP increase

OK MT performance

Good single thread performance

Good utilization …

Need fewer threads

Computer Architecture 2015 - Multithreading

25/40

SMT vs. CMP

Core A

1MB

L2

 How to use 2x transistors?

 Better core+SMT or CMP?

 SMT

 Better single thread performance

 Better resource utilization

 CMP

SMT

Core

2MB

L2

 Much easier to implement

 Avoid wire delays problems

 Overall better throughput per area/power

 More deterministic performance

 BUT! program must be multithreaded

Core A

Core B

2MB

L2

 Rough numbers when doubling the number of transistors:

 SMT:

 ST Performance / MT performance = 1.3X/1.5-1.7X

 CMP:

 ST Performance / MT performance = 1X/2X

Reference

SMT

CMP

Computer Architecture 2015 - Multithreading

26/40

Example:

Pentium

®

4 Hyper-Threading

 Executes two threads simultaneously

 Two different applications or

 Two threads of same application

 CPU maintains architecture state for two processors

 Two logical processors per physical processor

Implementation on Intel® Xeon™ Processor

 Two logical processors for < 5% additional die area

 The processor pretends as if it has 2 cores in an MP shared memory system

Computer Architecture 2015 - Multithreading

28/40

uArchitecture impact

 Replicate some resources

 All; per-CPU architectural state

 Instruction pointers, renaming logic

 Some smaller resources e.g. return stack predictor, ITLB, etc.

 Partition some resources (share by splitting in half per thread)

 Re-Order Buffer

 Load/Store Buffers

 Queues; etc.

 Share most resources

 Out-of-Order execution engine

 Caches

Computer Architecture 2015 - Multithreading

29/40

Hyper-Threading Performance

 Performance varies as expected with:

 Number of parallel threads

 Resource utilization

Less aggressive MT than EV8  Less impressive scalability

 Machine optimized for single-thread.

1.30

1.23

1.22

1.18

1.18

1.00

IXP-MP

Hyper-

Threading

Disabled

Microsoft*

Active Directory

Microsoft*

SQL Server

Microsoft*

Exchange

Microsoft*

IIS

Parallel GCC

Compile

IXP-MP

Hyper-Threading Enabled

Intel Xeon Processor MP platforms are prototype systems in 2-way configurations

Applications not tuned or optimized for Hyper-Threading Technology

*All trademarks and brands are the property of their respective owners Computer Architecture 2015 - Multithreading

30/40

Analysis of a Multi-Thread Architecture:

A simple switch-on-event (SoE) processor

Memory access = P execution execution

Memory access

Period

Computer Architecture 2015 - Multithreading

31/40

Multi-Thread Architecture (SoE)

Threads architectural states

C

To Memory

Memory latency is shielded by multiple threads execution

Computer Architecture 2015 - Multithreading

32/40

Multi-Thread Architecture (SoE)

Threads architectural states

C

To Memory

Instructions/Memory access=1/MI

CPI

I

= Ideal CPI of core C

MI = Memory access/Instruction

P = Memory access penalty (cycles) n = number of threads

Computer Architecture 2015 - Multithreading

33/40

Multi-Thread Architecture (SoE)

Threads architectural states

C

To Memory

Period = (1/MI)*CPI i

+ P

P

CPI

1

= {(1/MI)*CPI i

+P} / {1/MI} =CPI i

+ MI * P

Cycles / instructions

CPI

I

= Ideal CPI of core C

MI = Memory access/Instruction

P = Memory access penalty (cycles) n = number of threads

Computer Architecture 2015 - Multithreading

34/40

Multi-Thread Architecture (SoE)

Threads architectural states

C

To Memory

Period = (1/MI])*CPI i

+ P

CPI

MT

= {(1/MI)*CPI i

+P} / n*{1/MI} =CPI i

+ MI * P n

Cycles / instructions

CPI

I

= Ideal CPI of core C

MI = Memory access/Instruction

P = Memory access penalty (cycles) n = number of threads

Computer Architecture 2015 - Multithreading

35/40

Multi-Thread Architecture (SoE)

Threads architectural states

C

To Memory

Period

IPC

MT

= n * IPC

1

= n * 1/CPI

1

CPI

I

= Ideal CPI of core C

MI = Memory access/Instruction

P = Memory access penalty (cycles) n = number of threads

= n / [ CPI i

+ MI * P ]

Computer Architecture 2015 - Multithreading

36/40

Multi-Thread Architecture (SoE)

Threads architectural states

C

To Memory

Memory latency is shielded by multiple threads execution

Max performance ?

Threads (n)

Computer Architecture 2015 - Multithreading

37/40

Multi-Thread Architecture (SoE)

Threads architectural states

C

To Memory

Memory latency is shielded by multiple threads execution

Max performance = 1/CPIi

N

1

?

Threads (n)

Computer Architecture 2015 - Multithreading

38/40

So how many threads do we need for a maximum performance?

1 / CPI i

= n / [ CPI i

+ MI * P ]

[ CPI i

+ MI * P ] / CPI i

= n

N = 1 + [MI * P] / CPI i

CPI

I

= CPI ideal of core C

MI = Memory access/Instruction

P = Memory access penalty (cycles) n = number of threads

Computer Architecture 2015 - Multithreading

39/40

SoE: Now with a cache! (single-thread)

Threads architectural states

C

To Memory

Period = (1/MI)*CPI i

+ MR*P

P

CPI

1

= {(1/MI)*CPI i

+MR*P} / {1/MI} =CPI i

+ MI * MR *P

Cycles / instructions

MR=cache miss rate CPI

I

= Ideal CPI of core C

MI = Memory access/Instruction

P = Memory access penalty (cycles) n = number of threads

Computer Architecture 2015 - Multithreading

40/40

The unified machine – the valley

When we have a cache, the memory penalty depends on the miss-rate

Perf = n/ [ CPI i

+ MI * MR(n) * P ]

 Three regions: Cache Region , the valley, MT Region

The valley MT region

Threads

Computer Architecture 2015 - Multithreading

41/40

Multi-Thread Architecture (SoE) conclusions

 Applicable when many threads are available

 Requires saving thread state

 Can reach “high” performance (~1/CPI i

)

 Bandwidth to memory is high

 high power

Computer Architecture 2015 - Multithreading

42/40

BACKUP SLIDES

Computer Architecture 2015 - Multithreading

43/40

The unified machine

Parameter: Cache Size

 Increase Cache size  cache ability to hide memory latency  favors MC

Performance for Different Cache Sizes

1100

1000

900

800

700

600

500

400

300

200

100

0 no cache

16M

32M

64M

128M perfect $

Number Of Threads

* At this point: unlimited BW to memory

1100

1000

900

800

700

600

500

400

300

200

100

0

Cache Size Impact

 Increase in cache size  cache suffices for more in-flight threads

 Extends the $ region ..AND also  Valuable in the MT region

 Caches reduce off-chip bandwidth  delay the BW saturation point

Performance for Different Cache Sizes (Limited BW)

Increase in cache size no $

16M

32M

64M

128M perfect $

Number Of Threads

Computer Architecture 2015 - Multithreading

45/40

Memory Latency Impact

 Increase in memory latency  Hinders the MT region

 Emphasise the importance of caches

Performance for Different memory Latencies

1100

1000

900

800

700

600

500

400

300

200

100

0 zero latency

50 cycles

100 cycles

300 cycles

1000 cycles

Unlimited BW to memory

Number Of Threads

Computer Architecture 2015 - Multithreading

46/40

Download