10-11-ca-multithreading

Multithreaded Architectures

Uri Weiser and Yoav Etsion

Computer Architecture 2015 - Multithreading

1/40

Overview



Multithreaded Software



Multithreaded Architecture



Multithreaded Micro-Architecture



Conclusions


2/40

Multithreading

Basics

 Process

 Each process has its unique address space

 Can consist of several threads

 Thread – each thread has its unique execution context

 Thread has its own PC (sequencer) + registers + stack

 All threads (within a process) share same address space

 Private heap is optional



Multithreaded app’s: process broken into threads

 #1 example: transactions (databases, web servers)





Increased concurrency

Partial blocking (your transaction blocks, mine doesn’t have to)

 Centralized (smarter) resource management (by process)


3/40

Superscalar Execution

Example – 5 stages, 4 wide machine

F D

1

D

2

EX/RET WB

F D

1

D

2

EX/RET WB

F D

1

D

2

EX/RET WB

F D

1

D

2

EX/RET WB


5/40

Superscalar: Execution Stage

Time

Generally increases throughput, but decreases utilization


6/40

CMP – Chip Multi-Processor

Multi-threads/Multi-tasks runs on Multi-Processor/core

Time

Better utilization / Lower thread-level throughput / Higher overall throughput /


7/40

Multiple Hardware Threads

 A thread can be viewed as a stream of instructions

 State is represented by PC, Stack pointer, GP Registers

 Equip multithreaded processors with multiple hardware contexts (threads)

 Multiple register files, PCs, SPs

 OS views CPU as multiple logical processors

 Execution of instructions from different threads are interleaved in the pipeline



Interleaving policy is critical…


8/40

Blocked Multithreading

Multi-threading/multi-tasking runs on single core

Time

How is this different from OS scheduling?

May increase utilization and throughput, but must switch when current thread goes to low utilization/throughput section (e.g. L2 cache miss)


9/40

Fine Grained Multithreading


Time

Increases utilization/throughput by reducing impact of dependences


10/40

Simultaneous Multithreading (SMT)


Time

Increases utilization/throughput


11/40

Blocked Multithreading

(SoE-MT- Switch on Event MT, aka ’ – “Poor Man MT”)





Critical decision: when to switch threads



When current thread’s utilization/throughput is about to drop

(e.g. L2 cache miss)

Requirements for throughput:





 (Thread switch) + (pipe fill time) << blocking latency

 Would like to get some work done before other thread comes back

Fast thread-switch: multiple register banks

Fast pipe-fill: short pipe

 Advantage: small changes to existing hardware

 Drawback: high single thread performance requires long thread switch

 Examples







Macro-dataflow machine

MIT Alewife

IBM Northstar


12/40

Interleaved (Fine grained)

Multithreading

 Critical decision: none?

 Requirements for throughput:

 Enough threads to eliminate data access and dependencies

 Increasing number of threads reduces single-thread performance

 Implementation options: fixed vs. flexible

I

1

F D E M W

F D E M W

I

2

F D E M W

I

1 of T

1

F D E M W

I

1 of T

2

F D E M W

I

2 of T

1

F D E M W

Superscalar pipeline Multithreaded pipeline


13/40

Interleaved (Fine grained)

Multithreading (cont.)

 Advantages:

 (w/ flexible interleave:) Reasonable single thread performance

 High processor utilization (esp. in case of many thread)

 Drawback:

 Complicated hardware

 Requires highly parallel software for efficient execution

 Massive cache pressure

 (w/ fixed interleave:) limits single thread performance

 Examples:

 HEP Denelcor: 8 threads (latencies were shorter then)

 TERA: 128 threads

 MicroUnity - 5 x 1GHz threads = 200 MHz like latency

 GPUs employ a variant of fine-grain interleaving


14/40

Fine Grained Multithreading in GPGPUs

Time

• 32 threads per group (warp); GPGPU core alternates between 48 warps

• Total number of threads running on a core: 32 * 48 = 1536

• Huge register file (around 128KB)


15/40

Simultaneous

Multi-threading (SMT)

 Critical decision: fetch-interleaving policy

 Requirements for throughput:

 Enough threads to utilize resources

 Fewer than needed to stretch dependences

 Limit: too many threads reduces cache effectiveness

 Examples:

 Compaq Alpha EV8 (cancelled)

 Intel Hyper-Threading Technology


16/40

Caching and Data Sharing

 Decoupled caches = Private caches



Guarantee each thread’s cache space

 Requires protocols across caches

X

Cache allocation is rigid

 Shared Caches

 Faster to communicate among threads

 No coherence overhead



X

Flexibility in allocating resources

Threads can step on each other toes…

Tag Data Tag Data Tag Data Tag Data

Way 0 Way 1 Way 2 Way 7


17/40

SMT Case Study: EV8





8-issue OOO processor (wide machine)













SMT Support

Multiple sequencers (PC): 4



Alternate fetch from multiple threads

Separate register renaming for each thread

More (4x) physical registers

Thread tags on all shared resources

 ROB, LSQ, BTB, etc.

 Allow per thread flush/trap/retirement



Process tags on all address space resources: caches, TLB’s, etc.

Notice: none of these things are in the “core”

Instruction queues, scheduler, wakeup are SMT-blind


18/40

Basic EV8

Fetch Decode/

Map

Queue Reg

Read

Execute Dcache/

Store

Buffer

Reg

Write

Retire

PC

Register

Map

Regs

Dcache

Regs

Icache

Thread-blind


19/40

SMT EV8

Fetch Decode/

Map

Queue Reg

Read

Execute Dcache/

Store

Buffer

Reg

Write

Retire

PC

Icache

Register

Map

Regs

Dcache Regs

Thread-blind

Thread Blind?

Not really


20/40

Performance Scalability

Multiprogrammed Workload

250%

200%

150%

100%

50%

0%

SpecInt SpecFP Mixed Int/FP

250%

Decomposed SPEC95 Applications

200%

150%

100%

50%

0%

Turb3d Swm256 Tomcatv

1T

2T

3T

4T

50%

1T

2T

3T

4T

0%

300%

250%

200%

150%

100%

Multithreaded Applications

Barnes Chess Sort TP

1T

2T

4T


21/40

Performance Gain – when?



When threads do not “step” on each other

“predicted state”  $, Branch Prediction, data prefetch….

 If ample resources are available

 Slow resource, can hamper performance


22/40

Scheduling in SMT

 What if one thread gets “stuck”?

 Round-robin

 eventually it will fill up the machine (resource contention)

 ICOUNT: thread with fewest instructions in pipe has priority

 Thread doesn ’t get to fetch until it gets “unstuck”

 Other policies have been devised to get best balance, fairness, and overall performance


23/40

Scheduling in SMT

 Variation: what if one thread is spinning?

 Not really stuck, gets to keep fetching

 Have to stop/slow it artificially (QUIESCE)





Sleep until a memory location changes

On Pentium 4 – use the Pause instruction


24/40

Intermediate summary

 Multithreaded Software

 Multithreaded Architecture



Advantageous cost/throughput …

 Blocked MT - long latency tolerance





Good single thread performance, good throughput

Needs fast thread switch and short pipe

 Interleaved MT – ILP increase





Bad single thread performance, good throughput

Needs many threads; several loaded contexts…

 Simultaneous MT – ILP increase









OK MT performance

Good single thread performance

Good utilization …

Need fewer threads


25/40

SMT vs. CMP

Core A

1MB

L2

 How to use 2x transistors?

 Better core+SMT or CMP?

 SMT

 Better single thread performance

 Better resource utilization

 CMP

SMT

Core

2MB

L2

 Much easier to implement

 Avoid wire delays problems

 Overall better throughput per area/power

 More deterministic performance

 BUT! program must be multithreaded

Core A

Core B

2MB

L2

 Rough numbers when doubling the number of transistors:

 SMT:

 ST Performance / MT performance = 1.3X/1.5-1.7X

 CMP:

 ST Performance / MT performance = 1X/2X

Reference

SMT

CMP


26/40

Example:

Pentium

®

4 Hyper-Threading

 Executes two threads simultaneously

 Two different applications or

 Two threads of same application

 CPU maintains architecture state for two processors

 Two logical processors per physical processor



Implementation on Intel® Xeon™ Processor

 Two logical processors for < 5% additional die area

 The processor pretends as if it has 2 cores in an MP shared memory system


28/40

uArchitecture impact

 Replicate some resources

 All; per-CPU architectural state

 Instruction pointers, renaming logic

 Some smaller resources e.g. return stack predictor, ITLB, etc.

 Partition some resources (share by splitting in half per thread)

 Re-Order Buffer

 Load/Store Buffers

 Queues; etc.

 Share most resources

 Out-of-Order execution engine

 Caches


29/40

Hyper-Threading Performance

 Performance varies as expected with:



 Number of parallel threads

 Resource utilization

Less aggressive MT than EV8  Less impressive scalability

 Machine optimized for single-thread.

1.30

1.23

1.22

1.18

1.18

1.00

IXP-MP

Hyper-

Threading

Disabled

Microsoft*

Active Directory

Microsoft*

SQL Server

Microsoft*

Exchange

Microsoft*

IIS

Parallel GCC

Compile

IXP-MP

Hyper-Threading Enabled

Intel Xeon Processor MP platforms are prototype systems in 2-way configurations

Applications not tuned or optimized for Hyper-Threading Technology

*All trademarks and brands are the property of their respective owners Computer Architecture 2015 - Multithreading

30/40

Analysis of a Multi-Thread Architecture:

A simple switch-on-event (SoE) processor

Memory access = P execution execution

Memory access

Period


31/40

Multi-Thread Architecture (SoE)

Threads architectural states

C

To Memory

Memory latency is shielded by multiple threads execution


32/40



C

To Memory

Instructions/Memory access=1/MI

CPI

I

= Ideal CPI of core C

MI = Memory access/Instruction

P = Memory access penalty (cycles) n = number of threads


33/40



C

To Memory

Period = (1/MI)*CPI i

+ P

P

CPI

1

= {(1/MI)*CPI i

+P} / {1/MI} =CPI i

+ MI * P

Cycles / instructions

CPI

I





34/40



C

To Memory

Period = (1/MI])*CPI i

+ P

CPI

MT

= {(1/MI)*CPI i

+P} / n*{1/MI} =CPI i

+ MI * P n


CPI

I





35/40



C

To Memory

Period

IPC

MT

= n * IPC

1

= n * 1/CPI

1

CPI

I




= n / [ CPI i

+ MI * P ]


36/40



C

To Memory


Max performance ?

Threads (n)


37/40



C

To Memory


Max performance = 1/CPIi

N

1

?

Threads (n)


38/40

So how many threads do we need for a maximum performance?

1 / CPI i

= n / [ CPI i

+ MI * P ]

[ CPI i

+ MI * P ] / CPI i

= n

N = 1 + [MI * P] / CPI i

CPI

I

= CPI ideal of core C




39/40

SoE: Now with a cache! (single-thread)


C

To Memory

Period = (1/MI)*CPI i

+ MR*P

P

CPI

1

= {(1/MI)*CPI i

+MR*P} / {1/MI} =CPI i

+ MI * MR *P


MR=cache miss rate CPI

I





40/40

The unified machine – the valley

When we have a cache, the memory penalty depends on the miss-rate

Perf = n/ [ CPI i

+ MI * MR(n) * P ]

 Three regions: Cache Region , the valley, MT Region

The valley MT region

Threads


41/40

Multi-Thread Architecture (SoE) conclusions

 Applicable when many threads are available

 Requires saving thread state

 Can reach “high” performance (~1/CPI i

)

 Bandwidth to memory is high



 high power


42/40

BACKUP SLIDES


43/40

The unified machine

Parameter: Cache Size

 Increase Cache size  cache ability to hide memory latency  favors MC

Performance for Different Cache Sizes

1100

1000

900

800

700

600

500

400

300

200

100

0 no cache

16M

32M

64M

128M perfect $

Number Of Threads

* At this point: unlimited BW to memory

1100

1000

900

800

700

600

500

400

300

200

100

0

Cache Size Impact

 Increase in cache size  cache suffices for more in-flight threads

 Extends the $ region ..AND also  Valuable in the MT region

 Caches reduce off-chip bandwidth  delay the BW saturation point

Performance for Different Cache Sizes (Limited BW)

Increase in cache size no $

16M

32M

64M

128M perfect $



45/40

Memory Latency Impact

 Increase in memory latency  Hinders the MT region

 Emphasise the importance of caches

Performance for Different memory Latencies

1100

1000

900

800

700

600

500

400

300

200

100

0 zero latency

50 cycles

100 cycles

300 cycles

1000 cycles

Unlimited BW to memory



46/40

10-11-ca-multithreading

Multithreaded Architectures

Overview

Multithreaded Software

Multithreaded Architecture

Multithreaded Micro-Architecture

Conclusions

Multithreading

Superscalar Execution

Superscalar: Execution Stage

CMP – Chip Multi-Processor

Multiple Hardware Threads

Blocked Multithreading

Fine Grained Multithreading

Simultaneous Multithreading (SMT)

Blocked Multithreading

Fine Grained Multithreading in GPGPUs

Caching and Data Sharing

Basic EV8

SMT EV8

Performance Scalability

Performance Gain – when?

Intermediate summary

SMT vs. CMP

Example:

Analysis of a Multi-Thread Architecture:

A simple switch-on-event (SoE) processor

Multi-Thread Architecture (SoE)

Multi-Thread Architecture (SoE)

Multi-Thread Architecture (SoE)

Multi-Thread Architecture (SoE)

Multi-Thread Architecture (SoE)

Multi-Thread Architecture (SoE)

Multi-Thread Architecture (SoE)

So how many threads do we need for a maximum performance?

SoE: Now with a cache! (single-thread)

The unified machine – the valley

BACKUP SLIDES

The unified machine

Cache Size Impact

Memory Latency Impact

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

**SoE: Now with a cache! (single-thread)**