Uri Weiser and Yoav Etsion
Computer Architecture 2015 - Multithreading
1/40
Computer Architecture 2015 - Multithreading
2/40
Basics
Process
Each process has its unique address space
Can consist of several threads
Thread – each thread has its unique execution context
Thread has its own PC (sequencer) + registers + stack
All threads (within a process) share same address space
Private heap is optional
Multithreaded app’s: process broken into threads
#1 example: transactions (databases, web servers)
Increased concurrency
Partial blocking (your transaction blocks, mine doesn’t have to)
Centralized (smarter) resource management (by process)
Computer Architecture 2015 - Multithreading
3/40
Example – 5 stages, 4 wide machine
F D
1
D
2
EX/RET WB
F D
1
D
2
EX/RET WB
F D
1
D
2
EX/RET WB
F D
1
D
2
EX/RET WB
Computer Architecture 2015 - Multithreading
5/40
Time
Generally increases throughput, but decreases utilization
Computer Architecture 2015 - Multithreading
6/40
Multi-threads/Multi-tasks runs on Multi-Processor/core
Time
Better utilization / Lower thread-level throughput / Higher overall throughput /
Computer Architecture 2015 - Multithreading
7/40
A thread can be viewed as a stream of instructions
State is represented by PC, Stack pointer, GP Registers
Equip multithreaded processors with multiple hardware contexts (threads)
Multiple register files, PCs, SPs
OS views CPU as multiple logical processors
Execution of instructions from different threads are interleaved in the pipeline
Interleaving policy is critical…
Computer Architecture 2015 - Multithreading
8/40
Multi-threading/multi-tasking runs on single core
Time
How is this different from OS scheduling?
May increase utilization and throughput, but must switch when current thread goes to low utilization/throughput section (e.g. L2 cache miss)
Computer Architecture 2015 - Multithreading
9/40
Multi-threading/multi-tasking runs on single core
Time
Increases utilization/throughput by reducing impact of dependences
Computer Architecture 2015 - Multithreading
10/40
Multi-threading/multi-tasking runs on single core
Time
Increases utilization/throughput
Computer Architecture 2015 - Multithreading
11/40
(SoE-MT- Switch on Event MT, aka ’ – “Poor Man MT”)
Critical decision: when to switch threads
When current thread’s utilization/throughput is about to drop
(e.g. L2 cache miss)
Requirements for throughput:
(Thread switch) + (pipe fill time) << blocking latency
Would like to get some work done before other thread comes back
Fast thread-switch: multiple register banks
Fast pipe-fill: short pipe
Advantage: small changes to existing hardware
Drawback: high single thread performance requires long thread switch
Examples
Macro-dataflow machine
MIT Alewife
IBM Northstar
Computer Architecture 2015 - Multithreading
12/40
Interleaved (Fine grained)
Multithreading
Critical decision: none?
Requirements for throughput:
Enough threads to eliminate data access and dependencies
Increasing number of threads reduces single-thread performance
Implementation options: fixed vs. flexible
I
1
F D E M W
F D E M W
I
2
F D E M W
I
1 of T
1
F D E M W
I
1 of T
2
F D E M W
I
2 of T
1
F D E M W
Superscalar pipeline Multithreaded pipeline
Computer Architecture 2015 - Multithreading
13/40
Interleaved (Fine grained)
Multithreading (cont.)
Advantages:
(w/ flexible interleave:) Reasonable single thread performance
High processor utilization (esp. in case of many thread)
Drawback:
Complicated hardware
Requires highly parallel software for efficient execution
Massive cache pressure
(w/ fixed interleave:) limits single thread performance
Examples:
HEP Denelcor: 8 threads (latencies were shorter then)
TERA: 128 threads
MicroUnity - 5 x 1GHz threads = 200 MHz like latency
GPUs employ a variant of fine-grain interleaving
Computer Architecture 2015 - Multithreading
14/40
Time
• 32 threads per group (warp); GPGPU core alternates between 48 warps
• Total number of threads running on a core: 32 * 48 = 1536
• Huge register file (around 128KB)
Computer Architecture 2015 - Multithreading
15/40
Simultaneous
Multi-threading (SMT)
Critical decision: fetch-interleaving policy
Requirements for throughput:
Enough threads to utilize resources
Fewer than needed to stretch dependences
Limit: too many threads reduces cache effectiveness
Examples:
Compaq Alpha EV8 (cancelled)
Intel Hyper-Threading Technology
Computer Architecture 2015 - Multithreading
16/40
Decoupled caches = Private caches
Guarantee each thread’s cache space
Requires protocols across caches
X
Cache allocation is rigid
Shared Caches
Faster to communicate among threads
No coherence overhead
X
Flexibility in allocating resources
Threads can step on each other toes…
Tag Data Tag Data Tag Data Tag Data
Way 0 Way 1 Way 2 Way 7
Computer Architecture 2015 - Multithreading
17/40
SMT Case Study: EV8
8-issue OOO processor (wide machine)
SMT Support
Multiple sequencers (PC): 4
Alternate fetch from multiple threads
Separate register renaming for each thread
More (4x) physical registers
Thread tags on all shared resources
ROB, LSQ, BTB, etc.
Allow per thread flush/trap/retirement
Process tags on all address space resources: caches, TLB’s, etc.
Notice: none of these things are in the “core”
Instruction queues, scheduler, wakeup are SMT-blind
Computer Architecture 2015 - Multithreading
18/40
Fetch Decode/
Map
Queue Reg
Read
Execute Dcache/
Store
Buffer
Reg
Write
Retire
PC
Register
Map
Regs
Dcache
Regs
Icache
Thread-blind
Computer Architecture 2015 - Multithreading
19/40
Fetch Decode/
Map
Queue Reg
Read
Execute Dcache/
Store
Buffer
Reg
Write
Retire
PC
Icache
Register
Map
Regs
Dcache Regs
Thread-blind
Thread Blind?
Not really
Computer Architecture 2015 - Multithreading
20/40
Multiprogrammed Workload
250%
200%
150%
100%
50%
0%
SpecInt SpecFP Mixed Int/FP
250%
Decomposed SPEC95 Applications
200%
150%
100%
50%
0%
Turb3d Swm256 Tomcatv
1T
2T
3T
4T
50%
1T
2T
3T
4T
0%
300%
250%
200%
150%
100%
Multithreaded Applications
Barnes Chess Sort TP
1T
2T
4T
Computer Architecture 2015 - Multithreading
21/40
When threads do not “step” on each other
“predicted state” $, Branch Prediction, data prefetch….
If ample resources are available
Slow resource, can hamper performance
Computer Architecture 2015 - Multithreading
22/40
Scheduling in SMT
What if one thread gets “stuck”?
Round-robin
eventually it will fill up the machine (resource contention)
ICOUNT: thread with fewest instructions in pipe has priority
Thread doesn ’t get to fetch until it gets “unstuck”
Other policies have been devised to get best balance, fairness, and overall performance
Computer Architecture 2015 - Multithreading
23/40
Scheduling in SMT
Variation: what if one thread is spinning?
Not really stuck, gets to keep fetching
Have to stop/slow it artificially (QUIESCE)
Sleep until a memory location changes
On Pentium 4 – use the Pause instruction
Computer Architecture 2015 - Multithreading
24/40
Multithreaded Software
Multithreaded Architecture
Advantageous cost/throughput …
Blocked MT - long latency tolerance
Good single thread performance, good throughput
Needs fast thread switch and short pipe
Interleaved MT – ILP increase
Bad single thread performance, good throughput
Needs many threads; several loaded contexts…
Simultaneous MT – ILP increase
OK MT performance
Good single thread performance
Good utilization …
Need fewer threads
Computer Architecture 2015 - Multithreading
25/40
Core A
1MB
L2
How to use 2x transistors?
Better core+SMT or CMP?
SMT
Better single thread performance
Better resource utilization
CMP
SMT
Core
2MB
L2
Much easier to implement
Avoid wire delays problems
Overall better throughput per area/power
More deterministic performance
BUT! program must be multithreaded
Core A
Core B
2MB
L2
Rough numbers when doubling the number of transistors:
SMT:
ST Performance / MT performance = 1.3X/1.5-1.7X
CMP:
ST Performance / MT performance = 1X/2X
Reference
SMT
CMP
Computer Architecture 2015 - Multithreading
26/40
Pentium
®
4 Hyper-Threading
Executes two threads simultaneously
Two different applications or
Two threads of same application
CPU maintains architecture state for two processors
Two logical processors per physical processor
Implementation on Intel® Xeon™ Processor
Two logical processors for < 5% additional die area
The processor pretends as if it has 2 cores in an MP shared memory system
Computer Architecture 2015 - Multithreading
28/40
uArchitecture impact
Replicate some resources
All; per-CPU architectural state
Instruction pointers, renaming logic
Some smaller resources e.g. return stack predictor, ITLB, etc.
Partition some resources (share by splitting in half per thread)
Re-Order Buffer
Load/Store Buffers
Queues; etc.
Share most resources
Out-of-Order execution engine
Caches
Computer Architecture 2015 - Multithreading
29/40
Hyper-Threading Performance
Performance varies as expected with:
Number of parallel threads
Resource utilization
Less aggressive MT than EV8 Less impressive scalability
Machine optimized for single-thread.
1.30
1.23
1.22
1.18
1.18
1.00
IXP-MP
Hyper-
Threading
Disabled
Microsoft*
Active Directory
Microsoft*
SQL Server
Microsoft*
Exchange
Microsoft*
IIS
Parallel GCC
Compile
IXP-MP
Hyper-Threading Enabled
Intel Xeon Processor MP platforms are prototype systems in 2-way configurations
Applications not tuned or optimized for Hyper-Threading Technology
*All trademarks and brands are the property of their respective owners Computer Architecture 2015 - Multithreading
30/40
Memory access = P execution execution
Memory access
Period
Computer Architecture 2015 - Multithreading
31/40
Threads architectural states
C
To Memory
Memory latency is shielded by multiple threads execution
Computer Architecture 2015 - Multithreading
32/40
Threads architectural states
C
To Memory
Instructions/Memory access=1/MI
CPI
I
= Ideal CPI of core C
MI = Memory access/Instruction
P = Memory access penalty (cycles) n = number of threads
Computer Architecture 2015 - Multithreading
33/40
Threads architectural states
C
To Memory
Period = (1/MI)*CPI i
+ P
P
CPI
1
= {(1/MI)*CPI i
+P} / {1/MI} =CPI i
+ MI * P
Cycles / instructions
CPI
I
= Ideal CPI of core C
MI = Memory access/Instruction
P = Memory access penalty (cycles) n = number of threads
Computer Architecture 2015 - Multithreading
34/40
Threads architectural states
C
To Memory
Period = (1/MI])*CPI i
+ P
CPI
MT
= {(1/MI)*CPI i
+P} / n*{1/MI} =CPI i
+ MI * P n
Cycles / instructions
CPI
I
= Ideal CPI of core C
MI = Memory access/Instruction
P = Memory access penalty (cycles) n = number of threads
Computer Architecture 2015 - Multithreading
35/40
Threads architectural states
C
To Memory
Period
IPC
MT
= n * IPC
1
= n * 1/CPI
1
CPI
I
= Ideal CPI of core C
MI = Memory access/Instruction
P = Memory access penalty (cycles) n = number of threads
= n / [ CPI i
+ MI * P ]
Computer Architecture 2015 - Multithreading
36/40
Threads architectural states
C
To Memory
Memory latency is shielded by multiple threads execution
Max performance ?
Threads (n)
Computer Architecture 2015 - Multithreading
37/40
Threads architectural states
C
To Memory
Memory latency is shielded by multiple threads execution
Max performance = 1/CPIi
N
1
?
Threads (n)
Computer Architecture 2015 - Multithreading
38/40
1 / CPI i
= n / [ CPI i
+ MI * P ]
[ CPI i
+ MI * P ] / CPI i
= n
N = 1 + [MI * P] / CPI i
CPI
I
= CPI ideal of core C
MI = Memory access/Instruction
P = Memory access penalty (cycles) n = number of threads
Computer Architecture 2015 - Multithreading
39/40
Threads architectural states
C
To Memory
Period = (1/MI)*CPI i
+ MR*P
P
CPI
1
= {(1/MI)*CPI i
+MR*P} / {1/MI} =CPI i
+ MI * MR *P
Cycles / instructions
MR=cache miss rate CPI
I
= Ideal CPI of core C
MI = Memory access/Instruction
P = Memory access penalty (cycles) n = number of threads
Computer Architecture 2015 - Multithreading
40/40
When we have a cache, the memory penalty depends on the miss-rate
Perf = n/ [ CPI i
+ MI * MR(n) * P ]
Three regions: Cache Region , the valley, MT Region
The valley MT region
Threads
Computer Architecture 2015 - Multithreading
41/40
Multi-Thread Architecture (SoE) conclusions
Applicable when many threads are available
Requires saving thread state
Can reach “high” performance (~1/CPI i
)
Bandwidth to memory is high
high power
Computer Architecture 2015 - Multithreading
42/40
Computer Architecture 2015 - Multithreading
43/40
Parameter: Cache Size
Increase Cache size cache ability to hide memory latency favors MC
Performance for Different Cache Sizes
1100
1000
900
800
700
600
500
400
300
200
100
0 no cache
16M
32M
64M
128M perfect $
Number Of Threads
* At this point: unlimited BW to memory
1100
1000
900
800
700
600
500
400
300
200
100
0
Increase in cache size cache suffices for more in-flight threads
Extends the $ region ..AND also Valuable in the MT region
Caches reduce off-chip bandwidth delay the BW saturation point
Performance for Different Cache Sizes (Limited BW)
Increase in cache size no $
16M
32M
64M
128M perfect $
Number Of Threads
Computer Architecture 2015 - Multithreading
45/40
Increase in memory latency Hinders the MT region
Emphasise the importance of caches
Performance for Different memory Latencies
1100
1000
900
800
700
600
500
400
300
200
100
0 zero latency
50 cycles
100 cycles
300 cycles
1000 cycles
Unlimited BW to memory
Number Of Threads
Computer Architecture 2015 - Multithreading
46/40