Chip Multiprocessing and SMT

advertisement
Computer Architecture
Fall 2006
Lecture 30. CMPs & SMTs
Adapted from Mary Jane Irwin
( www.cse.psu.edu/~mji )
[Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]
Lecture 30
Fall
Review: Multiprocessor Basics

Q1 – How do they share data?

Q2 – How do they coordinate?

Q3 – How scalable is the architecture? How many
processors?
# of Proc
Communication Message passing 8 to 2048
model
Shared NUMA 8 to 256
address UMA
2 to 64
Physical
connection
Lecture 30
Network
8 to 256
Bus
2 to 36
Fall
CMP: Multiprocessors On One Chip

By placing multiple processors, their memories and the
IN all on one chip, the latencies of chip-to-chip
communication are drastically reduced

ARM multi-chip core
Configurable #
of hardware intr
Private IRQ
Interrupt Distributor
Per-CPU
aliased
peripherals
Configurable
between 1 & 4
symmetric
CPUs
Private
peripheral
bus
CPU
CPU
CPU
CPU
Interface
Interface
Interface
Interface
CPU
L1$s
CPU
L1$s
CPU
L1$s
CPU
L1$s
Snoop Control Unit
Primary AXI R/W 64-b bus
Lecture 30
I & D CCB
64-b bus
Optional AXI R/W 64-b bus
Fall
Multithreading on A Chip

Find a way to “hide” true data dependency stalls, cache
miss stalls, and branch stalls by finding instructions (from
other process threads) that are independent of those
stalling instructions

Multithreading – increase the utilization of resources on a
chip by allowing multiple processes (threads) to share the
functional units of a single processor
Lecture 30

Processor must duplicate the state hardware for each thread – a
separate register file, PC, instruction buffer, and store buffer for
each thread

The caches, TLBs, BHT, BTB can be shared (although the miss
rates may increase if they are not sized accordingly)

The memory can be shared through virtual memory mechanisms

Hardware must support efficient thread context switching
Fall
Types of Multithreading on a Chip


Fine-grain – switch threads on every instruction issue

Round-robin thread interleaving (skipping stalled threads)

Processor must be able to switch threads on every clock cycle

Advantage – can hide throughput losses that come from both
short and long stalls

Disadvantage – slows down the execution of an individual thread
since a thread that is ready to execute without stalls is delayed
by instructions from other threads
Coarse-grain – switches threads only on costly stalls
(e.g., L2 cache misses)

Advantages – thread switching doesn’t have to be essentially
free and much less likely to slow down the execution of an
individual thread

Disadvantage – limited, due to pipeline start-up costs, in its
ability to overcome throughput loss
- Pipeline must be flushed and refilled on thread switches
Lecture 30
Fall
Multithreaded Example: Sun’s Niagara (UltraSparc T1)
1.2 GHz
1.0 GHz
Cache
(I/D/L2)
32K/64K/
(8M external)
16K/8K/3M
Issue rate
4 issue
1 issue
Pipe stages
14 stages
6 stages
BHT entries
16K x 2-b
None
TLB entries
128I/512D
64I/64D
Memory BW
2.4 GB/s
~20GB/s
Transistors
29 million
200 million
Power (max) 53 W
Lecture 30
<60 W
4-way MT SPARC pipe
Clock rate
4-way MT SPARC pipe
64-b
4-way MT SPARC pipe
64-b
4-way MT SPARC pipe
Data width
4-way MT SPARC pipe
Niagara
4-way MT SPARC pipe
Ultra III
4-way MT SPARC pipe
Eight fine grain multithreaded single-issue, in-order cores
(no speculation, no dynamic branch prediction)
4-way MT SPARC pipe

Crossbar
I/O
shared
funct’s
4-way banked L2$
Memory controllers
Fall
Niagara Integer Pipeline

Cores are simple (single-issue, 6 stage, no branch
prediction), small, and power-efficient
Fetch
Thrd Sel
Decode
RegFile
x4
I$
Inst
bufx4
ITLB
Thrd
Sel
Mux
Thrd
Sel
Mux
Decode
Thread
Select
Logic
Execute
ALU
Mul
Shft
Div
Memory
D$
DTLB
Stbufx4
WB
Crossbar
Interface
Instr type
Cache misses
Traps & interrupts
Resource conflicts
PC
logicx4
From MPR, Vol. 18, #9, Sept. 2004
Lecture 30
Fall
Simultaneous Multithreading (SMT)

A variation on multithreading that uses the resources of a
multiple-issue, dynamically scheduled processor
(superscalar) to exploit both program ILP and threadlevel parallelism (TLP)

Most SS processors have more machine level parallelism than
most programs can effectively use (i.e., than have ILP)

With register renaming and dynamic scheduling, multiple
instructions from independent threads can be issued without
regard to dependencies among them
- Need separate rename tables (ROBs) for each thread
- Need the capability to commit from multiple threads (i.e., from
multiple ROBs) in one cycle

Intel’s Pentium 4 SMT called hyperthreading

Lecture 30
Supports just two threads (doubles the architecture state)
Fall
Threading on a 4-way SS Processor Example
Coarse MT
Fine MT
SMT
Issue slots →
Thread A
Thread B
Time →
Thread C Thread D
Lecture 30
Fall
Multicore Xbox360 – “Xenon” processor


Lecture 30
To provide game developers with a balanced and
powerful platform

Three SMT processors, 32KB L1 D$ & I$, 1MB UL2 cache

165M transistors total

3.2 Ghz Near-POWER ISA

2-issue, 21 stage pipeline, with 128 128-bit registers

Weak branch prediction – supported by software hinting

In order instructions

Narrow cores – 2 INT units, 2 128-bit VMX units, 1 of anything
else
An ATI-designed 500MZ GPU w/ 512MB of DDR3DRAM

337M transistors, 10MB framebuffer

48 pixel shader cores, each with 4 ALUs
Fall
Xenon Diagram
Core 1
Core 2
L1D L1I
L1D L1I
L1D L1I
XMA Dec
Core 0
1MB UL2
MC1
MC0
512MB
DRAM
BIU/IO Intf
Lecture 30
SMC
GPU
DVD
HDD Port
Front USBs (2)
Wireless
MU ports (2 USBs)
Rear USB (1)
Ethernet
IR
Audio Out
Flash
Systems Control
3D Core
10MB
EDRAM
Video
Out
Analog
Chip
Video Out
Fall
The PS3 “Cell” Processor Architecture

Composed of a Non-SMP Architecture

234M transistors @ 4Ghz

1 Power Processing Element, 8 “Synergistic” (SIMD) PE’s

512KB L2 $ - Massively high bandwidth (200GB/s) bus connects
it to everything else

The PPE is strangely similar to one of the Xenon cores
- Almost identical, really. Slight ISA differences, and fine-grained MT
instead of real SMT

The real differences lie in the SPEs (21M transistors each)
- An attempt to ‘fix’ the memory latency problem by giving each
processor complete control over it’s own 256KB “scratchpad” – 14M
transistors
– Direct mapped for low latency
- 4 vector units per SPE, 1 of everything else – 7M trans.
Lecture 30
Fall
How to make use of the SPEs
Lecture 30
Fall
What about the Software?



Lecture 30
Makes use of special IBM “Hypervisor”

Like an OS for OS’s

Runs both a real time OS (for sound) and non-real time (for
things like AI)
Software must be specially coded to run well

The single PPE will be quickly bogged down

Must make use of SPEs wherever possible

This isn’t easy, by any standard
What about Microsoft?

Development suite identifies which 6 threads you’re expected to
run

Four of them are DirectX based, and handled by the OS

Only need to write two threads, functionally
Fall
Next Lecture and Reminders

Next lecture


Review for Final
Reminders

Lecture 30
Final is Tuesday, December 12 from 8-9:50 AM in ITT 322
Fall
Download