Thread scheduling in multithreaded cores

advertisement
Multithreaded Processors
Dezső Sima
Spring 2007
(Ver. 2.1)
 Dezső Sima, 2007
Overview
•
1. Introduction
•
2. Overview of multithreaded cores
•
3. Thread scheduling
•
4. Case examples
•
4.1. Coarse grained multithreaded cores
•
4.2. Fine grained multithreaded cores
•
4.3. SMT cores
1. Introduction
1. Introduction (1)
Aim of multithreading
to raise performance (beyond superscalar or EPIC execution)
by introducing and utilizing finer grained parallelism than multitasking
at execution.
Thread
flow of control
(in superscalars: dynamic sequence of instructions to be executed that are managed
as an entity during instruction scheduling for dispatching or issue.)
1. Introduction (2)
Sequential
programm
ing
P1
Multitasked programming
P1
Multithreaded programming
P1
fork()
Process / Thread
Management Example
T1
CreateThread()
P2
exec()
T2
fork()
Create Process()
T3
P2
P2
T4
P3
T5
exec()
T6
P3
join()
Figure 1.1: Principle of sequential-, multitasked- and multithreaded programming
1. Introduction (3)
Main features of multithreading
Threads
• belong to the same process,
• share usually a common address space
(else multiple address translation paths (virtual to real) need to be maintained
concurrently),
• are executed concurrently (simultaneously (i.e. overlapped by time sharing)
or in parallel), depending on the impelmentation of multithreading.
Main tasks of thread management
•
creation, control and termination of individual threads,
•
context swithing between threads,
•
maintaining multiple sets of thread states.
Basic thread states
•
thread program state (state of the ISA) including:
PC, FX/FP architectural registers, state registers,
•
thread microstate (supplementary state of the microarchitecture) including:
rename register mappings, branch history, ROB etc.
1. Introduction (4)
Implementation of multithreading
(while executing multithreaded apps/OSs)
Software
multithreading
Hardware
multithreading
Execution of multithreaded apps/OSs
on a single threaded processor
simultaneously (i.e. by time sharing)
Execution of multithreaded apps/OSs
on a multithreaded processor
concurrently
Maintaining multiple threads
simultaneously by the OS
Maintaining multiple threads
concurrently by the processor
Multithreaded OSs
Multithreaded processors
Fast context swithing between threads required.
1. Introduction (5)
Multithreaded processors
Multicore processors
Multithreaded cores
(SMP: Symmetric Multiprocessing
CMP: Chip Multiprocessing)
Chip
Core
L2/L3
MT
core
Core
L2/L3
L3/Memory
L3/Memory
1. Introduction (6)
Requirement of software multithreading
Maintaining multiple thread program states concurrently by the OS, including:
PC, FX/FP architectural registers, state registers
Core enhancements needed in multithreaded cores
• Maintaining multiple thread program states concurrently by the processor, including:
PC, FX/FP architectural registers, state registers
• Maintaning multiple thread microstates, pertaining to:
rename register mappings, the RAS (Return Address Stack), theROB, etc.
• Providing increased sizes for scarce or sensitive resorces, such as:
the instruction buffer, store queue,in case of merged arch. and rename registers
appropriatly large file sizes (FX/FP) etc.
Options to provide multiple states
• Implementing individual per thread structures, like 2 or 4 sets of FX registers,
• Implementing tagged structures, like a tagged ROB, a tagged buffer etc.
1. Introduction (7)
Multicore
processors
Multithreade
d cores
Additional
complexity
~ (60 – 80) %
~ (2 – 10) %
Additional
gain
(in gen. purp.
apps)
~ (60 – 80) %
~ (0 – 30) %
1. Introduction (8)
Multithreaded OSs
• Windows NT
• OS/2
• Unix w/Posix
• most OSs developed from the 90’s on
Introduction (9)
Multitasked
programs
Key
Issues
Key Features
Description
Sequential
programs
Single
process on a
single
processor
No issues
with parallel
programs
Sequential
bottleneck
Multithreaded programs
Hardware multithreading
Software
implementation
Software
multithreading
Multithreaded
Multiple processes on
software on a single
a single processor
threaded processor
using time sharing
using time sharing
on a
multithreaded
core
on a multicore
proc.
Multithreaded
software on a
multithreaded core
Multithreaded
software on a
multicore
processor
Multiple programs
with quasi-parallel
execution
Multiple programs
with quasi-parallel
execution
Simultaneous
True parallel
execution of threads execution of
threads
Private address
spaces
Shared process
address spaces
Threads share
address space
Threads share
address space
Thread context
switch needed
No thread context
switches needed
(except coarse
grained MT)
No thread
context switches
needed
Solutions for fast
context switching
Thread state
management and
context switching
Thread scheduling
Intra-core
communication
Figure 1.2: Contrasting sequential-, multitasked- and multithreaded execution (2)
Introduction (10)
Multitasked
programs
Software
multithreading
Legacy OS
support
Low
Software
Development
Software
implementation
OS Support
Hardware multithreading
Performance
Level
Sequential
programs
Multithreaded programs
No API level
support
on a
multithreaded
core
on a multicore
proc.
Traditional Unix
Most modern OS’s
(Windows
NT/2000, OS/2,
Unix w/Posix)
Most modern OS’s
(Windows
NT/2000, OS/2,
Unix w/Posix)
Most modern OS’s
(Windows NT/2000,
OS/2, Unix w/Posix)
Low-medium
High
Higher
Highest
Process and thread
management API
Explicit threading
API
OpenMP
Process and thread
management API
Explicit threading
API
OpenMP
Process and thread
management API
Explicit threading
API
OpenMP
Process life cycle
management API
Figure 1.3: Contrasting sequential-, multitasked- and multithreaded execution (2)
2. Overview of multithreaded cores
2. Overview of multithreaded cores (1)
8CMT
QCMT
1/06
5/05
DCMT
Pentium EE 840
(Smithfield)
90 nm/2*103 mm2
230 mtrs./130 W
2-way MT/core
11/02
SCMT
Pentium 4
(Prescott)
130 nm/146 mm2
55 mtrs./82 W
2-way MT
2H
2002
65 nm/2*81 mm2
2*188 mtrs./130 W
2-way MT/core
02/04
Pentium 4
(Northwood B)
1H
Pentium EE 955/965
(Presler)
90 nm/112 mm2
125 mtrs./103 W
2-way MT
1H
2H
2003
1H
2H
2004
1H
2H
2005
Figure 2.1: Intel’s multithreaded desktop families
1H
2H
2006
2. Overview of multithreaded cores (2)
8CMT
QCMT
6/06
10/05
DCMT
Xeon 5000
(Dempsey)
Xeon DP 2.8
(Paxville DP)
65 nm/2*81 mm2
90 nm/2*135 mm2
2*188 mtrs./95/130 W
2*169 mtrs./135 W
2-way MT/core
2-way MT/core
SCMT
2/02
11/03
Pentium 4
(Prestonia-A)
Pentium 4
(Irwindale-A)
130 nm/146 mm2
55 mtrs./55 W
2-way MT
1H
130 nm/135 mm2
169mtrs./110 W
2-way MT
2H
2002
6/04
1H
2H
2003
Pentium 4
(Nocona)
90 nm/112 mm2
125 mtrs./103 W
2-way MT
1H
2H
2004
1H
2H
2005
Figure 2.2.: Intel’s multithreaded Xeon DP-families
1H
2H
2006
2. Overview of multithreaded cores (3)
8CMT
QCMT
11/05
DCMT
8/06
Xeon 7000
(Paxville MP)
Xeon 7100
(Tulsa)
90 nm/2*135 mm2
65 nm/435 mm2
2*169 mtrs./95/150 W 1328 mtrs./95/150 W
2-way MT/core
2-way MT/core
SCMT
3/05
3/04
3/02
Pentium 4
(Gallatin)
Pentium 4
(Potomac)
130 nm/310 mm2
178/286 mtrs./77 W
2-way MT
90 nm/339 mm2
675 mtrs./95/129 W
2-way MT
Pentium 4
(Foster-MP)
180 nm/ n/a
108 mtrs./64 W
2-way MT
1H
2H
2002
1H
2H
2003
1H
2H
2004
1H
2H
2005
Figure 2.3.: Intel’s multithreaded Xeon MP-families
1H
2H
2006
2. Overview of multithreaded cores (4)
8CMT
QCMT
7/06
DCMT
9x00
(Montecito)
90 nm/596 mm2
1720 mtrs./104 W
2-way MT/core
SCMT
1H
2H
2002
1H
2H
2003
1H
2H
2004
1H
2H
1H
2005
Figure 2.4.: Intel’s multithreaded EPIC based server family
2H
2006
2. Overview of multithreaded cores (5)
2007
8CMT
POWER6
65 nm/341 mm2
750 mtrs./~100W
2-way MT/core
QCMT
10/05
5/04
DCMT
POWER5+
POWER5
90 nm/230 mm2
276 mtrs./70 W
2-way MT/core
130 nm/389 mm2
276 mtrs./80W (est.)
2-way MT/core
5/04
2006
SCMT
RS 64 IV
(Sstar)
Cell BE PPE
90 nm/221* mm2
234* mtrs./95* W
2-way MT
(*: entire proc.)
180 nm/n/a
44 mtrs./n/a
2-way MT
1H
2H
2000
~
~
1H
2H
2004
1H
2H
2005
1H
2H
2006
Figure 2.5.: IBM’s multithreaded server families
1H
2H
2007
2. Overview of multithreaded cores (6)
2007
11/2005
8CMT
UltraSPARC T1
(Niagara)
UltraSPARC T2
(Niagara II)
90 nm/379 mm2
279 mtrs./63 W
4-way MT/core
65 nm/342 mm2
72 W (est.)
8-way MT/core
2008
QCMT
APL SPARC64 VII
(Jupiter)
65 nm/464 mm2
~120 W
2-way MT/core
2007
DCMT
APL SPARC64 VI
(Olympus)
90 nm/421 mm2
540 mtrs./120 W
2-way MT/core
SCMT
1H
2H
2004
1H
2H
2005
1H
2H
2006
1H
2H
1H
2007
Figure 2.6: Sun’s and Fujitsu’s multithreaded server families
2H
2008
2. Overview of multithreaded cores (7)
5/05
8CMT
XLR 5xx
90 nm/~220 mm2
333 mtrs./10-50 W
4-way MT/core
QCMT
DCMT
SCMT
1H
2H
2002
1H
2H
2003
1H
2H
2004
1H
2H
2005
Figure 2.7: RMI’s multithreaded XLR family (scalar RISC)
1H
2H
2006
2. Overview of multithreaded cores (8)
8CMT
QCMT
DCMT
2003
SCMT
Alpha 21464
(V8)
130 nm/ n/a
250 mtrs./10-50 W
4-way MT
Cancelled 6/2001
1H
2H
2002
1H
2H
2003
1H
2H
2004
1H
2H
2005
Figure 2.8: DEC’s/Compaq’s multithreaded processor
1H
2H
2006
2. Overview of multithreaded cores (9)
Underlying core(s)
Scalar core(s)
Superscalar core(s)
VLIW core(s)
SUN UltraSPARC T1 (2005)
(Niagara)
up to 8 cores/4T
IBM RS64 IV (2000)
(SStar)
Single-core/2T
SUN MAJC 5200 (2000)
Quad-core/4T
(dedicated use)
RMI XLR 5xx (2005)
8 core/4T
Pentium 4 based
processors
Single-core/2T (2002-)
Dual-core/2T (2005-)
Intel Montecito (2006)
Dual-core/2T
DEC 21464 (2003)
Single-core/4T
IBM POWER5 (2005)
Dual-core/2T
PPE of Cell BE (2006)
Single-core/2T
Fujitsu SPARC64 VI / VII
Dual-core/Quad-core/2T
3. Thread scheduling
3. Thread scheduling (1)
Dispatch slots
Thread scheduling
in software multithreading on a traditional supercalar processor
Thread1
Context switch
Thread2
Clock cycles
The execution of a new thread is initiated by a context switch
(needed to save the state of the suspended thread
and loading the state of the thread to be executed next).
Figure 3.1: Thread scheduling assuming software multithreading
on a 4-way superscalar processor
3. Thread scheduling (2)
Dispatch slots
Thread scheduling in multicore processors (CMP-s)
Thread1
Thread2
Clock cycles
Both t-way superscalar cores execute different threads independently.
Figure 3.2: Thread scheduling in a dual core processor
3. Thread scheduling (3)
Thread scheduling in multithreaded cores
Coarse grained MT
Dispatch/issue
slots
3. Thread scheduling (4)
Clock cycles
Thread1
Context switch
Thread2
Threads are switched by means of rapid, HW-supported context switches.
Figure 3.3: Thread scheduling in a 4-way coarse grained multithreaded processor
3. Thread scheduling (5)
Coarse grained MT
Scalar based
Superscalar based
IBM RS64 IV (2000)
(SStar)
Single-core/2T
VLIW based
SUN MAJC 5200 (2000)
Quad-core/4T
(dedicated use)
Intel Montecito (2006?)
Dual-core/2T
3. Thread scheduling (6)
Thread scheduling in multithreaded cores
Coarse grained MT
Fine grained MT
Dispatch/issue
slots
3. Thread scheduling (7)
Clock cycles
Thread1
Thread2
Thread3
Thread4
The hardware thread scheduler choses a thread in each cycle and
instructions from this thread are dispatched/issued in this cycle..
Figure 3.4: Thread scheduling in a 4-way fine grained multithreaded processor
3. Thread scheduling (8)
Fine grained MT
Round robin
selection policy
Scalar
based
Superscalar
based
Priority based
selection policy
VLIW
based
Scalar
based
Superscalar
based
SUN UltraSPARC T1 (2005)
(Niagara)
up to 8 cores/4T
PPE of Cell BE (2006)
single-core/2T
VLIW
based
3. Thread scheduling (9)
Thread scheduling in multithreaded cores
Coarse grained MT
Fine grained MT
Simultaneous MT (SMT)
Dispatch/issue
slots
3. Thread scheduling (10)
Clock cycles
Thread1
Thread2
Thread3
Thread4
Available instructions (chosen according to an appropriate selection policy,
such as the priority of the threads) are dispatched/issued for execution in each cycle.
SMT: Proposed by Tullsen, Eggers and Levy in 1995 (U. of Washington).
Figure 3.5: Thread scheduling in a 4-way symultaneous multithreaded processor
3. Thread scheduling (11)
SMT cores
Scalar based
Superscalar based
Pentium 4based proc.s
Single-core/2T (2002-)
Dual-core/2T (2005-)
DEC 21464 (2003)
Dual-core/4T
(canceled in 2001)
IBM POWER5 (2005)
Dual-core/2T
VLIW based
4. Case examples
4.1. Coarse grained multithreading
4.2. Fine grained multithreading
4.3. SMT multithreading
4.1 Coarse grained multithreaded processors
4.1.1. IBM RS64 IV
4.1.2. SUN MAJC 5200
4.1.3. Intel Montecito
4.1. Coarse grained multithreaded processors
Thread scheduling in multithreaded cores
Coarse grained MT
Fine grained MT
Simultaneous MT (SMT)
4.1.1. IBM RS 64 IV (1)
Microarchitecture
4-way superscalar, dual-threaded.
Used in IBM’s iSeries and pSeries commercial servers.
Optimized for commercial server workloads, such as
on-line transaction processing, Web-serving, ERP (Enterprise Resource Planning).
Characteristics of server workloads:
•
•
•
large working sets,
poor locality of references and
frequently occurring task switches
• high cache miss rates,
• Memory bandwidth and latency strongly limits performance.
need for wide instruction and data fetch bandwidth,
need for large L1 $s,
using multithreading to hide memory latency.
4.1.1. IBM RS 64 IV (2)
Main microarchitectural features of the RS64 IV to support commercial workloads:
• 128 KB L1 D$ and L1 I$,
•
instruction fetch width: 8 instr./cycle,
•
dual-threaded core.
4.1.1. IBM RS 64 IV (3)
IERAT: Effective to real
address translation cache
(2x64 entries)
6XX bus
Figure 4.1.1: Microarchitecture of IBM’s RS 64 IV
Source: Borkenhagen J.M. et al. „A multithreaded PowerPC processor for commercial servers”,
IBM J.Res.Develop. Vol. 44. No. 6. Nov. 2000, pp. 885-898
4.1.1. IBM RS 64 IV (4)
Multithreading policy (strongly simplified)
Coarse grained MT with two Ts; a foreground T and a background T.
The foreground T executes until a long latency event, such as a cache miss or an IERAT miss occurs.
Subsequently, a T switch is performed and the background T begins to execute.
After the long latency event is serviced, a T switch occurs back to the foreground T.
Both single threaded and multithreaded modes of execution.
Threads can be allocated different priorities by explicit instructions.
Implementation of multithreading
Dual architectural states maintained for:
• GPRs, FPRs, CR (condition reg.), CTR (count reg.),
• spec. purpose priviledged mode reg.s, such as the MSR (machine state reg..)
• status and control reg.s, such as T priority.
Each T executes in its own effective address space (an unusual feature of multithreaded cores).
Units used for address translation need to be duplicated,
such as the SRs (Segment Address Reg.s)
Thread Swith Buffer holds up to 8 instructions from the background T,
to shorten context swithching by eliminating the latency of the I$
For multithreading additionally needed die area:
~ + 5 % die area
4.1.1. IBM RS 64 IV (5)
Figure 4.1.2: Thread switch on data cache miss in IBM’s RS 64 IV
Source: Borkenhagen J.M. et al. „A multithreaded PowerPC processor for commercial servers”,
IBM J.Res.Develop. Vol. 44. No. 6. Nov. 2000, pp. 885-898
4.1.2. SUN MAJC 5200 (1)
Aim:
Dedicated use, high-end graphics, networking
with wire-speed computational demands.
Microarchitecture:
• up to 4 processors on a die,
• each processor has 4 FUs (Functional Units);
3 of them are identical, one is enhanced,
• each FU has its private logic and register set (e.g. 32 or 64 regs.,
• the 4 FUs of a processor share a set of global regs., e.g. 64 regs.,
• all registers are unified (not splitted to FX/FP files),
• any FU can process any data type.
Each processor is a 4-wide VLIW and can be 4-way multithreaded.
4.1.2. SUN MAJC 5200 (2)
Figure 4.1.3: General view of SUN’s MAJC 5200
Source: “MAJC Architecture Tutorial,” Whitepaper, Sun Microsystems, Inc
4.1.2. SUN MAJC 5200 (3)
Figure 4.1.4: The principle of private, unified register files associated with each FU
Source: “MAJC Architecture Tutorial,” Whitepaper, Sun Microsystems, Inc
4.1.2. SUN MAJC 5200 (4)
Threading
Each processor with its 4 FUs can be operated in a 4-way multithreaded mode
(called Vertical Multithreading by Sun)
Implementation of 4-way multithreading:
by executing each T by one of the 4 FUs („Vertical multithreading”)
Thread switch
Following a cache miss, the processor saves the T state and begins to process the next T.
Example
Comparison of program execution without and with multithreading on a 4-wide VLIW
Considered program:
•
•
•
•
It consists of 100 instructions,
on average 2.5 instrs./cycle executed on average,
giving birth to a cache miss after each 20 instructions.
Latency of serving a cache miss: 75 cycles.
4.1.2. SUN MAJC 5200 (5)
Figure 4.1.5: Execution for subsequent cache misses in a single threaded processor
Source: “MAJC Architecture Tutorial,” Whitepaper, Sun Microsystems, Inc
4.1.2. SUN MAJC 5200 (6)
Figure 4.1.6: Execution for subsequent cache misses in SUN’s MAJC 5200
Source: “MAJC Architecture Tutorial,” Whitepaper, Sun Microsystems, Inc
4.1.3. Intel Montecito (1)
Aim:
High end servers
Main differences between Itanium2 and Montecito
•
•
•
•
Split L2 caches,
larger unified L3 cache,
duplicated architectural states for
FX/FP-registers,
branch and predicate registers,
next address register
maintained.
(Foxton technology for power management/frequency boost,
planned but not implemented).
Additional support for dual-threading (duplicated microarchitectural states)
•
•
•
the branch prediction structures provide T tagging,
per thread return address stacks,
per thread ALATs (Advance Load Address Table)
Additional core area needed for multithreading: ~ 2 %.
4.1.3. Intel Montecito (2)
Figure 4.1.7: Microarchitecture of Intel’s Itanium 2
Source: McNairy, C., „Itanium 2”, IEEE Micro, March/April 2003, Vol. 23, No. 2, pp. 44-55
4.1.3. Intel Montecito (3)
Figure 4.1.8: Microarchitecture of Intel’s Montecito (ALAT: Advanced Load Address Table)
Source: McNairy, C., „Montecito”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 10-20
4.1.3. Intel Montecito (4)
Thread switches
5 event types cause thread switches, such as L3 cache misses,
programmed switched hints.
Total switch penalty: 15 cycles
Example for thread switching
If control logic detects that a thread doesn’t make progress,
a thread switch will be initiated.
4.1.3. Intel Montecito (5)
Figure 4.1.9: Thread switch in Intel’s Montecito vs single thread execution
Source: McNairy, C., „Montecito”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 10-20
4.2 Fine grained multithreaded processors
4.2.1. SUN Ultrasparc T1
4.2.2. PPE of Cell BE
4.2. Fine grained multithreaded processors
Thread scheduling in multithreaded cores
Coarse grained MT
Fine grained MT
Simultaneous MT
(SMT)
4.2.1. SUN UltraSPARC T1 (1)
Aim
Commercial server applications, such as
•
•
•
•
web servicing,
transaction processing,
ERP (Enterprise Resource Planning),
DSS (Decision Support Systems)
Characteristics of commercial server applications
• large working sets,
• poor locality of memory references.
• high cache miss rates,
• low prediction accuracy for data dependent branches.
Memory latency strongly limits performance.
Multithreading to hide memory latency.
4.2.1. SUN UltraSPARC T1 (2)
Structure
• 8 scalar cores, 4-way multithreaded each.
• All 32 threads share an L2 cache of 3 MB, built up of 4 banks,
4.2.1. SUN UltraSPARC T1 (3)
Figure 4.2.1: Block diagram of SUN’s UltraSPARC T1
Source: Kongetira P., et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21-29
4.2.1. SUN UltraSPARC T1 (4)
Structure
• 8 scalar cores, 4-way multithreaded each.
• All 32 threads share an L2 cache of 3 MB, built up of 4 banks,
• 4 memory channels with on chip DDR2 memory controllers.
It runs under Solaris.
4.2.1. SUN UltraSPARC T1 (5)
Figure 4.2.2: SUN’s UltraSPARC T1 chip
Source: www.princeton.edu/~jdonald/research/hyperthreading/romanescu_niagara.pdf
4.2.1. SUN UltraSPARC T1 (6)
Processor Elements (Sparc pipes):
• Scalar FX-units, 6-stage pipeline
• all Processor Elements share a single FP-unit
4.2.1. SUN UltraSPARC T1 (7)
Figure 4.2.3: Microarchitecture of the core of SUN’s UltraSPARC T1
Source: Kongetira P., et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21-29
4.2.1. SUN UltraSPARC T1 (8)
Processor Elements (Sparc pipes):
• Scalar FX-units, 6-stage pipeline
• all Processor Elements share a single FP-unit
Each thread of a processor element has its private:
•
•
•
•
PC-logic
register file,
instruction buffer,
store buffer.
4.2.1. SUN UltraSPARC T1 (9)
Figure 4.2.4: Microarchitecture of the core of SUN’s UltraSPARC T1
Source: Kongetira P., et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21-29
4.2.1. SUN UltraSPARC T1 (10)
Processor Elements (Sparc pipes):
• Scalar FX-units, 6-stage pipeline
• all Processor Elements share a single FP-unit
Each thread of a processor element has its private:
•
•
•
•
PC-logic,
register file,
instruction buffer,
store buffer.
No thread switch penalty!
4.2.1. SUN UltraSPARC T1 (11)
Thread switch:
Threads are switched on a per cycle basis.
Selection of threads:
In the thread select pipeline stage
• the thread select multiplexer selects a thread from the set of available threads in each clock cycle and
issues the subsequent instr. of this thread from the instruction buffer into the pipeline for execution, and
• fetches the following instr. of the same thread into the instruction buffer.
4.2.1. SUN UltraSPARC T1 (12)
Figure 4.2.5: Microarchitecture of the core of SUN’s UltraSPARC T1
Source: Kongetira P., et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21-29
4.2.1. SUN UltraSPARC T1 (13)
Thread switch:
Threads are switched on a per cycle basis.
Selection of threads:
In the thread select pipeline stage
• the thread select multiplexer selects a thread from the set of available threads in each clock cycle and
issues the subsequent instr. of this thread from the instruction buffer into the pipeline for execution, and
• fetches the following instr. of the same thread into the instruction buffer.
Thread selection policy: the least recently used policy.
Threads become unavailable due to:
• long-latency instructions, such as loads, branches, multiplies, divides,
• pipeline stalls because of cache misses, traps, resource conflicts.
1. Example:
• all 4 threads are available.
4.2.1. SUN UltraSPARC T1 (14)
t0-add
t1-sub
Figure 4.2.6: Thread switch in the SUN’s UltraSPARC T1 when all threads are available
Source: Kongetira P., et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21-29
4.2.1. SUN UltraSPARC T1 (15)
2. Example:
•There are only 2 threads available,
•speculative execution of instructions following a load.
(Data referenced by a load instruction arrive in the 3. cycle after decoding, assuming a cache hit.
So, after issuing a load the thread becomes unavailable for the next two subsequent cycles.)
4.2.1. SUN UltraSPARC T1 (16)
ld data
available
t1-sub
t0 yet
unavailable
Figure 4.2.7: Thread switch in the SUN’s UltraSPARC T1 when only two threads are available
(Thread t0 issues a ld instruction and becomes unavailable for two cycles.
The add instruction from thread t0 is speculatively switched into the pipeline assuming a cache hit.)
Source: Kongetira P., et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21-29
4.2.2. Cell BE

Overview of the Cell BE

Processor components

Multithreading the PPE

Programming models

Implementation of the Cell BE
Overview of the Cell BE (1)
Cell BE
Objective: Speeding up game/multimedia apps.
Used: In the PlayStation 3 (PS3) and in the QS20 Blade Server
Goal: 100 times the PS 2 performance.
History
CELL BE: Collaborative effort from Sony, IBM and Toshiba
Summer 2000:
End 2000:
March 2001:
Spring 2004:
Summer 2004:
Febr. 2005:
Oct. 2005:
Nov. 2005:
Febr. 2006:
High level architectural discussions
Architectural concept
Design Center opened in Austin TX.
Single Cell BE operational
2-way SMP operational
First technical disclosures
Mercury announces Cell Blade
Open Source SDK & Simulator published
IBM announced Cell Blade QS20
Cell BE at NIK
May 2007: QS20 arrives at NIK within IBM’s loan program
Overview of the Cell BE (2)
Main functional units of the Cell BE
•
9 cores;
 the PPE (Power Processing Element), a dual threaded, dual issue
64-bit Power PC compliant processor and
 8 SPEs (Synergistic Processing Elements),single threaded, dual issue
128-bit SIMD processors.
•
the EIB (Element Interconnection Bus, an on-chip interconnetion network,
•
the MIC (Memory Interface Controller), a Memory Controller supporting
dual Rambus XDR channels and
•
the BIC (Bus Interface Controller) that interfaces the Rambus Flex IO bus.
Overview of the Cell BE (3)
SPE: Synergistic Procesing Element
SPU: Synergistic Processor Unit
SXU: Synergistic Execution Unit
LS: Local Store of 256 KB
SMF: Synergistic Mem. Flow Unit
EIB: Element Interface Bus
PPE: Power Processing Element
PPU: Power Processing Unit
PXU: POWER Execution Unit
MIC: Memory Interface Contr.
BIC: Bus Interface Contr.
XDR: Rambus DRAM
Figure 4.2.8: Block diagram of the Cell BE [4.2.2.1]
Overview of the Cell BE (4)
Unique features of the Cell BE
a) Heterogeneous MCP rather than being a symmetrical MCP (as usual implementations)
The PPE
• is optimized to run a 32/64-bit OS
• controls usually the SPEs,
• complies with the 64-bit PowerPC ISA.
The SPEs
• are optimized to run compute intensive SIMD apps.,
• operate usually under the control of the PPE,
• run their individual apps. (threads),
• have full access to a coherent shared memory including
the memory mapped I/O-space,
• can be programmed in C/C++.
Contrasting the PPE and the SPEs
• the PPE is more adept at control-intensive tasks and quicker in task switching,
• the SPEs are more adept at compute intensive tasks and slower at task switcing.
Overview of the Cell BE (5)
b) The SPEs have an unusual storage architecture, as
• SPEs

operate in connection with a local store (LS) of 256 KB, i.e.
o they fetch instructions from their private LS and
o their Load/Store-instructions access their LS rather than the main store,
• The LS

has no associated cache.
• SPEs

access main memory (effective address space) by DMA commands,
i.e. DMA commands move data and instructions between
main store and the private LS, while

DMA commands can be batched (up to 16 commands).
Overview of the Cell BE (6)
Although the PPE and the SPEs have coherent access to main memory,
the Cell BE is not a traditional shared-memory multiprocessor
as SPEs operate in connnection with the LS rather than with the main memory.
Processor components of the Cell BE (1)
PPE (Power Processing Element) [4.2.2.2]
•
Fully compliant 64-bit Power processor (Architecture Specification 2.02)
•
fc = 3.2 GHz (11 FO4 design, 23 pipeline stages).
•
Dual-issue, in-order, two-way (fine grained) multithreaded
core.
•
Conventional cache architecture of 32 KB I$, 32 KB D$, 512 KB
unified L2.
Processor components of the Cell BE (2)
Instructions Unit
FX Execution Unit
Vector Scalar Unit
Vector/Media
Execution)
Figure 4.2.9: Main functional units of the PPE [4.2.2.3]
Processor components of the Cell BE (3)
Main components of the PPE
•
•
•
IU (Instruction unit)

predecodes instructions while loading them from the L2 cache to the L1 cache,

fetches 4 instructions per cycle alternating between the two threads
from the L1 instr. cache into two instruction buffers (each one for each thread),

dispatches instructions from the two instruction buffers to the shared decode,
dependency checking and issue pipeline according to the thread scheduling rules.
Microcode Engine

Instructions that are either difficult to implement in hardware or are rarely used
can be split into a few simple PowerPC instructions and are stored in a ROM
(such as Load string or several Condition Register (CR) instructions.

Most microcoded instructions are split into two or three microcoded instruction.

The Microcode Engine inserts microcoded instructions from one thread into the
instruction flow with a delay of 11 clock cycles.

The Microcode Engine stalls dispatching from the instruction buffers until the last
microcode of the microcoded instruction is dispatched.

The next dispatch cycle belongs to the thread that did not invoke the Microcode Engine.
Shared decode, dependency checking and issue pipeline

Receives dispatched instructions (up to two in each cycle from the same thread),

it decodes them, checks for dependencies, and issues instructions for execution
according to the issue rules.
Processor components of the Cell BE (4)
•
XU (FX Execution Unit)

32x64-bit register file/thread

FXU (FX Unit)

LSU (L/S Unit)

BRU (Branch Unit)
Per thread branch prediction (6 bit global history, 4 K x 2 bit history table)
•
VSU (Vector Scalar Unit)

VMX/FPU issue queue (Vector-Media Execution Unit/FP Unit)
called also as the VSU (Vector-Scalar Unit) issue queue (two entries)

VMX (Vector-Media Execution Unit), called also as the
VXU (Vector Execution Unit)
o
o
o

32x128 bit vector register file/thread,
simple, complex, permute and single-precision FP subunits,
128-bit SIMD instructions with varying data width
(2x64-bit, 4x32-bit, 8x16-bit, 16x8-bit, 128x1-bit).
FPU (FP Unit):
•
•
32x64 bit register file/thread
10-stage double precision pipeline
Processor components of the Cell BE (5)
Basic operation of the PPE
Instr. fetch
•
•
Instruction fetch operates autonomously in order to keep each thread’s
instruction buffer full with useful instructions that are likely to be needed.
4 instr./cycle are fetched strictly alternating between the two threads
from the L1 I$ to the private Instruction Buffers of the threads.
•
The fetch address is taken from the Instruction Fetch Address Registers
associated with each thread (IFAR0, IFAR1). The IFARs are distinct from the
Program Counters (PC) associated with both threads; the PCs track the actual
program flow while the IFARs track the predicted instruction execution flow.
•
Accessing the taken path after a predicted-taken branch requires 8 cycles.
Processor components of the Cell BE (6)
Instruction dispatch
•
Moves up to two instructions either
 from one of the Instruction Buffers or the Microcode Engine (complex instructions)
 to the shared decode, dependency check and issue pipeline.
•
Instruction dispatch is governed by the dispatch rules (thread scheduling rules).
•
The dispatch rules take into account thread priority and stall conditions
(see Section 5.34?).
•
Each pipeline stage beyond the dispath point contains instructions from one thread only.
Instruction decode and dependency checking
Decoding of up to two instructions from the same tread in each cycle
and checking for dependencies
Processor components of the Cell BE (7)
Pipeline stages
Units
(IFAR: Instr. Fetch Addr.)
IC: Instruction cache
4
IB: Instruction buffer
ibuf: Instr. Buffer
2
ID: Instruction decode
IS: Instruction issue
IU:
Instruction Unit
VSU: Vector Scalar Unit
VXU: Vector Execution Unit
FPU: FP Execution Unit
BRU: Branch Unit
XU:
FX Execution Unit
FXU: FX Execution Unit
LSU: L/S Execution Unit
Figure 4.2.10: Instruction flow in the PPE [4.2.2.4]
Processor components of the Cell BE (8)
Instruction issue at the pipeline stage IS2
•
•
Forwarding up to two PowerPC or vector/SIMD multimedia extension instructions
per cycle from the IS2 pipeline stage for execution to the

VSU (VMX/FPU) issue queue (up to two instr./cycle) or the

BRU, LSU, FXU execution units (up to one instr./cycle per execution unit).
Any issue combinations are allowed, except two instructions to the same unit
with a few restrictions. (See Figure 4.2.11 for the valid issue combinations.)
Note that valid resp. invalid issue combinations result from the underlying
microarchitecture, as shown in Figure 4.2.13.
•
Instructions are issued in each cycle from the same thread.
•
Instruction issue can be stalled at the IS2 pipeline stage for various reasons, like invalid
issue combinations, full VSU issue queue.
Instruction issue from the VSU (VMX/FPU) issue queue
Forwarding up to two VMX or FPU instructions to the respective execution units.
Note that instructions kept in the issue queue are already prearranged
for execution, i.e. they obey the issue restrictions summarized in Figure 4.2.11.
Processor components of the Cell BE (9)
(younger instr.)
(older instr.)
Figure 4.2.11: Valid issue combinations (designated as pink squares) [4.2.2.4]
Type 1 instructions: VXU simple, VXU complex, VXU FP and FPU arithmetic instructions,
Type 2 instructions: VXU load, VXU store, VXU permute, FPU load and FPU store instructions.
Processor components of the Cell BE (10)
Figure 4.2.12: Pipeline stages of the PPE [4.2.2.3]
Processor components of the Cell BE (11)
EIB data ring for internal communication [4.2.2.2]
•
Four 16 byte data rings, supporting multiple transfers
•
96B/cycle peak bandwidth
•
Over 100 outstanding requests
•
300+ GByte/sec @ 3.2 GHz
Processor components of the Cell BE (12)
SPE [4.2.2.2]
•
SPEs optimized for data-rich operation
•
are allocated by the PPE
•
SPEs are not intended to run an OS
Main Components
a)
b)
c)
d)
SPU (Synergistic Processing Unit)
MFC (Memory Flow Controller)
LS (Local Store)
AUC (Atomic Unit and Cache)
SPE
Processor components of the Cell BE (13)
a) SPU
Overview
• Dual-issue superscalar RISC core supporting basically a 128-bit SIMD ISA.
• The SIMD ISA provides FX, FP and logical operations on 2x64-bit, 4x32-bit, 8x16-bit,
16x8-bit and 128x1-bit data.
• In connnection with the MFC the SPU support also a set of commands for
- performing DMA transfers,
- interprocessor messaging and
- synchronization.
• The SPU executes instructions from the LS (256 KB),
• Instructions reference data from the 128x128-bit unified register file,
• The Register file fetches/delivers data from/to the LS by L/S instructions,
• The SPU moves instructions and data between the main memory and the local store
by requesting DMA transfers from its MFC. (Up to 16 outstanding DMA requests allowed).
Processor components of the Cell BE (14)
SPU
Even pipe
LS
Odd pipe
MFC
Figure 4.2.13: Block diagram of the SPU [4.2.2.3]
Processor components of the Cell BE (15)
Main components of the SPU
•
•
Instruction issue unit – instruction line buffer

Fetches 32 instructions per LS request from the LS into the Instruction line buffer.

Instruction fetching is supported by hardware prefetching. Pefetching requires
15 cycles to fill the instruction line buffer.

Fetched instructions are decoded and issued (up to two instructions per cycle) according
to the issue rules.
Register file

Unified Register file of 128 registers each 128-bit wide.
• Result forwarding and staging

Instructions are staged in an operand staging network for up to 6 additinal cycles
to achieve that all execution units write their results in the Register file in the same
pipeline stage. (See Figure 4.2.19).
Processor components of the Cell BE (16)
•
Execution units
Execution units are organised into two pipelines.

The even pipeline includes
o the Fixed-point unit and
o the Floating-point unit.

The
o
o
o
o
odd pipeline includes
the Channel unit,
Branch unit,
Load/Store unit and
the Permute unit.
Processor components of the Cell BE (17)
Basic operation of the SPU
Instruction issue
•
The SPU issues up to two instructions per cycle from a 2-instructions wide
issue window, called the fetch group.
•
Fetch groups are aligned to doubleword boundaries, i.e. the first instruction
is at an even and the second one at an odd word address. (Words are
4-Byte long).
•
An instruction becomes issueable when no register dependencies or
resource conflicts, e.g. busy execution units, exist.
•
Instructions are issued in program order, that is
- if the first instruction of a fetch group can be issued to the even pipeline
and the second instruction to the odd pipeline both instructions are
issued in the same cycle,
-
in all other cases instruction issue needs two cycles such that instructions
are issued in program order to the pertaining pipeline (see Figure 4.2.14).
•
Register or resource conflicts stall instruction issue.
•
A new fetch group is loaded after both instructions of the current
fetch group are issued.
Processor components of the Cell BE (18)
Figure 4.2.14: Instruction issue example [4.2.2.4]
(Assuming that instruction issue is not constrained by register or resource conflicts)
Processor components of the Cell BE (19)
SPU channels
• An SPU communicates with its associated MFC as well as (via its MFC) with the PPE,
other SPEs and devices (such as a decrementer) through its channels.
MMIO: Memory-Mapped I/O Registers
SLC:
SPU Load and Store Unit
SSC:
SPU Channel and DMA Unit
Figure 4.2.15: The channel interface between the SPU and the MFC [4.2.2.4]
Processor components of the Cell BE (20)
•
•
SPU channels are unidirectional interfaces for

sending commands (such as DMA commands) to the MFC, owned by the SPU or

sending/receiving up to 32-bit long messages between the SPU and the
PPE or other SPEs.
SPU channels are implemented in and managed by the MFC.
• Each channel has
- a corresponding capacity (maximum message entries) and
- count (remaining available message entries).
The channel count
- decrements when ever a channel instruction (rdch or wrch)is issued, and
- increments whenever an operation associated with the channnel
completes.
The channel count of „0” means
- empty for read only channels and
- full for write only channels.
Processor components of the Cell BE (21)
•
The SPU can read or write its channels by three instructions;

the read channel (rdch),

write channel (wrch) and

read channel count (rchcnt)
instructions.
Figure 4.2.16: Assembler instruction mnemonics and their corresponding C-language intrinsics
of the channel instructions available for the SPU [4.2.2.4]
(Intrinsics represent in-line assembly code segments in the form of C-language function calls).
Processor components of the Cell BE (22)
• The channel instructions or DMA commands evoked by channel instructions
are enqueued for execution in the MFC for purposes like
 initiating DMA transfers between the SPE’s LS and the main storage,
 queuring DMA and SPU status,
 sending or receiving up to 32-bit long mailbox messages primarily between
the SPU and the PPE or
 sending or receiving up to 32-bit long signal-notification messages
between the SPU and the PPE or other SPEs.
•
The PPE and other devices in the system including other SPEs, can also access
the channels through the MFC’s memory mapped I/O (MMIO) registers and queues,
which are visible to software in the main storage space.
Processor components of the Cell BE (23)
Figure 4.2.17: SPE channels
and associated MMIO registers (1) [4.2.2.4]
Processor components of the Cell BE (24)
Figure 4.2.18: SPE channels
and associated MMIO registers (2)
[4.2.2.4]
Processor components of the Cell BE (25)
Figure 4.2.19: Pipeline stages of the SPUs [4.2.2.1]
Processor components of the Cell BE (26)
b) Memory Flow Controller (MFC)
[4.2.2.2]
•
acts as a specialized co-processor
for its associated SPU by
executing autonomously its own
command set and
•
serves as the SPU’s interface, via
the EIB to main storage and
other processor elements, such
as other SPEs or system devices.
Processor components of the Cell BE (27)
MMIO: Memory-Mapped I/O Registers
SLC:
SPU Load and Store Unit
SSC:
SPU Channel and DMA Unit
Figure 4.2.20: Block diagram of the MFC [4.2.2.4]
Processor components of the Cell BE (28)
The MFC as a specialized co-processor
It executes three types of commands
• DMA commands
•
DMA List commands and
•
synchronization commands.
•
DMA commands (put, get)
•
can be initaiated by both the PPE and the SPU,
•
move up to 16 KByte of data between the LS and the main storage,
•
supports transfer sizes of 1, 2, 4, 8. 16 bytes and multiples of 16 bytes,
•
access main store by using main storage effective addresses,
•
can be tagged with a 5-bit tag (tag group ID) to allow special handling
within the tag group, such a to enforce ordering of the DMA commands.
•
DMA list commands (put, get commands with the command modifier l)
•
can be initiated only by the SPU,
•
consist of up to 2 K 8-byte long list elements,
•
each list element specifies a DMA transfer
•
used to move data between a contiguous area in the LS
and possible noncontiguous area in the effective address space
implementing scatter-gather functions between main storage and the LS).
Processor components of the Cell BE (29)
•
Synchronization comands
•
used basically to control the order of storage accesses,
•
include atomic commands (a form of semaphores), send signal commands
and barrier commands.
Operation of the MFC
•
•
The MFC maintains two separate command queues
-
the 16-entry SPU comand queue for commands from the SPU associated with the MFC,
and
-
the 8-entry proxi command queue for commands from the PPE, other SPEs and devices.
The MFC supports out-of-order execution of DMA commands.
Processor components of the Cell BE (30)
The MFC as the interface between the SPU and the main storage, the PPE
and other devices
•
supports storage protection on the main storage side while performing DMA transfers,
•
maintains synchronization between main storage and the LS,
•
performs intercore communication functions,
such as mailbox and signal-notification messaging with the PPE,
other SPEs and devices.
Processor components of the Cell BE (31)
Intercore communication tools of the MFC
•
•
three mailboxes, primarily intended for holding up to 32-bit long messages
from/to the SPE:
-
one four-deep mailbox for receiving mailbox messages and
-
two one-deep mailbox for sending mailbox messages.
two signal notification channels for receiving signals sent basically by the PPE.
Processor components of the Cell BE (32)
Figure 4.2.21: Contrasting mailboxes and signals [4.2.2.4]
Processor components of the Cell BE (33)
c) Local Store [4.2.2.2]
•
Single-port SRAM cell.
•
Executes DMA reads/writes and instruction
prefetches via 128-Byte wide read/write ports
•
Executes instruction fetches and load/stores
via 128-bit read/write ports.
•
Asynchronous, coherent DMA commands are
used to move instructions and data between
the local store and system memory.
•
DMA transfers between the LS and the main
storage are executed by the SMF’s DMA unit
•
A 128-Byte long DMA read or write requires
16 processor cycles to forward data on the EIB.
SPE
Processor components of the Cell BE (34)
d) The Atomic Update and Cache unit [4.2.2.2]
The Atomic Unit
•
executes atomic operations
(a form of mutual-exclusion (mutex)
operations) invoked by the MFC,
• supports Page Table lookups and
•
maintains cache coherency
by supporting snoop operations.
The Atomic Cache
six 128-byte cache lines of data
to support atomic operations
and Page Table accesses.
SPE
Processor components of the Cell BE (35)
Broadband Interface Controller (BIC) [4.2.2.2]
•
Provides a wide connection to external devices
•
Two configurable interfaces (50+GB/s @ 5Gbps)

Configurable number of bytes

Coherent (BIF) and/or I/O (IOIFx) protocols
•
Supports two virtual channels per interface
•
Supports multiple system configurations
Memory Interface Controller (MIC)
•
Dual XDRTM controller (25.6GB/s @ 3.2Gbps)
•
ECC support
•
Suspend to DRAM support
Multithreading the PPE (1)
Scheduling of PPE threads
Thread scheduling
depends both on
•
•
thread states
thread priorities
•
single threaded or dual threaded mode of execution
Multithreading the PPE (2)
1. Thread states
•
Privilege states
•
Suspended/enabled state
•
Blocked/not blocked state
a) Privilege States
•
•
Hypervisor state
•
most privileged
•
allows to run a meta OS that manages logical partitions in which
multiple OS instances can run
•
some system operations require the initiating thread to be in
hypervisor state
Supervisor state
•
•
is the state in which an OS instance is intended to run
Problem state (user state)
•
is the state in which an application is intended to rum
Multithreading the PPE (3)
(HV: Hypervisor, PR: Problem)
Figure 4.2.22: Bits of the Machine State Register (MSR) defining the privilege state of a thread
[4.2.2.4]
Multithreading the PPE (4)
b) Suspended/enabled State
•
a thread in the hypervisor state can change its state from enabled to
suspended.
•
Two bits of the Control Register (CTRL[TE0], [TE1]) define whether a
thread is in the suspended or enabled state.
c) Blocked/stalled State
•
Blocking
- occurs at the instruction dispatch stage if the thread selection rule
favours the other thread, or due to a special „nop” instruction,
- stops only one of the two threads.
•
Stalling
- occurs at the instruction issue stage due to dependencies
- stops both threads.
- for very long latency conditions, such as L1 cache misses, or devide
instructions, stalling both threads is avoided, by
-- flushing instructions younger than the stalled instruction,
-- instructions starting with the stalled instruction are refetched and
-- the thread is stalled at the dispatch stage until the stall condition
is removed, but then the other thread can be continued to dispatch.
Multithreading the PPE (5)
2. Thread priorities
•
determines dispatch priority
•
four priority levels
thread disabled
low priority
medium priority
high priorith
•
priority levels are specified by a 2-bit field (TP field) of the TSRL register
(Thread Status Regiter Local)
•
Software, in particular OS software, sets thread priorities
(according to the throughput requirements of the programs running in the threads.)
E.g. a foreground/background thread priority scheme can be set,
to favor one thread over the other when allocating instruction dispatch slots.
•
A thread must be in the hypervisor or supervisor state to set its priority to high.
Multithreading the PPE (6)
Usual thread priority combinations
The combination high priority thread/low priority thread is not expected to be used,
as in this case the PPE would never dispatch instructions from the low priority thread
unless the high priority thread was unable to dispatch.
Figure 4.2.23:Usual thread priority combinations [4.2.2.4]
Multithreading the PPE (7)
Example (1): Scheduling in case of the medium priority/medium priority setting
Basic scheduling rules
•
The PPE attempts to utilize all available dispatch slots.
•
Thread scheduling is fair (round robin scheduling).
•
If a thread under consideration is unable to dispatch an instruction in a given slot,
the other thread will be allowed to dispatch even if it was selected for dispatch
on the previous attempt.
Note:The same scheduling applies when both threads are set to high priority.
Figure 4.2.24: Thread scheduling when both priorities are set to medium [4.2.2.4]
Multithreading the PPE (8)
Example (2): Scheduling in case of the low priority/medium priority setting
Basic scheduling rules
•
The PPE attempts to utilize most available dispatch slots for the medium priority thread
(this setting is appropriate to run a low-priority program in the background)
•
Assuming a duty cycle of 5 (TSRL[DISP_COUNT] = 5) instructions from thread 1
are dispatched on four out of five cycles while instructions from thread 0
are dispatched only on one out of five cycles.
• If a thread under consideration is unable to dispatch an instruction in a given slot,
the other thread will be allowed to dispatch even if it was selected for dispatch
on the previous attempt.
Figure 4.2.25: Thread scheduling when one thread runs at medium priority while the other
at low priority [4.2.2.4]
Multithreading the PPE (9)
Example (3): Scheduling in case of the low priority/low priority setting
Basic scheduling rules
•
The PPE attempts to dispatch only once every duty cycle (TSCR[DISP_COUNT]) cycles.
(With high values of DISP-COUNT the PPE will mostly idle, which will reduce
power comsuption and heat production while keepint both threads alive.)
•
Thread scheduling is fair (round robin scheduling)
•
Assuming a duty cycle of 5, both threads are scheduled only once every 5 cycles.
•
If a thread under consideration is unable to dispatch an instruction in a given slot,
the other thread will be allowed to dispatch even if it was selected for dispatch
on the previous attempt.
Figure 4.2.26: Thread scheduling when both priorities are set to low [4.2.2.4]
Multithreading the PPE (10)
3. Single threaded/dual threaded mode of execution
•
In single threaded mode all resources are allocated to a single thread,
this reduces the turnaround time of the thread.
•
Software can change the operating mode of the PPE between single threaded
and dual threaded mode only in the hypervisor state .
Multithreading the PPE (11)
Software controlled thread behaviour
software can use various schemes to controll thread behaviour, including
•
enabling and suspending a thread,
•
•
by setting thread priorities to control instruction dispatch policy,
executing a special nop instruction to cause temporary dispatch blocking,
•
switching the state of the PPE between single threaded and multithreaded
mode.
Multithreading the PPE (12)
Core enhancements for multithreading
• Duplicated architectural states for
32 GPRs
32 FPRs
32 Vector Registers (VRs)
Condition Register (CR)
Count Register (CTR)
Link Register (LR)
FX Exception Register (XER)
FP Status and Control Register (FPSCR)
Vector Status and Control Register (VSCR)
Decrementer (DEC)
Multithreading the PPE (13)
• Duplicated microarchitectural states for
Branch History Table (BHT) with global branch history
(to allow independent and simultaneous branch prediction for both threads)
Internal registers associated with exceptions and interrupt handling, such as
Machine State Register (MR)
Machine Status Save/Restore Registers (SRR0, SRR1)
Hipervisor Machine Status Save/Restore Registers (HSRR0, HSRR1)
FP Status and Control Register (FPSCR) etc.
(to allow concurrent exception and interrupt handling)
Multithreading the PPE (14)
•
Duplicated queues and arrays
Segment lookaside buffer (SLB)
Instruction buffer queue (Ibuf)
(to allow each thread to dispatch regardles of any dispatch stall in the other thread)
Link stack queue
•
Shared resources
Hardware execution units
The instruction fetch control
(because the I$ has only one read port and so fetching must alternate
between threads every cycle).
Virtual memory mapping
(as both threads always execute in the same logical partitioning context)
Most large arrays and queues, such as caches that consume significant amount
of chip area
Multithreading the PPE (15)
The programming model
assumes the choice of an appropriate SPU configuration
Basic SPU configurations
•
Application specific SPU
accelerators,
•
Multi-stage pipeline SPU
configuration or
•
Parallel-stages SPU configuration;
Multithreading the PPE (16)
Application specific SPU accelerators [4.2.2.5]
Multithreading the PPE (17)
Multi-stage SPU pipeline configuration [4.2.2.5]
Programming models (1)
Parallel-stages SPU configuration [4.2.2.5]
Programming models (2)
Basic approach for creating an application
•
Programmer chooses the appropriate SPU configuration according to the
features of an application, such as




•
•
Graphics processing
Audio processing
MPEG Encoding/Decoding
Encryption/Decryption
Programmer writes/uses SPU „libraries” either for

Application specific SPU accelerators,

Multi-stage pipeline SPU configuration or

Parallel-stages SPU configuration; e.g. for;
Main application in PPE, invokes SPU bound services by
 creating SPU threads
 RPC like function calls
 I/O device like interfaces (FIFO/command queue)
•
One ore more SPUs cooperate in the presumed SPU configuration to execute
the tasks required.
Programming models (3)
•
Acceleration
provided by OS or application libraries
•
Application portability
maintained with platform specific libraries
Programming models (4)
Example
Aim
•
showing the cooperation between PPE and SPE
Program
•
Actual goal: To calculate distance travelled in a car
•
It asks for:

elapsed time

speed
Program structure
•
There are two program codes, one for the PPE and one for the SPE.
•
The PPE does the user input, then it calls the SPE executable which
calculates the distance and then returns with the result.
•
The result is then given to the user by the PPE.
Programming models (5)
Example
PPE
Loading program
and data to
main store
Main Store
copying data
from MS to LS
(mfc_get)
notifying SPE of
work to be done
(create_spu_thread)
SPE 1
Local Store 1
Accessing data
SPE n
Local Store n
Programming models (6)
Main Store
PPE
Execution of the SPE
thread
SPE 1
Local Store 1
SPE n
Local Store n
Programming models (7)
PPE
Loading results
from main store
Main Store
copying data
from LS to MS
(mfc_get)
SPE notifies PPE
„job is finished”
by sending a message
SPE 1
SPE n
Updating
results
Local Store 1
Local Store n
Programming models (8)
#include <stdio.h>
#include <libspe.h>
extern spe_program_handle_t calculate_distance_handle;
typedef struct {
float speed;
//input parameter
float num_hours; //input parameter
float distance; //output parameter
float padding; //pad the struct a multiple of 16 bytes
} program_data;
External SPE program
(next slide)
Define the data structure
passed to the SPE task
int main() {
program_data pd __attribute__((aligned(16))); //aligned for transfer
printf("Enter the speed in miles/hr: ");
scanf("%f", &pd.speed);
printf("Enter the number of hours you have been driving: ");
scanf("%f", &pd.num_hours);
Data input
speid_t spe_id = spe_create_thread(0, &calculate_distance_handle, &pd,NULL,
-1, 0);
spe_wait(spe_id, NULL, 0);
Create the thread and wait
for it to finish
printf("The distance travelled is %f miles.\n", pd.distance);
}
return 0;
Data output
Programming models (9)
#include <spu_mfcio.h>
typedef struct {
float speed;
//input parameter
float num_hours; //input parameter
float distance; //output parameter
float padding; //pad the struct a multiple of 16 bytes
} program_data;
Define the data structure
to communicate with the SPE
int main(unsigned long long spe_id, unsigned long long program_data_ea, unsigned
long long env) {
program_data pd __attribute__((aligned(16)));
int tag_id = 0;
mfc_get(&pd, program_data_ea, sizeof(pd), tag_id, 0, 0);
mfc_write_tag_mask(1<<tag_id);
mfc_read_tag_status_any();
pd.distance = pd.speed * pd.num_hours;
}
Copy data from MS to LS
Wait for completition
Calculate the result
mfc_put(&pd, program_data_ea, sizeof(program_data), tag_id, 0, 0);
mfc_write_tag_mask(1<<tag_id);
Copy data from LS to MS
mfc_read_tag_status_any();
Wait for completition
return 0;
Implementation of the Cell BE (1)
Implemetation alternatives
Figure 4.2.27: Cell system configuration options [4.2.2.3]
Implementation of the Cell BE (2)
Figure: Cell BE Blade Roadmap
Source: Brochard L., A Cell History,” Cell Workshop, April, 2006
http://www.irisa.fr/orap/Constructeurs/Cell/Cell%20Short%20Intro%20Luigi.pdf
Implementation of the Cell BE (3)
Motherboard of the Cell Blade (QS20)
Figure 4.2.28: Motherboard of the Cell Blade (QS20) [4.2.2.5]
References
Cell BE
[4.2.2.1] Gshwind M., „Chip Multiprocessing and the Cell BE,” ACM Computing Frontiers, 2006,
http://beatys1.mscd.edu/compfront//2006/cf06-gschwind.pdf
[4.2.2.2] Hofstee P., „Tutorial: Hardware and Software Architectures
for the CELL BROADBAND ENGINE processor”, IBM Corp., September 2005
http://www.crest.gatech.edu/conferences/cases2005/pdf/Cell-tutorial.pdf
[4.2.2.3] Kahle J.A., „Introduction to the Cell multiprocessor”, IBM J. Res & Dev Vol. 49, 2005, pp. 584-604
http://www.research.ibm.com/journal/rd/494/kahle.pdf
[4.2.2.4]: Cell Broadband Engine Programming Handbook Vers. 1.1, Apr. 2007, IBM Corp.
[4.2.2.5] Cell BE Overview, Course code: L1T1H1-02, May 2006, IBM Corp.
4.3 SMT multithreaded processors
4.3.1. Intel Pentium 4
4.3.2. Alpha 21464 (V8)
4.3.3. IBM Power5
4.3. Simultaneously multithreaded processors
Thread scheduling in multithreaded cores
Coarse grained MT cores
Fine grained MT cores
SMT cores
4.3.1. Intel Pentium 4 (1)
Intel designates SMT as Hyperthreading (HT)
Introduced in the Northwood based DP- and MP-server cores in 2/2002 and 3/2002 resp.
(called the Prestonia and Foster MP cores),
followed by the Northwood core for desktops in 11/2002.
Additions for implementing MT:
• Duplicated architectural state, including
•
•
•
•
•
instruction pointer,
the general purpose regs.,
the control regs.,
the APIC (Advanced Programable Interrupt Controller) regs.,
some machine state regs.
4.3.1. Intel Pentium 4 (2)
Figure 4.3.1. Intel Pentium 4 and the visible processor resources duplicated to support
hyperthreading technology. Hyperthreading requires duplication of additional miscellaneous
pointers and control logic, but these are too small to
point out.
Source: Koufaty D. and Marr D.T. „Hyperthreading Technology in the Netburst Microarchitecture,
IEEE. Micro, Vol. 23, No.2, March-April 2003, pp. 56-65.
4.3.1. Intel Pentium 4 (3)
Intel designates SMT as Hyperthreading (HT)
Introduced in the Northwood based DP- and MP-server cores in 2/2002 and 3/2002 resp.
(called the Prestonia and Foster MP cores),
followed by the Northwood core for desktops in 11/2002.
Additions for implementing MT:
• Duplicated architectural state, including
•
•
•
•
•
instruction pointer,
the general purpose regs.,
the control regs.,
the APIC (Advanced Programable Interrupt Controller) regs.,
some machine state regs.
• Further enhancements to support MT (thread microstate):
•
•
•
•
•
•
TC-entries (Trace cache) are tagged,
BHB (Branch History Buffer) is duplicated,
Global History Table is tagged,
RAS (Return Address Stack) is duplicated,
Rename tables are duplicated,
ROB is tagged.
4.3.1. Intel Pentium 4 (4)
Figure 4.3.2: SMT pipeline in Intel’s Pentium 4/HT
Source: Marr T.T. et al. „Hyper-Threading Technology Architecture and Microarchitecture”,
Intel Technology Journal, Vol. 06, Issue 01, Febr 14, 2002, pp. 4-16
4.3.1. Intel Pentium 4 (5)
Intel designates SMT as Hyperthreading (HT)
Introduced in the Northwood based DP- and MP-server cores in 2/2002 and 3/2002 resp.
(called the Prestonia and Foster MP cores),
followed by the Northwood core for desktops in 11/2002.
Additions for implementing MT:
• Duplicated architectural state, including
•
•
•
•
•
instruction pointer,
the general purpose regs.,
the control regs.,
the APIC (Advanced Programable Interrupt Controller) regs.,
some machine state regs.
• Further enhancements to support MT (thread microstate):
•
•
•
•
•
•
TC-entries (Trace cache) are tagged,
BHB (Branch History Buffer) is duplicated,
Global History Table is tagged,
RAS (Return Address Stack) is duplicated,
Rename tables are duplicated,
ROB is tagged.
Additional die area required for MT: less than 5 %.
Single thread/dual thread modes:
To prevent single thread performance degradation:
in single thred mode partitioned resources are recombined.
4.3.2. Alpha 21464 (V8) (1)
8-way superscalar, scheduled for 2003, but canceled in June 2001 in favour of the Itanium line.
In 2001 all Alpha intellectual property rights were sold to Intel.
Core enhancements for 4-way multithreading:
• Providing replicated (4 x) thread states for:
PC, architectural registers (by increasing the sizes of the merged GPR and FPR
architectural and rename reg. files):
Alpha 21264Alpha 21464
GPRs
FPRs
80
80
512
Source: :Preston R. P. and all., Design of an 8-wide Superscalar RISC Microprocessor with
Simultaneous Mltithreading”, Proc. ISSCC, 2002, pp. 334-243
4.3.2. Alpha 21464 (V8) (2)
SMT Pipeline
Fetch
Decode/
Map
Queue
Reg
Read
Execute
Dcache/
Store
Buffer
Reg
Write
Retire
PC
Register
Map
Regs
Dcache
Regs
Icache
Better answers
Figure 4.3.3: SMT pipeline in the Alpha 21464 (V8)
Source: Mukkherjee S., „The Alpha 21364 and 21464 Microprocessors,” http://www.compaq.com
4.3.2. Alpha 21464 (V8) (3)
8-way superscalar, scheduled for 2003, but canceled in June 2001 in favour of the Itanium line.
In 2001 all Alpha intellectual property rights were sold to Intel.
Core enhancements for 4-way multithreading:
• Providing replicated (4 x) thread states for:
PC, architectural registers (by increasing the sizes of the merged GPR and FPR
architectural and rename reg. files):
Alpha 21264Alpha 21464
GPRs
FPRs
80
80
512
• Providing replicated (4 x) thread microstates for:
Register Maps,
Source: :Preston R. P. and all., Design of an 8-wide Superscalar RISC Microprocessor with
Simultaneous Mltithreading”, Proc. ISSCC, 2002, pp. 334-243
4.3.2. Alpha 21464 (V8) (4)
SMT Pipeline
Fetch
Decode/
Map
Queue
Reg
Read
Execute
Dcache/
Store
Buffer
Reg
Write
Retire
PC
Register
Map
Regs
Dcache
Regs
Icache
Better answers
Figure 4.3.4: SMT pipeline in the Alpha 21464 (V8)
Source: Mukkherjee S., „The Alpha 21364 and 21464 Microprocessors,” http://www.compaq.com
4.3.2. Alpha 21464 (V8) (5)
8-way superscalar, scheduled for 2003, but canceled in June 2001 in favour of the Itanium line.
In 2001 all Alpha intellectual property rights were sold to Intel.
Core enhancements for 4-way multithreading:
• Providing replicated (4 x) thread states for:
PC, architectural registers (by increasing the sizes of the merged GPR and FPR
architectural and rename reg. files):
Alpha 21264Alpha 21464
GPRs
FPRs
80
80
512
• Providing replicated (4 x) thread microstates for:
Register Maps,
Additional core area needed for SMT: ~ 6 %
Source: :Preston R. P. and all., Design of an 8-wide Superscalar RISC Microprocessor with
Simultaneous Mltithreading”, Proc. ISSCC, 2002, pp. 334-243
4.3.3. IBM POWER5 (1)
POWER5 enhancements vs the POWER4:
• on-chip memory control,
4.3.3. IBM POWER5 (2)
Fabric
Controller
Figure 4.3.5: POWER4 and POWER5 system structures
Source: R. Kalla, B. Sinharoy, J.M. Tendler: IBM Power5 chip: A Dual-core multithreaded Processor,
IEEE. Micro, Vol. 24, No.2, March-April 2004, pp. 40-47.
4.3.3. IBM POWER5 (3)
POWER5 enhancements vs the POWER4:
• on-chip memory control,
• separate L3/memory attachment,
4.3.3. IBM POWER5 (4)
Fabric
Controller
Figure 4.3.6: POWER4 and POWER5 system structures
Source: R. Kalla, B. Sinharoy, J.M. Tendler: IBM Power5 chip: A Dual-core multithreaded Processor,
IEEE. Micro, Vol. 24, No.2, March-April 2004, pp. 40-47.
4.3.3. IBM POWER5 (5)
POWER5 enhancements vs the POWER4:
• on-chip memory control,
• separate L3/memory attachment,
• dual threaded.
4.3.3. IBM POWER5 (6)
Figure 4.3.7: Microarchitecture of IBM’s POWER5
Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003
4.3.3. IBM POWER5 (7)
Figure 4.3.8: IBM POWER5 Chip
Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003
4.3.3. IBM POWER5 (8)
Core enhancements for multithreading:
• Providing duplicated thread states for:
PC, architectural registers (by increasing the sizes of the merged GPR and FPR
architectural and rename reg. files):
POWER4
GPRs
FPRs
80
72
POWER5
120
120
4.3.3. IBM POWER5 (9)
Figure 4.3.9: SMT pipeline of IBM’s POWER5
Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003
4.3.3. IBM POWER5 (10)
Core enhancements for multithreading:
• Providing duplicated architectural states for:
PC, architectural registers (by increasing the sizes of the merged GPR and FPR
architectural and rename reg. files):
POWER4
GPRs
FPRs
POWER5
80
72
• Providing duplicated thread microstates for:
Return Address Stack, Group Completion (ROB)
120
120
4.3.3. IBM POWER5 (11)
Figure 4.3.10: SMT pipeline of IBM’s POWER5
Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003
4.3.3. IBM POWER5 (12)
Core enhancements for multithreading:
• Providing duplicated thread states for:
PC, architectural registers (by increasing the sizes of the merged GPR and FPR
architectural and rename reg. files):
POWER4
GPRs
FPRs
POWER5
80
72
120
120
• Providing duplicated thread microstates for:
Return Address Stack, Group Completion (ROB)
• Providing increased (in fact duplicated) sizes for scarce or sensitive resorces, such as:
Instruction Buffer, Store Queue
4.3.3. IBM POWER5 (13)
Figure 4.3.11: SMT pipeline of IBM’s POWER5
Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003
4.3.3. IBM POWER5 (14)
Core enhancements for multithreading:
• Providing duplicated thread states for:
PC, architectural registers (by increasing the sizes of the merged GPR and FPR
architectural and rename reg. files):
POWER4
GPRs
FPRs
POWER5
80
72
120
120
• Providing duplicated thread microstates for:
Return Address Stack, Group Completion (ROB)
• Providing increased (duplicated) size for scarce or sensitive resorces, such as:
Instruction Buffer, Store Queue
Additional core area needed for SMT: ~ 10 %
4.3.3. IBM POWER5 (15)
Unbalanced execution of threads:
(an enhancement of the single mode/dual mode thred execution model)
• Threads have 8 priority levels (0...7) controlled by HW/SW,
• the decode rate of each thread will be controlled according to the associated priority
Figure 4.3.12: Unbalanced execution of threads in IBM’s POWER5
Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003
4.3.3. IBM POWER5 (16)
Development effort:
• Concept phase:
• High level design phase:
• Implementation phase:
~ 10 persons/ 4 month
~ 50 persons/ 6 month
~ 200 persons/ 12-18 month
Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003
Download