A - B

advertisement
Multi Threaded Architectures
Sima, Fountain and Kacsuk
Chapter 16
CSE462
1
Memory and Synchronization
Latency


Scalability of system is limited by ability to handle
memory latency & algorithmic sychronization
delays
Overall solution is well known
– Do something else whilst waiting

Remote memory accesses
– Much slower than local
– Varying delay depending on
• Network traffic
• Memory traffic
 David Abramson, 2004
2
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Processor Utilization

Utilization
– P/T
• P time spent processing
• T total time
– P/(P + I + S)
• I time spent waiting on other tasks
• S time spent switching tasks
3
 David Abramson, 2004
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Basic ideas - Multithreading

Fine Grain – task switch every cycle
Blocked
Blocked
Blocked

Coarse Grain – Task swith every n cycles
Task Switch Overhead
Task Switch Overhead
4
 David Abramson, 2004
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Design Space
Multi threaded
architectures
Computational
Model
Memory
Organization
Granularity
Von Neumann
Hybrid Von Neumann/
(Sequential Control
Dataflow
Flow)
Parallel Control flow
Based on parallel
Control operators
Parallel control flow
Based on control tokens
Number of threads
per processor
Fine
Grain
Physical Shared
Memory
Small
(4 – 10)
Coarse
Grain
Distributed
Shared
Memory
Middle
(10 – 100)
Cache-coherent
Distributed shared
Memory
Large
(over 100)
5
 David Abramson, 2004
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Classification of multi-threaded architectures
Multi-threaded
architectures
Von Neumann based
architectures
HEP
Hybrid von Neumann/
Dataflow architectures
RISC Like
Macro dataflow
architectures
Decoupled
Tera
P-RISC
USC
MIT Hybrid
Machine
MIT Alewife & Sparcle
*T
McGill MGDA & SAM
EM-4
6
 David Abramson, 2004
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Computational Models
7
Sequential control flow (von Neumann)




Flow of control and data separated
Executed sequentially (or at least sequential
semantics – see chapter 7)
Control flow changed with
JUMP/GOTO/CALL instructions
Data stored in rewritable memory
– Flow of data does not affect execution order
8
 David Abramson, 2004
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Sequential Control Flow Model
L1:
A
B
m1
L2:
+
B
1
m2
L3:
*
m1
m2
R
Control
Flow
R = (A - B) * (B + 1)
9
 David Abramson, 2004
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Dataflow


Control tied to data
Instruction “fires” when data is available
– Otherwise it is suspended

Order of instructions in program has no effect on execution order
– Cf Von Neumann

No shared rewritable memory
– Write once semantics



Code is stored as a dataflow graph
Data transported as tokens
Parallelism occurs if multiple instructions can fire at same time
– Needs a parallel processor

Nodes are self scheduling
10
 David Abramson, 2004
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Dataflow – arbitrary execution order
A
B
1
-
+
*
R
R = (A - B) * (B + 1)
11
 David Abramson, 2004
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Dataflow – arbitrary execution order
A
B
1
-
+
*
R
R = (A - B) * (B + 1)
12
 David Abramson, 2004
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Dataflow – Parallel Execution
A
B
1
-
+
*
R
R = (A - B) * (B + 1)
13
 David Abramson, 2004
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Implementation




Dataflow model required very different execution
engine
Data must be stored in special matching store
Instructions must be triggered when both operands
are available
Parallel operations must be scheduled to
processors dynamically
– Don’t know apriori when they are available.

Instruction operands are pointers
– To instruction
– Operand number
 David Abramson, 2004
14
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Dataflow model of execution
L1: Compte B
L2/2
L3/1
L2:
-
A
L3:
B
+
B
1
L4/2
L4/1
L4:
*
L6/1
15
 David Abramson, 2004
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Parallel Control flow

Sometimes called macro dataflow
– Data flows between blocks of sequential code
– Has advantaged of dataflow & Von Neumann
• Context switch overhead reduced
• Compiler can schedule instructions statically
• Don’t need fast matching store

Requires additional control instructions
– Fork/Join
16
 David Abramson, 2004
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Macro Dataflow (Hybrid Control/Dataflow)
L1:
FORK
L4
L2:
A
B
m1
Control
Flow
L3:
L4:
Control
Flow
GOTO
L5
R = (A - B) * (B + 1)
 David Abramson, 2004
+
B
1
m2
L5:
JOIN
2
L6:
*
m1
m2
R
17
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Issues for Hybrid dataflow


Blocks of sequential instructions need to be large
enough to absorb overheads of context switching
Data memory same as MIMD
– Can be partitioned or shared
– Synchronization instructions required
• Semaphores, test-and-set

Control tokens required to synchronize threads.
18
 David Abramson, 2004
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Some examples
19
Denelcor HEP








Designed to tolerate latency in memory
Fine grain interleaving of threads
Processor pipeline contains 8 stages
Each time step a new thread enters the pipeline
Threads are taken from the Process Status Word (PSW)
After thread taken from the PSW queue, instruction and
operands are fetched
When an instruction is executed, another one is placed on
the PSW queue
Threads are interleaved at the instruction level.
20
 David Abramson, 2004
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Denelcor HEP



Memory latency toleration solved with
Scheduler Function Unit (SFU)
Memory words are tagged as full or empty
Attempting to read an empty suspends the
current thread
– Then current PSW entry is moved to the SFU

When data is written, taken from the SFU
and placed back on the PSW queue.
21
 David Abramson, 2004
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Synchronization on the HEP



All registers have Full/Empty/Reserved bit
Reading an empty register causes thread toe
be placed back on the PSW queue without
updating its program counter
Thread synchronization is busy-wait
– But other threads can run
22
 David Abramson, 2004
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
HEP Architecture
PSW
queue
Matching
Unit
Program
memory
Operand hand 1
Increment
control
Operand
fetch
Operand hand 2
Registers
SFU
Funct
unit 1
Funct
unit 2
Funct
unit N
To/from
Data
memory
23
 David Abramson, 2004
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
HEP configuration




Up to 16 processors
Up to 128 data memories
Connected by high speed switch
Limitations
– Threads can have only 1 outstanding memory request
– Thread synchronization puts bubbles in the pipeline
– Maximum of 64 threads causing problems for software
• Need to throttle loops
– If parallelism is lower than 8 full utilisation not
possible.
 David Abramson, 2004
24
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
MIT Alewife Processor







512 Processors in 2-dim mesh
Sparcle Processor
Physcially distributed memory
Logical shared memory
Hardware supported cache coherence
Hardware supported user level message passing
Multi-threading
25
 David Abramson, 2004
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Threading in Alewife




Coarse-grained multithreading
Pipeline works on single thread as long as
remote memory access or synchronization
not required
Can exploit register optimization in the
pipeline
Integration of multi-threading with
hardware supported cache coherence
26
 David Abramson, 2004
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
The Sparcle Processor




Extension of SUN Sparc architecture
Tolerant of memory latency
Fine grained synchronisation
Efficient user level message passing
27
 David Abramson, 2004
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Fast context switching


In Sparc 8 overlapping register windows
Used in Sparcle in paris to represent 4 independent, non-overlapping contexts
– Three for user threads
– 1 for traps and message handlers

Each context contains 32 general purpose registers and
– PSR (Processor State Register)
– PC (Program Counter)
– nPC (next Program Counter)

Thread states
– Active
– Loaded
• State stored in registers – can become active
– Ready
• Not suspended and not loaded
– Suspended

Thread switching
– In fast if one is active and the other is loaded
– Need to flush the pipeline (cf HEP)
 David Abramson, 2004
28
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Sparcle Architecture
0:R0
PSR
PC
nPC
PSR
PC
nPC
PSR
PC
nPC
PSR
PC
nPC
0:R31
1:R0
CP
Active
thread
1:R31
2:R0
2:R31
3:R0
3:R31
29
 David Abramson, 2004
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
MIT Alewife and Sparcle
NR
Sparcle
Cache
64
kbytes
NR = Network router
CMMU = Communication & memory management unit
FPU = Floating point unit
Main
CMMU Memory
4 Bytes
FPU
30
 David Abramson, 2004
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
From here figures are drawn by Tim
31
Figures 16.10 Thread states in
Sparcle
Global register
frames
Process state
Memory
G0
Ready queue
PSR
PC
nPC
PSR
PC
nPC
PSR
PC
nPC
PSR
PC
nPC
...
0:R31
1:R0
active
thread
CP
1:R31
2:R0
Loaded thread
Unloaded
thread
2:R31
3:R0
3:R31
 David Abramson, 2004
...
G7
0:R0
PC and PSR
frames
Suspended queue
32
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.11 structure of a typical
static dataflow PE
Fetch unit
Instruction
queue
Func.
Unit 1
Func.
Unit 2
Func.
Unit N
Activity
store
Update unit
To/From other (PEs)
33
 David Abramson, 2004
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.12 structure of a typical
tagged-token dataflow PE
Token queue
Func.
Unit 1
Matching unit
Matching store
Fetch unit
Instruction/
data memory
Func.
Unit 2
Func.
Unit N
Update unit
To other (PEs)
 David Abramson, 2004
34
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.13 organization of the Istructure storage
Data storage
k:
k+1:
k+2:
k+3:
k+4:
W
A
A
P
W
Data storage
tag X
tag Z
tag Y
nil
nil
datum
Presence bits (A=Absent, P=Present, W=Waiting
35
 David Abramson, 2004
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.14 coding in explicit
token-store architectures (a) and (b)
<35, <FP, IP>>
<12, <FP, IP>>
-
-
fire
<23, <FP, IP+2>>
*
+
*
<23, <FP, IP+1>>
+
36
 David Abramson, 2004
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.14 coding in explicit
token-store architectures (c)
Instruction memory
IP SUB
2
+1, +2
ADD
3
+2
MUL 4
+7
Frame memory
Frame memory
FP
FP
fire
FP+2
1
0
35
0
FP+3
1
23
0
FP+4
1
23
Presence bit
37
 David Abramson, 2004
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.15 structure of a typical
explicit token-store dataflow PE
From other PEs
Fetch unit
Fetch unit
Effective
address
Presence bits
Frame
memory
Frame store
operation
Form token
unit
Func.
Unit 1
Func.
Unit 2
Func.
Unit N
Form token
unit
38
To/from other PEs
 David Abramson, 2004
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.16 scale of von
Neumann/dataflow architectures
Dataflow
Macro dataflow
Decoupled hybrid dataflow
RISC-like hybrid
von Neumann
39
 David Abramson, 2004
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.17 structure of a typical
macro dataflow PE
Matching unit
Instruction
Frame
memory
Fetch unit
Token queue
Func. Unit
Internal control pipeline
(program counter-based
sequential execution)
Form token
unit
To/from other (PEs)
 David Abramson, 2004
40
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.18 organization of a PE
in the MIT hybrid Machine
PC
FBR
+1
Instruction fetch
Enabled
continuation
queue
Instruction
memory
Decode unit
Frame
memory
(Token queue)
Operand fetch
Execution unit
To/from global memory
 David Abramson, 2004
Registers
41
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.19 comparison of (a) SQ
and (b) SCB macro nodes
a
SQ1
c
b
l4
l1
SQ2
a
SCB1
c
b
l4
l1
SCB2
3
l5
l2
1
l3
l5
l2
2
1
l6
l3
2
l6
42
 David Abramson, 2004
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.20 structure of the USC
Decoupled Architecture
To/from network (Graph virtual space)
Cluster graph memory
GC
GC
DFGE
DFGE
RQ
AQ
RQ
AQ
Cluster 0
CE
CE
CC
CC
Cluster graph memory
To/from network (Computation virtual space)
 David Abramson, 2004
43
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.21 structure of a node in
the SAM
fire
APU
Main
memory
done
SEU
ASU
LEU
To/from network
44
 David Abramson, 2004
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.22 structure of the PRISC processing element
Local memory
Instruction
Instruction fetch
Internal control pipeline
(conventional RISCprocessor)
Operand fetch
Load/Store
Func. unit
Messages to/from other
PE’s memory
Operand store
Start
Token queue
Frame
memory
45
 David Abramson, 2004
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.23 transformation of dataflow graphs into control
flow graphs (a) dataflow graph (b) control flow graph
join
+
+
fork L1
*
-
join
*
L1:
join
-
46
 David Abramson, 2004
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.24 structure of *T node
From
network
Network
interface
Message
formatter
To
network
Message
queues
Remote memory
request
coprocessor
Synchronization
coprocessor
sIP
sFP
sV1
sV2
Data processor
Continuation
queue
<IP,FP>
dIP
dFP
dV1
dV2
Local memory
47
 David Abramson, 2004
Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Download