POD: A Parallel-On-Die Architecture

advertisement
POD: A Parallel-On-Die Architecture
Dong Hyuk Woo
Joshua B. Fryman
Allan D. Knies
Marsha Eng
Hsien-Hsin S. Lee
Georgia Tech
Intel Corp.
Intel Berkeley Research
Intel Corp.
Georgia Tech
What So New with ‘Many-Core’?
• On Chip!
– New definition of scalability!
Performance Scalability
Performance Scalability
- Minimize communication
- Minimize communication
- Hide communication latency
- Hide communication latency
Area Scalability
- How many cores on a die?
Power Scalability
- How much power per core?
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
2
Scalable Power Consumption?
Technology (nm)
65
45
32
22
16
# of cores
4
8
16
32
64
Frequency (GHz)
3.0
3.0
3.0
3.0
3.0
Vdd (V)
1.3
1.1
0.9
0.8
0.7
Power / core (W)
37.5
26.8
18.0
14.2
10.9
Total power (W)
150
215
288
454
696
* The above pictures do not represent any current or future Intel products.
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
3
Thousand Cores, Power Consumption?
?
?
?
? ?
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
4
Some Known Facts on Interconnection
• One main energy hog!
Interconnection Network related
– Alpha 21364:
• 20% (25W / 125W)
– MIT RAW:
• 40% of individual tile power
– UT TRIPS:
• 25%
• Mesh-based MIMD many-core?
– Unsustainable energy in its router logic
– Unpredictable communication pattern
• Unfriendly to low-power techniques
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
5
POD: A Parallel-On-Die Architecture
Design Goals
• Broad-purpose acceleration
– Scalable vector performance
• Energy and area efficiency
• Low latency inter-PE communication
• Best-in-class scalar performance
• Backward compatibility
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
7
Host Processor
Host Processor
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
8
Processing Elements (PEs)
Host Processor
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
9
Instruction Bus (IBus)
Host Processor
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
10
ORTree: Status Monitoring
Flag status
Host Processor
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
11
Data Return Buffer (DRB)
DRB
DRB
DRB
DRB
DRB
DRB
DRB
DRB
Host Processor
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
12
2D Torus P2P Links
0
7
1
6
2
5
3
4
DRB
DRB
DRB
DRB
DRB
DRB
DRB
DRB
Host Processor
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
13
2D Torus P2P Links
DRB
DRB
DRB
DRB
DRB
DRB
DRB
DRB
Host Processor
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
14
RRQ & MBus: DMA
Row
Response
Queue
RRQ
DRB
RRQ
DRB
RRQ
DRB
RRQ
DRB
RRQ
DRB
RRQ
DRB
RRQ
DRB
RRQ
DRB
Host Processor
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Memory
Bus
15
System memory
On-chip MC / Ring
Memory Controller
RRQ
DRB
RRQ
DRB
RRQ
DRB
MC
RRQ
DRB
MC
RRQ
DRB
RRQ
DRB
RRQ
DRB
RRQ
DRB
Arbitrator
ARB
LLC
xTLB
Host Processor
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
16
Wire Delay Aware Design
RRQ
D
RRQ
D
D
D
RRQ
D
D
D
RRQ
D
D
D
RRQ
D
D
D
RRQ
D
D
D
RRQ
D
D
D
RRQ
D
D
EFLAGS
! MemFence
IBUS
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
17
Simplified PE Pipelining
NSEW
NSEW
96-bit VLIW instruction
G
DEC / RF
X
Local
SRAM
M
MBUS
80
78
IBUS
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
18
Physical Design Evaluation
SIMD
DL1
iBTB
BTB
hist
L
G s/f
Fetch
B
AGU
FP
DTLB tags
PF
s/f LQ SQ
Bypass
Decode
IL1
UCode
ROM
TLB Tags
IFQ
INT
prefetch
BAC
RAT
Decode uopQ
Decode
Rename Alloc
Decode
(Conplex)
RS
ROB
Commit
ARF
Intel Conroe Core 2 Duo processor die photo
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
19
Physical Design Evaluation
POD
PE
Intel Conroe Core 2 Duo processor die photo
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
20
Physical Design Evaluation @ 45nm
• PE
– 128KB dual-ported SRAM, Basic GP ALU, simplified SSE ALU
– ~3.2 mm2
• RRQ
– ~3.2 mm2 (conservative estimation)
• The host processor
– Core-based microarchitecture
– ~25.9 mm2
RRQ
DRB
RRQ
DRB
RRQ
DRB
MC
RRQ
DRB
MC
RRQ
DRB
RRQ
DRB
RRQ
DRB
RRQ
DRB
• 3MB LLC
– ~20 mm2
• Total: ~276.5 mm2
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
ARB
LLC
xTLB
Host Processor
21
Interconnection Fabric Among PEs
• Different links for different communication
– IBus / Mbus / 2D torus P2P links / ORTree
– No contention among different communication
• Fully synchronized and orchestrated by the
host processor (and RRQ for MBus)
–
–
–
–
No crossbar
No contention / arbitration
No buffer space for the router
Nothing unpredictable!
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
22
Design Space Exploration
• Baseline Design
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
23
Design Space Exploration
• 3D POD with many-core
Die 0
Die 1
You live and work
here
You get your beer here
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
24
Design Space Exploration
• 3D many-POD processor
Die 0
Die 1
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
25
Programming Model
Programming Model
• Example code: Reduction
char* base_addr = (char*) malloc(npe);
for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;
}
$ASM mov r15 = MY_PE_ID_POINTER
$ASM ld r15 = local [ r15 + 0 ]
POD_movl( r14, base_addr );
N 1
S   ai
i 0
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]
POD_mfence();
$ASM add r2 = r2, r1
for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1
$ASM add r2 = r2, r1
}
$ASM add r1 = r2, 0
for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1
$ASM add r2 = r2, r1
}
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
27
Simulation Result
Performance Result
860.3
GFLOPS
869.3
GFLOPS
Speedup
113.6
GFLOPS
91.4
GFLOPS
504.6
GFLOPS
64
134.8
GFLOPS
16
4
DenseMMM
(SP)
FFT
(SP)
1x1 POD
IDCT
(DP)
2x2 POD
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Option
Pricing
(SP)
4x4 POD
Down
Sampling
(SP)
K-means
(SP)
8x8 POD
29
Performance per mm2
14
Homogeneous
many-core
(ideal)
Performance per mm^2
12
10
8
6
4
2
DenseMMM
FFT
IDCT
1x1 POD
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Option
Pricing
2x2 POD
Down
Sampling
4x4 POD
K-means
8x8 POD
1 core
4 cores
16 cores
64 cores
0
30
10
DenseMMM
FFT
IDCT
1x1 POD
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Option
Pricing
2x2 POD
Down
Sampling
4x4 POD
K-means
8
1 core
4 cores
16 cores
64 cores
p=
0.
p= 125
0.
p= 250
0.
p= 500
0.
p= 125
0.
p= 250
0.
5
p= 00
0.
p= 125
0.
p= 250
0.
p= 500
0.
1
p= 25
0.
p= 250
0.
p= 500
0.
p= 125
0.
p= 250
0.
p= 500
0.
p= 125
0.
p= 250
0.
50
0
Performance per Watt
Performance per Watt
Homogeneous
many-core
(ideal)
6
4
2
0
8x8 POD
31
Performance per Joule
600
DenseMMM
FFT
IDCT
Option
Pricing
Down
Sampling
K-means
Performance per Joule
500
Homogeneous
many-core
(ideal)
400
300
200
100
1x1 POD
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
2x2 POD
4x4 POD
1 core
4 cores
16 cores
64 cores
p=
0.
p= 125
0.
p= 250
0.
p= 500
0.
p= 125
0.
p= 250
0.
5
p= 00
0.
p= 125
0.
p= 250
0.
p= 500
0.
p= 125
0.
2
p= 50
0.
p= 500
0.
p= 125
0.
p= 250
0.
5
p= 00
0.
p= 125
0.
p= 250
0.
50
0
0
8x8 POD
32
Interconnection Power
Input buffer
X-bar
arbiter
Link
RAW
31%
30%
~0%
39%
TRIPS
35%
33%
1%
31%
Wang et al., “Power-Driven Design of Router Microarchitectures in On-Chip Networks”, MICRO 2003
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
33
2D Torus P2P Link Active Time
6%
DenseMMM
FFT
IDCT
N E W S
N E W S
N E W S
OptionPricing DownSampling
K-means
5%
Active Time
4%
3%
2%
1%
0%
1x1 POD
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
2x2 POD
N E W S
4x4 POD
N E W S
N E W S
8x8 POD
34
Conclusion
• We propose POD
–
–
–
–
Parallel-On-Die many-core architecture
Broad-purpose acceleration
High energy and area efficiency
Backward compatibility
RRQ
DRB
RRQ
DRB
RRQ
DRB
MC
RRQ
DRB
MC
RRQ
DRB
RRQ
DRB
RRQ
DRB
RRQ
DRB
ARB xTLB
LLC
Host Processor
• A single-chip POD can achieve up to 1.5
TFLOPS of IEEE SP FP operations.
• Interconnection architecture of POD is
extremely power- and area-efficient.
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
35
Let’s Use Power to Compute,
Not Commute!
Georgia Tech
ECE MARS Labs
http://arch.ece.gatech.edu
Back-up Slides
PODSIM
PE pipelining
Native
Memory subsystem
Accurate modeling of on-chip, off-chip bandwidth
x86
machine
PODSIM
Limitation:
Host processor overhead not modeled (I$ miss, branch misprediction)
LLC takeover of the MC ring by the host
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
38
Why multi-processing again?
• No more faster processor
– Power wall
– Increasing wire delay
– Diminishing return of deeply pipelined architecture
• No more illusion of a sequential processor
– Design complexity
– Diminishing return of wide-issue superscalar
processors
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
39
Out-of-order Host Issues
• Speculative execution
– No recovery mechanism in each PE
– Broadcast POD instructions in a non-speculative
manner
• As long as host-side code does not depend on the results
from POD, this approach does not affect the performance.
• Instruction Reordering
– POD instructions are strongly ordered.
• IBits queue
– Similar to the store queue
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
40
x86 Modification
• 5 new instructions:
– sendibits, getort, drainort, movl, getdrb,
• 3 modified instructions:
– lfence, sfence, mfence
• 4 possible new co-processor registers or else mem-mapped
addrs:
– ibits register, ort status, drb value, xTLB support
• IBits queue
• All other modifications external to the x86 core
– Arbiter for LLC priority can be external to LLC
• Deferred Design Issue: LLC-POD Coherence
– v1 Study: uncachable
– v2 Prototype: LLC Snoops on A/D rings
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
41
POD ISA
• “Instructions” are 3-wide VLIW bundles,
– One of each G,X,M (32 bits each)
• Integer Pipe (G):
– and,or,not,xor,shl,shr,add,sub,mul,cmp,cmov,xfer,xferdrb : !div
• SSE Pipe (X):
– pfp{fma,add,sub,mul,div,max,min},pi{add,sub,mul,and,or,not,c
mp},shuffle,cvt,xfer : !idiv
• Memory Pipe (M):
– ld,st,blkrd/wr,mask{set,pop,push},mov,xfer
• Hybrid (G+X+M):
– movl
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
42
Power Scalability!
Relative Performance Per Joule
14
12
10
8
6
4
2
0
0
10
20
30
40
50
60
Relative Chip Power Budget
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
43
Some Known Facts on Interconnection
• MIT RAW
– ~ 40% of die area
Taylor et al., “The Raw Microprocessor: A Computational Fabric for Software Circuits and General Purpose
Programs”, IEEE MICRO Mar/Apr 2002
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
44
Some Known Facts on Interconnection
• One main energy hog!
– Alpha 21364:
• 20% (25W / 125W)
– MIT RAW:
• 40% of individual tile power
– UT TRIPS:
• 25%
Taylor et al., “The Raw Microprocessor: A Computational Fabric for Software Circuits and General Purpose
Programs”, IEEE MICRO Mar/Apr 2002
Wang et al., “Orion: A Power-Performance Simulator for Interconnection Networks”, MICRO 2002
Wang et al., “Power-Driven Design of Router Microarchitectures in On-Chip Networks”, MICRO 2003
Peh, “Chip-Scale Power Scaling”, http://www.princeton.edu/~peh/talks/stanford_nws.pdf
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
45
Some Known Facts on Interconnection
• Mesh-based MIMD many-core?
– Unsustainable energy in its router logic
– Unpredictable communication pattern
• Difficult to apply common low-power techniques
Bokar, “Networks for Multi-core Chip—A Controversial View”, 2006 Workshop on On- and Off-Chip
Interconnection Networks for Multicore Systems
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
46
Programming Model
Programming Model
• Example code: Reduction
char* base_addr = (char*) malloc(npe);
for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;
}
$ASM mov r15 = MY_PE_ID_POINTER
$ASM ld r15 = local [ r15 + 0 ]
POD_movl( r14, base_addr );
N 1
S   ai
i 0
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]
POD_mfence();
$ASM add r2 = r2, r1
for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1
$ASM add r2 = r2, r1
}
$ASM add r1 = r2, 0
for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1
$ASM add r2 = r2, r1
}
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
48
Programming Model (2x2 POD)
• Malloc memory at the host memory space
char* base_addr = (char*) malloc(npe);
for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;
}
$ASM mov r15 = MY_PE_ID_POINTER
$ASM ld r15 = local [ r15 + 0 ]
POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]
POD_mfence();
$ASM add r2 = r2, r1
for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1
$ASM add r2 = r2, r1
}
$ASM add r1 = r2, 0
for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1
$ASM add r2 = r2, r1
}
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
49
Programming Model (2x2 POD)
• Initialize ai
char* base_addr = (char*) malloc(npe);
for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;
}
$ASM mov r15 = MY_PE_ID_POINTER
$ASM ld r15 = local [ r15 + 0 ]
POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]
POD_mfence();
base_addr: 16
16: 0
17: 0
18: 0
19: 0
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
$ASM add r2 = r2, r1
for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1
$ASM add r2 = r2, r1
}
$ASM add r1 = r2, 0
for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1
$ASM add r2 = r2, r1
}
50
Programming Model (2x2 POD)
• Load the address of my PE ID
char* base_addr = (char*) malloc(npe);
for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;
}
$ASM mov r15 = MY_PE_ID_POINTER
$ASM ld r15 = local [ r15 + 0 ]
POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]
POD_mfence();
base_addr: 16
16: 1
17: 2
18: 3
19: 4
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
$ASM add r2 = r2, r1
for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1
$ASM add r2 = r2, r1
}
$ASM add r1 = r2, 0
for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1
$ASM add r2 = r2, r1
}
51
Programming Model (2x2 POD)
• Load my PE ID
r15 2
r15 2
char* base_addr = (char*) malloc(npe);
for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;
}
$ASM mov r15 = MY_PE_ID_POINTER
$ASM ld r15 = local [ r15 + 0 ]
POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
r15 2
r15 2
base_addr: 16
16: 1
17: 2
18: 3
19: 4
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
$ASM ld r1 = sys [ r15 + 0 ]
POD_mfence();
$ASM add r2 = r2, r1
for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1
$ASM add r2 = r2, r1
}
$ASM add r1 = r2, 0
for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1
$ASM add r2 = r2, r1
}
52
Programming Model (2x2 POD)
• Broadcast base_addr
r15 2
r15 3
char* base_addr = (char*) malloc(npe);
for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;
}
$ASM mov r15 = MY_PE_ID_POINTER
$ASM ld r15 = local [ r15 + 0 ]
POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
r15 0
r15 1
base_addr: 16
16: 1
17: 2
18: 3
19: 4
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
$ASM ld r1 = sys [ r15 + 0 ]
POD_mfence();
$ASM add r2 = r2, r1
for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1
$ASM add r2 = r2, r1
}
$ASM add r1 = r2, 0
for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1
$ASM add r2 = r2, r1
}
53
Programming Model (2x2 POD)
• Calculate the pointer to my ai
r14 16
r15 2
r14 16
r15 3
char* base_addr = (char*) malloc(npe);
for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;
}
$ASM mov r15 = MY_PE_ID_POINTER
$ASM ld r15 = local [ r15 + 0 ]
POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
r14 16
r15 0
r14 16
r15 1
base_addr: 16
16: 1
17: 2
18: 3
19: 4
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
$ASM ld r1 = sys [ r15 + 0 ]
POD_mfence();
$ASM add r2 = r2, r1
for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1
$ASM add r2 = r2, r1
}
$ASM add r1 = r2, 0
for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1
$ASM add r2 = r2, r1
}
54
Programming Model (2x2 POD)
• Load my ai from the host memory space
r14 16
r15 18
r14 16
r15 19
char* base_addr = (char*) malloc(npe);
for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;
}
$ASM mov r15 = MY_PE_ID_POINTER
$ASM ld r15 = local [ r15 + 0 ]
POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
r14 16
r15 16
r14 16
r15 17
base_addr: 16
16: 1
17: 2
18: 3
19: 4
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
$ASM ld r1 = sys [ r15 + 0 ]
POD_mfence();
$ASM add r2 = r2, r1
for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1
$ASM add r2 = r2, r1
}
$ASM add r1 = r2, 0
for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1
$ASM add r2 = r2, r1
}
55
Programming Model (2x2 POD)
• Wait until it is loaded
r14 16
r15 18
r14 16
r15 19
char* base_addr = (char*) malloc(npe);
for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;
}
$ASM mov r15 = MY_PE_ID_POINTER
$ASM ld r15 = local [ r15 + 0 ]
POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
r14 16
r15 16
r14 16
r15 17
base_addr: 16
16: 1
17: 2
18: 3
19: 4
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
$ASM ld r1 = sys [ r15 + 0 ]
POD_mfence();
$ASM add r2 = r2, r1
for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1
$ASM add r2 = r2, r1
}
$ASM add r1 = r2, 0
for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1
$ASM add r2 = r2, r1
}
56
Programming Model (2x2 POD)
• Update S
r1
3
r1
4
r14 16
r15 18
r14 16
r15 19
r1
r1
1
r14 16
r15 16
2
r14 16
r15 17
base_addr: 16
16: 1
17: 2
18: 3
19: 4
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
char* base_addr = (char*) malloc(npe);
for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;
}
$ASM mov r15 = MY_PE_ID_POINTER
$ASM ld r15 = local [ r15 + 0 ]
POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]
POD_mfence();
$ASM add r2 = r2, r1
for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1
$ASM add r2 = r2, r1
}
$ASM add r1 = r2, 0
for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1
$ASM add r2 = r2, r1
}
57
Programming Model (2x2 POD)
• 1st iteration
r1
r2
r14
r15
3
3
16
18
r1
r2
r14
r15
4
4
16
19
r1
r2
r14
r15
1
1
16
16
r1
r2
r14
r15
2
2
16
17
base_addr: 16
16: 1
17: 2
18: 3
19: 4
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
char* base_addr = (char*) malloc(npe);
for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;
}
$ASM mov r15 = MY_PE_ID_POINTER
$ASM ld r15 = local [ r15 + 0 ]
POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]
POD_mfence();
$ASM add r2 = r2, r1
for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1
$ASM add r2 = r2, r1
}
$ASM add r1 = r2, 0
for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1
$ASM add r2 = r2, r1
}
58
Programming Model (2x2 POD)
• Transfer my ai to my east-side neighbor
r1
r2
r14
r15
3
3
16
18
r1
r2
r14
r15
4
4
16
19
r1
r2
r14
r15
1
1
16
16
r1
r2
r14
r15
2
2
16
17
base_addr: 16
16: 1
17: 2
18: 3
19: 4
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
char* base_addr = (char*) malloc(npe);
for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;
}
$ASM mov r15 = MY_PE_ID_POINTER
$ASM ld r15 = local [ r15 + 0 ]
POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]
POD_mfence();
$ASM add r2 = r2, r1
for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1
$ASM add r2 = r2, r1
}
$ASM add r1 = r2, 0
for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1
$ASM add r2 = r2, r1
}
59
Programming Model (2x2 POD)
• Update S
r1
r2
r14
r15
3
3
16
18
r1
r2
r14
r15
4
4
16
19
r1
r2
r14
r15
1
1
16
16
r1
r2
r14
r15
2
2
16
17
base_addr: 16
16: 1
17: 2
18: 3
19: 4
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
char* base_addr = (char*) malloc(npe);
for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;
}
$ASM mov r15 = MY_PE_ID_POINTER
$ASM ld r15 = local [ r15 + 0 ]
POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]
POD_mfence();
$ASM add r2 = r2, r1
for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1
$ASM add r2 = r2, r1
}
$ASM add r1 = r2, 0
for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1
$ASM add r2 = r2, r1
}
60
Programming Model (2x2 POD)
• Copy rowsum to r1
r1
r2
r14
r15
4
7
16
18
r1
r2
r14
r15
3
7
16
19
r1
r2
r14
r15
2
3
16
16
r1
r2
r14
r15
1
3
16
17
base_addr: 16
16: 1
17: 2
18: 3
19: 4
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
char* base_addr = (char*) malloc(npe);
for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;
}
$ASM mov r15 = MY_PE_ID_POINTER
$ASM ld r15 = local [ r15 + 0 ]
POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]
POD_mfence();
$ASM add r2 = r2, r1
for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1
$ASM add r2 = r2, r1
}
$ASM add r1 = r2, 0
for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1
$ASM add r2 = r2, r1
}
61
Programming Model (2x2 POD)
• 1st iteration
r1
r2
r14
r15
7
7
16
18
r1
r2
r14
r15
7
7
16
19
r1
r2
r14
r15
3
3
16
16
r1
r2
r14
r15
3
3
16
17
base_addr: 16
16: 1
17: 2
18: 3
19: 4
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
char* base_addr = (char*) malloc(npe);
for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;
}
$ASM mov r15 = MY_PE_ID_POINTER
$ASM ld r15 = local [ r15 + 0 ]
POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]
POD_mfence();
$ASM add r2 = r2, r1
for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1
$ASM add r2 = r2, r1
}
$ASM add r1 = r2, 0
for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1
$ASM add r2 = r2, r1
}
62
Programming Model (2x2 POD)
• Transfer my rowsum to my north-side neighbor
r1
r2
r14
r15
7
7
16
18
r1
r2
r14
r15
7
7
16
19
r1
r2
r14
r15
3
3
16
16
r1
r2
r14
r15
3
3
16
17
base_addr: 16
16: 1
17: 2
18: 3
19: 4
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
char* base_addr = (char*) malloc(npe);
for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;
}
$ASM mov r15 = MY_PE_ID_POINTER
$ASM ld r15 = local [ r15 + 0 ]
POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]
POD_mfence();
$ASM add r2 = r2, r1
for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1
$ASM add r2 = r2, r1
}
$ASM add r1 = r2, 0
for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1
$ASM add r2 = r2, r1
}
63
Programming Model (2x2 POD)
• Update S
r1
r2
r14
r15
7
7
16
18
r1
r2
r14
r15
7
7
16
19
r1
r2
r14
r15
3
3
16
16
r1
r2
r14
r15
3
3
16
17
base_addr: 16
16: 1
17: 2
18: 3
19: 4
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
char* base_addr = (char*) malloc(npe);
for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;
}
$ASM mov r15 = MY_PE_ID_POINTER
$ASM ld r15 = local [ r15 + 0 ]
POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]
POD_mfence();
$ASM add r2 = r2, r1
for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1
$ASM add r2 = r2, r1
}
$ASM add r1 = r2, 0
for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1
$ASM add r2 = r2, r1
}
64
Programming Model (2x2 POD)
• Done!
r1
r2
r14
r15
3
10
16
18
r1
r2
r14
r15
3
10
16
19
r1
r2
r14
r15
7
10
16
16
r1
r2
r14
r15
7
10
16
17
base_addr: 16
16: 1
17: 2
18: 3
19: 4
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
char* base_addr = (char*) malloc(npe);
for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;
}
$ASM mov r15 = MY_PE_ID_POINTER
$ASM ld r15 = local [ r15 + 0 ]
POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]
POD_mfence();
$ASM add r2 = r2, r1
for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1
$ASM add r2 = r2, r1
}
$ASM add r1 = r2, 0
for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1
$ASM add r2 = r2, r1
}
65
Programming Model (2x2 POD)
• Broadcast the pointer to ‘result’
int result;
r1
r2
r14
r15
3
10
16
18
r1
r2
r14
r15
3
10
16
19
r1
r2
r14
r15
7
10
16
16
r1
r2
r14
r15
7
10
16
17
$ASM mov r15 = MY_PE_ID_POINTER
$ASM ld r15 = local [ r15 + 0 ]
$ASM sub r15 = r15, 0
$ASM pushmask.e
$ASM st sys[ r14 + 0 ] = r15
$ASM popmask
POD_mfence();
base_addr: 16
16: 1
17: 2
18: 3
19: 4
POD_movl( r14, &result );
128: <- addr of ‘result’
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
66
Programming Model (2x2 POD)
• Only PE0 will write-back!
int result;
r1
r2
r14
r15
3
10
128
18
r1
r2
r14
r15
3
10
128
19
r1
r2
r14
r15
7
10
128
16
r1
r2
r14
r15
7
10
128
17
$ASM mov r15 = MY_PE_ID_POINTER
$ASM ld r15 = local [ r15 + 0 ]
$ASM sub r15 = r15, 0
$ASM pushmask.e
$ASM st sys[ r14 + 0 ] = r15
$ASM popmask
POD_mfence();
base_addr: 16
16: 1
17: 2
18: 3
19: 4
POD_movl( r14, &result );
128: <- addr of ‘result’
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
67
Programming Model (2x2 POD)
• Load PE_ID
int result;
r1
r2
r14
r15
3
10
128
2
r1
r2
r14
r15
3
10
128
2
r1
r2
r14
r15
7
10
128
2
r1
r2
r14
r15
7
10
128
2
$ASM mov r15 = MY_PE_ID_POINTER
$ASM ld r15 = local [ r15 + 0 ]
$ASM sub r15 = r15, 0
$ASM pushmask.e
$ASM st sys[ r14 + 0 ] = r15
$ASM popmask
POD_mfence();
base_addr: 16
16: 1
17: 2
18: 3
19: 4
POD_movl( r14, &result );
128: <- addr of ‘result’
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
68
Programming Model (2x2 POD)
• Compare PE_ID
int result;
r1
r2
r14
r15
3
10
128
2
r1
r2
r14
r15
3
10
128
3
r1
r2
r14
r15
7
10
128
0
r1
r2
r14
r15
7
10
128
1
$ASM mov r15 = MY_PE_ID_POINTER
$ASM ld r15 = local [ r15 + 0 ]
$ASM sub r15 = r15, 0
$ASM pushmask.e
$ASM st sys[ r14 + 0 ] = r15
$ASM popmask
POD_mfence();
base_addr: 16
16: 1
17: 2
18: 3
19: 4
POD_movl( r14, &result );
128: <- addr of ‘result’
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
69
Programming Model (2x2 POD)
• Enable PE only when equal to zero
int result;
r1
r2
r14
r15
3
10
128
2
r1
r2
r14
r15
3
10
128
3
r1
r2
r14
r15
7
10
128
0
r1
r2
r14
r15
7
10
128
1
$ASM mov r15 = MY_PE_ID_POINTER
$ASM ld r15 = local [ r15 + 0 ]
$ASM sub r15 = r15, 0
$ASM pushmask.e
$ASM st sys[ r14 + 0 ] = r15
$ASM popmask
POD_mfence();
base_addr: 16
16: 1
17: 2
18: 3
19: 4
POD_movl( r14, &result );
128: <- addr of ‘result’
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
70
Programming Model (2x2 POD)
• Write-back
int result;
r1
r2
r14
r15
3
10
128
2
r1
r2
r14
r15
3
10
128
3
r1
r2
r14
r15
7
10
128
0
r1
r2
r14
r15
7
10
128
1
$ASM mov r15 = MY_PE_ID_POINTER
$ASM ld r15 = local [ r15 + 0 ]
$ASM sub r15 = r15, 0
$ASM pushmask.e
$ASM st sys[ r14 + 0 ] = r15
$ASM popmask
POD_mfence();
base_addr: 16
16: 1
17: 2
18: 3
19: 4
POD_movl( r14, &result );
128: <- addr of ‘result’
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
71
Programming Model (2x2 POD)
• Enable ALL PEs
int result;
r1
r2
r14
r15
3
10
128
2
r1
r2
r14
r15
3
10
128
3
r1
r2
r14
r15
7
10
128
0
r1
r2
r14
r15
7
10
128
1
$ASM mov r15 = MY_PE_ID_POINTER
$ASM ld r15 = local [ r15 + 0 ]
$ASM sub r15 = r15, 0
$ASM pushmask.e
$ASM st sys[ r14 + 0 ] = r15
$ASM popmask
POD_mfence();
base_addr: 16
16: 1
17: 2
18: 3
19: 4
POD_movl( r14, &result );
128: <- addr of ‘result’
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
72
Programming Model (2x2 POD)
• Wait until the result is written-back
int result;
r1
r2
r14
r15
3
10
128
2
r1
r2
r14
r15
3
10
128
3
r1
r2
r14
r15
7
10
128
0
r1
r2
r14
r15
7
10
128
1
$ASM mov r15 = MY_PE_ID_POINTER
$ASM ld r15 = local [ r15 + 0 ]
$ASM sub r15 = r15, 0
$ASM pushmask.e
$ASM st sys[ r14 + 0 ] = r15
$ASM popmask
POD_mfence();
base_addr: 16
16: 1
17: 2
18: 3
19: 4
POD_movl( r14, &result );
128: <- addr of ‘result’
Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
73
Download