I 1 - University of California, Riverside

advertisement
Memory Organization and Data Layout
for Instruction Set Extensions with
Architecturally Visible Storage
Panagiotis Athanasopoulos
Philip Brisk
Yusuf Leblebici
Paolo Ienne
EPFL
UCR
EPFL
EPFL
École Polytechnique Fédérale de Lausanne (EPFL)
University of California, Riverside (UCR)
First_name.Second_name@{epfl.ch|ucr.edu}
1
Motivation
 Classic Challenge
 Increase performance while maintaining area/cost
constrained
 Typical solutions
 Customizable and extensible processors
 Instruction set extension (ISE)
 Custom functional units (CFU)
 Architecturally visible storage (AVS)
2
Typical embedded application extract
2D DCT 8x8 Matrix
Pseudo: dct{
for(int i=0,i<num_of_rows,i++){
.
.
1D DCT Slice
.
.
}
for(int j=0,j<num_of_columns,j++){
.
.
1D DCT Slice
.
.
}
}
3
Typical embedded application extract
2D DCT 8x8 Matrix
for(int i=0,i<num_of_rows,i++){
.
.
1D DCT Slice
.
0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7
.
1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7
}
2,0 2,1 2,2 2,3 2,4 2,5 2,6 2,7
Row accesses
3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,7
4,0 4,1 4,2 4,3 4,4 4,5 4,6 4,7
i,j
Data accessed in
row i, column j
5,0 5,1 5,21D5,3
DCT5,4
Slice5,5 5,6 5,7
6,0 6,1 6,2 6,3 6,4 6,5 6,6 6,7
7,0 7,1 7,2 7,3 7,4 7,5 7,6 7,7
4
Typical embedded application extract
2D DCT 8x8 Matrix
for(int j=0,j<num_of_columns,j++){
.
.
1D DCT Slice
.
0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7
.
1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7
}
Column accesses
1D DCT Slice
2,0 2,1 2,2 2,3 2,4 2,5 2,6 2,7
3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,7
4,0 4,1 4,2 4,3 4,4 4,5 4,6 4,7
5,0 5,1 5,2 5,3 5,4 5,5 5,6 5,7
I,j
Data accessed in row i,
column j
6,0 6,1 6,2 6,3 6,4 6,5 6,6 6,7
7,0 7,1 7,2 7,3 7,4 7,5 7,6 7,7
5
Speeding up the execution
 ISE
 Extend the basic processor instruction set with a new
instruction: DCT_instr
 CFU
 Assign the execution of the new instruction to a
dedicated unit
6
Reasonable ISE/CFU implementation
Pseudo: dct{
DCT_instr(0,1,2,...,7)
DCT_instr(8,9,10,...,15)
.
.
DCT_instr(56,57,58,...,63)
DCT_instr(0,8,16,...,56)
DCT_instr(1,9,17,...,56)
.
.
DCT_instr(7,15,23,...,63)
16 executions
}
7
Speeding up the execution
 Memory bandwidth
 Usually limited to 2 read/write ports
 Caches, scratchpads, architecturally visible storage
 Area quadruplicates to the number of ports [ref]
 Increased latency to execute the new instruction until
all data is available
8
Speeding up the execution
 Ideally
 8 read 8 write ports
 Minimum area
 Full bandwidth utilization
 Could we achieve this???
9
Speeding up the execution
 Minimum Area
 What is the minimum memory organization for 64
elements with 8 read and 8 write ports?
 8 individual single port 8 word capacity memory arrays
(Flip Flop)
10
Speeding up the execution
 Full bandwidth utilization
0,0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
1,0
1,1
1,2
1,3
1,4
1,5
1,6
1,7
2,0
2,1
2,2
2,3
2,4
2,5
2,7
3,0
3,1
3,2
3,3
3,4
3,5
3,6
3,7
4,0
4,1
Slice 4,5
4,2 1D
4,3DCT 4,4
4,6
4,7
5,0
5,1
5,2
5,3
5,4
1D DCT Slice
2,6
5,5
5,6
5,7
6,0
6,1
6,2
6,3
6,4
6,5
6,6
6,7
7,0
7,1
7,2
7,3
7,4
7,5
7,6
7,7
Row Major Order
Good for row
accesses
Bad for column
accesses
11
Speeding up the execution
 Full bandwidth utilization
0,0
1,0
2,0
3,0
4,0
5,0
6,0
7,0
0,1
1,1
2,1
3,1
4,1
5,1
6,1
7,1
0,2
1,2
2,2
3,2
4,2
5,2
7,2
0,3
1,3
2,3
3,3
4,3
5,3
6,3
7,3
0,4
1,4
Slice 5,4
2,4 1D
3,4DCT 4,4
6,4
7,4
0,5
1,5
2,5
3,5
4,5
1D DCT Slice
6,2
5,5
6,5
7,5
0,6
1,6
2,6
3,6
4,6
5,6
6,6
7,6
0,7
1,7
2,7
3,7
4,7
5,7
6,7
7,7
Column Major
Order
Good for column
accesses
Bad for row
accesses
12
Speeding up the execution
 Full bandwidth utilization
 Would there exist a data layout that would allow row
and column access with the same latency ???
 Not with the existing organization
 What if we attempted to relax the requirements by
ignoring the misalignment of data ???
 Introduce alignment layers
 Form of Register Clustering that is cheap! [RWTH ICCAD’07]
13
0,0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
1,1
1,0
1,3
1,2
1,5
1,4
1,7
1,6
2,2
2,3
2,0
2,1
2,6
2,7
2,4
2,5
3,3
3,2
3,1
3,0
3,7
3,6
3,5
3,4
4,4
4,5
4,6 1D4,7
4,0 4,1
DCT Slice
4,2
4,3
5,5
5,4
5,7
5,6
5,1
5,0
5,3
5,2
6,6
6,7
6,4
6,5
6,2
6,3
6,0
6,1
7,7
7,6
7,5
7,4
7,3
7,2
7,1
7,0
Crossbar
DCT Logic
Crossbar
14
Memory Area Comparison
Area
mm2
15
Methodology
 Optimizing the memory system
 Enumerate Memories
 Memory Organization
 Cost Estimation
 Data Layout
Limitedly Improper Constrained Color Assignment
 Alignment Layer
16
LICCA Formulation
 Input:
 Graph G = (V,E,I)
 Vertices V = {v0,...,vn-1}
 Edges E = {e0,...,em-1}
 Set of Set of vertices I = {I0,...,IL-1}
 Where:
E = {(vx, vy)|∃Ij∈E∋vx∈Ij and vy∈Ij}
17
LICCA Formulation
 Solution:
 Assignment of colors to vertices
 Every function f: V→{0,..., k-1}
 A maximum of ni vertices can receive color i, 0<i<k-1;
that is, |{v∈V| f(v) = i}| < ni
 For each set Ij∈I, there can be at most ai vertices that
receive color i.
 Any instance of the k-colorability problem can be
reduced to an instance of LICCA by setting I = {{vx, vy|
(vx, vy)∈E}}, and, for 0<i<k-1: ni=|V| and ai=1
18
LICCA Relation to the problem
 Relation to the problem:
 An edge e = (vx, vy) indicates that vx and vy are read in
the same cycle
 Each set of vertices Ij ∈I is a set of vertices that are read
in parallel
 k is the number of memories
 ni is the capacity of the ith memory
 ai is the number read/write ports of the ith memory
19
LICCA Example
 V = {v0,v1,v2,v3,v4,v5}
v0
v3
 I2 = {v0,v2,v5}
v1
v4
E=
{(v0,v1),(v0,v2),(v0,v5),(v1,v
2),(v2,v5),(v3,v4),(v3,v5),(v4
,v5)}
v2
v5
 I0 = {v0,v1,v2}
 I1 = {v3,v4,v5}
G
 Legal k-coloring?
 Legal LICCA coloring?
20
LICCA Example
v0
v3
v1
v4
v2
v5
I0
I1
M0
M1
v1
v0
v2
v4
v3
v5
n1=2
a1=1
v0
v2
v5
I2
n0=4
a0=2
21
Comparison Example
AVS
Main
(Single/Dual Port
Memory
Memory or 8x8 Non(DMA)
clustered RF)
ISE Logic
Memory Decoder
Baseline Processor
Ports
RF
Baseline
Processor
22
Comparison Example
Main
Memory
(DMA)
Alignment
Layer
Decoders
AVS
(8x8 clustered RF)
Memory Decoder
Baseline Processor
Ports
Alignment Layer
ISE Logic
Alignment Layer
RF
Baseline
Processor
23
Comparison Example
 2D DCT
8x8 Matrix
 DCT row/column Slice VS 2-point
 8x8 Clustered RF VS Single port Memory
 150 MHz
 2D FFT
8x8 Matrix
 12 butterfly VS 1 butterfly
 8x8 Clustered RF VS Single port Memory
 150 MHz
24
Comparison Example
 2D DCT
8x8 Matrix
3x
8x
25
Comparison Example
 2D FFT
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Area
mm2
8x8 Matrix
2,5x
12x
12
1
butterfly butterfly
FFT
FFT
26
Conclusion
 Methodology to efficiently increase bandwidth to AVS
enhanced ISEs
 LICCA
 Memory System Optimization
Future Work
 Commutativity
 LICCA Extension for multiple ISEs and shift registers
27
0,0
0,1
1,0
1,1
0,0
0,1
1,0
1,1
0,2
0,3
1,2
1,3
1,2
1,3
0,2
0,3
2,0
2,1
3,0
3,1
2,0
2,1
2,2
2,3
2,2
2,3
3,2
3,3
3,2
3,3
3,0
3,1
0,0
0,1
2,1
1,1
1,2
1,3
2,2
0,2
0,3
2,0
2,1
3,1
2,3
3,2
3,3
3,1
4x4 NonClustered RF
28
References
29
Thank you!
Questions?
30
Download