Memory Organization and Data Layout for Instruction Set Extensions with Architecturally Visible Storage Panagiotis Athanasopoulos Philip Brisk Yusuf Leblebici Paolo Ienne EPFL UCR EPFL EPFL École Polytechnique Fédérale de Lausanne (EPFL) University of California, Riverside (UCR) First_name.Second_name@{epfl.ch|ucr.edu} 1 Motivation Classic Challenge Increase performance while maintaining area/cost constrained Typical solutions Customizable and extensible processors Instruction set extension (ISE) Custom functional units (CFU) Architecturally visible storage (AVS) 2 Typical embedded application extract 2D DCT 8x8 Matrix Pseudo: dct{ for(int i=0,i<num_of_rows,i++){ . . 1D DCT Slice . . } for(int j=0,j<num_of_columns,j++){ . . 1D DCT Slice . . } } 3 Typical embedded application extract 2D DCT 8x8 Matrix for(int i=0,i<num_of_rows,i++){ . . 1D DCT Slice . 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 . 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7 } 2,0 2,1 2,2 2,3 2,4 2,5 2,6 2,7 Row accesses 3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,7 4,0 4,1 4,2 4,3 4,4 4,5 4,6 4,7 i,j Data accessed in row i, column j 5,0 5,1 5,21D5,3 DCT5,4 Slice5,5 5,6 5,7 6,0 6,1 6,2 6,3 6,4 6,5 6,6 6,7 7,0 7,1 7,2 7,3 7,4 7,5 7,6 7,7 4 Typical embedded application extract 2D DCT 8x8 Matrix for(int j=0,j<num_of_columns,j++){ . . 1D DCT Slice . 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 . 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7 } Column accesses 1D DCT Slice 2,0 2,1 2,2 2,3 2,4 2,5 2,6 2,7 3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,7 4,0 4,1 4,2 4,3 4,4 4,5 4,6 4,7 5,0 5,1 5,2 5,3 5,4 5,5 5,6 5,7 I,j Data accessed in row i, column j 6,0 6,1 6,2 6,3 6,4 6,5 6,6 6,7 7,0 7,1 7,2 7,3 7,4 7,5 7,6 7,7 5 Speeding up the execution ISE Extend the basic processor instruction set with a new instruction: DCT_instr CFU Assign the execution of the new instruction to a dedicated unit 6 Reasonable ISE/CFU implementation Pseudo: dct{ DCT_instr(0,1,2,...,7) DCT_instr(8,9,10,...,15) . . DCT_instr(56,57,58,...,63) DCT_instr(0,8,16,...,56) DCT_instr(1,9,17,...,56) . . DCT_instr(7,15,23,...,63) 16 executions } 7 Speeding up the execution Memory bandwidth Usually limited to 2 read/write ports Caches, scratchpads, architecturally visible storage Area quadruplicates to the number of ports [ref] Increased latency to execute the new instruction until all data is available 8 Speeding up the execution Ideally 8 read 8 write ports Minimum area Full bandwidth utilization Could we achieve this??? 9 Speeding up the execution Minimum Area What is the minimum memory organization for 64 elements with 8 read and 8 write ports? 8 individual single port 8 word capacity memory arrays (Flip Flop) 10 Speeding up the execution Full bandwidth utilization 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7 2,0 2,1 2,2 2,3 2,4 2,5 2,7 3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,7 4,0 4,1 Slice 4,5 4,2 1D 4,3DCT 4,4 4,6 4,7 5,0 5,1 5,2 5,3 5,4 1D DCT Slice 2,6 5,5 5,6 5,7 6,0 6,1 6,2 6,3 6,4 6,5 6,6 6,7 7,0 7,1 7,2 7,3 7,4 7,5 7,6 7,7 Row Major Order Good for row accesses Bad for column accesses 11 Speeding up the execution Full bandwidth utilization 0,0 1,0 2,0 3,0 4,0 5,0 6,0 7,0 0,1 1,1 2,1 3,1 4,1 5,1 6,1 7,1 0,2 1,2 2,2 3,2 4,2 5,2 7,2 0,3 1,3 2,3 3,3 4,3 5,3 6,3 7,3 0,4 1,4 Slice 5,4 2,4 1D 3,4DCT 4,4 6,4 7,4 0,5 1,5 2,5 3,5 4,5 1D DCT Slice 6,2 5,5 6,5 7,5 0,6 1,6 2,6 3,6 4,6 5,6 6,6 7,6 0,7 1,7 2,7 3,7 4,7 5,7 6,7 7,7 Column Major Order Good for column accesses Bad for row accesses 12 Speeding up the execution Full bandwidth utilization Would there exist a data layout that would allow row and column access with the same latency ??? Not with the existing organization What if we attempted to relax the requirements by ignoring the misalignment of data ??? Introduce alignment layers Form of Register Clustering that is cheap! [RWTH ICCAD’07] 13 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 1,1 1,0 1,3 1,2 1,5 1,4 1,7 1,6 2,2 2,3 2,0 2,1 2,6 2,7 2,4 2,5 3,3 3,2 3,1 3,0 3,7 3,6 3,5 3,4 4,4 4,5 4,6 1D4,7 4,0 4,1 DCT Slice 4,2 4,3 5,5 5,4 5,7 5,6 5,1 5,0 5,3 5,2 6,6 6,7 6,4 6,5 6,2 6,3 6,0 6,1 7,7 7,6 7,5 7,4 7,3 7,2 7,1 7,0 Crossbar DCT Logic Crossbar 14 Memory Area Comparison Area mm2 15 Methodology Optimizing the memory system Enumerate Memories Memory Organization Cost Estimation Data Layout Limitedly Improper Constrained Color Assignment Alignment Layer 16 LICCA Formulation Input: Graph G = (V,E,I) Vertices V = {v0,...,vn-1} Edges E = {e0,...,em-1} Set of Set of vertices I = {I0,...,IL-1} Where: E = {(vx, vy)|∃Ij∈E∋vx∈Ij and vy∈Ij} 17 LICCA Formulation Solution: Assignment of colors to vertices Every function f: V→{0,..., k-1} A maximum of ni vertices can receive color i, 0<i<k-1; that is, |{v∈V| f(v) = i}| < ni For each set Ij∈I, there can be at most ai vertices that receive color i. Any instance of the k-colorability problem can be reduced to an instance of LICCA by setting I = {{vx, vy| (vx, vy)∈E}}, and, for 0<i<k-1: ni=|V| and ai=1 18 LICCA Relation to the problem Relation to the problem: An edge e = (vx, vy) indicates that vx and vy are read in the same cycle Each set of vertices Ij ∈I is a set of vertices that are read in parallel k is the number of memories ni is the capacity of the ith memory ai is the number read/write ports of the ith memory 19 LICCA Example V = {v0,v1,v2,v3,v4,v5} v0 v3 I2 = {v0,v2,v5} v1 v4 E= {(v0,v1),(v0,v2),(v0,v5),(v1,v 2),(v2,v5),(v3,v4),(v3,v5),(v4 ,v5)} v2 v5 I0 = {v0,v1,v2} I1 = {v3,v4,v5} G Legal k-coloring? Legal LICCA coloring? 20 LICCA Example v0 v3 v1 v4 v2 v5 I0 I1 M0 M1 v1 v0 v2 v4 v3 v5 n1=2 a1=1 v0 v2 v5 I2 n0=4 a0=2 21 Comparison Example AVS Main (Single/Dual Port Memory Memory or 8x8 Non(DMA) clustered RF) ISE Logic Memory Decoder Baseline Processor Ports RF Baseline Processor 22 Comparison Example Main Memory (DMA) Alignment Layer Decoders AVS (8x8 clustered RF) Memory Decoder Baseline Processor Ports Alignment Layer ISE Logic Alignment Layer RF Baseline Processor 23 Comparison Example 2D DCT 8x8 Matrix DCT row/column Slice VS 2-point 8x8 Clustered RF VS Single port Memory 150 MHz 2D FFT 8x8 Matrix 12 butterfly VS 1 butterfly 8x8 Clustered RF VS Single port Memory 150 MHz 24 Comparison Example 2D DCT 8x8 Matrix 3x 8x 25 Comparison Example 2D FFT 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Area mm2 8x8 Matrix 2,5x 12x 12 1 butterfly butterfly FFT FFT 26 Conclusion Methodology to efficiently increase bandwidth to AVS enhanced ISEs LICCA Memory System Optimization Future Work Commutativity LICCA Extension for multiple ISEs and shift registers 27 0,0 0,1 1,0 1,1 0,0 0,1 1,0 1,1 0,2 0,3 1,2 1,3 1,2 1,3 0,2 0,3 2,0 2,1 3,0 3,1 2,0 2,1 2,2 2,3 2,2 2,3 3,2 3,3 3,2 3,3 3,0 3,1 0,0 0,1 2,1 1,1 1,2 1,3 2,2 0,2 0,3 2,0 2,1 3,1 2,3 3,2 3,3 3,1 4x4 NonClustered RF 28 References 29 Thank you! Questions? 30