COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION 1 03/26/2012 OUTLINE Introduction Motivation Network-on-Chip (NoC) ASIC based approaches Coarse grain architectures Proposed Architecture Results 2 INTRODUCTION Goal Purpose Application specific hybrid coarse grained reconfigurable architecture using NoC Support Variable Block Size Motion Estimation (VBSME) First approach No ASIC and other coarse grained reconfigurable architectures Difference Use of intelligent NoC routers Support full and fast search algorithms 3 MOTIVATION H.264 Ө(f)= Motion Estimation 4 MOTION ESTIMATION Sum of Absolute Difference (SAD) Search Window Current 16x16 Block Previous Frame Current Frame 5 SYSTEM-ON-CHIP (SOC) Single chip systems Common components Microprocessor Memory Co-processor Other blocks Increased processing power and data intensive applications Facilitating communication between individual blocks has become a challenge 6 TECHNOLOGY ADVANCEMENT 7 DELAY VS. PROCESS TECHNOLOGY 8 NETWORK-ON-CHIP (NOC) Efficient communication via use of transfer protocols Need to take into consideration the strict constraints of SoC environment Types of communication structure Bus Point-to-point Network 9 COMMUNICATION STRUCTURES 10 BUS VS. NETWORK Bus Pros & Cons Network Pros & Cons Every unit attached adds parasitic capacitance x ✓ Local performance not degraded with scaling Bus timing is difficult x ✓ Network wires can be pipelined Bus arbitration can become a bottleneck x ✓ Routing decisions are distributed Bus testability problematic and slow x ✓ Locally placed BIST is fast and easy Bandwidth is limited and shared by all x ✓ Bandwidth scales with network size Bus latency is wire speed once granted ✓ x Network contention may cause latency Very compatible ✓ x IPs need smart wrappers Simple to understand ✓ x Relatively complicated 11 EXAMPLE 12 EXAMPLE OF NOC 13 ROUTER ARCHITECTURE 14 BACKGROUND ME General purpose processors, ASIC, FPGA and coarse grain Only FBSME VBSME with redundant hardware General purpose processors Can exploit parallelism Limited by the inherent sequential nature and data access via registers 15 CONTINUED… ASIC No support to all block sizes of H.264 Support provided at the cost of high area overhead Coarse grained Overcome the drawbacks of LUT based FPGAs Elements with coarser granularity Fewer configuration bits Under utilization of resources 16 ASIC Approaches Topology 1D systolic array 2D systolic array SAD accumulation Partial Sum Parallel Sum •Large number of broadcasted registers •Mesh •Reference based pixels architecture •All pixel differences of a 4x4 block computed in parallel •Store •SAD computation partial SADs forreused each 4x4 block pipelined •Reference pixels are •Area •Each overhead processing element depends computes difference, •Direction of data transfer onpixel search pattern accumulates it to the •High latency previous partial SAD and sends the computed partial SAD to the next •No VBSME processing element 17 •Large number of registers OU’S APPROACH 16 SAD modules to process 16 4x4 motion vectors VBSME processor Chain of adders and comparators to compute larger SADs PE array Basic computational element of SAD module Cascade of 4 1D arrays 1D array 1D systolic array of 4 PEs Each PE computes a 1 pixel SAD 18 current_block_data_i current_block_data_0 search_block_data_0 D D block_strip_A block_strip_B Module 0 SAD_0 MV_0 current_block_data_1 search_block_data_1 32 bits Module 1 SAD_1 MV_1 current_block_data_15 search_block_data_15 32 bits 1D Array 0 1 bit 1 bit 1D Array 3 Module 15 SAD_15 MV_15 4 bits SAD_i MUX for SAD MV_i strip_sel read_addr_B read_addr_A write_addr SAD Modules PE Array 19 1D Array 32 bits 32 bits PE D D PE D D D PE D D D D PE D D D ACCM 20 PUTTING IT TOGETHER Clock cycle Columns of current 4x4 sub-block scheduled using a delay line Two sets of search block columns broadcasted 4 block matching operations executed concurrently per SAD module 4x4 SADs -> 4x4 motion vectors 4x4 SADs -> 4x8 SADs -> … 16x16 SADs Chain of adders and comparators Chain of adders and comparators Drawbacks No reuse of search data between modules Resource wastage 21 ALTERNATIVE SOLUTION: COARSE GRAIN ARCHITECTURES 22 ChESS MATRIX *(M x 0.8M)/256 x 17 x 17 *(M x0.8M)/256 x 17 x 17 • Resource utilization • Generic interconnect * Performance (clock cycles) [Frame Size: M x 0.8M] RaPiD *272+32M+14.45M2 PROPOSED ARCHITECTURE 2D architecture 16 CPEs 4 PE2s 1 PE3 Main Memory Memory Interface CPE (Configurable Processing Element) PE1 NoC router Network Interface Current and reference block from main memory 23 Memory Interface (MI) Main Memory c_d_(x,y) (32 bits) r_d_(x,y) (32 bits) c_d c_d r_d CPE (1,1) 32 bits 12 bits CPE (2,1) r_d c_d reference_block_id (5 bits) r_d c_d CPE (1,2) PE 2(1) c_d data_load_contro l (16 bits) c_d CPE (1,3) r_d c_d 14 bits CPE (1,4) r_d r_d PE 2(2) c_d CPE (2,2) CPE (2,3) r_d CPE (2,4) r_d PE 3 c_d CPE (3,1) CPE (3,2) r_d PE 2(3) c_d r_d CPE (4,1) c_d CPE (3,3) c_d r_d c_d r_d CPE (3,4) r_d PE 2(4) c_d CPE (4,2) c_d r_d CPE (4,3) c_d r_d CPE (4,4) 24r_d c_d r_d 1 8 bit sub CPR 2 8 bit sub RPR CPR 3 8 bit sub RPR To/From NI CPR 8 bit sub CPR 8 bit sub CPR 7 8 bit sub RPR CPR 8 8 bit sub RPR 8 bit sub CPR 10 8 bit sub RPR CPR RPR 11 8 bit sub CPR 8 bit sub CPR RPR 12 8 bit sub RPR 10 bit adder 13 RPR CPR RPR CO MP 12 bit adder 9 CPR To/From Eas 10 bit adder 6 RPR 8 bit sub RPR 10 bit adder 5 4 RE G CPR RPR 10 bit adder 14 8 bit sub CPR RPR 15 8 bit sub To/From South CPR RPR 16 8 bit sub 4x4 mv CPR 25 RPR NETWORK INTERFACE reference_block_id to MI data_load_control to MI CONTROL UNIT PACKETIZATION UNIT DEPACKETIZATION UNIT Network Interface 26 Input/Output Control Signals NOC ROUTER Receives packets from NI/ adjacent router request ack request ack Input Controller Sends packets to NI or adjacent router Output Controller PE 1 PE 1 East West North Header Decoder 3 South First Index 2 4 1 5 0 Ring Buffer Stores packets East West North South Last Index •XY routing protocol •Extracts direction of data transfer from header packet •Updates number of hops 27 Router 1 Router 2 Input Controller ack (1 bit) Input Controller packet 32 bit Output Controller req (1 bit) Busy? Buffer space available? Output Controller Step 1: Send a message from Router 1 to Router 2 Step 2: Send a 1 bit request signal to Router 2 Step 3: Router 2 first checks if it is busy. If not checks for available buffer space Step 4: Send ack if space available Step 5: Send the packet 28 PE2 AND PE3 Muxes Adders De-muxes Comparators Registers 29 FAST SEARCH ALGORITHM Diamond Search •9 candidate search points •Numbers represent order of processing the reference frames •Directed edges labeled with data transmission equations derived based on data dependencies 30 EXAMPLE Frame Macroblock SAD 31 CONTINUED… 32 DATA TRANSFER Data Transfer between PE1(1,1) and PE1(1,3) Individual Points Intersecting Points 33 DATA LOAD SCHEDULE 34 OTHER FAST SEARCH ALGORITHMS Hexagon Big Hexagon Spiral 35 FULL SEARCH 36 CONTINUED… 37 RESULTS 38 CONTINUED… 39 40