Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park, Scott Mahlke September 25, 2004 EDCEP Workshop University of Michigan Electrical Engineering and Computer Science Loop Accelerators • Hardware implementation of a critical loop nest – Hardwired state machine – Digital camera appln – 1000x vs Pentium III – Multiple accelerators hooked up in a pipeline • Loop accelerator vs. customized processor – 1 block of code vs. multiple blocks – Trivial control flow vs. handling generic branches – Traditionally state machine vs. instruction driven University of Michigan Electrical Engineering and Computer Science Programmable Loop Accelerators • Goals – Multifunction accelerators – Accelerator hardware can handle multiple loops (re-use) – Post-programmable – To a degree, allow changes to the application – Use compiler as architecture synthesis tool • But … – Don’t build a customized processor – Maintain ASIC-level efficiency University of Michigan Electrical Engineering and Computer Science NPA (Nonprogrammable Accelerator) Synthesis in PICO University of Michigan Electrical Engineering and Computer Science PICO Frontend • Goals – Exploit loop-level parallelism – Map loop to abstract hardware – Manage global memory BW • Steps – – – – – Tiling Load/store elimination Iteration mapping Iteration scheduling Virtual processor clustering for i = 1 to ni for j = 1 to nj y[i] += w[j] * x[i+j] for jt = 1 to 100 step 10 for t = 0 to 502 for p = 0 to 1 (i,j) = function of (t,p) if (i>1) W[t][p] = W[t-5][p] else w[jt+j] if (i>1 && j<bj) X[t][p] = X[t-4][p+1] else x[i+jt+j] Y[t][p] += W[t][p] * X[t][p] University of Michigan Electrical Engineering and Computer Science PICO Backend • Resource allocation (II, operation graph) • Synthesize machine description for “fake” fully connected processor with allocated resources University of Michigan Electrical Engineering and Computer Science Reduced VLIW Processor after Modulo Scheduling University of Michigan Electrical Engineering and Computer Science Data/control-path Synthesis NPA Load yii Xr-1 Load wjj Load xii-jj 1 0 t2 t3 1 0 1 0 1 0 Yr-1 1 0 t1 + yii Store University of Michigan Electrical Engineering and Computer Science PICO Methodology – Why it Works? • Systematic design methodology – 1. Parameterized meta-architecture – all NPAs have same general organization – 2. Performance/throughput is input – 3. Abstract architecture – We know how to build compilers for this – 4. Mapping mechanism – Determine architecture specifics from schedule for abstract architecture University of Michigan Electrical Engineering and Computer Science Direct Generalization of PICO? • Programmability would require full interconnect between elements • Back to the meta architecture! • Generalize connectivity to enable post-programmability • But stylize it University of Michigan Electrical Engineering and Computer Science Programmable Loop Accelerator – Design Strategy • Compile for partially defined architecture – Build long distance communication into schedule – Limit global communication bandwidth • Proposed meta-architecture – Multi-cluster VLIW • Explicit inter-cluster transfers (varying latency/BW) • Intra-cluster communication is complete – Hardware partially defined – expensive units University of Michigan Electrical Engineering and Computer Science Programmable Loop Accelerator Schema DRAM Stream Unit II Shift Register Control Unit FU MEM SRAM Accelerator … … … … … … … … Intra-cluster Communication Stream Buffer Stream Unit FU Accelerator FU Inter-cluster Register File … Pipeline of Tiled or Clustered Accelerators Accelerator Datapath University of Michigan Electrical Engineering and Computer Science Flow Diagram Assembly code, II # cheap FUs FUs assigned to clusters FU Alloc Modulo Schedule # clusters # expensive FUs Shift register depth, width, porting Intercluster bandwidth Partition Loop Accelerator University of Michigan Electrical Engineering and Computer Science Sobel Kernel for (i = 0; i < N1; i++) { for (j = 0; j < N2; j++) { int t00, t01, t02, t10, t12, t20, t21, t22; int e, tmp; t00 t01 t02 t10 t12 t20 t21 t22 = = = = = = = = x[i ][j ]; x[i ][j+1]; x[i ][j+2]; x[i+1][j ]; x[i+1][j+2]; x[i+2][j ]; x[i+2][j+1]; x[i+2][j+2]; e1 = ((t00 + t01) + (t01 + t02)) – ((t20 + t21) + (t21 + t22)); e2 = ((t00 + t10) + (t10 + t20)) – ((t02 + t12) + (t12 + t22)); e12 = e1*e1; e22 = e2*e2; e = e12 + e22; if (e > threshold) tmp = 1; else tmp = 0; edge[i][j] = tmp; } } University of Michigan Electrical Engineering and Computer Science FU Allocation • Determine number of clusters: # ops 4 II • Determine number of expensive FUs – MPY, DIV, memory # ops _ of _ type II • Sobel with II=4 41 ops 3 clusters 2 MPY ops 1 multiplier 9 memory ops 3 memory units University of Michigan Electrical Engineering and Computer Science Partitioning • Multi-level approach consists of two phases – Coarsening – Refinement • Minimize inter-cluster communication • Load balance – Max of 4 II operations per cluster • Take FU allocation into account – Restricted # of expensive units – # of cheap units (ADD, logic) determined from partition University of Michigan Electrical Engineering and Computer Science Coarsening • Group highly related operations together – Pair operations together at each step – Forces partitioner to consider several operations as a single unit • Coarsening Sobel subgraph into 2 groups: L L + L + + L + L + + L L + L + + L + L + + L L + L + + L + L + + L L + L + + L + L + + University of Michigan Electrical Engineering and Computer Science Refinement • Move operations between clusters • Good moves: – Reduce inter-cluster communication – Improve load balance – Reduce hardware cost • Reduce number of expensive units to meet limit • Collect similar bitwidth operations together ? L L + L + + L + L + + University of Michigan Electrical Engineering and Computer Science Partitioning Example • From sobel, II=4 • Place MPYs together • Place each tree of ADDLOAD-ADDs together • Cuts 6 edges University of Michigan Electrical Engineering and Computer Science Modulo Scheduling • Determines shift register width, depth, and number of read ports FU Cycle Max Req’d Req’d • Sobel II=4 cycle FU0 FU1 FU2 3 0 2 4 4 1 1 1 2 4 2 3 4 2 4 1 1 1 3 0 - 1 1 3 1 LD 1 2 FU3 ADD 0 result depth ports lifetime ADD LD ADD ADD University of Michigan Electrical Engineering and Computer Science Test Cases • Sobel and fsed kernels, II=4 designs • Each machine has 4 clusters with 4 FUs per cluster sobel fsed M +- M +- M +- B << +- +- +- +- * & +- << M +- M +& M +- B +- +- << +& +& +- << * University of Michigan Electrical Engineering and Computer Science Cross Compile Results • Computation is localized – sobel: 1.5 moves/cycle – fsed: 1 move/cycle • Cross compile – – – – – Can still achieve II=4 More inter-cluster communication May require more units sobel on fsed machine: ~2 moves/cycle fsed on sobel machine: ~3 moves/cycle University of Michigan Electrical Engineering and Computer Science Concluding Remarks • Programmable loop accelerator design strategy – Meta-architecture with stylized interconnect – Systematic compiler-directed design flow • Costs of programmability: – Interconnect, inter-cluster communication – Control – “micro-instructions” are necessary • Just scratching the surface of this work • For more, see the CCCP group webpage – http://cccp.eecs.umich.edu University of Michigan Electrical Engineering and Computer Science