Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008 Overview • • • • Virtex-5 6-LUT Packing Virtex-5 DSP and Block RAM Packing Results Summary 2 Simplified FPGA Logic Element A4 A3 A2 A1 3 4-LUT O4 FF Simplified FPGA Logic Block 4-LUT 4-LUT General Interconnec t 4-LUT 4-LUT 4 FF FF FF FF General Interconnec t Virtex-5 Logic Block CLB General Interconnec t 5 SLICE 6-LUT FF 6-LUT FF 6-LUT FF 6-LUT FF General Interconnec t SLICE 6-LUT FF 6-LUT FF 6-LUT FF 6-LUT FF Dual-Output 6-LUT A6 A5 A4 A3 A2 A1 6 O6 6-LUT O5 Dual-Output 6-LUT Usage A6 A5 A4 A3 A2 A1 5-LUT O6 5-LUT 7 O5 Dual-Output Packing 6-LUT 6-LUT VCC A6 a A6 x ab A5 A4 b xA3 yA2 A5 y A4 A3 A2 A1 5-LUT Logic Y A1 YO6 Logic 5-LUT X X O5 2 Number of 6-LUTs used: 1! 8 5-LUT Logic X O6 X 5-LUT O5 Virtex-5 LUT/FF Pair F7 A XOR CY O6 6-LUT O5 O6 O6 O5 F7 CY XOR AX AX CIN 9 AMUX F7 O5 AQ FF Dual-Output Packing Tradeoff F7 O6 6-LUT O5 O5 O6 O6 O5 FF AX 10 Dual-Output Packing in Placer • Goal: To reduce area without performance hit – Can be done pre-placement • Will be sub-optimal without delay estimates – Use delay estimates available during placement to make good decisions on when to merge two LUTs • Approach: – Allow second 5-LUT to be used, when performance impact is small – Incorporate LUT packing in placer’s cost function 11 Placer Cost Function • Previous cost function: – Cost = a * W + b * T – W: wirelength cost T: timing performance cost • Extend cost function with two new terms – One based on 6-LUT utilization (L) – One based on SLICE utilization (S) – Cost = a * W + b * T + c * L + d * S 12 6-LUT Utilization Term • L is computed based on all the used 6-LUT slots • Where 13 SLICE Utilization Term • S is computed based on all the available SLICEs m S= Si i=0 • Let: – Ni = Number of used 5-LUTs in SLICE i (at most 8) 14 Performance Recovery • Helpful to prohibit pack in certain cases for performance reasons • Other used elements in a SLICE may block the “good” path from the O5 output to external interconnect. 15 Performance Recovery: XOR F7 A XOR CY O6 LUT6 LUT6 O5 O6 O6 O5 F7 CY XOR AX AX CIN 16 AMUX F7 O5 FF AQ Performance Recovery: F7 F7 A XOR CY O6 LUT6 O5 O6 O6 O5 F7 CY XOR AX AX CIN 17 AMUX F7 O5 AQ FF 6-LUT Reduction % 6-LUT Reduction 16 14 12 10 8 5.5% 6-LUT Reduction 6 4 2 0 Benchmark Design # 18 SLICE Reduction % SLICE Reduction 25 20 10.23% SLICE Reduction 15 10 5 0 Benchmark Design # 19 Performance Results 25 Performance Loss (%) 20 15 3.3% Performance Degradation 10 5 0 0 5 10 15 -5 -10 -15 SLICEs Reduction (% ) 20 20 25 Overview • Virtex-5 6-LUT Packing • Virtex-5 DSP and Block RAM Packing • Summary 21 New Type of Packing Problem • Traditionally, packing is considered to be a problem of just LUTs and flops • However, Virtex-5 contains large IP blocks that present their own packing problem 22 Virtex-5 Block RAMs 36Kb RAM • A 36 Kbit block RAM tile can store: a) single 36 Kb RAM b) two independent 18 Kb RAMs • Block RAM has configurable “aspect ratio” • 18 Kb RAM can be configured as: 16K x 1, 8K x 2, 2K x 9, or 1K x 18 • Tools decide which independent 18 Kb block RAMs to locate in which tile 23 18 Kb RAM 18 Kb RAM Virtex-5 DSP48E Block • A multiply-accumulate operation, pervasive in DSP circuits, can be realized in a single DSP48E. • Multiple DSP48Es can be chained together to form more complex functions through the PCIN and PCOUT ports PCOUT X ALU = Pattern detect Optional pipeline register/ routing logic C (48-bit) 25x18 Routing logic B (18-bit) A (25-bit) Optional pipeline register/ routing logic 48-bit P PCIN 24 Block RAM and DSP Floorplan • Block RAM and DSP48E tiles are organized in columns 25 Block RAM tile DSP48E Block RAM tile DSP48E Block RAM tile DSP48E Block RAM tile DSP48E Block RAM tile DSP48E Block RAM tile DSP48E DSP48E DSP48E DSP48E DSP48E Virtex-5 DSP tile Block RAM tile Block RAM tile Block RAM tile Block RAM tile Block RAM/DSP Packing • Problem: Placer algorithms are heuristic and sometimes do not find an optimal block RAM packing • Goal: Leverage preferred block RAM packing patterns to achieve high performance • Target area: DSP designs – DSP designs make heavy use of block RAMs and DSP blocks 26 DSP Block RAM Designs • Most common DSP application is the Finite Impulse Response Filter or FIR filter – FIR filters have multiple instances of a “tap” which involve DSP and block RAMs 27 FIR Filter • A Finite Impulse Response or FIR filter is a digital filter that takes a weighted average of the signals in a delay line • An N-tap filter can be expressed as: y[n] = c0*x[n] + c1*x[n-1]+…+cn*[n-N+1] – Where: • y[n] is the output of the filter at time n • x[n] is the data input “signal” at time n • Ci is the coefficient • Each coefficient/data product in sum is referred to as a “tap” – DSP units used for the multiply and accumulate – Block RAMs used to store the data and coefficients 28 FIR Designs – Use Case 1 • 2-tap FIR filter involving small block RAMs data output B DSP1 Tap 1 DSP0 Tap 0 A 18 Kb block RAM RAMD1 PCIN RAMC1 PCOUT 36 Kb block RAM Tile B data input RAMD0 Data RAM 29 RAMC0 Coefficient RAM Packing for Use Case 1 • Packing both 18k Block RAMs into a Block RAM tile permits a natural alignment between the DSP and Block RAMs Operates as two independent 18 Kb block RAMs 30 Block RAM tile DSP48E Block RAM tile DSP48E Block RAM tile DSP48E Block RAM tile DSP48E DSP48E Virtex-5 DSP tile DSP48E DSP48E DSP48E High Performance! FIR Designs – Use Case 2 • 2-tap FIR filter involving larger block RAMs B A RAMD1 DSP1 PCIN RAMC1 Tap 1 PCOUT B A RAMD0 Data RAM 31 DSP0 18 Kb block RAM RAMC0 Coefficient RAM Tap 0 36 Kb block RAM Packing for Use Case 2 • Two Block RAM columns feed one DSP column • Again provides a natural alignment between the DSP and Block RAMs Block RAM tile DSP48E Block RAM tile DSP48E Block RAM tile DSP48E Block RAM tile DSP48E DSP48E DSP48E DSP48E DSP48E Virtex-5 DSP tile 32 Block RAM tile Block RAM tile Block RAM tile Block RAM tile Block RAM Chains • Use Case: 18k Block RAM’s data input and output pins connected together (e.g. FIFO) in RAM0 RAM1 dia dib doa addra dob out addrb 18 Kb block RAM • Algorithm: Look for such chains and pack them together into single block RAM tile • Special Case: 18k block RAMs separated by registers 33 Block RAM/DSP Packing Results Circuit Circuit 1 Perf RAM Perf. Baseline Packing (MHz) (MHz) 500 400 Percent Improvement 25% Circuit 2 450 365 23% Circuit 3 500 470 6% Circuit 4 425 435 -2% Circuit 5 215 200 8% Geomean 400 359 11% 34 Summary • Described two architecture specific packing approaches for a 65nm commercial FPGA: Xilinx Virtex-5 – Dual-output LUT packing in placement: • Achieves 10.2% SLICE reduction and 5.5% LUT reduction – Packing for DSPs and block RAMs: • Achieves 11% performance improvement 35 Questions 36