Graduate Computer Architecture I Lecture 16: FPGA Design Emergence of FPGA • Great for Prototyping and Testing – Enable logic verification without high cost of fab – Reprogrammable Research and Education – Meets most computational requirements – Options for transferring design to ASIC • Technology Advances – Huge FPGAs are available • Up to 200,000 Logic Units – Above clocking rate of 500 MHz • Competitive Pricing 2 - CSE/ESE 560M – Graduate Computer Architecture I System on Chip (SoC) • Large Embedded Memories – Up 10 Megabits of on-chip memories (Virtex 4) – High bandwidth and reconfigurable • Processor IP Cores – Tons of Soft Processor Cores (some open source) – Embedded Processor Cores • PowerPC, Nios RISC, and etc. – 450+ MHz – Simple Digital Signal Processing Cores • Up to 512 DSPs on Virtex 4 • Interconnects – High speed network I/O (10Gbps) – Built-in Ethernet MACs (Soft/Hard Core) • Security – Embedded 256-bit AES Encryption 3 - CSE/ESE 560M – Graduate Computer Architecture I Potential Advantages of FPGAs 4 - CSE/ESE 560M – Graduate Computer Architecture I Designing with FPGAs • Opportunities – Hardware logics are programmable – Immediate testing on the actual platform • Challenges – Programming Environment • Think and design in 2-D instead of 1-D • Consider hardware limitations – Hardware Synthesis • Smart language interpreter and translator • Efficient HW resource utilization 5 - CSE/ESE 560M – Graduate Computer Architecture I Today • Programming Environment – Object Oriented Programming Model – Template based language editors – Hardware/Software Co-design – Still a disconnect between SW/HW methods – Lack of education to bring them together • Hardware Synthesis – Getting smarter but not smart enough – Tuned specifically for each platform – Not able to take full advantage of resources – Manual tweaking and using templates 6 - CSE/ESE 560M – Graduate Computer Architecture I High Performance Design in FPGA • Fine Grain Pipelining – Reducing Critical Path – One level of look-up-table between D-flip flop – Works best for streaming data with little or no data dependencies • Logic Resource – – – – Smaller sizes often yield faster design Use all available resources Less resource map and place conflicts Quicker compilation • Parallel Engines – Exploit parallelism in application – Faster place and route 7 - CSE/ESE 560M – Graduate Computer Architecture I Pipelining • DEFINITION: – a K-Stage Pipeline (“K-pipeline”) is an acyclic circuit having exactly K registers on every path from an input to an output. – a COMBINATIONAL CIRCUIT is thus an 0-stage pipeline. • CONVENTION: – Every pipeline stage, hence every K-Stage pipeline, has a register on its OUTPUT (not on its input). • ALWAYS: – The CLOCK common to all registers must have a period sufficient to cover propagation over combinational paths + (input) register progation delay + (output) register setup time. 8 - CSE/ESE 560M – Graduate Computer Architecture I Bad pipelining • You can not just randomly registers – Successive inputs get mixed: e.g., B(A(Xi+1), Yi) – This happened because some paths from inputs to outputs have 2 registers, and some have only 1! • Not a well-formed K pipeline! 9 - CSE/ESE 560M – Graduate Computer Architecture I Adding Pipelines • Method – Draw a line that crosses every output in the circuit and mark the endpoints as terminal points. – Continue to draw new lines between the terminal points across various circuit connections, ensuring that every connection crosses each line in the same direction. • These lines represent pipeline stages. • Adding a pipeline register at every point where a separating line crosses a connection will always generate a valid pipeline • Focus on the slowest part of the circuit 10 - CSE/ESE 560M – Graduate Computer Architecture I Pipelining Example • 8 bit to 256 bit decoder – 256 different combination library ieee; use ieee.std_logic_1164.all; entity DECODER is port( I: in std_logic_vector(7 downto 0); O: out std_logic_vector(255 downto 0)); end DECODER; architecture behavioral of DECODER is begin 256 bits process (I) begin case I is when “00000000” => O <= “1000...0000”; when “00000001” => O <= “0100...0000”; when “00000010” => O <= “0010...0000”; ... when “11111110” => O <= “0000...0010”; when “11111111” => O <= “0000...0001”; end case; end process; end behavioral; 11 - CSE/ESE 560M – Graduate Computer Architecture I Hardware Synthesis • Synthesis – Uses at least three 4 to 1 Look-up-tables to decode 256 combinations of I(7:0) LUT4 LUT4 LUT4 • Resource Usage – 3-LUT4 X 256 – 768 LUT4 • Critical Path – – – – Input/Output pin delays 2 levels of LUT4 Sometimes 3 levels?! Virtex 4 – Speed 11 • 8.281 ns 121 Mhz 12 - CSE/ESE 560M – Graduate Computer Architecture I I(7:0) Comb Logic for “0” “1” “2” … Comb Logic for “255” O(0) O(1) O(2) O(254:3) O(255) Pipelined Decoder • Input/Output pin DFF – Already in most FPGAs – Minimizes pin latencies LUT4 • DFF after every LUT4 LUT4 – LUT4 always followed by DFF (why not use it) – Only when possible – Minimizes logic latency LUT4 • FPGA Resource – 768 LUT4 as before – Plus 768 dff and 264 pin dff – But not really… • Critical Path – 1 Level of LUT4 – Plus small DFF prop delay and setup – Virtex 4 – Speed 11 • 2.198 ns 455 Mhz • 3.76x Speedup 13 - CSE/ESE 560M – Graduate Computer Architecture I I(7:0) Comb Logic for “0” “1” “2” … Comb Logic for “255” O(0) O(1) O(2) O(254:3) O(255) Logic Resource • Leveraging on FPGA Architecture – Similarity with Architecture – LUT and few special logic followed by DFF • Smaller Design is often Faster – – – – Easier for tools to Map, Place, and Route Optimize designs wherever In FPGA, each wire can has a large fanout limit Reuse logic and results Input logic Output Fanout Capacity for the wire to drive the inputs to other logic 14 - CSE/ESE 560M – Graduate Computer Architecture I Reusing Logic • Synthesis Tools – Obvious duplicate logics are automatically combined – Most are not optimized LUT4 LUT4 • Decoder Example LUT4 – Two 4 bit to16 bit decoders – Combining decoder outputs – Two 16 bits to 256 bit – 1 Level of LUT4 – Approximately the same – Differences in wire delay • FPGA Resources – – – – I/O DFF remain same 2 x 16 LUT4 and DFF Plus 256 LUT4 and DFF Total 272 LUT4 and DFF! 15 - CSE/ESE 560M – Graduate Computer Architecture I I(7:0) Two sets of 4 to16 decoder • Critical Path Comb AND Logic Gatefor “0,0” “0” “0,1” “1” “0,2” “2” … Comb AND Logic Gatefor “15,15” “256” O(0) O(1) O(2) O(254:3) O(255) Virtex 4 – Elementary Logic Block 2 to 1 Multiplexors 4 to 1 LUT 1 bit D-Flip Flops 16 - CSE/ESE 560M – Graduate Computer Architecture I Using MUXF as 2-input Gates 0 0 MUXF a a b z 1 sel z b 0 0 MUXF a 1 sel z a b z b Inverters can be pushed into the LUT4 or DFF (by using inverted Q) 17 - CSE/ESE 560M – Graduate Computer Architecture I Using Unused Multiplexors • Decoder Example – Replace all LUT4 in the 2nd Decoder stages with MUX based 2 input AND gates 0 MUXF LUT4 1 sel • Critical Path • FPGA Resources – I/O DFF remain same – 256 MUXF and DFF I(7:0) – 32 LUT4 and DFF 18 - CSE/ESE 560M – Graduate Computer Architecture I Two sets of 4 to16 decoder – Same – 2.198 ns 455 Mhz Comb AND Logic Gatefor “0,0” “0” “0,1” “1” “0,2” “2” … Comb AND Logic Gatefor “15,15” “256” O(0) O(1) O(2) O(254:3) O(255) Parallel Design • Use Area to Increase Performance – Increase the Input bandwidth (Input Bus width) • Processing multiple data at a time – Duplicate engines to process independent data sets • Thread/Object level parallelism • Instructional level parallelism – Loop unroll to expose the parallelism – Excellent for Streaming Data Applications • Multimedia • Network Processing • Performance Scalability – Linear Performance increase with Size • Achieved for many algorithms – Sometimes Exponential Hardware Size • Try to scale using higher level of parallelism 19 - CSE/ESE 560M – Graduate Computer Architecture I Summary • FPGA Designing Methods – Fine Grain Pipelining to Increase Clock Rate • If possible 1-level of LUT followed by DFF – Parallel Engines to Increase Bandwidth • Duplicate logic to linearly increase the performance – Reducing Logic Resource Usage • Reusing duplicate logics • Using all available embedded Logic • There are other logics (i.e. Embedded Procs, Large Memories, Optimized primitive gates, and IP Cores) • Best Methods Today – Learn about internal architecture of FPGA – Make your own templates and use them – Use IP Cores • Future Research Topics – Integration of Generalize Pipelining Algorithms (In the works) – Smarter Synthesis Tools (Understanding HDL) – Automatic Platform Specific Optimization Techniques 20 - CSE/ESE 560M – Graduate Computer Architecture I