Digital Design: An Embedded Systems Approach Using Verilog Chapter 9 Accelerators Portions of this work are from the book, Digital Design: An Embedded Systems Approach Using Verilog, by Peter J. Ashenden, published by Morgan Kaufmann Publishers, Copyright 2007 Elsevier Inc. All rights reserved. Verilog Performance and Parallelism A processor core performs steps in sequence Accelerating performance Perform steps in parallel Takes less time overall to complete an operation Instruction-level parallelism Performance limited by the instruction rate Within a processor core Pipelining, multiple-issue Accelerators Custom hardware for parallel operations Digital Design — Chapter 9 — Accelerators 2 Verilog Achievable Parallelism How many steps can be performed at once? Regularly structured data Independent processing steps Examples Video and image pixel processing Audio or sensor signal processing Constrained by data dependencies Operations that depend on results of previous steps Digital Design — Chapter 9 — Accelerators 3 Verilog Algorithm Kernels Algorithm: specification of the required processing steps Kernel: the part that involves the most intensive, repetitive processing Often expressed in a programming language “10% of operations take 90% of the time” Accelerating a kernel with parallel hardware gives the best payback Digital Design — Chapter 9 — Accelerators 4 Verilog Amdahl’s Law Time for an algorithm is t Fraction f is spent on a kernel Accelerator speeds up kernel by a factor s Overall speedup factor s' For large f, s' s For small f, s' 1 t ft (1 f )t ft t (1 f )t s t 1 s t f (1 f ) s Digital Design — Chapter 9 — Accelerators 5 Verilog Amdahl’s Law Example An algorithm with two kernels Kernel 1: 80% of time, can be sped up 10 times Kernel 2: 15% of time, can be sped up 100 times Which speedup gives best overall improvement? For kernel 1: s For kernel 2: s 1 1 3.57 0.8 (1 0.8) 0.08 0.2 10 1 1 1.17 0.15 (1 0.15) 0.0015 0.85 100 Digital Design — Chapter 9 — Accelerators 6 Verilog Parallel Architectures An architecture for an accelerator specifies Processing blocks Data flow between them Parallelism through replication Multiple identical block operating on different data elements Works well when elements can be processed independently Digital Design — Chapter 9 — Accelerators 7 Verilog Parallel Architectures Parallelism through pipelining Break a computation into steps, performs them in assembly-line fashion Latency (time to complete a single operation) is not increased Throughput (rate of completion of operations) is increased data in Ideally by a factor equal to the number of pipeline stages step 1 step 2 step 3 Digital Design — Chapter 9 — Accelerators data out 8 Verilog Direct Memory Access (DMA) Input/Output data for accellerators must be transferred at high speed Using the processor would be too slow Direct memory access I/O controller and accellerator transfer data to and from memory autononously Program supplies starting address and length Digital Design — Chapter 9 — Accelerators 9 Verilog Bus Arbitration Bus masters take turns to use bus to access slaves Controlled by a bus arbiter Arbitration policies Priority, round-robin, … request grant request arbiter request processor grant grant accelerator controller memory bus memory Digital Design — Chapter 9 — Accelerators 10 Verilog Block-Processing Accelerator Data arranged in regular groups of contiguous memory locations Accelerator works block by block E.g., images in blocks of 8 × 8 × 16-bit pixels Datapath comprises Memory access: address generation, counters Computation section Control section: finite-state machine(s) Digital Design — Chapter 9 — Accelerators 11 Verilog Stream-Processing Accelerator Streams of data from an input source E.g., high-speed sensors Digital signal processing (DSP) Analog sensor signal converted to stream of digital sample values Filtering, gain/attenuation, frequencydomain conversion (Fourier transform) Digital Design — Chapter 9 — Accelerators 12 Verilog Processor/Accelerator Interface Embedded software controls an accelerator Providing control parameters Synchronizing operations Input/output registers and interrupts Interact with the control sequencer Digital Design — Chapter 9 — Accelerators 13 Verilog Case Study: Edge Detection Illustration of accelerator design Edge detection in video processing Application areas Identify where image intensity changes abruptly Typically at the boundary of objects First step in identifying objects in a scene Video surveillance, computer vision, … For this case study Monochrome images of 640 × 480 × 8-bit pixels Stored row-by-row in memory Pixel values: 0 (black) – 255 (white) Digital Design — Chapter 9 — Accelerators 14 Verilog Sobel Edge Detection Compute derivatives of intensity in x and y directions Look for minima and maxima (where intensity changes most rapidly) Digital Design — Chapter 9 — Accelerators 15 Verilog The Sobel Algorithm Use convolution to approximate partial derivatives Dx and Dy at each position Weighted sum of value of a pixel and its eight nearest neighbors Coefficients represented using a 3×3 convolution mask Sobel masks for x and y derivatives Gx –1 0 +1 –2 0 +2 –1 0 +2 Dx (i, j ) O(i, j ) Gx +1 +2 +1 Gy 0 0 0 –1 –2 –1 Dy (i, j) O(i, j ) Gy Digital Design — Chapter 9 — Accelerators 16 Verilog The Sobel Algorithm Combine partial derivatives D Dx2 Dy2 Since we just want maxima and minima in magnitude, approximate as: D Dx D y Edge pixels don’t have eight neighbors Skip computation of |D| for edges Just set them to 0 using software Digital Design — Chapter 9 — Accelerators 17 Verilog The Algorithm in Pseudocode for (row = 1; row <= 478; row = row + 1) begin for (col = 1; col <= 638; col = col + 1) begin sumx = 0; sumy = 0; for (i = –1; i <= +1; i = i + 1) begin for (j = –1; j <= +1; j = j + 1) begin sumx = sumx + 0[row+i][col+j] * Gx[i][j]; sumy = sumy + 0[row+i][col+j] * Gy[i][j]; end end D[row][col] = abs(sumx) + abs(sumy); end end Digital Design — Chapter 9 — Accelerators 18 Verilog Data Formats and Rates Pixel values: 0 to 255 (8 bits) Coefficients are 0, ±1 and ±2 Partial products: –510 to +510 (10 bits) Dx and Dy: –1020 to +1020 (11 bits) |D|: 0 to 2040 (11 bits) Final pixel value: scale back to 8 bits Video rate: 30 frames/sec 640 × 480 = 307,200 pixels 307,200 × 30 10 million pixels/sec Digital Design — Chapter 9 — Accelerators 19 Verilog Data Dependencies Pixels can be computed independently For each pixel: Digital Design — Chapter 9 — Accelerators 20 Verilog System Architecture Data dependencies suggest a pipeline Coefficient multiplies are simple shift/negate, so merge with adder stage Digital Design — Chapter 9 — Accelerators 21 Verilog Memory Bandwidth Assume memory read/write takes 20ns (2 cycles of 100MHz clock) Memory is 32-bits wide, byte addressable Bandwidth = 50M operations/sec Camera produces 10Mpixels/sec Accelerator needs to process at this rate (8 reads + 1 write) × 10Mpixel/sec = 90M operations/sec Greater than memory bandwidth Digital Design — Chapter 9 — Accelerators 22 Verilog Memory Bandwidth Read 4 pixels at once from each of previous, current, and next rows Store in accelerator to compute multiple derivative image pixels Produce derivative pixels row-by-row, left-toright Read 3 × 32-bit words for every 4th derivative pixel computed Write 4 pixels at a time (3 reads + 1 write) / 4 × 10Mpixel/sec = 10M operations/sec = 20% of available memory bandwidth Digital Design — Chapter 9 — Accelerators 23 Verilog Sobel Accelerator Architecture Digital Design — Chapter 9 — Accelerators 24 Verilog Accelerator Sequence Steady state Start of row Write 4 result pixels Read 4 pixels for previous, current, next rows Compute for 4 cycles Repeat… Omit writes until pipeline full End of row Omit reads to drain pipeline Digital Design — Chapter 9 — Accelerators 25 Verilog Memory Operation Timing Steady state Digital Design — Chapter 9 — Accelerators 26 Verilog Pixel Datapath // Computation datapath signals reg [31:0] prev_row, curr_row, next_row; reg [7:0] O [-1:+1][-1:+1]; reg signed [10:0] Dx, Dy, D; reg [7:0] abs_D; reg [31:0] result_row; ... // Computational datapath always @(posedge clk_i) // Previous row register if (prev_row_load) prev_row <= dat_i; else if (shift_en) prev_row[31:8] <= prev_row[23:0]; ... // Current row register ... // Next row register function [10:0] abs (input signed [10:0] x); abs = x >= 0 ? x : -x; endfunction ... Digital Design — Chapter 9 — Accelerators 27 Verilog Pixel Datapath always @(posedge clk_i) // Computation pipeline if (shift_en) begin D = abs(Dx) + abs(Dy); abs_D <= D[10:3]; Dx <= - $signed({3'b000, O[-1][-1]}) + $signed({3'b000, O[-1][+1]}) - ($signed({3'b000, O[ 0][-1]}) << 1) + ($signed({3'b000, O[ 0][+1]}) << 1) - $signed({3'b000, O[+1][-1]}) + $signed({3'b000, O[+1][+1]}); Dy <= $signed({3'b000, O[-1][-1]}) + ($signed({3'b000, O[-1][ 0]}) << 1) + $signed({3'b000, O[-1][+1]}) - $signed({3'b000, O[+1][-1]}) - ($signed({3'b000, O[+1][ 0]}) << 1) - $signed({3'b000, O[+1][+1]}); ... Digital Design — Chapter 9 — Accelerators 28 Verilog Pixel Datapath O[-1][-1] <= O[-1][0]; O[-1][ 0] <= O[-1][+1]; O[-1][+1] <= prev_row[31:24]; O[ 0][-1] <= O[0][ 0]; O[ 0][ 0] <= O[0][+1]; O[ 0][+1] <= curr_row[31:24]; O[+1][-1] <= O[+1][ 0]; O[+1][ 0] <= O[+1][+1]; O[+1][+1] <= next_row[31:24]; end always @(posedge clk_i) // Result row register if (shift_en) result_row <= {result_row[23:0], abs_D}; Digital Design — Chapter 9 — Accelerators 29 Verilog Address Generation Given an image in memory at base address B Address for pixel in row r, column c is B + r × 640 + c Base address (B) is fixed Offset (r × 640 + c) increments by 4 for each group of 4 pixels read/written Use word-aligned addresses Two least-significant bits always 00 Increment word address by 1 Digital Design — Chapter 9 — Accelerators 30 Verilog Address Generation Digital Design — Chapter 9 — Accelerators 31 Verilog Address Generation always @(posedge clk_i) // O base address register if (O_base_ce) O_base <= dat_i[21:2]; always @(posedge clk_i) // O address offset counter if (offset_reset) O_offset <= 0; else if (O_offset_cnt_en) O_offset <= O_offset + 1; always @(posedge clk_i) // D base address register if (D_base_ce) D_base <= dat_i[21:2]; always @(posedge clk_i) // D address offset counter if (offset_reset) D_offset <= 0; else if (D_offset_cnt_en) D_offset <= D_offset + 1; ... Digital Design — Chapter 9 — Accelerators 32 Verilog Address Generation assign assign assign assign assign O_prev_addr = O_base + O_offset; O_curr_addr = O_prev_addr + 640/4; O_next_addr = O_prev_addr + 1280/4; D_addr = D_base + D_offset; adr_o[21:2] = prev_row_load ? O_prev_addr : curr_row_load ? O_curr_addr : next_row_load ? O_next_addr : D_addr; assign adr_o[1:0] = 2'b00; Digital Design — Chapter 9 — Accelerators 33 Verilog Control/Status Registers Register Offset Read/Write Purpose Int_en 0 Write-only Interrupt enable (bit 0). Start 4 Write-only Write causes image processing to start (value ignored). O_base 8 Write-only Original image base address. D_base 12 Write-only Derivative image base address + 640. Status 0 Read-only Processing done (bit 0). Reading clears interrupt. Digital Design — Chapter 9 — Accelerators 34 Verilog Slave Bus Interface assign start = cyc_i && stb_i && we_i && adr_i == 2'b01; assign O_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b10; assign D_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b11; always @(posedge clk_i) // Interrupt enable register if (rst_i) int_en <= 1'b0; else if (cyc_i && stb_i && we_i && adr_i == 2'b00) int_en <= dat_i[0]; always @(posedge clk_i) // Status register if (rst_i) done <= 1'b0; else if (done_set) // This occurs when last write is acknowledged, // and so cannot coincide with a read of the status register. done <= 1'b1; else if (cyc_i && stb_i && we_i && adr_i == 2'b00 && ack_o) done <= 1'b0; assign int_req = int_en && done; ... Digital Design — Chapter 9 — Accelerators 35 Verilog Slave Bus Interface always @(posedge clk_i) // Generate ack output ack_o <= cyc_i && stb_i && !ack_o; // Wishbone data output multiplexer always @* if (cyc_i && stb_i && !we_i) if (adr_i == 2'b00) dat_o = {31'b0, done}; // status register read else dat_o = 32'b0; // other registers read as 0 else dat_o = result_row; // for master write Digital Design — Chapter 9 — Accelerators 36 Verilog Control Sequencing Use a finite-state machine Counters keep track of rows (0 to 477) and columns (0 to 159) See textbook for details of FSM output functions Digital Design — Chapter 9 — Accelerators 37 Verilog State Transition Diagram Digital Design — Chapter 9 — Accelerators 38 Verilog Accelerator Verification Simulation-based verification of each section of the accelerator Slave bus operations Computation sequencing Master bus operations Address generation Pixel computation Testbench including the accelerator Bus functional processor model Simplified memory and bus arbiter models Digital Design — Chapter 9 — Accelerators 39 Verilog Sobel Verification Testbench Processor BFM Arbiter Sobel Accelerator Multiplexed Bus: Muxes and Connections Memory Model Digital Design — Chapter 9 — Accelerators 40 Verilog Processor Bus Functional Model initial begin // Processor bus-functional model cpu_adr_o <= 23'h000000; cpu_sel_o <= 4'b0000; cpu_dat_o <= 32'h00000000; cpu_cyc_o <= 1'b0; cpu_stb_o <= 1'b0; cpu_we_o <= 1'b0; @(negedge rst); @(posedge clk); // Write 008000 (hex) to O_base_addr register bus_write(sobel_reg_base + sobel_O_base_reg_offset, 32'h00008000); // Write 053000 + 280 (hex) to D_base_addr register bus_write(sobel_reg_base + sobel_D_base_reg_offset, 32'h00053280); // Write 1 to interrupt control register (enable interrupt) bus_write(sobel_reg_base + sobel_int_reg_offset, 32'h00000001); // Write to start register (data value ignored) bus_write(sobel_reg_base + sobel_start_reg_offset, 32'h00000000); // End of write operations ... Digital Design — Chapter 9 — Accelerators 41 Verilog Processor Bus Functional Model cpu_cyc_o = 1'b0; cpu_stb_o = 1'b0; cpu_we_o = 1'b0; begin: loop forever begin #10000; @(posedge clk); // Read status register cpu_adr_o <= sobel_reg_base + sobel_status_reg_offset; cpu_sel_o <= 4'b1111; cpu_cyc_o <= 1'b1; cpu_stb_o <= 1'b1; cpu_we_o <= 1'b0; @(posedge clk); while (!cpu_ack_i) @(posedge clk); cpu_cyc_o <= 1'b0; cpu_stb_o <= 1'b0; cpu_we_o <= 1'b0; if (cpu_dat_i[0]) disable loop; end end end Digital Design — Chapter 9 — Accelerators 42 Verilog Memory Bus Functional Model always begin // Memory bus-functional model mem_ack_o <= 1'b0; mem_dat_o <= 32'h00000000; @(posedge clk); while (!(bus_cyc && mem_stb_i)) @(posedge clk); if (!bus_we) mem_dat_o <= 32'h00000000; // in place of read data mem_ack_o <= 1'b1; @(posedge clk); end Digital Design — Chapter 9 — Accelerators 43 Verilog Bus Arbiter Uses sobel_cyc_o and cpu_cyc_o as request inputs If both request at the same time, give accelerator priority Mealy FSM Digital Design — Chapter 9 — Accelerators 44 Verilog Bus Arbiter always @(posedge clk) // Arbiter FSM register if (rst) arbiter_current_state <= sobel; else arbiter_current_state <= arbiter_next_state; always @* // Arbiter logic case (arbiter_current_state) sobel: if (sobel_cyc_o) begin sobel_gnt <= 1'b1; cpu_gnt <= 1'b0; arbiter_next_state end else if (!sobel_cyc_o && cpu_cyc_o) begin sobel_gnt <= 1'b0; cpu_gnt <= 1'b1; arbiter_next_state end else begin sobel_gnt <= 1'b0; cpu_gnt <= 1'b0; arbiter_next_state end cpu: if (cpu_cyc_o) begin sobel_gnt <= 1'b0; cpu_gnt <= 1'b1; arbiter_next_state end else if (sobel_cyc_o && !cpu_cyc_o) begin sobel_gnt <= 1'b1; cpu_gnt <= 1'b0; arbiter_next_state end else begin sobel_gnt <= 1'b0; cpu_gnt <= 1'b0; arbiter_next_state end endcase Digital Design — Chapter 9 — Accelerators <= sobel; <= cpu; <= sobel; <= cpu; <= sobel; <= sobel; 45 Verilog Simulation Results See waveforms in textbook But what about… Demonstrates sequencing and address generation Data values computed correctly Interactions between processor and accelerator Need to use more sophisticated verification techniques Due to complexity of the design Digital Design — Chapter 9 — Accelerators 46 Verilog Summary Accelerators boost performance using parallel hardware Ahmdahl’s Law Replication, pipelining, … Best payback from accelerating a kernel DMA avoids processor overhead Verification requires advanced techniques Digital Design — Chapter 9 — Accelerators 47