09 Accelerators ppt

Digital Design: An Embedded Systems Approach Using Verilog Chapter 9 Accelerators Portions of this work are from the book, Digital Design: An Embedded Systems Approach Using Verilog, by Peter J. Ashenden, published by Morgan Kaufmann Publishers, Copyright 2007 Elsevier Inc. All rights reserved. Verilog Performance and Parallelism  A processor core performs steps in sequence   Accelerating performance    Perform steps in parallel Takes less time overall to complete an operation Instruction-level parallelism    Performance limited by the instruction rate Within a processor core Pipelining, multiple-issue Accelerators  Custom hardware for parallel operations Digital Design — Chapter 9 — Accelerators 2 Verilog Achievable Parallelism   How many steps can be performed at once? Regularly structured data   Independent processing steps Examples    Video and image pixel processing Audio or sensor signal processing Constrained by data dependencies  Operations that depend on results of previous steps Digital Design — Chapter 9 — Accelerators 3 Verilog Algorithm Kernels  Algorithm: specification of the required processing steps   Kernel: the part that involves the most intensive, repetitive processing   Often expressed in a programming language “10% of operations take 90% of the time” Accelerating a kernel with parallel hardware gives the best payback Digital Design — Chapter 9 — Accelerators 4 Verilog Amdahl’s Law  Time for an algorithm is t    Fraction f is spent on a kernel Accelerator speeds up kernel by a factor s Overall speedup factor s'   For large f, s'  s For small f, s'  1 t  ft  (1  f )t ft t    (1  f )t s t 1 s   t  f  (1  f ) s Digital Design — Chapter 9 — Accelerators 5 Verilog Amdahl’s Law Example  An algorithm with two kernels    Kernel 1: 80% of time, can be sped up 10 times Kernel 2: 15% of time, can be sped up 100 times Which speedup gives best overall improvement?  For kernel 1: s   For kernel 2: s  1 1   3.57 0.8  (1  0.8) 0.08  0.2 10 1 1   1.17 0.15  (1  0.15) 0.0015 0.85 100 Digital Design — Chapter 9 — Accelerators 6 Verilog Parallel Architectures  An architecture for an accelerator specifies    Processing blocks Data flow between them Parallelism through replication   Multiple identical block operating on different data elements Works well when elements can be processed independently Digital Design — Chapter 9 — Accelerators 7 Verilog Parallel Architectures  Parallelism through pipelining    Break a computation into steps, performs them in assembly-line fashion Latency (time to complete a single operation) is not increased Throughput (rate of completion of operations) is increased  data in Ideally by a factor equal to the number of pipeline stages step 1 step 2 step 3 Digital Design — Chapter 9 — Accelerators data out 8 Verilog Direct Memory Access (DMA)  Input/Output data for accellerators must be transferred at high speed   Using the processor would be too slow Direct memory access   I/O controller and accellerator transfer data to and from memory autononously Program supplies starting address and length Digital Design — Chapter 9 — Accelerators 9 Verilog Bus Arbitration  Bus masters take turns to use bus to access slaves   Controlled by a bus arbiter Arbitration policies  Priority, round-robin, … request grant request arbiter request processor grant grant accelerator controller memory bus memory Digital Design — Chapter 9 — Accelerators 10 Verilog Block-Processing Accelerator  Data arranged in regular groups of contiguous memory locations    Accelerator works block by block E.g., images in blocks of 8 × 8 × 16-bit pixels Datapath comprises    Memory access: address generation, counters Computation section Control section: finite-state machine(s) Digital Design — Chapter 9 — Accelerators 11 Verilog Stream-Processing Accelerator  Streams of data from an input source   E.g., high-speed sensors Digital signal processing (DSP)   Analog sensor signal converted to stream of digital sample values Filtering, gain/attenuation, frequencydomain conversion (Fourier transform) Digital Design — Chapter 9 — Accelerators 12 Verilog Processor/Accelerator Interface  Embedded software controls an accelerator    Providing control parameters Synchronizing operations Input/output registers and interrupts  Interact with the control sequencer Digital Design — Chapter 9 — Accelerators 13 Verilog Case Study: Edge Detection   Illustration of accelerator design Edge detection in video processing     Application areas   Identify where image intensity changes abruptly Typically at the boundary of objects First step in identifying objects in a scene Video surveillance, computer vision, … For this case study    Monochrome images of 640 × 480 × 8-bit pixels Stored row-by-row in memory Pixel values: 0 (black) – 255 (white) Digital Design — Chapter 9 — Accelerators 14 Verilog Sobel Edge Detection  Compute derivatives of intensity in x and y directions  Look for minima and maxima (where intensity changes most rapidly) Digital Design — Chapter 9 — Accelerators 15 Verilog The Sobel Algorithm  Use convolution to approximate partial derivatives Dx and Dy at each position    Weighted sum of value of a pixel and its eight nearest neighbors Coefficients represented using a 3×3 convolution mask Sobel masks for x and y derivatives Gx –1 0 +1 –2 0 +2 –1 0 +2 Dx (i, j )  O(i, j )  Gx +1 +2 +1 Gy 0 0 0 –1 –2 –1 Dy (i, j)  O(i, j )  Gy Digital Design — Chapter 9 — Accelerators 16 Verilog The Sobel Algorithm  Combine partial derivatives D  Dx2  Dy2  Since we just want maxima and minima in magnitude, approximate as: D  Dx  D y  Edge pixels don’t have eight neighbors   Skip computation of |D| for edges Just set them to 0 using software Digital Design — Chapter 9 — Accelerators 17 Verilog The Algorithm in Pseudocode for (row = 1; row <= 478; row = row + 1) begin for (col = 1; col <= 638; col = col + 1) begin sumx = 0; sumy = 0; for (i = –1; i <= +1; i = i + 1) begin for (j = –1; j <= +1; j = j + 1) begin sumx = sumx + 0[row+i][col+j] * Gx[i][j]; sumy = sumy + 0[row+i][col+j] * Gy[i][j]; end end D[row][col] = abs(sumx) + abs(sumy); end end Digital Design — Chapter 9 — Accelerators 18 Verilog Data Formats and Rates  Pixel values: 0 to 255 (8 bits)       Coefficients are 0, ±1 and ±2 Partial products: –510 to +510 (10 bits) Dx and Dy: –1020 to +1020 (11 bits) |D|: 0 to 2040 (11 bits) Final pixel value: scale back to 8 bits Video rate: 30 frames/sec   640 × 480 = 307,200 pixels 307,200 × 30  10 million pixels/sec Digital Design — Chapter 9 — Accelerators 19 Verilog Data Dependencies   Pixels can be computed independently For each pixel: Digital Design — Chapter 9 — Accelerators 20 Verilog System Architecture  Data dependencies suggest a pipeline  Coefficient multiplies are simple shift/negate, so merge with adder stage Digital Design — Chapter 9 — Accelerators 21 Verilog Memory Bandwidth  Assume memory read/write takes 20ns (2 cycles of 100MHz clock)    Memory is 32-bits wide, byte addressable Bandwidth = 50M operations/sec Camera produces 10Mpixels/sec    Accelerator needs to process at this rate (8 reads + 1 write) × 10Mpixel/sec = 90M operations/sec Greater than memory bandwidth Digital Design — Chapter 9 — Accelerators 22 Verilog Memory Bandwidth  Read 4 pixels at once from each of previous, current, and next rows   Store in accelerator to compute multiple derivative image pixels Produce derivative pixels row-by-row, left-toright    Read 3 × 32-bit words for every 4th derivative pixel computed Write 4 pixels at a time (3 reads + 1 write) / 4 × 10Mpixel/sec = 10M operations/sec = 20% of available memory bandwidth Digital Design — Chapter 9 — Accelerators 23 Verilog Sobel Accelerator Architecture Digital Design — Chapter 9 — Accelerators 24 Verilog Accelerator Sequence  Steady state      Start of row   Write 4 result pixels Read 4 pixels for previous, current, next rows Compute for 4 cycles Repeat… Omit writes until pipeline full End of row  Omit reads to drain pipeline Digital Design — Chapter 9 — Accelerators 25 Verilog Memory Operation Timing  Steady state Digital Design — Chapter 9 — Accelerators 26 Verilog Pixel Datapath // Computation datapath signals reg [31:0] prev_row, curr_row, next_row; reg [7:0] O [-1:+1][-1:+1]; reg signed [10:0] Dx, Dy, D; reg [7:0] abs_D; reg [31:0] result_row; ... // Computational datapath always @(posedge clk_i) // Previous row register if (prev_row_load) prev_row <= dat_i; else if (shift_en) prev_row[31:8] <= prev_row[23:0]; ... // Current row register ... // Next row register function [10:0] abs (input signed [10:0] x); abs = x >= 0 ? x : -x; endfunction ... Digital Design — Chapter 9 — Accelerators 27 Verilog Pixel Datapath always @(posedge clk_i) // Computation pipeline if (shift_en) begin D = abs(Dx) + abs(Dy); abs_D <= D[10:3]; Dx <= - $signed({3'b000, O[-1][-1]}) + $signed({3'b000, O[-1][+1]}) - ($signed({3'b000, O[ 0][-1]}) << 1) + ($signed({3'b000, O[ 0][+1]}) << 1) - $signed({3'b000, O[+1][-1]}) + $signed({3'b000, O[+1][+1]}); Dy <= $signed({3'b000, O[-1][-1]}) + ($signed({3'b000, O[-1][ 0]}) << 1) + $signed({3'b000, O[-1][+1]}) - $signed({3'b000, O[+1][-1]}) - ($signed({3'b000, O[+1][ 0]}) << 1) - $signed({3'b000, O[+1][+1]}); ... Digital Design — Chapter 9 — Accelerators 28 Verilog Pixel Datapath O[-1][-1] <= O[-1][0]; O[-1][ 0] <= O[-1][+1]; O[-1][+1] <= prev_row[31:24]; O[ 0][-1] <= O[0][ 0]; O[ 0][ 0] <= O[0][+1]; O[ 0][+1] <= curr_row[31:24]; O[+1][-1] <= O[+1][ 0]; O[+1][ 0] <= O[+1][+1]; O[+1][+1] <= next_row[31:24]; end always @(posedge clk_i) // Result row register if (shift_en) result_row <= {result_row[23:0], abs_D}; Digital Design — Chapter 9 — Accelerators 29 Verilog Address Generation  Given an image in memory at base address B     Address for pixel in row r, column c is B + r × 640 + c Base address (B) is fixed Offset (r × 640 + c) increments by 4 for each group of 4 pixels read/written Use word-aligned addresses   Two least-significant bits always 00 Increment word address by 1 Digital Design — Chapter 9 — Accelerators 30 Verilog Address Generation Digital Design — Chapter 9 — Accelerators 31 Verilog Address Generation always @(posedge clk_i) // O base address register if (O_base_ce) O_base <= dat_i[21:2]; always @(posedge clk_i) // O address offset counter if (offset_reset) O_offset <= 0; else if (O_offset_cnt_en) O_offset <= O_offset + 1; always @(posedge clk_i) // D base address register if (D_base_ce) D_base <= dat_i[21:2]; always @(posedge clk_i) // D address offset counter if (offset_reset) D_offset <= 0; else if (D_offset_cnt_en) D_offset <= D_offset + 1; ... Digital Design — Chapter 9 — Accelerators 32 Verilog Address Generation assign assign assign assign assign O_prev_addr = O_base + O_offset; O_curr_addr = O_prev_addr + 640/4; O_next_addr = O_prev_addr + 1280/4; D_addr = D_base + D_offset; adr_o[21:2] = prev_row_load ? O_prev_addr : curr_row_load ? O_curr_addr : next_row_load ? O_next_addr : D_addr; assign adr_o[1:0] = 2'b00; Digital Design — Chapter 9 — Accelerators 33 Verilog Control/Status Registers Register Offset Read/Write Purpose Int_en 0 Write-only Interrupt enable (bit 0). Start 4 Write-only Write causes image processing to start (value ignored). O_base 8 Write-only Original image base address. D_base 12 Write-only Derivative image base address + 640. Status 0 Read-only Processing done (bit 0). Reading clears interrupt. Digital Design — Chapter 9 — Accelerators 34 Verilog Slave Bus Interface assign start = cyc_i && stb_i && we_i && adr_i == 2'b01; assign O_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b10; assign D_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b11; always @(posedge clk_i) // Interrupt enable register if (rst_i) int_en <= 1'b0; else if (cyc_i && stb_i && we_i && adr_i == 2'b00) int_en <= dat_i[0]; always @(posedge clk_i) // Status register if (rst_i) done <= 1'b0; else if (done_set) // This occurs when last write is acknowledged, // and so cannot coincide with a read of the status register. done <= 1'b1; else if (cyc_i && stb_i && we_i && adr_i == 2'b00 && ack_o) done <= 1'b0; assign int_req = int_en && done; ... Digital Design — Chapter 9 — Accelerators 35 Verilog Slave Bus Interface always @(posedge clk_i) // Generate ack output ack_o <= cyc_i && stb_i && !ack_o; // Wishbone data output multiplexer always @* if (cyc_i && stb_i && !we_i) if (adr_i == 2'b00) dat_o = {31'b0, done}; // status register read else dat_o = 32'b0; // other registers read as 0 else dat_o = result_row; // for master write Digital Design — Chapter 9 — Accelerators 36 Verilog Control Sequencing  Use a finite-state machine   Counters keep track of rows (0 to 477) and columns (0 to 159) See textbook for details of FSM output functions Digital Design — Chapter 9 — Accelerators 37 Verilog State Transition Diagram Digital Design — Chapter 9 — Accelerators 38 Verilog Accelerator Verification  Simulation-based verification of each section of the accelerator       Slave bus operations Computation sequencing Master bus operations Address generation Pixel computation Testbench including the accelerator   Bus functional processor model Simplified memory and bus arbiter models Digital Design — Chapter 9 — Accelerators 39 Verilog Sobel Verification Testbench Processor BFM Arbiter Sobel Accelerator Multiplexed Bus: Muxes and Connections Memory Model Digital Design — Chapter 9 — Accelerators 40 Verilog Processor Bus Functional Model initial begin // Processor bus-functional model cpu_adr_o <= 23'h000000; cpu_sel_o <= 4'b0000; cpu_dat_o <= 32'h00000000; cpu_cyc_o <= 1'b0; cpu_stb_o <= 1'b0; cpu_we_o <= 1'b0; @(negedge rst); @(posedge clk); // Write 008000 (hex) to O_base_addr register bus_write(sobel_reg_base + sobel_O_base_reg_offset, 32'h00008000); // Write 053000 + 280 (hex) to D_base_addr register bus_write(sobel_reg_base + sobel_D_base_reg_offset, 32'h00053280); // Write 1 to interrupt control register (enable interrupt) bus_write(sobel_reg_base + sobel_int_reg_offset, 32'h00000001); // Write to start register (data value ignored) bus_write(sobel_reg_base + sobel_start_reg_offset, 32'h00000000); // End of write operations ... Digital Design — Chapter 9 — Accelerators 41 Verilog Processor Bus Functional Model cpu_cyc_o = 1'b0; cpu_stb_o = 1'b0; cpu_we_o = 1'b0; begin: loop forever begin #10000; @(posedge clk); // Read status register cpu_adr_o <= sobel_reg_base + sobel_status_reg_offset; cpu_sel_o <= 4'b1111; cpu_cyc_o <= 1'b1; cpu_stb_o <= 1'b1; cpu_we_o <= 1'b0; @(posedge clk); while (!cpu_ack_i) @(posedge clk); cpu_cyc_o <= 1'b0; cpu_stb_o <= 1'b0; cpu_we_o <= 1'b0; if (cpu_dat_i[0]) disable loop; end end end Digital Design — Chapter 9 — Accelerators 42 Verilog Memory Bus Functional Model always begin // Memory bus-functional model mem_ack_o <= 1'b0; mem_dat_o <= 32'h00000000; @(posedge clk); while (!(bus_cyc && mem_stb_i)) @(posedge clk); if (!bus_we) mem_dat_o <= 32'h00000000; // in place of read data mem_ack_o <= 1'b1; @(posedge clk); end Digital Design — Chapter 9 — Accelerators 43 Verilog Bus Arbiter  Uses sobel_cyc_o and cpu_cyc_o as request inputs   If both request at the same time, give accelerator priority Mealy FSM Digital Design — Chapter 9 — Accelerators 44 Verilog Bus Arbiter always @(posedge clk) // Arbiter FSM register if (rst) arbiter_current_state <= sobel; else arbiter_current_state <= arbiter_next_state; always @* // Arbiter logic case (arbiter_current_state) sobel: if (sobel_cyc_o) begin sobel_gnt <= 1'b1; cpu_gnt <= 1'b0; arbiter_next_state end else if (!sobel_cyc_o && cpu_cyc_o) begin sobel_gnt <= 1'b0; cpu_gnt <= 1'b1; arbiter_next_state end else begin sobel_gnt <= 1'b0; cpu_gnt <= 1'b0; arbiter_next_state end cpu: if (cpu_cyc_o) begin sobel_gnt <= 1'b0; cpu_gnt <= 1'b1; arbiter_next_state end else if (sobel_cyc_o && !cpu_cyc_o) begin sobel_gnt <= 1'b1; cpu_gnt <= 1'b0; arbiter_next_state end else begin sobel_gnt <= 1'b0; cpu_gnt <= 1'b0; arbiter_next_state end endcase Digital Design — Chapter 9 — Accelerators <= sobel; <= cpu; <= sobel; <= cpu; <= sobel; <= sobel; 45 Verilog Simulation Results  See waveforms in textbook   But what about…    Demonstrates sequencing and address generation Data values computed correctly Interactions between processor and accelerator Need to use more sophisticated verification techniques  Due to complexity of the design Digital Design — Chapter 9 — Accelerators 46 Verilog Summary  Accelerators boost performance using parallel hardware   Ahmdahl’s Law    Replication, pipelining, … Best payback from accelerating a kernel DMA avoids processor overhead Verification requires advanced techniques Digital Design — Chapter 9 — Accelerators 47

09 Accelerators ppt

Related documents

Products

Support

09 Accelerators ppt

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib