Introduction to circuit design using Celoxica’s Handel-C Presenter: Mr. David Sanders Co-Sponsored by: The Internet Innovation Centre and IEEE Computer Society Agenda FPGA overview Purpose of Handel-C Comparison with ANSI C Handel-C data types parallelism special Handel-C constructs and data types Hardware implementation of Handel-C constructs (Some of them) Optimization and retiming features of Celoxica’s design tool 2 FPGAs Field Programmable Gate Array A user programmable logic device with a collection of Look-Up-Tables (LUTs), routing resources, and Input/Output blocks (IOBs). LUTs contain varying number of inputs depending on vendor/technology used. Usually at least 4 inputs, but state of the art LUTs can have up to 8 (Altera’s 8-input fracturable LUT). Modern FPGAs also included dedicated RAM blocks, ALUs, multipliers, or hard or soft-core processors (eg. ARM, NIOS II, MicroBlaze, PPC). 3 So what does Handel-C do for us? Those who have programmed in VHDL/Verilog know that you must think in terms of a state machine, and write the code accordingly. Handel-C is one level of abstraction higher than an HDL. Will create a netlist, but not an FPGA programming files Compiler deals with the state machine generation automatically. The FPGA vendor tools are still required. It does however provide scripts for automating the place and route, and bit stream creation for Xilinx and Altera tools. 4 Still no free lunch!! Just as in Professor McLeod’s talk last time, there is no free lunch. The state machine generated by the HandelC compiler uses One-Hot Encoding. Not necessarily optimal for every design, but still gives good results in practice. 5 Other Capabilities of Handel-C Design Suite Provides support for Altera, Xilinx, and Actel FPGAs Compiler can provide output in different formats: Handel-C compiler can take advantage of technology available in a particular device (RAMs, ALUs, Multipliers, etc.) Netlist (EDIF) VHDL code from C code Debug (can be used with SystemC or ANSI C front-end for verification) Provides a Platform Abstraction Layer (PAL) Set of common utilities for hardware devices commonly found on development boards Video, Keyboard, Mouse, Ethernet, RS-232, LED output, General I/O, etc. Provides support for integration with company specific tools and/or intellectual property. Quartus II, SOPC Builder, NIOS II processor, MicroBlaze processor 6 Handel-C Data Types Handel-C supports all of the primitive integral types provided by ANSI C, (signed and unsigned). char, int, short, long Variables are implemented as registers. Depth of an array must be specified at compile time. Can also declare variables of arbitrary width from 1 to 128 bits. eg. unsigned 8 myVariable; signed 25 myVariable2[15]; No native floating point types or calculations in the current version. Course instructor claims it will be included in the next release. 7 Operators All the operators from ANSI C, plus a few others: Relational: !=, ==, <, >, <=, >= (GT and LT expensive to evaluate with combinational logic). Operands must have same width. Result is a 1 bit value. Logical: &&, ||, ! Take 1 bit unsigned operands, however... X || y compiler will take this as: x!=0 || y!=0 Bitwise: ^, |, &, ~ Operands must have equal width. Shift: <<, >> For a << b, b must have a width of ceil(log2(width(a)+1)) Macros provided by the Platform Developers Kit (PDK). 8 …the others a= Bit manipulation Take: <- Drop: \\ 1 0 0 1 1 b = a <- 3 0 1 1 1 c = a \\ 3 0 Very cheap in hardware since these operators are implemented as wires. Range selection: Expression[n:m] (bits n to m) a[3:1] = 0 0 1 Concatenation: expression 1 @ expression 2 d = a @ a[3:1] = 1 0 0 1 1 0 0 1 9 Parallelism Since logic circuit operation is highly parallel by nature, it is necessary for a design tool to support parallelism. Accomplished in Handel-C by using a par statement, as opposed to a seq statement, where the code is executed sequentially. 10 static unsigned 8 a = 2; static unsigned 8 b = 1; par { a++; b = a + 10; } Results: a = 3, b = 12 Each Handel-C assignment takes 1 clock cycle. Both statements begin execution at the same time, therefore both statements take only 1 clock cycle combined. Operations are performed on the value that the variable contained before the start of the previous cycle. static unsigned 8 a = 2; static unsigned 8 b = 1; seq { The seq block operates in the same manner as you would expect from an ANSI C program. a++; b = a + 10; } Results: a = 3, b = 13 11 Signals However, occasionally we need to use a value immediately after assigning it in a par block. This can be done by declaring a variable as a signal. The value of a signal lasts only for the duration of the current clock cycle. signal unsigned 8 a; static unsigned 8 b; signal unsigned 8 a; static unsigned 8 b; par { seq { a = 7; b = a; } a = 7; b = a; } Results: a = 0, b = 7 Results: a = 0, b = 0 12 Nesting seq and par Can be nested as in the following example: par { seq { /*some statements to be executed sequentially */ } seq { /* these statements are executed sequentially, but in parallel with previous seq block */ } } par will not return until all of the statements/sub-blocks have completed. 13 Special Data Types Input/Output Obviously there must be a mechanism for performing I/O with the FPGA. Handel-C has data types for buses or interfaces. (input, output, tri-state). Also supports ports I/O between modules/components in a design, not a physical pin. 14 I/O Declaration Examples Input interface prototype: interface bus_in(type portName) Name() with {data = {Pin List}}; Input interface usage: interface bus_in(unsigned 2 val) myInput() with {data = {“P1”,”P2”}}; unsigned 2 inData; inData = myInput.val; //read the value {P1 P2} 15 Examples cont’d Output interface prototype: interface bus_out() Name(type portName=Expression) with {data = {Pin List}}; Output interface usage: static unsigned 8 counter = 0; interface bus_out() CountOut(unsigned 8 outVal=counter+1); while(1) { counter++; } 16 RAM and ROM No such thing as malloc() on an FPGA Instead, Handel-C allows you to store variables in FPGA dedicated RAM blocks ram int 9 myRam[256]; /* a RAM block that holds 256, 9-bit integers */ static rom int 9 myRom[3] = {100,200,300}; /* must be static or global */ Different from arrays because declaring an array is the same as declaring multiple variables This means that an array’s indices can be accessed simultaneously RAMs cannot because they only have 1 or 2 ports. myRam[25]++; /*Read, Write, Modify = undefined results */ par /* 2 modifies during same cycle -> This also won’t work */ { myRam[0] = 100; myRam[2] = 498; } 17 If/Else Handel-C if/else syntax is almost the same as in ANSI C. The exception: The condition of the if() must take 0 clock cycles to evaluate. This implies that there can not be any variable assignment in the condition expression. if( (z = x + y) == 6) //legal in ANSI C, but not in Handel-C 18 Loops while(), for(), do…while() All have same syntax as in ANSI C Same limitation applies to the conditions as with if/else. When programming a PC, it is good practice to use a for loop when the context calls for it. When writing C code for circuits, it’s almost never good practice to use for() loops at all. One clock cycle overhead per iteration. 19 While Loop Optimization The limitations of a for() loop can be avoided by incrementing a counter variable in parallel with the body of a while() loop. static unsigned 4 x = 15; par { do{ //do something } while(x != 0); x--; } 20 Macros, Channels, Prialt, and Semaphores Scenario: Suppose you need to design a circuit that calculates pixel values in a frame buffer, and that each calculation takes 4 or 5 clock cycles. However you need to calculate one pixel every clock cycle to meet a display timing constraint. Possible Solution: Duplicate the calculation code 5 times, and have each block store values in the proper place in the frame buffer. 21 Macros Macros can be used to implement parameterizable code, or to provide code re-use. Like a regular function without parameter types. For the solution to our scenario, declaring a macro would look like: macro proc myCalculation(dataSource) { //receive data from source //Perform 3-5 clock cycles worth of calculations } 22 Channels Handel-C provides a channel type to allow for synchronization or communication between parallel processes. Declaration: chan <type> <channelName> Data can then be sent over the channel, or received from it, but only in one direction. Each channel operation Must be declared with global scope. will block if the other party is not ready. Two parallel blocks of code: chan unsigned 8 dataPipe; static unsigned 8 someData = 5; … dataPipe ! someData; … static unsigned 8 recvData; … dataPipe ? recvData; … 23 Prialt Now suppose we have 5 of our ‘worker’ processes running in parallel. How do we use them to achieve our goal? Each operation will complete in 3-5 cycles, so we don’t know which of the 5 will be free to perform the next pixel calculation. But if we send data down a channel sequentially to each of the 5 processes, we might block on one of them, when another is not doing anything…wasted clock cycles. Prialt is the solution for this. 24 Prialt Similar to a case statement that chooses the first channel able to receive data. In other words, it gives a priority to each channel. prialt { case channel1 ! data ; break; case channel2 ! data ; break; default: If default is not used, then prialt will block on break; case statement if a prior one was not taken. } the last Need to be careful that process aren’t starved. Wasted resources 25 Semaphores Once a process has finished its computation we need to update the frame buffer (FB), which is typically implemented in a RAM block for FPGA area efficiency. Recall that a RAM block typically only has one write port, therefore we can’t have each process write to the frame buffer because we can’t guarantee that simultaneous access will not happen. One solution is to have each process send the result down a separate channel to another process that deals with FB access. But this is a section on semaphores, so we’ll go with them instead. 26 Semaphores Semaphores can be used to guard critical sections of code against parallel access. More like a mutex from POSIX threads. trysema() and releasesema() methods used to check if critical section is free. eg. sema fbGuard; … while(trysema(fbGuard)==0); delay; /*loop until semaphore is free */ /* critical section of code, ie. Frame buffer access */ releasesema(fbGuard); /*skipping this step could result in deadlock*/ … 27 Putting it all together… #define NUM_CHANNELS 5 #define SCR_WIDTH 4 #define SCR_HEIGHT 4 set clock = external; typedef struct point { unsigned 2 x; unsigned 2 y; } point; //just as in ANSI C sema fbGuard; //you can even send structures over channels chan point dataChannels[NUM_CHANNELS]; ram unsigned 8 frameBuffer[SCR_WIDTH*SCR_HEIGHT]; macro proc increment(p) { if(p.x==SCR_WIDTH-1) { par { p.x=0; p.y++; } } else p.x++; } macro proc coordGen() { point pGen; pGen.x = 0; pGen.y = 0; while(1) { prialt { case dataChannels[0] ! increment(pGen); break; case dataChannels[1] ! increment(pGen); break; case dataChannels[2] ! increment(pGen); break; case dataChannels[3] ! increment(pGen); break; case dataChannels[4] ! increment(pGen); break; default: delay; break; pGen: pGen: pGen: pGen: pGen: } } } 28 macro proc worker(channel) { point p; static unsigned 8 pixel = 0; //loop forever waiting for data to compute pixels with while(1) { channel ? p; even if(p.x <- 1 == 0 && p.y <- 1 == 0 ) //x, y are { pixel = 2; delay; delay; odd } else if(p.x <- 1 == 1 && p.y <- 1 == 1 ) //both void main() { pixel = 1; par delay; { } //create the coord generator and the worker else //x is even/odd and y is odd/even processes coordGen(); pixel = 3; worker(dataChannels[0]); worker(dataChannels[1]); //critical section worker(dataChannels[2]); while(trysema(fbGuard) == 0) worker(dataChannels[3]); delay; worker(dataChannels[4]); { frameBuffer[p.y@p.x] = pixel; releasesema(fbGuard); //will never return because at //least 1 process has an infinite loop } } } } 29 Mapping Handel-C to Logic Ultimately, the statements you write in Handel-C must be mapped to logic by the compiler. The following slides show the mapping for some of the constructs discussed so far. assignment seq and par if while do…while The following logic circuits are taken from the course notes from Celoxica’s DK training course. 30 Assignment a = b; 31 Sequential Statements seq { statement1; statement2; } 32 Parallel Statements par { statement1; statement2; } 33 If Statements if (Condition) statement2; 34 While Loops while (Condition) { statement2; } do { statement2; } while(Condition); 35 Automatic Retiming 36 Why Retime? Many designs will require the use of a multiplier, divider, or other large combinational logic circuit. The propagation delay through deep logic can be quite long. Having even one path in the design with a long delay could cause the maximum clock rate to drop significantly to the point where timing constraints cannot be met. Retiming involves moving/adding flip-flops around the data path to reduce the depth of logic, and ultimately reduce the critical path delay. 37 Simple 1 Example x = a+b+c+d; The result is calculated through two adder stages. However we can pipeline the result by inserting registers at intermediate locations. The adder stages are split with two registers. This reduces the propagation delay of each stage, allowing a higher clock frequency. The consequence is that the result is delayed by one cycle. 1: Example adapted from Celoxica’s Handel-C and DK training course notes. 38 Programming for Retiming Retiming is not a trivial task, it is extremely time consuming to do by hand, especially for large designs. Handel-C design tools can perform retiming automatically if the code is written properly. The compiler will add/remove/move flip-flops as necessary, but will not alter the timing of the design. Therefore to use retiming, the design must be pipelined, or have extra pipelining stages built-in. The compiler can then shift logic and flip-flops around without altering the timing of the design. 39 Programming Example Example: x = a*b+c*d; unsigned 8 x[3]; //3 retiming stages; interface bus_out() sumOut(unsigned 8 out = x[2]) with {data ={"P2","P3","P4","P5","P6","P7","P8","P9"}}; interface bus_clock_in(unsigned 8 in) input() with {data ={"P10","P11","P12","P13","P14","P15","P16","P17"}}; void main() { unsigned 8 data[4]; Output is the last of the retiming stages. while(1) { par { //get the data[0] = data[1] = data[2] = data[3] = input and shift the previous inputs input.in; data[0]; data[1]; data[2]; Coded like you would without retiming. x[0] = data[0]*data[1] + data[2]*data[3]; x[1] = x[0]; //extra stages x[2] = x[1]; } } } Result is shifted through the retiming registers. 40 FIR Example One of the exercises at the training course was to code a nine tap FIR filter that was pipelined and retimed automatically. Nine multiplications of data and coefficients, followed by summation of the nine products. Very deep logic Xilinx Spartan™ 3 chip was targeted. The fmax results were recorded for various number of extra retiming stages. 41 Fmax (MHz) 160 140 120 100 Frequency (MHz) 80 Fmax (MHz) 60 40 Flip Flop Usage Before and After Retiming 20 0 1 2 3 4 5 6 # of Retiming Stages 1000 900 800 700 # of Flip Flops 600 500 FF Before FF After 400 300 200 100 0 1 2 3 4 5 6 # of Retiming Stages 42 Final Notes Not enough time to cover everything HandelC has to offer. There are ways to create parameterizable code. pointers, macro expressions Allows the designer to easily vary the # of worker processes, or pipeline/retiming stages, for example. More information available at www.celoxica.com 43 Thank-You! Questions? 44