Kiwi: Synthesis of FPGA Circuits from Multi-Threaded C# Programs Satnam Singh, Microsoft Research Cambridge, UK David Greaves, Computer Lab, Cambridge University, UK XD2000i FPGA in-socket accelerator for Intel FSB XD2000F FPGA in-socket accelerator for AMD socket F XD1000 FPGA co-processor module for socket 940 The Future is Heterogeneous Example Speedup: DNA Sequence Matching Why are regular computers not fast enough? FPGAs are the Lego of Hardware LUT4 (OR) LUT4 (AND) opportunity challenge scientific computing data mining search image processing financial analytics The Accidental Semi-colon Kiwi Thesis • Parallel programs are a good representation for circuit designs. (?) • Separated at birth? Objectives • A system for software engineers. • Model synchronous digital circuits in C# etc. – Software models offer greater productivity than models in VHDL or Verilog. • Transform circuit models automatically into circuit implementations. • Transform programs with dynamic memory allocation into their array equivalents. • Exploit existing concurrent software verification tools. Previous Work • Starts with sequential C-style programs. • Uses various heuristics to discover opportunities for parallelism esp. in nested loops. • Good for certain idioms that can be recognized. • However: – many parallelization opportunities are not discovered – lack of control – no support for dynamic memory allocation Kiwi gate-level VHDL/Verilog Kiwi structural 0 & 0 0 S R SET CLR Q thread 1 parallel imperative C-togates imperative (C) ; ; Q thread 2 thread 3 ; jpeg.c Key Points • We focus on compiling parallel C# programs into parallel hardware. • Important because future processors will be heterogeneous and we need to find ways to model and program multi-core CPUs, GPUs, FPGAs etc. • Previous work has had some success with compiling sequential programs into hardware. • Our hypothesis: it’s much better to try and produce parallel hardware from parallel programs. • Our approach involves compiling .NET concurrency constructs into gates. Self Inflicted Constraints • Use a standard programming language with no special extensions (C#). • Use standard mechanism for concurrency (System.Threading). • Use concurrency of model circuit structure. I2C Bus Control in VHDL Ports and Clocks public static class I2C { [OutputBitPort("scl")] static bool scl; [InputBitPort("sda_in")] static bool sda_in; [OutputBitPort("sda_out")] static bool sda_out; [OutputBitPort("rw")] static bool rw; circuit ports identified by custom attribute I2C Control private static void SendDeviceID() { Console.WriteLine("Sending device ID"); // Send out 7-bit device ID 0x76 int deviceID = 0x76; for (int i = 7; i > 0; i--) { scl = false; sda_out = (deviceID & 64) != 0; Kiwi.Pause(); // Set it i-th bit of the device ID scl = true; Kiwi.Pause(); // Pulse SCL scl = false; deviceID = deviceID << 1; Kiwi.Pause(); } } Generated Verilog module i2c_demo(clk, reset, I2CTest_I2C_scl, I2CTest_I2C_sda); input clk; input reset; reg i2c_demo_CS$4$0000; reg I2CTest_I2C_SendDeviceID_CS$4$0000; reg I2CTest_I2C_SendDeviceID_second_CS$4$0000; reg I2CTest_I2C_ProcessACK_ack1; reg I2CTest_I2C_ProcessACK_fourth_ack1; reg I2CTest_I2C_ProcessACK_second_ack1; reg I2CTest_I2C_ProcessACK_third_ack1; integer I2CTest_I2C_SendDeviceID_deviceID; integer I2CTest_I2C_SendDeviceID_second_deviceID; integer I2CTest_I2C_SendDeviceID_i; integer i2c_demo_i; integer I2CTest_I2C_SendDeviceID_second_i; integer i2c_demo_inBit; integer i2c_demo_registerID; output I2CTest_I2C_scl; output I2CTest_I2C_sda; System Composition • We need a way to separately develop components and then compose them together. • Don’t invent new language constructs: reuse existing concurrency machinery. • Adopt single-place channels for the composition of components. • Model channels with regular concurrency constructs (monitors). Writing to a Channel public class Channel<T> { T datum; bool empty = true; public void Write(T v) { lock (this) { while (!empty) Monitor.Wait(this); datum = v; empty = false; Monitor.PulseAll(this); } } Reading from a Channel public T Read() { T r; lock (this) { while (empty) Monitor.Wait(this); empty = true; r = datum; Monitor.PulseAll(this); } return r; } Our Implementation • Use regular Visual Studio technology to generate a .NET IL assembly language file. • Our system then processes this file to produce a circuit: – The .NET stack is analyzed and removed – The control structure of the code is analyzed and broken into basic blocks which are then composed. – The concurrency constructs used in the program are used to control the concurrency / clocking of the generated circuit. user applications rendezvous join patterns domain specific languages transactional memory systems level concurrency constructs threads, events, monitors, condition variables data parallelism Higher Level Concurrency Constructs • By providing hardware semantics for the system level concurrency abstractions we hope to then automatically deal with other higher level concurrency constructs: – Join patterns (C-Omega, CCR, .NET Joins Library) – Rendezvous – Data parallel operations Kiwi Library circuit model Kiwi.cs JPEG.cs Visual Studio Kiwi Synthesis multi-thread simulation debugging verification circuit implementation JPEG.v C to gates Thread 1 parallel program circuit C to gates Thread 2 circuit C# Thread 3 Thread 3 C to gates C to gates circuit circuit Verilog for system .method public hidebysig static public static int max2(int a, int b) int32 { int result; max2(int32 a, if (a > b) int32 b) cil managed result = a; { else // Code size 12 (0xc) result = b; .maxstack 2 return result; .locals init ([0] int32 result) } IL_0000: ldarg.0 IL_0001: ldarg.1 IL_0002: ble.s IL_0008 max2(3, 7) stack 7 3 7 7 0 local memory } IL_0004: IL_0005: IL_0006: ldarg.0 stloc.0 br.s IL_0008: IL_0009: IL_000a: IL_000b: ldarg.1 stloc.0 ldloc.0 ret IL_000a System.Threading • We have decided to target hardware synthesis for a sub-set of the concurrency features in the .NET library System.Threading – Monitors (synchronization) – Thread creation (circuit structure) Kiwi Concurrency Library • A conventional concurrency library Kiwi is exposed to the user which has two implementations: – A software implementation which is defined purely in terms of the support .NET concurrency mechanisms (events, monitors, threads). – A corresponding hardware semantics which is used to drive the .NET IL to Verilog flow to generate circuits. • A Kiwi program should always be a sensible concurrent program but it may also be a sensible parallel circuit. System Composition • We need a way to separately develop components and then compose them together. • Don’t invent new language constructs: reuse existing concurrency machinery. • Adopt single-place channels for the composition of components. • Model channels with regular concurrency constructs (monitors). Writing to a Channel public class Channel<T> { T datum; bool empty = true; public void Write(T v) { lock (this) { while (!empty) Monitor.Wait(this); datum = v; empty = false; Monitor.PulseAll(this); } } Reading from a Channel public T Read() { T r; lock (this) { while (empty) Monitor.Wait(this); empty = true; r = datum; Monitor.PulseAll(this); } return r; } class FIFO2 { [Kiwi.OutputWordPort(“result“, 31, 0)] public static int result; static Kiwi.Channel<int> chan1 = new Kiwi.Channel<int>(); static Kiwi.Channel<int> chan2 = new Kiwi.Channel<int>(); public static void Consumer() { while (true) { int i = chan1.Read(); chan2.Write(2 * i); Kiwi.Pause(); } } public static void Producer() { for (int i = 0; i < 10; i++) { chan1.Write(i); Kiwi.Pause(); } } public static void Behaviour() { Thread ProducerThread = new Thread(new ThreadStart(Producer)); ProducerThread.Start(); Thread ConsumerThread = new Thread(new ThreadStart(Consumer)); ConsumerThread.Start(); two clock ticks per result handshaking protocol Filter Example thread one-place channel public static int[] SequentialFIRFunction(int[] weights, int[] input) { int[] window = new int[size]; int[] result = new int[input.Length]; // Clear to window of x values to all zero. for (int w = 0; w < size; w++) window[w] = 0; // For each sample... for (int i = 0; i < input.Length; i++) { // Shift in the new x value for (int j = size - 1; j > 0; j--) window[j] = window[j - 1]; window[0] = input[i]; // Compute the result value int sum = 0; for (int z = 0; z < size; z++) sum += weights[z] * window[z]; result[i] = sum; } return result; } Transposed Filter static void Tap(int i, byte w, Kiwi.Channel<byte> xIn, Kiwi.Channel<int> yIn, Kiwi.Channel<int> yout) { byte x; int y; while(true) { y = yIn.Read(); x = xIn.Read(); yout.Write(x * w + y); } } Inter-thread Communication and Synchronization // Create the channels to link together the taps for (int c = 0; c < size; c++) { Xchannels[c] = new Kiwi.Channel<byte>(); Ychannels[c] = new Kiwi.Channel<int>(); Ychannels[c].Write(0); // Pre-populate y-channel registers with zeros } // Connect up the taps for a transposed filter for (int i = 0; i < size; i++) { int j = i; // Quiz: why do we need the local j? Thread tapThread = new Thread(delegate() { Tap(j, weights[j], Xchannels[j], Ychannels[j], Ychannels[j+1]); }); tapThread.Start(); } Performance • Software – Dual-core Pentium 2.67GHz, 3GB – 6,562,500 pixels per second • BEE3 FPGA Performance – Xilinx XC5VLX110T FPGA, 100MHz – DDR2 memory, 2 DIMMS per channels, 288-bits per read – 4 cycles per pixel – 429,000,000 pixels per second • Hand optimized core – Xilinx CoreGenerator: 400MHz Current Limitations • Only integer arithmetic and string handling. • Floating point could be added easily. • Generation of statically allocated code: – Arrays must be dimensioned at compile time – Number of objects on the heap is determined at compile time – Recursive function calling must bottom out at compile time (so depth can not be run-time dependent) Next Steps • Consider a series of concurrency constructs and their meaning in hardware: – – – – Transactional memory Rendezvous. Join patterns / chords Data Parallel Descriptions • Optimize away handshaking protocol. • Allow non trivial dynamic memory allocation. • Solve impedance mismatch with back-end tools to improve performance. Smith-Waterman Recurrence SW Diagonal Dependencies Can perform all operations on an anti-diagonal in parallel. Can pass query and database data along channels between cells. However, each operation needs a scoring matrix read. for (int qpos = 0; qpos < height; qpos++) { short score = (dbval < 0 || seq[qpos] < 0) ? (short)0: pam250[dbval, seq[qpos]]; int left = prev[qpos]; int above = (qpos==0)? aboveScore: here[qpos-1]; int diag = (qpos==0)? prevAbove: prev[qpos-1]; int nv = Math.Max(0, Math.Max(left - 10, Math.Max(above - 10, diag + score))); if (nv > (int)max) max = (short)nv; here[qpos] = (short)nv; if (qpos == height-1) below_score.Write((short)nv); } FPGA hardware (VHDL) GPU code (CUDA) data parallel description of FFT-style operations in a multi-core bytecode C# SMP Summary • Circuits can be modelled as regular parallel programs. • Automatically transform parallel circuit models into digital circuit implementations. • Exploit shared memory and passage passing idioms for codesign. • We don’t need to invent a new language: – Exploit rich existing knowledge of concurrent programming. • Apply recent innovations in shape analysis and region types to allow us to compile programs with lists and trees. • Is there an application for this work at Sanger/EBI? • More information about Kiwi synthesis at http://research.microsoft.com/~satnams Synplify Pro FPGA Implementation: First, preliminary result: Device: Virtex 5x110T-2: Static timing: 20 logic layers, Fmax=78MHz (12.7 ns). Utilisation = 3120 Virtex-5 slices, 17% of 17500. Clock cycles per streaming base: 10. Future parameter exploration: QSL search string query limit increase = 256 or 512. N search parallelism (number of units) = 32 or 64. Clocks per cell : reduce to 4 or 2 (channel overheads then dominate). Extend Kiwi channels between the four chips on the BEE3 board.