pptx

advertisement
Kiwi: Synthesis of FPGA Circuits
from Multi-Threaded C# Programs
Satnam Singh, Microsoft Research Cambridge, UK
David Greaves, Computer Lab, Cambridge University, UK
XD2000i FPGA in-socket
accelerator for Intel FSB
XD2000F FPGA in-socket
accelerator for AMD socket F
XD1000 FPGA co-processor
module for socket 940
The Future is Heterogeneous
Example Speedup: DNA Sequence Matching
Why are regular computers not fast
enough?
FPGAs are the Lego of Hardware
LUT4 (OR)
LUT4 (AND)
opportunity
challenge
scientific computing
data mining
search
image processing
financial analytics
The Accidental Semi-colon
Kiwi Thesis
• Parallel programs are a good
representation for circuit designs. (?)
• Separated at birth?
Objectives
• A system for software engineers.
• Model synchronous digital circuits in C# etc.
– Software models offer greater productivity than
models in VHDL or Verilog.
• Transform circuit models automatically into
circuit implementations.
• Transform programs with dynamic memory
allocation into their array equivalents.
• Exploit existing concurrent software
verification tools.
Previous Work
• Starts with sequential C-style programs.
• Uses various heuristics to discover opportunities
for parallelism esp. in nested loops.
• Good for certain idioms that can be recognized.
• However:
– many parallelization opportunities are not discovered
– lack of control
– no support for dynamic memory allocation
Kiwi
gate-level
VHDL/Verilog
Kiwi
structural
0
&
0
0
S
R
SET
CLR
Q
thread
1
parallel
imperative
C-togates
imperative (C)
;
;
Q
thread
2
thread
3
;
jpeg.c
Key Points
• We focus on compiling parallel C# programs into parallel
hardware.
• Important because future processors will be
heterogeneous and we need to find ways to model and
program multi-core CPUs, GPUs, FPGAs etc.
• Previous work has had some success with compiling
sequential programs into hardware.
• Our hypothesis: it’s much better to try and produce
parallel hardware from parallel programs.
• Our approach involves compiling .NET concurrency
constructs into gates.
Self Inflicted Constraints
• Use a standard programming language
with no special extensions (C#).
• Use standard mechanism for concurrency
(System.Threading).
• Use concurrency of model circuit structure.
I2C Bus Control in VHDL
Ports and Clocks
public static class I2C
{ [OutputBitPort("scl")]
static bool scl;
[InputBitPort("sda_in")]
static bool sda_in;
[OutputBitPort("sda_out")]
static bool sda_out;
[OutputBitPort("rw")]
static bool rw;
circuit ports
identified by
custom attribute
I2C Control
private static void SendDeviceID()
{
Console.WriteLine("Sending device ID");
// Send out 7-bit device ID 0x76
int deviceID = 0x76;
for (int i = 7; i > 0; i--)
{
scl = false; sda_out = (deviceID & 64) != 0; Kiwi.Pause();
// Set it i-th bit of the device ID
scl = true; Kiwi.Pause(); // Pulse SCL
scl = false; deviceID = deviceID << 1; Kiwi.Pause();
}
}
Generated Verilog
module i2c_demo(clk, reset, I2CTest_I2C_scl, I2CTest_I2C_sda);
input clk;
input reset;
reg i2c_demo_CS$4$0000;
reg I2CTest_I2C_SendDeviceID_CS$4$0000;
reg I2CTest_I2C_SendDeviceID_second_CS$4$0000;
reg I2CTest_I2C_ProcessACK_ack1;
reg I2CTest_I2C_ProcessACK_fourth_ack1;
reg I2CTest_I2C_ProcessACK_second_ack1;
reg I2CTest_I2C_ProcessACK_third_ack1;
integer I2CTest_I2C_SendDeviceID_deviceID;
integer I2CTest_I2C_SendDeviceID_second_deviceID;
integer I2CTest_I2C_SendDeviceID_i;
integer i2c_demo_i;
integer I2CTest_I2C_SendDeviceID_second_i;
integer i2c_demo_inBit;
integer i2c_demo_registerID;
output I2CTest_I2C_scl;
output I2CTest_I2C_sda;
System Composition
• We need a way to separately develop
components and then compose them
together.
• Don’t invent new language constructs: reuse
existing concurrency machinery.
• Adopt single-place channels for the
composition of components.
• Model channels with regular concurrency
constructs (monitors).
Writing to a Channel
public class Channel<T>
{
T datum;
bool empty = true;
public void Write(T v)
{
lock (this)
{
while (!empty)
Monitor.Wait(this);
datum = v;
empty = false;
Monitor.PulseAll(this);
}
}
Reading from a Channel
public T Read()
{
T r;
lock (this)
{
while (empty)
Monitor.Wait(this);
empty = true;
r = datum;
Monitor.PulseAll(this);
}
return r;
}
Our Implementation
• Use regular Visual Studio technology to
generate a .NET IL assembly language file.
• Our system then processes this file to
produce a circuit:
– The .NET stack is analyzed and removed
– The control structure of the code is analyzed and
broken into basic blocks which are then
composed.
– The concurrency constructs used in the program
are used to control the concurrency / clocking of
the generated circuit.
user
applications
rendezvous
join patterns
domain specific
languages
transactional
memory
systems level concurrency constructs
threads, events, monitors, condition variables
data
parallelism
Higher Level Concurrency Constructs
• By providing hardware semantics for the
system level concurrency abstractions we
hope to then automatically deal with other
higher level concurrency constructs:
– Join patterns (C-Omega, CCR, .NET Joins
Library)
– Rendezvous
– Data parallel operations
Kiwi
Library
circuit
model
Kiwi.cs
JPEG.cs
Visual Studio
Kiwi Synthesis
multi-thread simulation
debugging
verification
circuit
implementation
JPEG.v
C to
gates
Thread 1
parallel
program
circuit
C to
gates
Thread 2
circuit
C#
Thread 3
Thread 3
C to
gates
C to
gates
circuit
circuit
Verilog
for system
.method public hidebysig static
public static int max2(int a, int b)
int32
{ int result;
max2(int32 a,
if (a > b)
int32 b) cil managed
result = a;
{
else
// Code size
12 (0xc)
result = b;
.maxstack 2
return result;
.locals init ([0] int32 result)
}
IL_0000: ldarg.0
IL_0001: ldarg.1
IL_0002: ble.s
IL_0008
max2(3, 7)
stack
7
3
7
7
0
local memory
}
IL_0004:
IL_0005:
IL_0006:
ldarg.0
stloc.0
br.s
IL_0008:
IL_0009:
IL_000a:
IL_000b:
ldarg.1
stloc.0
ldloc.0
ret
IL_000a
System.Threading
• We have decided to target hardware
synthesis for a sub-set of the concurrency
features in the .NET library
System.Threading
– Monitors (synchronization)
– Thread creation (circuit structure)
Kiwi Concurrency Library
• A conventional concurrency library Kiwi is exposed
to the user which has two implementations:
– A software implementation which is defined purely in
terms of the support .NET concurrency mechanisms
(events, monitors, threads).
– A corresponding hardware semantics which is used to
drive the .NET IL to Verilog flow to generate circuits.
• A Kiwi program should always be a sensible
concurrent program but it may also be a sensible
parallel circuit.
System Composition
• We need a way to separately develop
components and then compose them
together.
• Don’t invent new language constructs: reuse
existing concurrency machinery.
• Adopt single-place channels for the
composition of components.
• Model channels with regular concurrency
constructs (monitors).
Writing to a Channel
public class Channel<T>
{
T datum;
bool empty = true;
public void Write(T v)
{
lock (this)
{
while (!empty)
Monitor.Wait(this);
datum = v;
empty = false;
Monitor.PulseAll(this);
}
}
Reading from a Channel
public T Read()
{
T r;
lock (this)
{
while (empty)
Monitor.Wait(this);
empty = true;
r = datum;
Monitor.PulseAll(this);
}
return r;
}
class FIFO2
{
[Kiwi.OutputWordPort(“result“, 31, 0)]
public static int result;
static Kiwi.Channel<int> chan1 = new Kiwi.Channel<int>();
static Kiwi.Channel<int> chan2 = new Kiwi.Channel<int>();
public static void Consumer()
{
while (true)
{
int i = chan1.Read();
chan2.Write(2 * i);
Kiwi.Pause();
}
}
public static void Producer()
{
for (int i = 0; i < 10; i++)
{
chan1.Write(i);
Kiwi.Pause();
}
}
public static void Behaviour()
{
Thread ProducerThread = new Thread(new ThreadStart(Producer));
ProducerThread.Start();
Thread ConsumerThread = new Thread(new ThreadStart(Consumer));
ConsumerThread.Start();
two
clock ticks
per result
handshaking
protocol
Filter Example
thread
one-place
channel
public static int[] SequentialFIRFunction(int[] weights, int[] input)
{
int[] window = new int[size];
int[] result = new int[input.Length];
// Clear to window of x values to all zero.
for (int w = 0; w < size; w++)
window[w] = 0;
// For each sample...
for (int i = 0; i < input.Length; i++)
{
// Shift in the new x value
for (int j = size - 1; j > 0; j--)
window[j] = window[j - 1];
window[0] = input[i];
// Compute the result value
int sum = 0;
for (int z = 0; z < size; z++)
sum += weights[z] * window[z];
result[i] = sum;
}
return result;
}
Transposed Filter
static void Tap(int i, byte w,
Kiwi.Channel<byte> xIn,
Kiwi.Channel<int> yIn,
Kiwi.Channel<int> yout)
{
byte x;
int y;
while(true)
{
y = yIn.Read();
x = xIn.Read();
yout.Write(x * w + y);
}
}
Inter-thread Communication and
Synchronization
// Create the channels to link together the taps
for (int c = 0; c < size; c++)
{
Xchannels[c] = new Kiwi.Channel<byte>();
Ychannels[c] = new Kiwi.Channel<int>();
Ychannels[c].Write(0); // Pre-populate y-channel registers with zeros
}
// Connect up the taps for a transposed filter
for (int i = 0; i < size; i++)
{
int j = i; // Quiz: why do we need the local j?
Thread tapThread = new Thread(delegate() { Tap(j, weights[j],
Xchannels[j],
Ychannels[j],
Ychannels[j+1]); });
tapThread.Start();
}
Performance
• Software
– Dual-core Pentium 2.67GHz, 3GB
– 6,562,500 pixels per second
• BEE3 FPGA Performance
– Xilinx XC5VLX110T FPGA, 100MHz
– DDR2 memory, 2 DIMMS per channels, 288-bits per
read
– 4 cycles per pixel
– 429,000,000 pixels per second
• Hand optimized core
– Xilinx CoreGenerator: 400MHz
Current Limitations
• Only integer arithmetic and string handling.
• Floating point could be added easily.
• Generation of statically allocated code:
– Arrays must be dimensioned at compile time
– Number of objects on the heap is determined at
compile time
– Recursive function calling must bottom out at
compile time (so depth can not be run-time
dependent)
Next Steps
• Consider a series of concurrency constructs
and their meaning in hardware:
–
–
–
–
Transactional memory
Rendezvous.
Join patterns / chords
Data Parallel Descriptions
• Optimize away handshaking protocol.
• Allow non trivial dynamic memory allocation.
• Solve impedance mismatch with back-end
tools to improve performance.
Smith-Waterman Recurrence
SW Diagonal Dependencies
Can perform all operations on an anti-diagonal in parallel.
Can pass query and database data along channels between cells.
However, each operation needs a scoring matrix read.
for (int qpos = 0; qpos < height; qpos++)
{ short score = (dbval < 0 || seq[qpos] < 0) ?
(short)0: pam250[dbval, seq[qpos]];
int left = prev[qpos];
int above = (qpos==0)? aboveScore: here[qpos-1];
int diag = (qpos==0)? prevAbove: prev[qpos-1];
int nv = Math.Max(0, Math.Max(left - 10,
Math.Max(above - 10, diag + score)));
if (nv > (int)max) max = (short)nv;
here[qpos] = (short)nv;
if (qpos == height-1) below_score.Write((short)nv);
}
FPGA
hardware
(VHDL)
GPU code (CUDA)
data parallel
description of
FFT-style operations
in a multi-core
bytecode
C#
SMP
Summary
• Circuits can be modelled as regular parallel programs.
• Automatically transform parallel circuit models into digital
circuit implementations.
• Exploit shared memory and passage passing idioms for codesign.
• We don’t need to invent a new language:
– Exploit rich existing knowledge of concurrent programming.
• Apply recent innovations in shape analysis and region types
to allow us to compile programs with lists and trees.
• Is there an application for this work at Sanger/EBI?
• More information about Kiwi synthesis at
http://research.microsoft.com/~satnams
Synplify Pro FPGA
Implementation:
First, preliminary result:
Device: Virtex 5x110T-2:
Static timing: 20 logic layers, Fmax=78MHz (12.7 ns).
Utilisation = 3120 Virtex-5 slices, 17% of 17500.
Clock cycles per streaming base: 10.
Future parameter exploration:
QSL search string query limit increase = 256 or 512.
N search parallelism (number of units) = 32 or 64.
Clocks per cell : reduce to 4 or 2 (channel overheads then dominate).
Extend Kiwi channels between the four chips on the BEE3 board.
Download