D2.4. Final integrated version of CUDA to SystemC translator and user's manual
FASTCUDA
Project No: 286770
D2.4. Final integrated version of CUDA to SystemC
translator and user’s manual
31st August 2013
Abstract:
This deliverable describes the functionality and operation of the tool that translates a CUDA kernel into a
SystemC model that can then be synthesized using any FPGA synthesis tool starting from SystemC (e.g.
Xilinx Vivado or Cadence CtoSilicon and Xilinx ISE). The tool also allows the designer to make some
implementation trade-offs, such as unrolling or not some loops of the CUDA kernel which do not involve
access to global or shared memory, and can thus be unrolled to increase performance at the expense of
cost. The Graphical User Interfaces that allows the user to perform the translation was designed to be
friendly and easy even for users who are not intimately familiar with hardware design.
Document Manager
Luciano Lavagno
POLITO
Document Id N°:
Final integrated version of CUDA to
Version:
SystemC translator and user’s manual
Filename:
FASTCUDA-D2.4_POLITO_V0.1-28082013.docx
Professor
V0.1
Date:
28/08/13
Disclaimer
This document contains material, which is the copyright of certain FASTCUDA contractors, and may not be
reproduced or copied without permission. All FASTCUDA consortium partners have agreed to the full
publication of this document. The commercial use of any information contained in this document may require
a license from the proprietor of that information
The FASTCUDA Consortium consists of the following companies:
Page 1 of 17
This document is produced under the EC contract 286770.
It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the FASTCUDA Steering
Committee
D2.4. Final integrated version of CUDA to SystemC translator and user's manual
Participant
no.
P1 (Coordinator)
P2
P3
P4
P5
P6
Participant organisation names
short name
Ingenieria de Sistemas Intensivos en Software Ltd
Politecnico di Torino
Universidad Politécnica de Madrid
ISIS
Spain
POLITO
Italy
UPM
TSI
Spain
Greece
ARD
FSR
Estonia
Germany
Telecommunication Systems Institute (TSI Technical University of Crete)
Ardoran
FSResult GmbH
Country
The information in this document is provided “as is” and no guarantee or warranty is given that the
information is fit for any particular purpose. The user thereof uses the information at its sole risk and liability.
An
up-to-date
version
of
this
(http://fastcuda.eu/techreports.html).
document
can
be
found
on
FASTCUDA's
website
Page 2 of 17
This document is produced under the EC contract 286770.
It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the FASTCUDA Steering
Committee
1. INTRODUCTION
In this deliverable we describe in detail how a CUDA kernel can be translated into a SystemC
implementation, which follows the interfacing mechanism with the host processor and shared
memory which has been defined in D2.2 and D5.1.
The Graphical User Interfaces that guides the user in this process requires minimal intervention.
The only step in which a human decision may be needed is whether to unroll or not some loops.
By default, no loop is unrolled, and hence in the final hardware implementation each iteration
requires at least one clock cycle. It may however be advantageous, in order to speed up the
hardware implementation of the kernel, to unroll some loops that involve only internal
computations, in order to reduce control overhead and offer more parallelism to the hardware
scheduler. This operation cannot be automated, because it involves decisions that only a human
can make. The GUI makes it easy by allowing the designer to refer directly to loops in the
original code.
The reader is referred to D2.1 for a definition of the supported CUDA subset. The restriction to
only one CUDA kernel per file, as described in earlier internal versions of this document, has
been lifted, as mentioned in D2.3.
Since D2.3, a major change has been introduced in the workflow, namely the support for the
execution of Vivado HLS (the Xilinx high-level synthesis tool) directly from the GUI.
Moreover, several bugs have been fixed, and the flow has been extensively tested on a number
of CUDA test cases.
Page 3 of 17
This document is produced under the EC contract 286770.
It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the FASTCUDA Steering
Committee
2. TOOL INSTALLATION
The final version of the tools can be found in the following SVN repository:
svn://wormtongue.polito.it/svn/fastcuda
Please contact fastcuda-synthesis@gforge.inetsis.es in order to obtain support for installation or
usage, including bug reports.
The FASTCUDA graphical user interface (GUI) is developed with a cross-platform application
framework called QT. In order to compile the application you should install the Qt widget
toolkit library, e.g. by executing:
$ sudo apt-get install libqt4-core libqt4-dev libqt4-gui qt4-devtools
Or whatever other command is required on your Linux installation to get a package.
Moreover, the GUI relies on a modified version of the MCUDA translator from CUDA to C ,
originally from http://impact.crhc.illinois.edu/mcuda.aspx (please see the MCUDA web site for
details on the MCUDA license). MCUDA generates:

SystemC code for the body of each kernel to be implemented in hardware, and

the information that the FASTCUDA GUI needs in order to synthesize the
SC_MODULE interface in SystemC.
Note that MCUDA is already precompiled (from java) when checking out the FASTCUDA
software from the repository mentioned above.
The commands required to compile the CUDA GUI are:
$ cd src/GUI
$ qmake -project QT+=core QT+=gui QT+=xml
$ qmake
$ make clean
$ make
Then you can run the application with:
$ ./GUI
Please note that the name has changed since D2.3, to better qualify its functionality. The GUI
used to be called “FastCuda”.
Execution of Vivado HLS from the GUI requires that the tool is installed on the machine on
which the GUI runs, and that it is executable from the command line as:
$ vivado_hls
Page 4 of 17
This document is produced under the EC contract 286770.
It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the FASTCUDA Steering
Committee
3. TOOL USAGE
The GUI uses six main windows to control the FastCuda project, as shown in Figure 1 .
Figure 1 Initial screenshot of FASTCUDA GUI
In the left window there are two Text Editors, called "cuda" and "m-cuda", organized in a tab
widget.
(1) The “cuda” text editor is used for the CUDA input file, that is the starting point of the
SystemC code generation. This text editor works like a normal linux text editor (e.g.
gedit). The source code can also be loaded with the "File->open" menu or the open icon,
as shown in Figure 2.
Page 5 of 17
This document is produced under the EC contract 286770.
It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the FASTCUDA Steering
Committee
Figure 2 CUDA file opening dialogue
(2) The “m-cuda” text editor is used to show the generated C file that is the body of the
SystemC thread implementing the kernel in hardware, as obtained from the MCUDA
compiler. Its content generally does not need to be modified, but it is useful to select the
loop(s) to be unrolled in the hardware implementation. In the example shown in Figure
3, the loop called “LOOP_11” is shown in the MCUDA editor, and selected in the right
hand window to be unrolled.
Page 6 of 17
This document is produced under the EC contract 286770.
It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the FASTCUDA Steering
Committee
Figure 3 MCUDA output and loop structure selection
In the middle window there are three tabs that show the results of executing MCUDA.
(1) The “tree” tab represents the top level of the abstract syntax tree of the CUDA kernels.
For each kernel it shows:

cuda function: name of the kernel.

function declaratory: name and type of the input arguments (that will become signals
in SystemC, as described in D2.2 and D5.1).

compound statement: a list of loops (labelled by MCUDA for convenience) in the
CUDA source. All those loops can be selected to be “unrolled” by the SystemC
synthesis tool in order to improve performance. Profiling information should be used
as a guidance to select which loops should be unrolled. Please note that unrolling a
loop causes both an improvement in performance and an increase in area.
(2) The “xml” tab is a text editor that shows the xml file produced by MCUDA (this is
useful mostly for debugging purposes).
(3) The “vivado” tab will show, after synthesis is performed in the next stage, a summa ry of
the results of FPGA implementation of the kernels.
In the bottom left window there is a log that shows some information about the state of the
FASTCUDA SystemC translation process, as shown in Figure 4. The same figure also shows
statistics on the cost of the FPGA implementation of the kernel and its local memories in terms
of:
-
Block RAM (BRAM) blocks to implement CUDA shared memory and some local
variables,
Page 7 of 17
This document is produced under the EC contract 286770.
It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the FASTCUDA Steering
Committee
-
DSP units to implement multiplications and additions,
-
Flip-Flops to implement inter-thread and intra-thread control,
-
Look-Up Tables (LUTs) to implement random logic.
Figure 4 Final results after SystemC code generation and synthesis
This data, collected and shown for each kernel, together with the profiling and design space
exploration data that is summarized in the two rightmost tabs, helps the designer make
decisions on the best HW/SW partitioning. Changes of the earlier decisions about loop
unrolling may of course also be necessary in order to meet the performance targets for HW
kernels. More unrolling exposes more parallelism to the tool, and thus often improves
performance. However, it increases the HW resources required, and may not be beneficial if the
bottleneck of the loop being unrolled is due to memory accesses (which are limited by the
number of BRAM ports).
shows the Vivado HLS synthesis results for a CUDA test case containing two (very
simple) kernels.
Figure 5
Page 8 of 17
This document is produced under the EC contract 286770.
It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the FASTCUDA Steering
Committee
Figure 5 Final results for two HW kernels
The GUI uses both a menu and an icon-based toolbar to offer the user the main commands, in
order to manage the CUDA source file and run the FASTCUDA translation steps:

“File” and “Edit” control the CUDA text editor.

“Project” control the FASTCUDA translation steps:
-
"Compile" runs MCUDA on the CUDA file shown in the CUDA editor. The
MCUDA output is displayed in the “mcuda” text editor, as well as in the
“tree” and “xml” tag windows.
-
“Synthesize” creates the SystemC and TCL files used by the synthesis tools
(Vivado HLS from Xilinx and CtoSilicon from Cadence are currently
supported). The contents of these files are displayed in the “Log” window.
The “Synthesize” button also executes Vivado, if it is installed on the host.
The results are shown on the “vivado” tab on the central column.
-
“Estimation” performs software performance estimation, to evaluate which
kernels represent the performance bottlenecks of the application. This is
described more in detail in D4.1 (Implementation of estimation tools).
-
“Exploration” starts the design space exploration step, that given area and
performance numbers chooses the best HW/SW partitioning. This is also
described more in detail in D4.3 (Final implementation of exploration tool).
Page 9 of 17
This document is produced under the EC contract 286770.
It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the FASTCUDA Steering
Committee
4. OUTPUT FILES
The FASTCUDA GUI creates four files for each processed CUDA kernel, with name
<kernel_name>, in a sub-directory called <kernel_name>__MCUDA_kernel :
(1) defines.h contains the macros that define:
a. the kernel name, derived from the CUDA function name by appending
_MCUDA_kernel to it.
b. the module name, derived from the CUDA source file name without the .cu
extension, by appending the kernel name to it.
(2) decl.h contains the kernel input argument names, declared as sc_in.
(3) unroll.tcl and directives.tcl contain the loop unrolling commands (if any)
for the synthesis tool. The former uses the syntax used by the CtoSilicon synthesis tool,
and the latter the syntax used by the Vivado HLS tool. It can be easily converted to the
format used by other tools. Please contact the support team to use it with other tools.
Note: currently these two files are created in the CUDA file directory, not in the kernel
sub-directory.
These files are meant to be included into the fastcuda.cc file, which is found in the
repository under the experiments/matmul/fpga directory, and which contains:

the top-level SystemC module interface,

the start/ready protocol for synchronization with the host processor

the interface with the FASTCUDA global memory controller (GMC), described in D5.1
(the interface is actually modelled in the gmem.h file).
Moreover, <CUDA_file_name>.c is created in the CUDA file directory (one level above the
above mentioned sub-directories). It contains the kernel bodies, translated into C++, and ready
for inclusion as a SystemC thread.
The following files are automatically copied in each kernel sub-directory, in order to enable its
synthesis:

A file called gmem.h, which contains the interface between the synthesized kernel and
global memory, using:
o Either the FASTCUDA global memory interface controller,
o Or a TLM-2 AXI bus transactor interfacing directly to the Xilinx DDR3
controller.

Some sample TCL scripts which can be used to synthesize the SystemC kernel using the
CtoSilicon tool (the top-level is called ctos.tcl, and it includes build.tcl and
setup.tcl).

A Vivado HLS project setup, in the sub-directory matmul, ready for synthesis using
Vivado HLS.
Again, please contact the support team for help with using tools other than Vivado HLS and
CtoSilicon.
Page 10 of 17
This document is produced under the EC contract 286770.
It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the FASTCUDA Steering
Committee
5. EXAMPLE DESIGN: MATRIX MULTIPLICATION
The experiments/matmul directory in the repository contains an example design, namely
the matrix multiplication kernel that was already described in the deliverables describing the
FASTCUDA hardware synthesis strategy. The main source file is called matmul.cu, and it is
taken directly from the CUDA programmer’s guide. The user can choose it with the
FASTCUDA GUI in order to illustrate the SystemC synthesis steps.
The experiments/matmul/MatrixMulKernel__MCUDA_kernel
sub-directory
(created for the MatrixMulKernel contained in matmul.cu) also contains:

A simulation testbench, called tb.cc, that stimulates the matrix multiplication kernel to
perform the multiplication of two 128x128 matrices.

The constants.h file that contains some constants that are used by tb.cc and by
fastcuda.cc to size the AXI burst cache, the AXI burst length, the memory latency
when simulating the global memory interface, etc.

Several files modelling:
o the AXI master interface, and
o the AXI slave and DDR3 controller provided by Xilinx.
These files can be used to perform a stand-alone simulation of the matrix multiplication
kernel without the rest of the FASTCUDA infrastructure (multi-processor, memory
controllers, etc.).

The run_sc.common file contains the command line to simulate this setup using the
Incisive simulator. Please note that valid licenses from Xilinx and Cadence are needed to
use these tools and files.
The TCL scripts that drive CtoSilicon to generate a synthesizable RTL file in Verilog for the
kernel are:

ctos.tcl, the top-level synthesis script.

build.tcl, the script (called by ctos.tcl) that reads in SystemC and builds the
internal database (meant to be used for interactive synthesis with the CtoSilicon GUI).

setup.tcl, the script (called by build.tcl) that defines the synthesis options, e.g.:
o Whether to use a Block RAM (BRAM) or registers to implement local and shared
arrays.
o Whether to use the GMC or directly read/write the DRAM via the AXI bus.
The Vivado HLS project setup files are ready for a Virtex 7 synthesis run. They inc lude the
directives.tcl file that is generated by the FASTCUDA GUI for Vivado, specifying the
loops to unroll. Please edit them to change, for example, the Xilinx FPGA platform.
Figure 4
above shows the results of SystemC synthesis for the matrix multiplication test case.
Page 11 of 17
This document is produced under the EC contract 286770.
It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the FASTCUDA Steering
Committee
6. CONCLUSION
This report showed how to use the FASTCUDA translation tool GUI to implement a CUDA kernel
in SystemC. Integration of the tool with Vivado HLS has been implemented in the last period of the
project, and it has been used to perform a variety of design experiments.
Page 12 of 17
This document is produced under the EC contract 286770.
It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the FASTCUDA Steering
Committee
APPENDIX
The appendix shows the synthesis report which is provided by Vivado HLS and which is used to generate
the synthesis summary results in the “vivado” tab of the FASTCUDA GUI. The example comes from the
matrix multiplication test case discussed above.
================================================================
== Report Version
================================================================
* Tool:
Vivado(TM) HLS - High-Level Synthesis from C, C++ and SystemC
* Version:
2012.3
* Build date:
Fri Oct 12 10:57:10 AM 2012
* Copyright (C): 2012 Xilinx Inc. All rights reserved.
================================================================
== General Information
================================================================
* Project:
project1
* Solution: solution1
* Date:
Wed Feb 27 16:37:30 2013
================================================================
== User Assignments
================================================================
* Product Family:
virtex7 virtex7_fpv6
* Part:
xc7vx330tffg1157-2
* Top Model name:
matmul_MatrixMulKernel_MCUDA_kernel
* Target clock period (ns): 10.00
* Clock uncertainty (ns):
1.25
================================================================
== Performance Estimates
================================================================
+ Summary of timing analysis:
* Estimated clock period (ns): 8.53
+ Summary of overall latency (clock cycles):
* Best-case latency:
?
* Average-case latency: ?
* Worst-case latency:
?
Page 13 of 17
This document is produced under the EC contract 286770.
It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the FASTCUDA Steering
Committee
================================================================
== Area Estimates
================================================================
* Summary:
(Target device: xc7vx330tffg1157-2)
+---+-----------------+---------+-------+--------+--------+-------+
| ID|
Name| BRAM_18K| DSP48E|
FF|
LUT|
SLICE|
+---+-----------------+---------+-------+--------+--------+-------+
|
0|
Component|
13|
24|
2215|
3136|
-|
|
1|
Expression|
-|
-|
-|
-|
-|
|
2|
FIFO|
-|
-|
-|
-|
-|
|
3|
Memory|
-|
-|
-|
-|
-|
|
4|
Multiplexer|
-|
-|
-|
-|
-|
|
5|
Register|
-|
-|
99|
-|
-|
|
6|
ShiftMemory|
-|
-|
-|
-|
-|
+---+-----------------+---------+-------+--------+--------+-------+
|
-|
Total|
13|
24|
2314|
3136|
0|
+---+-----------------+---------+-------+--------+--------+-------+
|
-|
Available|
1500|
1120|
408000|
204000|
51000|
+---+-----------------+---------+-------+--------+--------+-------+
|
-|
Utilization (%)|
~0|
2|
~0|
1|
0|
+---+-----------------+---------+-------+--------+--------+-------+
+ Details:
* Component:
+---+---------------------------------------------------------------------------------------------+---------+-------+------+------+
|
Name| BRAM_18K| DSP48E|
ID|
FF|
LUT|
+---+---------------------------------------------------------------------------------------------+---------+-------+------+------+
|
0|
(matmul_MatrixMulKernel_MCUDA_kernel_run)|
grp_matmul_MatrixMulKernel_MCUDA_kernel_run_fu_126
13|
24| 2215| 3136|
+---+---------------------------------------------------------------------------------------------+---------+-------+------+------+
|
Total|
-|
13|
24|
2215|
3136|
+---+---------------------------------------------------------------------------------------------+---------+-------+------+------+
* Expression:
N/A
* FIFO:
N/A
Page 14 of 17
This document is produced under the EC contract 286770.
It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the FASTCUDA Steering
Committee
* Memory:
N/A
* Multiplexer:
N/A
* Register:
+---+---------------+-----+-------+----+
| ID|
Name| Bits| Consts|
FF|
+---+---------------+-----+-------+----+
|
0|
DataOut_GM|
32|
0|
32|
|
1|
RD_Address_GM|
32|
0|
32|
|
2|
RD_Req_GM|
1|
0|
1|
|
3|
WR_Address_GM|
32|
0|
32|
|
4|
WR_Req_GM|
1|
0|
1|
|
5|
ready|
1|
0|
1|
+---+---------------+-----+-------+----+
|
-|
Total|
99|
0|
99|
+---+---------------+-----+-------+----+
* ShiftMemory:
N/A
* Hierarchical Multiplexer Count:
+---+---------------------------------------------------------------------------------------------+-----+------+------+
| ID|
Size|
Name|
Bits| Count|
+---+---------------------------------------------------------------------------------------------+-----+------+------+
| 0|
0|
(This level)|
0|
0|
| 1| grp_matmul_MatrixMulKernel_MCUDA_kernel_run_fu_126 (matmul_MatrixMulKernel_MCUDA_kernel_run)|
157| 1028| 2490|
+---+---------------------------------------------------------------------------------------------+-----+------+------+
| -|
157| 1028|
Total|
2490|
+---+---------------------------------------------------------------------------------------------+-----+------+------+
================================================================
== Power Estimate
================================================================
* Summary:
+---+-------------+------+
Page 15 of 17
This document is produced under the EC contract 286770.
It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the FASTCUDA Steering
Committee
| ID|
Name| Power|
+---+-------------+------+
|
0|
Component|
537|
|
1|
Expression|
-|
|
2|
FIFO|
-|
|
3|
Memory|
-|
|
4|
Multiplexer|
-|
|
5|
Register|
9|
|
6|
ShiftMemory|
-|
+---+-------------+------+
|
-|
Total|
546|
+---+-------------+------+
* Hierarchical Register Count:
+---+---------------------------------------------------------------------------------------------+------+
| ID|
Count|
Name|
+---+---------------------------------------------------------------------------------------------+------+
| 0|
99|
(This level)|
| 1|
1945|
grp_matmul_MatrixMulKernel_MCUDA_kernel_run_fu_126 (matmul_MatrixMulKernel_MCUDA_kernel_run)|
+---+---------------------------------------------------------------------------------------------+------+
| -|
2044|
Total|
+---+---------------------------------------------------------------------------------------------+------+
================================================================
== Interface Summary
================================================================
* Interfaces:
+---+---------------+----------------------------------------------------------------------------+-------------+------+------------+----------+-----+-----+
| ID|
RTL Ports|
Type| Scope| IO Protocol| IO Config|
Object|
Dir| Bits|
+---+---------------+----------------------------------------------------------------------------+-------------+------+------------+----------+-----+-----+
| 0|
pointer|
-|
A|
| 1|
pointer|
-|
| 2|
pointer|
-|
| 3|
pointer|
-|
A|
-|
-|
in|
32|
-|
-|
in|
32|
-|
-|
in|
32|
-|
-|
in|
32|
wB|
wB|
wA|
wA|
C|
C|
Page 16 of 17
This document is produced under the EC contract 286770.
It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the FASTCUDA Steering
Committee
| 4|
pointer|
B|
-|
B|
-|
-|
in|
32|
| 5|
oblockIdx_x|
pointer|
-|
-|
-|
in|
32|
| 6|
oblockIdx_y|
pointer|
-|
-|
-|
in|
32|
| 7|
oblockIdx_z|
pointer|
-|
-|
-|
in|
32|
| 8|
oblockDim_x|
pointer|
-|
-|
-|
in|
32|
| 9|
oblockDim_y|
pointer|
-|
-|
-|
in|
32|
| 10|
oblockDim_z|
pointer|
-|
-|
-|
in|
32|
| 11|
return value|
| 12|
-|
clk|
-|
oblockIdx_x|
oblockIdx_y|
oblockIdx_z|
oblockDim_x|
oblockDim_y|
oblockDim_z|
matmul_MatrixMulKernel__MCUDA_kernel::matmul_MatrixMulKernel__MCUDA_kernel|
-|
-|
in|
1|
reset|
-|
-|
-|
in|
1|
-|
-|
in|
1|
-|
-|
out|
1|
RD_Req_GM|
-|
-|
-|
out|
1|
| 16| RD_Address_GM|
pointer|
-|
-|
-|
out|
32|
| 17|
pointer|
ACK_GM|
-|
-|
-|
in|
1|
| 18|
pointer|
DataIn_GM|
-|
-|
-|
in|
32|
| 19|
pointer|
WR_Req_GM|
-|
-|
-|
out|
1|
| 20| WR_Address_GM|
pointer|
-|
-|
-|
out|
32|
| 21|
pointer|
-|
-|
out|
32|
| 13|
pointer|
-|
| 14|
pointer|
-|
| 15|
pointer|
-|
start|
start|
ready|
DataOut_GM|
-|
ready|
RD_Req_GM|
RD_Address_GM|
ACK_GM|
DataIn_GM|
WR_Req_GM|
WR_Address_GM|
DataOut_GM|
+---+---------------+----------------------------------------------------------------------------+-------------+------+------------+----------+-----+-----+
Page 17 of 17
This document is produced under the EC contract 286770.
It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the FASTCUDA Steering
Committee