D2.4. Final integrated version of CUDA to SystemC translator and user's manual FASTCUDA Project No: 286770 D2.4. Final integrated version of CUDA to SystemC translator and user’s manual 31st August 2013 Abstract: This deliverable describes the functionality and operation of the tool that translates a CUDA kernel into a SystemC model that can then be synthesized using any FPGA synthesis tool starting from SystemC (e.g. Xilinx Vivado or Cadence CtoSilicon and Xilinx ISE). The tool also allows the designer to make some implementation trade-offs, such as unrolling or not some loops of the CUDA kernel which do not involve access to global or shared memory, and can thus be unrolled to increase performance at the expense of cost. The Graphical User Interfaces that allows the user to perform the translation was designed to be friendly and easy even for users who are not intimately familiar with hardware design. Document Manager Luciano Lavagno POLITO Document Id N°: Final integrated version of CUDA to Version: SystemC translator and user’s manual Filename: FASTCUDA-D2.4_POLITO_V0.1-28082013.docx Professor V0.1 Date: 28/08/13 Disclaimer This document contains material, which is the copyright of certain FASTCUDA contractors, and may not be reproduced or copied without permission. All FASTCUDA consortium partners have agreed to the full publication of this document. The commercial use of any information contained in this document may require a license from the proprietor of that information The FASTCUDA Consortium consists of the following companies: Page 1 of 17 This document is produced under the EC contract 286770. It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the FASTCUDA Steering Committee D2.4. Final integrated version of CUDA to SystemC translator and user's manual Participant no. P1 (Coordinator) P2 P3 P4 P5 P6 Participant organisation names short name Ingenieria de Sistemas Intensivos en Software Ltd Politecnico di Torino Universidad Politécnica de Madrid ISIS Spain POLITO Italy UPM TSI Spain Greece ARD FSR Estonia Germany Telecommunication Systems Institute (TSI Technical University of Crete) Ardoran FSResult GmbH Country The information in this document is provided “as is” and no guarantee or warranty is given that the information is fit for any particular purpose. The user thereof uses the information at its sole risk and liability. An up-to-date version of this (http://fastcuda.eu/techreports.html). document can be found on FASTCUDA's website Page 2 of 17 This document is produced under the EC contract 286770. It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the FASTCUDA Steering Committee 1. INTRODUCTION In this deliverable we describe in detail how a CUDA kernel can be translated into a SystemC implementation, which follows the interfacing mechanism with the host processor and shared memory which has been defined in D2.2 and D5.1. The Graphical User Interfaces that guides the user in this process requires minimal intervention. The only step in which a human decision may be needed is whether to unroll or not some loops. By default, no loop is unrolled, and hence in the final hardware implementation each iteration requires at least one clock cycle. It may however be advantageous, in order to speed up the hardware implementation of the kernel, to unroll some loops that involve only internal computations, in order to reduce control overhead and offer more parallelism to the hardware scheduler. This operation cannot be automated, because it involves decisions that only a human can make. The GUI makes it easy by allowing the designer to refer directly to loops in the original code. The reader is referred to D2.1 for a definition of the supported CUDA subset. The restriction to only one CUDA kernel per file, as described in earlier internal versions of this document, has been lifted, as mentioned in D2.3. Since D2.3, a major change has been introduced in the workflow, namely the support for the execution of Vivado HLS (the Xilinx high-level synthesis tool) directly from the GUI. Moreover, several bugs have been fixed, and the flow has been extensively tested on a number of CUDA test cases. Page 3 of 17 This document is produced under the EC contract 286770. It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the FASTCUDA Steering Committee 2. TOOL INSTALLATION The final version of the tools can be found in the following SVN repository: svn://wormtongue.polito.it/svn/fastcuda Please contact fastcuda-synthesis@gforge.inetsis.es in order to obtain support for installation or usage, including bug reports. The FASTCUDA graphical user interface (GUI) is developed with a cross-platform application framework called QT. In order to compile the application you should install the Qt widget toolkit library, e.g. by executing: $ sudo apt-get install libqt4-core libqt4-dev libqt4-gui qt4-devtools Or whatever other command is required on your Linux installation to get a package. Moreover, the GUI relies on a modified version of the MCUDA translator from CUDA to C , originally from http://impact.crhc.illinois.edu/mcuda.aspx (please see the MCUDA web site for details on the MCUDA license). MCUDA generates: SystemC code for the body of each kernel to be implemented in hardware, and the information that the FASTCUDA GUI needs in order to synthesize the SC_MODULE interface in SystemC. Note that MCUDA is already precompiled (from java) when checking out the FASTCUDA software from the repository mentioned above. The commands required to compile the CUDA GUI are: $ cd src/GUI $ qmake -project QT+=core QT+=gui QT+=xml $ qmake $ make clean $ make Then you can run the application with: $ ./GUI Please note that the name has changed since D2.3, to better qualify its functionality. The GUI used to be called “FastCuda”. Execution of Vivado HLS from the GUI requires that the tool is installed on the machine on which the GUI runs, and that it is executable from the command line as: $ vivado_hls Page 4 of 17 This document is produced under the EC contract 286770. It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the FASTCUDA Steering Committee 3. TOOL USAGE The GUI uses six main windows to control the FastCuda project, as shown in Figure 1 . Figure 1 Initial screenshot of FASTCUDA GUI In the left window there are two Text Editors, called "cuda" and "m-cuda", organized in a tab widget. (1) The “cuda” text editor is used for the CUDA input file, that is the starting point of the SystemC code generation. This text editor works like a normal linux text editor (e.g. gedit). The source code can also be loaded with the "File->open" menu or the open icon, as shown in Figure 2. Page 5 of 17 This document is produced under the EC contract 286770. It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the FASTCUDA Steering Committee Figure 2 CUDA file opening dialogue (2) The “m-cuda” text editor is used to show the generated C file that is the body of the SystemC thread implementing the kernel in hardware, as obtained from the MCUDA compiler. Its content generally does not need to be modified, but it is useful to select the loop(s) to be unrolled in the hardware implementation. In the example shown in Figure 3, the loop called “LOOP_11” is shown in the MCUDA editor, and selected in the right hand window to be unrolled. Page 6 of 17 This document is produced under the EC contract 286770. It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the FASTCUDA Steering Committee Figure 3 MCUDA output and loop structure selection In the middle window there are three tabs that show the results of executing MCUDA. (1) The “tree” tab represents the top level of the abstract syntax tree of the CUDA kernels. For each kernel it shows: cuda function: name of the kernel. function declaratory: name and type of the input arguments (that will become signals in SystemC, as described in D2.2 and D5.1). compound statement: a list of loops (labelled by MCUDA for convenience) in the CUDA source. All those loops can be selected to be “unrolled” by the SystemC synthesis tool in order to improve performance. Profiling information should be used as a guidance to select which loops should be unrolled. Please note that unrolling a loop causes both an improvement in performance and an increase in area. (2) The “xml” tab is a text editor that shows the xml file produced by MCUDA (this is useful mostly for debugging purposes). (3) The “vivado” tab will show, after synthesis is performed in the next stage, a summa ry of the results of FPGA implementation of the kernels. In the bottom left window there is a log that shows some information about the state of the FASTCUDA SystemC translation process, as shown in Figure 4. The same figure also shows statistics on the cost of the FPGA implementation of the kernel and its local memories in terms of: - Block RAM (BRAM) blocks to implement CUDA shared memory and some local variables, Page 7 of 17 This document is produced under the EC contract 286770. It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the FASTCUDA Steering Committee - DSP units to implement multiplications and additions, - Flip-Flops to implement inter-thread and intra-thread control, - Look-Up Tables (LUTs) to implement random logic. Figure 4 Final results after SystemC code generation and synthesis This data, collected and shown for each kernel, together with the profiling and design space exploration data that is summarized in the two rightmost tabs, helps the designer make decisions on the best HW/SW partitioning. Changes of the earlier decisions about loop unrolling may of course also be necessary in order to meet the performance targets for HW kernels. More unrolling exposes more parallelism to the tool, and thus often improves performance. However, it increases the HW resources required, and may not be beneficial if the bottleneck of the loop being unrolled is due to memory accesses (which are limited by the number of BRAM ports). shows the Vivado HLS synthesis results for a CUDA test case containing two (very simple) kernels. Figure 5 Page 8 of 17 This document is produced under the EC contract 286770. It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the FASTCUDA Steering Committee Figure 5 Final results for two HW kernels The GUI uses both a menu and an icon-based toolbar to offer the user the main commands, in order to manage the CUDA source file and run the FASTCUDA translation steps: “File” and “Edit” control the CUDA text editor. “Project” control the FASTCUDA translation steps: - "Compile" runs MCUDA on the CUDA file shown in the CUDA editor. The MCUDA output is displayed in the “mcuda” text editor, as well as in the “tree” and “xml” tag windows. - “Synthesize” creates the SystemC and TCL files used by the synthesis tools (Vivado HLS from Xilinx and CtoSilicon from Cadence are currently supported). The contents of these files are displayed in the “Log” window. The “Synthesize” button also executes Vivado, if it is installed on the host. The results are shown on the “vivado” tab on the central column. - “Estimation” performs software performance estimation, to evaluate which kernels represent the performance bottlenecks of the application. This is described more in detail in D4.1 (Implementation of estimation tools). - “Exploration” starts the design space exploration step, that given area and performance numbers chooses the best HW/SW partitioning. This is also described more in detail in D4.3 (Final implementation of exploration tool). Page 9 of 17 This document is produced under the EC contract 286770. It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the FASTCUDA Steering Committee 4. OUTPUT FILES The FASTCUDA GUI creates four files for each processed CUDA kernel, with name <kernel_name>, in a sub-directory called <kernel_name>__MCUDA_kernel : (1) defines.h contains the macros that define: a. the kernel name, derived from the CUDA function name by appending _MCUDA_kernel to it. b. the module name, derived from the CUDA source file name without the .cu extension, by appending the kernel name to it. (2) decl.h contains the kernel input argument names, declared as sc_in. (3) unroll.tcl and directives.tcl contain the loop unrolling commands (if any) for the synthesis tool. The former uses the syntax used by the CtoSilicon synthesis tool, and the latter the syntax used by the Vivado HLS tool. It can be easily converted to the format used by other tools. Please contact the support team to use it with other tools. Note: currently these two files are created in the CUDA file directory, not in the kernel sub-directory. These files are meant to be included into the fastcuda.cc file, which is found in the repository under the experiments/matmul/fpga directory, and which contains: the top-level SystemC module interface, the start/ready protocol for synchronization with the host processor the interface with the FASTCUDA global memory controller (GMC), described in D5.1 (the interface is actually modelled in the gmem.h file). Moreover, <CUDA_file_name>.c is created in the CUDA file directory (one level above the above mentioned sub-directories). It contains the kernel bodies, translated into C++, and ready for inclusion as a SystemC thread. The following files are automatically copied in each kernel sub-directory, in order to enable its synthesis: A file called gmem.h, which contains the interface between the synthesized kernel and global memory, using: o Either the FASTCUDA global memory interface controller, o Or a TLM-2 AXI bus transactor interfacing directly to the Xilinx DDR3 controller. Some sample TCL scripts which can be used to synthesize the SystemC kernel using the CtoSilicon tool (the top-level is called ctos.tcl, and it includes build.tcl and setup.tcl). A Vivado HLS project setup, in the sub-directory matmul, ready for synthesis using Vivado HLS. Again, please contact the support team for help with using tools other than Vivado HLS and CtoSilicon. Page 10 of 17 This document is produced under the EC contract 286770. It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the FASTCUDA Steering Committee 5. EXAMPLE DESIGN: MATRIX MULTIPLICATION The experiments/matmul directory in the repository contains an example design, namely the matrix multiplication kernel that was already described in the deliverables describing the FASTCUDA hardware synthesis strategy. The main source file is called matmul.cu, and it is taken directly from the CUDA programmer’s guide. The user can choose it with the FASTCUDA GUI in order to illustrate the SystemC synthesis steps. The experiments/matmul/MatrixMulKernel__MCUDA_kernel sub-directory (created for the MatrixMulKernel contained in matmul.cu) also contains: A simulation testbench, called tb.cc, that stimulates the matrix multiplication kernel to perform the multiplication of two 128x128 matrices. The constants.h file that contains some constants that are used by tb.cc and by fastcuda.cc to size the AXI burst cache, the AXI burst length, the memory latency when simulating the global memory interface, etc. Several files modelling: o the AXI master interface, and o the AXI slave and DDR3 controller provided by Xilinx. These files can be used to perform a stand-alone simulation of the matrix multiplication kernel without the rest of the FASTCUDA infrastructure (multi-processor, memory controllers, etc.). The run_sc.common file contains the command line to simulate this setup using the Incisive simulator. Please note that valid licenses from Xilinx and Cadence are needed to use these tools and files. The TCL scripts that drive CtoSilicon to generate a synthesizable RTL file in Verilog for the kernel are: ctos.tcl, the top-level synthesis script. build.tcl, the script (called by ctos.tcl) that reads in SystemC and builds the internal database (meant to be used for interactive synthesis with the CtoSilicon GUI). setup.tcl, the script (called by build.tcl) that defines the synthesis options, e.g.: o Whether to use a Block RAM (BRAM) or registers to implement local and shared arrays. o Whether to use the GMC or directly read/write the DRAM via the AXI bus. The Vivado HLS project setup files are ready for a Virtex 7 synthesis run. They inc lude the directives.tcl file that is generated by the FASTCUDA GUI for Vivado, specifying the loops to unroll. Please edit them to change, for example, the Xilinx FPGA platform. Figure 4 above shows the results of SystemC synthesis for the matrix multiplication test case. Page 11 of 17 This document is produced under the EC contract 286770. It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the FASTCUDA Steering Committee 6. CONCLUSION This report showed how to use the FASTCUDA translation tool GUI to implement a CUDA kernel in SystemC. Integration of the tool with Vivado HLS has been implemented in the last period of the project, and it has been used to perform a variety of design experiments. Page 12 of 17 This document is produced under the EC contract 286770. It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the FASTCUDA Steering Committee APPENDIX The appendix shows the synthesis report which is provided by Vivado HLS and which is used to generate the synthesis summary results in the “vivado” tab of the FASTCUDA GUI. The example comes from the matrix multiplication test case discussed above. ================================================================ == Report Version ================================================================ * Tool: Vivado(TM) HLS - High-Level Synthesis from C, C++ and SystemC * Version: 2012.3 * Build date: Fri Oct 12 10:57:10 AM 2012 * Copyright (C): 2012 Xilinx Inc. All rights reserved. ================================================================ == General Information ================================================================ * Project: project1 * Solution: solution1 * Date: Wed Feb 27 16:37:30 2013 ================================================================ == User Assignments ================================================================ * Product Family: virtex7 virtex7_fpv6 * Part: xc7vx330tffg1157-2 * Top Model name: matmul_MatrixMulKernel_MCUDA_kernel * Target clock period (ns): 10.00 * Clock uncertainty (ns): 1.25 ================================================================ == Performance Estimates ================================================================ + Summary of timing analysis: * Estimated clock period (ns): 8.53 + Summary of overall latency (clock cycles): * Best-case latency: ? * Average-case latency: ? * Worst-case latency: ? Page 13 of 17 This document is produced under the EC contract 286770. It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the FASTCUDA Steering Committee ================================================================ == Area Estimates ================================================================ * Summary: (Target device: xc7vx330tffg1157-2) +---+-----------------+---------+-------+--------+--------+-------+ | ID| Name| BRAM_18K| DSP48E| FF| LUT| SLICE| +---+-----------------+---------+-------+--------+--------+-------+ | 0| Component| 13| 24| 2215| 3136| -| | 1| Expression| -| -| -| -| -| | 2| FIFO| -| -| -| -| -| | 3| Memory| -| -| -| -| -| | 4| Multiplexer| -| -| -| -| -| | 5| Register| -| -| 99| -| -| | 6| ShiftMemory| -| -| -| -| -| +---+-----------------+---------+-------+--------+--------+-------+ | -| Total| 13| 24| 2314| 3136| 0| +---+-----------------+---------+-------+--------+--------+-------+ | -| Available| 1500| 1120| 408000| 204000| 51000| +---+-----------------+---------+-------+--------+--------+-------+ | -| Utilization (%)| ~0| 2| ~0| 1| 0| +---+-----------------+---------+-------+--------+--------+-------+ + Details: * Component: +---+---------------------------------------------------------------------------------------------+---------+-------+------+------+ | Name| BRAM_18K| DSP48E| ID| FF| LUT| +---+---------------------------------------------------------------------------------------------+---------+-------+------+------+ | 0| (matmul_MatrixMulKernel_MCUDA_kernel_run)| grp_matmul_MatrixMulKernel_MCUDA_kernel_run_fu_126 13| 24| 2215| 3136| +---+---------------------------------------------------------------------------------------------+---------+-------+------+------+ | Total| -| 13| 24| 2215| 3136| +---+---------------------------------------------------------------------------------------------+---------+-------+------+------+ * Expression: N/A * FIFO: N/A Page 14 of 17 This document is produced under the EC contract 286770. It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the FASTCUDA Steering Committee * Memory: N/A * Multiplexer: N/A * Register: +---+---------------+-----+-------+----+ | ID| Name| Bits| Consts| FF| +---+---------------+-----+-------+----+ | 0| DataOut_GM| 32| 0| 32| | 1| RD_Address_GM| 32| 0| 32| | 2| RD_Req_GM| 1| 0| 1| | 3| WR_Address_GM| 32| 0| 32| | 4| WR_Req_GM| 1| 0| 1| | 5| ready| 1| 0| 1| +---+---------------+-----+-------+----+ | -| Total| 99| 0| 99| +---+---------------+-----+-------+----+ * ShiftMemory: N/A * Hierarchical Multiplexer Count: +---+---------------------------------------------------------------------------------------------+-----+------+------+ | ID| Size| Name| Bits| Count| +---+---------------------------------------------------------------------------------------------+-----+------+------+ | 0| 0| (This level)| 0| 0| | 1| grp_matmul_MatrixMulKernel_MCUDA_kernel_run_fu_126 (matmul_MatrixMulKernel_MCUDA_kernel_run)| 157| 1028| 2490| +---+---------------------------------------------------------------------------------------------+-----+------+------+ | -| 157| 1028| Total| 2490| +---+---------------------------------------------------------------------------------------------+-----+------+------+ ================================================================ == Power Estimate ================================================================ * Summary: +---+-------------+------+ Page 15 of 17 This document is produced under the EC contract 286770. It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the FASTCUDA Steering Committee | ID| Name| Power| +---+-------------+------+ | 0| Component| 537| | 1| Expression| -| | 2| FIFO| -| | 3| Memory| -| | 4| Multiplexer| -| | 5| Register| 9| | 6| ShiftMemory| -| +---+-------------+------+ | -| Total| 546| +---+-------------+------+ * Hierarchical Register Count: +---+---------------------------------------------------------------------------------------------+------+ | ID| Count| Name| +---+---------------------------------------------------------------------------------------------+------+ | 0| 99| (This level)| | 1| 1945| grp_matmul_MatrixMulKernel_MCUDA_kernel_run_fu_126 (matmul_MatrixMulKernel_MCUDA_kernel_run)| +---+---------------------------------------------------------------------------------------------+------+ | -| 2044| Total| +---+---------------------------------------------------------------------------------------------+------+ ================================================================ == Interface Summary ================================================================ * Interfaces: +---+---------------+----------------------------------------------------------------------------+-------------+------+------------+----------+-----+-----+ | ID| RTL Ports| Type| Scope| IO Protocol| IO Config| Object| Dir| Bits| +---+---------------+----------------------------------------------------------------------------+-------------+------+------------+----------+-----+-----+ | 0| pointer| -| A| | 1| pointer| -| | 2| pointer| -| | 3| pointer| -| A| -| -| in| 32| -| -| in| 32| -| -| in| 32| -| -| in| 32| wB| wB| wA| wA| C| C| Page 16 of 17 This document is produced under the EC contract 286770. It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the FASTCUDA Steering Committee | 4| pointer| B| -| B| -| -| in| 32| | 5| oblockIdx_x| pointer| -| -| -| in| 32| | 6| oblockIdx_y| pointer| -| -| -| in| 32| | 7| oblockIdx_z| pointer| -| -| -| in| 32| | 8| oblockDim_x| pointer| -| -| -| in| 32| | 9| oblockDim_y| pointer| -| -| -| in| 32| | 10| oblockDim_z| pointer| -| -| -| in| 32| | 11| return value| | 12| -| clk| -| oblockIdx_x| oblockIdx_y| oblockIdx_z| oblockDim_x| oblockDim_y| oblockDim_z| matmul_MatrixMulKernel__MCUDA_kernel::matmul_MatrixMulKernel__MCUDA_kernel| -| -| in| 1| reset| -| -| -| in| 1| -| -| in| 1| -| -| out| 1| RD_Req_GM| -| -| -| out| 1| | 16| RD_Address_GM| pointer| -| -| -| out| 32| | 17| pointer| ACK_GM| -| -| -| in| 1| | 18| pointer| DataIn_GM| -| -| -| in| 32| | 19| pointer| WR_Req_GM| -| -| -| out| 1| | 20| WR_Address_GM| pointer| -| -| -| out| 32| | 21| pointer| -| -| out| 32| | 13| pointer| -| | 14| pointer| -| | 15| pointer| -| start| start| ready| DataOut_GM| -| ready| RD_Req_GM| RD_Address_GM| ACK_GM| DataIn_GM| WR_Req_GM| WR_Address_GM| DataOut_GM| +---+---------------+----------------------------------------------------------------------------+-------------+------+------------+----------+-----+-----+ Page 17 of 17 This document is produced under the EC contract 286770. It is the property of the FASTCUDA consortium and shall not be distributed or reproduced without the formal approval of the FASTCUDA Steering Committee