1 - Index of

advertisement
SoC CAD
DDMCPP: The Data-Driven
Multithreading C Pre-Processor
徐 子 傑 Hsu,Zi Jei
Department of Electrical Engineering
National Cheng Kung University
Tainan, Taiwan, R.O.C
1
NCKU
SoC CAD

Introduction(1/3)
Architecture and technology advances have resulted in
microprocessors that are able to achieve very high
performance.
Exploiting this performance for real-world applications is an ongoing challenge.
 A common case for enhancing performance is by overlapping
different tasks, i.e. executing them in parallel.



Hsu, Zi Jei
Substantial effort has been put into developing hardware
techniques [3] as well as compilers [10] that transparently exploit
parallelism at different levels (e.g. instruction and task).
This type of parallelism that is exploited transparently, i.e.
without user intervention, is known as implicit parallelism.
SoC & ASIC Lab
2
NCKU
SoC CAD
Introduction(2/3)
While implicit parallelism is the desired technique to exploit
performance of real-world applications, currently it is limited
to a moderate level of parallelism.
 In order to achieve high degrees of parallelism, the user
intervention is required leading to what is known as explicit
parallelism.


Hsu, Zi Jei
With this approach the user identifies the parallel sections of the
code and inserts commands that allow the system to execute
the tasks in parallel.
SoC & ASIC Lab
3
NCKU
SoC CAD

Introduction(3/3)
In order to evaluate a new architecture, researchers are
usually forced to hand-code the applications in order to
efficiently exploit the benefits of their proposal.

This is usually a very time consuming process that becomes
impractical and unfeasible when evaluation is done using a
large number of applications.
The alternative option, pursued by some projects, is to develop a
compiler for the particular architecture that is able to automatically
generate the parallel code from the existing serial applications.
 Although being the ideal solution, this is a very time consuming
effort that may take precious time, delaying the development of the
architecture.


We decided to start with a Pre-Processor that takes as input
regular C code augmented with special directives and
produces the parallel code for the target architecture.

Hsu, Zi Jei
We call this tool the Data-Driven Multithreading C PreProcessor or DDMCPP.
SoC & ASIC Lab
4
NCKU
SoC CAD

Data-Driven Multithreading(1/7)
2.1. DDM Model of Execution

DDM provides effective latency tolerance by allowing the
computation processor produce useful work, while a long
latency event is in progress.


This is achieved by scheduling a thread for execution only when its
input data have been produced i.e. scheduling in a Data-Driven
manner.
A program in DDM is a collection of Code-Blocks.

Each Code-Block comprises of several threads where a thread is a
sequence of instructions of arbitrary length. A producer/ consumer
relationship exists among threads.


Scheduling of Code-Blocks, as well as scheduling of threads within
a Code-Block is done dynamically at runtime according to data
availability.

Hsu, Zi Jei
In a typical program, a set of threads, called the producers, create data
used by other threads, called the consumers.
This task is performed with the help of the Thread Synchronization
Unit (TSU)
SoC & ASIC Lab
5
NCKU
SoC CAD

Data-Driven Multithreading(2/7)
2.2. Hardware Support for Data-Driven Multithreading:
Thread Synchronization Unit
The Thread Synchronization Unit (TSU) is the hardware unit
responsible for the scheduling of the DDM threads [7].
 The CPU communicates with the TSU through simple read and
write instructions as the TSU is implemented as a memory
mapped device.
 The sequence by which the processors execute the program
threads is defined dynamically by the TSU according to data
availability,

a thread can be executed only when all its producers have
completed their execution.
 When a thread completes its execution it notifies the TSU. This
event together with the program synchronization information allows
the TSU to dynamically identify the next threads to be executed.

Hsu, Zi Jei
SoC & ASIC Lab
6
NCKU
SoC CAD

Data-Driven Multithreading(3/7)
2.3. The Data-Driven Multithreading Chip Multiprocessor

DDM-CMP is a chip multiprocessor able to support the DataDriven Multithreading model of execution.
Figure 1. The layout of DDM-CMP
chip with 4 cores
Hsu, Zi Jei
SoC & ASIC Lab
7
NCKU
SoC CAD

Data-Driven Multithreading(4/7)
2.4 DDM-CMP Runtime System and Support

A primary target of the DDM-CMP architecture is to be able to
execute not only DDM applications, but also conventional, nonDDM binaries.
To meet this goal, a Runtime Support System (RSS) that does not
require modifications of the Operating System or the CPU cores
has been designed [7].
 As such, the RSS has to satisfy two important requirements.




Hsu, Zi Jei
First, when an application is executed in parallel in a shared-memory
multiprocessor, the execution CPUs need to have access to the same
virtual address space. This is also true for the DDM-CMP architecture.
Secondly, as the TSU space is limited, a mechanism that dynamically
loads and unloads its internal structures with the proper data is
required.
To meet these requirements, we designed a simple, lightweight
user level process, the DDM-Kernel [7].
SoC & ASIC Lab
8
NCKU
SoC CAD

Data-Driven Multithreading(5/7)
2.4.1 The DDM-Kernel

A DDM application starts its execution by launching n DDMKernels.


Each Kernel is executed by a different process on a different CPU.
The application completes its execution when all its kernels have
done so. This approach guarantees a common virtual address for
all CPUs, the first requirement the RSS must meet.
Figure 2 depicts the pseudo-code of the DDM-Kernel. Its first
operation is to transfer the execution to the address of the first
instruction of the Inlet Thread (the first thread of each CodeBlock is called the “Inlet Thread”) of the first Code-Block it will
execute.
goto (firstInstructionINLET_THREAD)
THREAD_SELECT:
address = readReadyThreadFromTSU();
goto address;
Hsu, Zi Jei
Figure 2. The Pesudocode of the DDM-Kernel
SoC & ASIC Lab
9
NCKU
SoC CAD
Data-Driven Multithreading(6/7)
The primary responsibility of Inlet Threads is to load the TSU
with all threads of their Code-Block (Figure 3-(a)). On the other
hand, the last thread of a Code-Block is the block’s Outlet
Thread (Figure 3-(b)). Its primary operation is to clear the
resources allocated on the TSU for that block.
 The THREAD SELECT loop, combined with the operation of the
TSU, guarantee that execution will be transferred to subsequent
threads.

Specifically, the last operation of all threads is to notify their TSU
that they have completed their execution and jump to a special
loop in the DDM-Kernel named the THREAD SELECT loop (Figure
3-(a)-(b)-(d)).
 Acknowledging the thread completion is achieved by sending a
special flag to the corresponding TSU. This triggers the TSU
operation that identifies the next ready thread.


However, the outlet thread of the last block, is set statically to
force the DDM-Kernel to exit (Figure 3-(c)).
Hsu, Zi Jei
SoC & ASIC Lab
10
NCKU
SoC CAD
Data-Driven Multithreading(7/7)
Figure 3.The pseudo-code
of DDM threads. The first
thread of a block is named
Inlet Thread and the last
Outlet Thread
Hsu, Zi Jei
SoC & ASIC Lab
11
SoC CAD

NCKU
Data-Driven Multithreading
C Directives(1/6)
3.1. DDM Code-Block
#pragma ddm block B_ID
[import (T1 A1, Tn An)]
[export (B1:T_ID1, Bn:T_IDn)]
#pragma ddm endblock

Defines the start of the DDM Code-Block B_ID,


where B_ID may have any value from 1 to MAX_BLOCKS.
Optionally the programmer may define variables to be imported
to the Code-Block or be exported from the Code-Block.
We represent these with the import list Ti Ai where Ti represents
the type of the variable (e.g. int or float) and Ai represents the
name of the variable.
 These are represented by the export list Bi:T_IDi where Bi
represents the name of the variable and T_IDi the thread that has
produced this variable.

Hsu, Zi Jei
SoC & ASIC Lab
12
SoC CAD

NCKU
Data-Driven Multithreading
C Directives(2/6)
3.2. DDM Thread
#pragma ddm thread T_ID kernel K_ID
[import (B1:T_ID1, Bn:T_IDn)]
[export (T1 A1, Tn An)]
#pragma ddm endthread
Defines the DDM Thread T_ID. This value must be between 1
and MAX_THREADS.
 This Thread will be executed by DDM-Kernel K_ID.


Hsu, Zi Jei
This value must be between 1 and KERNELS where KERNELS is
a command line option of the pre-processor and shows the number
of DDM Kernels.
SoC & ASIC Lab
13
SoC CAD

Optionally the programmer may define a list of variables that
are imported from other threads.

These are represented by the import list where each element is of
the form Bi:T_IDi, Bi representing the name of the variable and T_
IDi the identifier of the thread that has produced the Bi variable.


Note that if a thread needs to consume a variable that has been
imported into the block with the import list of the block directive, the
programmer needs to also specify that variable in the thread import list
and the T_ID for that variable should be zero.
Also, the programmer may specify a list of variables that should
be exported from the thread, i.e. the variables produced by this
thread and consumed by some other thread.

Hsu, Zi Jei
NCKU
Data-Driven Multithreading
C Directives(3/6)
In this list each element is represented by Ti_Ai where Ti
represents the type of the variable (e.g. int or float) and Ai the
name of the variable.
SoC & ASIC Lab
14
SoC CAD

NCKU
Data-Driven Multithreading
C Directives(4/6)
3.3. DDM Loop
#pragma ddm for thread ( t1, ... tn )
kernel ( k1, ... kn )
index <var>
<num1>
<num2>
#pragma ddm endfor

Defines the loop body code which will be executed by threads
T1 and Tn on kernels K1 to Kn.

Hsu, Zi Jei
The loop index variable var ranges between values num1 and
num2 incremented each by one on each iteration. All iterations of
such a loop are assumed to be independent.
SoC & ASIC Lab
15
SoC CAD

NCKU
Data-Driven Multithreading
C Directives(5/6)
3.4. DDM Function
#pragma ddm func <name>


Defines a function that includes one or more DDM Codes-Block
in its body.
3.5. User Defined Shared Variables
#pragma ddm var <type> <name> <size>

Defines a variable as shared and allocates memory for it in the
shared memory address space.
name is the name of the variable,
 type shows the type of the variable and finally
 size, the number of elements to be allocated (useful for array
structures).

Hsu, Zi Jei
SoC & ASIC Lab
16
SoC CAD

NCKU
Data-Driven Multithreading
C Directives(6/6)
3.6. System Configuration/Variables
#pragma ddm kernelid <var>

Assigns the kernel ID to variable var.
#pragma ddm kernel <number>


Defines the number of DDM Kernels in the program.
3.7. Debugging Primitives
#pragma ddmdebug print tsu <number>
#pragma ddmdebug print tsu all
#pragma ddmdebug print all
#pragma ddmdebug flush all
#pragma ddmdebug stats tsu <number>

Hsu, Zi Jei
These pragmas are related to debug functionality offered by the
DDM-CMP simulator, such as printing the contents of the TSU.
SoC & ASIC Lab
17
SoC CAD

The Data-Driven Multithreading C Pre-Processor (DDMCPP)
is a tool that takes as input a regular C code program with the
directives as described in the previous section and outputs a
C program that includes all the library calls necessary for the
program to execute on the DDMCMP architecture.


NCKU
Data-Driven Multithreading
C Pre-Processor(1/2)
In addition, it embeds the Runtime Support System code. This
tool is logically divided into two modules, the front-end and the
back-end, which are described next.
4.1. DDMCPP Front-End

The DDMCPP front-end is a parser tool based on the flex and
bison tools.
The parser recognizes the directives presented in the previous
section.
 The task of front-end is to parse the DDM directives and then pass
the information to the back-end to produce the code corresponding
to the target architecture.

Hsu, Zi Jei
SoC & ASIC Lab
18
SoC CAD

NCKU
Data-Driven Multithreading
C Pre-Processor(2/2)
4.2. DDMCPP Back-end
The back-end is built as the actions of the bison grammar for
the DDM directives.
 The task of the back-end is to generate the code required for
the DDM-CMP Runtime Support System such as the DDMKernel code, Thread select loop, and the load operations to the
TSU, among others.


4.3. DDMCPP Usage
ddmcpp -K n [-debug]
[-o outfile] infile
The pre-processor produces code for n kernels from the source
code infile.
 Optionally the produced code may include debugging
information (-debug) and the output.

Hsu, Zi Jei
SoC & ASIC Lab
19
NCKU
SoC CAD
Example of Code Transformation(1/7)
01 main(){
23 #pragma ddm block 2 \
02 int x=4; double y,z,k; y=8.1;
24 import(int x,double y,double z) export(x:3, y:4)
03
25
04 #pragma ddm kernel 2
26
#pragma ddm thread 3 kernel 1 \
05
27
import (x:0, z:0) export(int x)
01 main(){
06 #pragma ddm block 1 \
28
x=z*x;
02 int x=4; double y,z,k; y=8.1; 07
import(int x,double y ) export( x:1, y:2 ) 29
#pragma ddm endthread
03 x++;
08
30
04 y++;
09
#pragma ddm thread 1 kernel 1 \
31
#pragma ddm thread 4 kernel 2 \
05 z=x+y;
10
import ( x:0 ) export ( int x )
32
import (y:0, z:0) export(double y)
06 x=z*x;
11
x++;
33
y=z*y;
07 y=z*y;
12
#pragma ddm endthread
34
#pragma ddm endthread
08 k=x*y;
13
35
09 printf( "k=%g\n", k );
14
#pragma ddm thread 2 kernel 2 \
36 #pragma ddm endblock
10 }
15
import(y:0) export(double y)
37
16
y++;
38 k=x*y;
Figure 4. The original code 17
#pragma ddm endthread
39 printf( "k=%g\n", k );
of the example program.
18
40 }
19 #pragma ddm endblock
20
21 z=x+y;
Figure 5. The code with the necessary
22
pre-processor directives of the example program.
Hsu, Zi Jei
SoC & ASIC Lab
20
NCKU
SoC CAD

In this section we describe the process of transforming the C
example application depicted in Figure 4 into a DDM binary.


Example of Code Transformation(2/7)
It is easy to identify that two set of independent instructions
exist, x++; y++; (lines 03/04) and x=x*z; y=y*z; (lines 06/07).
Figure 5 shows the same code augmented with the
necessary pre-processor directives to express the existing
parallelism.
Note that DDM threads normally contain more than just one
instruction;here single-instruction threads are for explanation
purposes only.
 In line 04 the user defines that the program will be executed by
2 DDM-Kernels.

Hsu, Zi Jei
SoC & ASIC Lab
21
NCKU
SoC CAD

Example of Code Transformation(3/7)
The first block starts in line 06 with the pragma ddm block
directive.
This directive also defines the variables that will be used in its
body, x and y, with the import part.
 This block ends at line 19 with the pragma ddm endblock directive.


The first DDM thread of the program is defined to start in line 09
with the pragma ddm thread directive.
The import part defines that variable x is used in the thread and is
imported from the containing block (the zero after variable x).
 The export part defines that variable x must be exported by the
DDM-Block as it is used by threads in the subsequent DDMBlocks.
 Thread 1 ends with the pragma ddm end-thread directive at line12.


Figure 6 depicts the Data-Flow Graph of the original program
and its Data-Flow Graph after the preprocessing phase.

Hsu, Zi Jei
Note that the DDM-Preprocessor has automatically added the
required Inlet and Outlet threads.
SoC & ASIC Lab
22
NCKU
SoC CAD
Example of Code Transformation(4/7)
Figure 6. The original program Data-Flow
Graph (left) and the Data-Flow Graph after
preprocessing (right).
Hsu, Zi Jei
SoC & ASIC Lab
23
NCKU
SoC CAD
Example of Code Transformation(5/7)
28 threadCompletedExecution(tsuID);
57
case 2:
01 //Variables for passing thread results
29 goto THREAD_SELECT;
58
loadThread(tsuID, 10102, consumerList);
02 struct sharedResults *sharedResultsArray;
30
59
goto THREAD_SELECT;
03
31 THREAD1_BLOCK1:
60 }
04 main(){
32 x = sharedResultsArray->result[0].t_int;
61
05 int x=4; double y,z,k; y=8.1;
33 x++;
62 END_BLOCK1:
06
34 sharedResultsArray->result[4].t_int = x;
63 z=x+y;
07 initializeTSUs();
35
64
08
36 threadCompletedExecution(tsuID);
65 goto DDM_BLOCK2;
09 //Create the ddm kernels
37 goto THREAD_SELECT;
66
10 for(_index=0;_index<2;_index++) {
38
67 // Goto the next ready thread.
11 pid = fork();
39 OUTLET_THREAD_BLOCK1_KERNEL1:
68 THREAD_SELECT:
12 if( pid == 0 ){
69 threadUnderExecution =
13 tsuID = _index; kernelNumber = _index + 1; 40 //Export variables
41
x
=
sharedResultsArray->result[4].t_int;
readFromTSU( tsuID );
14 goto DDM_BLOCK1;
42 y = sharedResultsArray->result[8].t_double; 70
15 }
43
71 switch ( threadUnderExecution ) {
16 }
44 threadCompletedExecution(tsuID);
72
// If no thread is ready wait
17
45 clearTsuState(tsuID);
73
case 0: goto THREAD_SELECT;
18 wait();
46 goto END_BLOCK1;
74
19 return;
47
75
case 10201: goto INLET_THREAD_
20
48 //DDM BLOCK01
BLOCK2_KERNEL1;
21 INLET_THREAD_BLOCK1_KERNEL1:
49 DDM_BLOCK1:
76
22 loadThread(tsuID, 1, consumersArray);
77
...
23 loadThread(tsuID, 20101,consumersArray); 50
51
//Select
the
first
thread
78
}
24 //Import variables
according to the kernel
79
25 sharedResultsArray->result[0].t_int = x;
80 DDM_BLOCK2:
26 sharedResultsArray->result[1].t_double = y; 52 switch ( kernelNumber ) {
53
case 1:
27
54
loadThread(tsuID, 10101, consumerList); Figure 7. The produced C code.
55
goto THREAD_SELECT;
56
Hsu, Zi Jei
SoC & ASIC Lab
24
NCKU
SoC CAD

Example of Code Transformation(6/7)
Figure 7 depicts a snapshot the produced C code that, after
being compiled with an ordinary compiler (e.g. gcc), can be
executed by the DDM-CMP architecture (note that only a part
of the program are shown for clarity issues).

Lines 09-16 regard the creation of the DDM-Kernels.

Upon creation, the DDM-Kernels are redirected to DDM BLOCK01
(line 49).



Hsu, Zi Jei
There, each Kernel loads its TSU with the Inlet thread of its first block
and then redirects to the THREAD SELECT loop.
In addition to loading the TSU with the block’s threads (lines 22 and 23
for the Inlet thread of the first DDM-Kernel for block 1), the Inlet
threads copy the imported variables to a special structure, the
sharedResultsArray.
In line 28, the Inlet thread notifies the TSU about its completion and
then redirects execution to the THREAD SELECT loop (line 29).
SoC & ASIC Lab
25
NCKU
SoC CAD

Example of Code Transformation(7/7)
The first operation of execution threads, for example THREAD1
BLOCK1 (lines 31-37), is to import the necessary variables (line
32).
Exported variables, are copied back to the sharedResultsArray
structure.
 Similarly to the Inlet threads, after completing their execution,
execution threads notify the the TSU about this event (line 28) and
redirect the execution to the THREAD SELECT loop.


The outlet threads, first export the shared variables (lines 4142), then notify the TSU about their completion and clear the
allocated resources on the TSU (line 45).

Hsu, Zi Jei
One of the Outlet threads, here OUTLET THREAD BLOCK 1
KERNEL 1, executes the inter-block code, as such, this thread is
redirected to END BLOCK1 and not to the THREAD SELECT loop.
SoC & ASIC Lab
26
NCKU
SoC CAD

Conclusions
In this paper we presented DDMCPP, a C Pre-Processor for
the Data-Driven Multithreading model of execution. This tool
offers two major benefits.
First, it makes it easier to port applications to the DDM-CMP
architecture as the original applications require only to be
augmented with simple compiler directives.
 Second, it is a valuable tool to be used fast into the
development of the new architecture and allows for the
discovery of patterns in applications that will later on be
recognized in a fully automatic compiler.


Although such a compiler is the ultimate goal for every
developer of a new architecture,

Hsu, Zi Jei
the solution here presented is feasible therefore allowing for a
better development and validation of the architecture as
opposed to the most common hand-coded application
development.
SoC & ASIC Lab
27
Download