Oscar Multi-Grain Architecture and Its Evaluation

advertisement
OSCAR Multi-grain Architecture and Its Evaluation
H. Kasahara
M. Okamoto
Toshiba Corporation
Fuchu Tokyo 183 JAPAN
W. Ogata E(. Kimura G. Matsui
Dept. of EECE
Waseda University
Shinjuku Tokyo 169 JAPAN
A. Yoshida
Dept. of IS
Toho University
Funabashi Chiba 274 JAPAN
Abstract
H. Honda
Dept. of IS
Univ. of Electro-Communications
Chofu Tokyo 182 JAPAN
grain parallelism among loop iterations by the loop
parallelization.
A multiprocessor system OSCAR [24] was developed to efficiently realize compiler parallelization techniques to handle the above problems, such as:
O S C A R (Optimally Se h edu led Advanced Mu It i p rocessor) was designed to efficiently realize multi-grain
parallel processing using static and dynamic scheduling. It is a shared memory multiprocessor system
having centralized and distributed shared memories
i n addition to local memory on each processor .with
data transfer controller f o r overlapping of data transfer and task processing. Also, its Fortran multigrain compiler hierarchically exploits coarse grain parallelism among loops, subroutines and basic blocks,
conventional medium grain parallelism among loopiterations in a Doall loop and near fine grain parallelism, among statements. At the coarse grain parallel
processang, data localization(automatic data distribution) have been employed to minimize d a t a transfer
overhead. I n the near fine grain processing of a basic
block, explicit synchronization can be removed b y use
of a clock level accurate code scheduling technique with
architectural supports. This p a p e r describes OSCAR’s
architecture, its compiler and the performance f o r the
multi-grain parallel processing. OSCAR’S architecture
and compilation technology wrll be more important in
future High Performance Computers and single chip
multaprocessors.
dynamic scheduling for coarse grain parallel processing, or macro-dataflow [6], using a scheduling
routine generated by OSCAR compiler [18, 19,
20, 211,
static scheduling for (near) fine grain parallel processing inside a basic block [17, 18, 251 , medium
grain parallel processing among loop iterations,
macro-dataflow processing,
efficient use of local memory by decomposition of
tasks and data, or data localization [26, 27, 281,
overlapping of task processing and data transfer
using a data transfer controller [ll].
On OSCAR, a multigrain compiler [18, 19, 231 realizing the above techniques 1. to 3 . was implemented
and evaluated. The last techniques originally developed for OSCAR, however, has been evaluated on Fujitsu VPP 500 having a stronger data transfer controller.
This paper describes OSCAR’S architecture and its
multi-grain parallelizing compiler and the evaluation
of the performance.
1 Introduction
Currently many multiprocessor systems use the
loop parallelism using parallelizing compilers [1]-[5].
Those compilers parallelize many types of Do loop using strong data dependency analysis techniques and
program restructuring techniques. There still exist,
however, sequential loops that can not be parallelized
efficiently because of loop carrying dependencies and
conditional branches to the outside of loops. Also,
most existing compilers for multiprocessor systems can
not exploit fine grain parallelism inside a basic block
and coarse grain parallelism among loops, subroutines
and basic blocks effectively. Therefore, to improve the
effective performance of multiprocessor systems, it is
important to exploit the fine grain parallelism and the
coarse grain parallelism in addition t o the medium
0-8186-8424-0/98 $10.00 0 1998 IEEE
H. Matsuzaki
2
OSCAR’S Architecture
This section describes the architecture of OSCAR
(Optimally Scheduled Advanced Multiprocessor) de-
signed to support the multi-grain parallel processing.
OSCAR itself was designed ten years ago. However, its
architecture gives us many suggestions for future high
performance multiprocessor systems requiring a new
parallelizing technology like macro-dataflow and a single chip multiprocessor searching for post-instructionlevel-parallelism.
Figure 1 shows the architecture of OSCAR. OSCAR is a shared memory multiprocessor system with
106
I
HOSTCOMPUTER
I
SYSTEM BUS
,...........................................
(......t
. . . . . . w ,
I
.................................
DP
i
........................................................................................................................................
I
D M A :DMA CONTROLLER
L P M : LOCAL PROGRAM MEMORY
(128KW * 2BANK)
I N S C : INSTRUCTION
CONTROL UNIT
D S M : DISTRIBUTED
SHARED MEMORY (2KW)
L s M : LOCAL
STACK MEMORY (4KW)
L D M : LOCAL DATA MEMORY
(25hKW)
:DATA PATH
i p U :INTEGER
PROCESSING UNIT
F p U : FLOATING
PROCESSING UNIT
R E G : REGISTER FILE
(64REGISTERS)
DP
Figure 1: OSCAR’s architecture.
Figure 2: OSCAR’s processor element.
both centralized and distributed shared memories, in
which sixteen processor elements (PES) having distributed shared memory (DSM) and local program
and data memories are uniformly connected to three
modules of centralized shared memory (CSMs) by
three buses.
Each P E shown in Figure 2 has a custom-made 32
bit RISC processor with throughput of 5 MFLOPS. It
consists of the processor having sixty-four registers, an
integer processing unit and a floating point processing
unit, a data memory, two banks of program memories for instruction preloading, a dual port memory
used as a distributed shared memory (DSM), a stack
memory (SM) and a DMA controller used for data
pre-loading and post-storing to CSMs. The PE executes every instruction including a floating point addition and a multiplication in one clock cycle. The
distributed shared memory on each P E can be accessed simultaneously by the PE itself and another
PE. Also, OSCAR provides the following three types
of data transfer modes by using the DSMs and the
OSCAR’s memory space consists of a local memory space on each PE and a system memory space
as shown in Figure 3. The local memory space on
each PE consists of DSM space, 2 banks of program
memory (PM) for program preloading, data memory
(DM) and control. The system memory space consists
of area for data broadcast onto all DSMs, areas for all
PES and CSMs. Therefore, memories on each P E can
be accessed by local memory address through the local bus by each P E and by system memory address
through the interconnection buses by every PE.
2.1
Architectural Supports for Dynamic
Scheduling
In OSCAR’s multigrain compiler, the dynamic
scheduling is adopted for handle run time uncertainty
caused by conditional branches among coarse grain
tasks, or macro-tasks, mainly for macro-dataflow processing since dynamic scheduling overhead can be kept
relatively low since processing times of macrotasks are
relatively large [18, 191. When macro-tasks are assigned t o processors or processor clusters at run time,
optimal allocation of shared data among macrotasks
t o DSMs is very difficult. To simplify this problem,
OSCAR provides CSM t o assign shared data used by
the macrotasks to be dynamically scheduled.
Also, OSCAR can simulate a multiple processor
cluster (PC) system with the global shared memory.
The number of PCs and the number of PES inside a
PC can be changed even at run-time according to par-
CSMs:
1. One PE to one P E direct data transfers using
DSMs,
2. One P E to all PES data broadcasting using the
DSM,
3. One PE to several PES indirect data transfers
through CSMs.
Each module of the centralized shared memory
(CSM) is a simultaneously readable memory of which
the same address or different addresses can be read by
three PES in the same clock cycle.
allelism of the target program, or the macrotask graph
mentioned later, because partitioning of PES into PCs
is made by compiler. Furthermore, each bus has a
107
SYSTEM MEMORY SPACE
oooooooor' ........................
!
LOCAL MEMORY SPACE
compiler and finally all synchronization codes inside a
basic block can be removed [25].
UNDEFINED
3
WOIIKXX)
L P M(Bunk0)
(LWd
.
P q r a m Mcmoiyl OW2r"
LPM(Bmk1)
. 0003wo0
3.1
NOTUSE
L I1 M
(Local
Dala Mcmory)
~'
0008oWX
02 I oooo0~-
L
NOTUSE
'
Compilation for Macro-dataflow
The macro-dataflow compilation scheme [18,20,21]
is mainly composed of the following four steps: 1)generation of macrotasks, 2)control-flow and data-flow
analysis among macrotasks, 3)earliest executable condition analysis [18, 20, 211 of macrotasks to detect parallelism among macrotasks considering control- and
data dependencies and 4)code generation for PCs and
for dynamic schedulers.
.w o m
0 2 2 m
Multi-grain Compilation S c h e m e
This section briefly describes OSCAR Fortran
multi-grain compilation scheme [17, 18, 231 that
mainly consists of the macro-dataflow processing, the
loop parallelization and the near fine grain parallel
processing.
WOFW
CONTROL
WlwooO
SYSEM
ACCESSING
3.1.1
Generation of macrotasks
AREA
A Fortran program is decomposed into macrotasks.
The macrotasks are so generated that they have
relatively large processing time compared with dynamic scheduling overhead and data transfer overhead. OSCAR compiler generates three types of
macrotasks, namely, Block of Pseudo Assignment
statements (BPA), Repetition Block (RB) and Subroutine Block (SB). A BPA is usually defined as an
ordinary basic clock (BB). However, it is sometimes
defined as a block generated by decomposing a BB
into independents parts to extract larger parallelism
or by fusing BBs to reduce dynamic scheduling overhead into a coarser macrotask.
A RB is a Do loop or a loop generated by a backward branch, namely, an outermost natural loop. RBs
can be defined for reducible flow graphs and for irreducible flow graphs with copying code.
A RB can be hierarchically decomposed into submacrotasks. For the sub-macrotasks, the macrodataflow processing scheme is hierarchically applied
by using sub-processor clusters defined inside a procesor cluster. In the decomposition of RB into
sub-macrotasks, overlapped loops are structured into
nested loop by copying code to exploit parallelism.
In the above definition of RBI a Doall loop is treated
as a macrotask assigned t o a processor cluster. In
other words, a Doall loop is not processed by all processors although it has enough parallelism to use all
processors or all processor clusters. Therefore, in the
proposed compilation scheme, a Doall loop is decomposed into "k" smaller Doall loops. The decomposed
Doall loops are assigned to processor clusters to process the original Doall loop by using all processor clusters or all processors. Here "k" is usually determined
as a number of processor clusters in the multiprocessor system or multiples of the number of processor
clusters.
Furthermore, in generation of the RB using the loop
decomposition, a loop aligned decomposition method
is applied for data localization among data dependent
loops, which minimizes data transfer among processor
FFFFFFFF
FFFFFPFF
Figure 3: OSCAR'S memory space.
hardware for fast barrier synchronization. By using
the hardware, each PC can take barrier synchronization in a few clocks.
2.2
Architectural
Scheduling
Supports for
Static
On OSCAR, static scheduling at a compile time is
used for near fine grain parallel processing, loop parallel processing, macro-dataflow processing as much as
possible to minimize runtime overheads.
For the near fine grain parallel processing [17], OSCAR provides three data transfer modes mentioned
above. The one to one direct data transfer or the
data broadcast needs only 4 clock cycles to write one
word data from a register of a sender PE to a DSM
on a receiver PE or DSMs on PES. On the other
hand, the indirect data transfer requires 8 clock cycles to write one word data from a register of a sender
PE onto a CSM and read the data from the CSM
into a register of a receiver PE. Therefore, the optimal selection of the above three modes using static
scheduling allows us to reduce data transfer overhead markedly. Also, synchronization using DSMs
reduces synchronization overhead because assigning
synchronization-flags onto the DSMs prevents degradation of bus band width that is caused by the busy
wait t o check synchronization-flags on CSMs.
Furthermore, the fixed clock execution of every instruction by OSCAR RISC processor and a single reference clock for PES and buses allows the compiler to
generate most efficient parallel machine code precisely
scheduled in clock level. In the optimized parallel machine code, data transfer timing including bus accesses
and remote memory accesses are determined by the
108
-Data dependency
.........
Exlended control dependency
0 Conditional branch
c
...._.
‘‘OR
P
‘‘;:W&
’,..,,
_..’.
AND
’ Original control flow
Figure 5: Macrotask graph (MTG).
Figure 4: Macroflow graph (MFG).
parallelism among macrotasks considering control dependencies and data dependencies. The earliest executable condition of macrotask i, MTi, is a condition
on which MTi may begin its execution earliest. For
example, an earliest executable condition of MT6 in
Figure 7, which is control-dependent on MT1 and on
MT2 and is data-dependent on MT3, is:
( M T 3 completes execution)
clusters by using local memory on each processor when
there exist array data dependencies among loops prior
to the decomposition [26, 27, 281.
As to subroutines, the in-line expansion is applied
as much as possible taking code length into account.
Subroutines for which the in-line expansion can not
be applied efficiently are defined as SBs. SBs can also
be hierarchically decomposed into sub-macrotasks as
well as RBs.
3.1.2
OR
( M T 2 branches t o M T 4 )
Here, ” M T 3 completes execution’’ means to satisfy
the data dependence of MT6 on MT3 because the following conditions for macro-dataflow execution are assumed in this paper:
Generation of macroflow graph (MFG)
1. If macrotask i (MTi) is data-dependent on macrotask j (MTj), MTi can not start execution before
MTj completes execution.
A macroflow graph represents both control flow and
data dependency among macrotaskS. Figure 4 shows
an example of a macroflow graph. In this macroflow
graph, nodes represent macrotasks. Dotted edges represent control flow. Solid edges represent data dependencies among macrotasks. Small circles inside nodes
represent conditional branch statements inside macrotasks. In this graph, directions of the edges are assumed to be downward though arrows are omitted.
MFG is a directed acyclic graph because all back-edges
are contained in RBs.
3.1.3
2. A conditional branch statement in a macrotask
may be executed as soon as data dependencies of
the branch statement are satisfied. This is because statements in a macrotask are processed
in parallel by using near fine grain parallel processing described later. Therefore, MTi, which is
control-dependent on a conditional statement in
MTj, can begin execution as soon as the branch
direction is determined even if MTj has not completed.
Macrotasks parallelism extraction
The above earliest executable condition of MT6
represents the simplest form of the condition. An original form of the condition of MTi which is controldependent on MTj and data-depen ent on MTk: 0 5
k 5 N ) can be represented in the following:
( M T j branches t o M T i )
The MFG represents control flow and data dependency among macrotasks though it does not show
any parallelism among macrotasks. The program dependence graph [la] represents maximum parallelism
among macrotasks with control dependency and data
dependency . In practice, however, the macrotask
scheduler needs to know when a macrotasks can start
6
AND
{ ( M T k complete e z e c u t i o n )
execution. In this earliest executable macro-dataflow
OR
computationscheme, earliest executable conditions of
macrotasks [18]-[21] are used to show the maximum
(it i s determinedthat M T k is not be executed)}
109
conditional branches among macrotasks and a variation of macrotask execution time. The use of dynamic
scheduling for coarse grain t,asks keeps the relative
scheduling overhead small. Furthermore, the dynamic
scheduling in this scheme is performed not by OS calls
like in popular multiprocessor systems but by a special scheduling routine generated by the compiler. In
other words, the compiler generates an efficient dynamic scheduling code exclusively for each Fortran
program based on the earliest executable conditions,
or the macrotask graph. The scheduling routine is
executed by a processor element. Dynamic-CP algorithm, which is a dynamic scheduling algorithm using longest path length of each macrotask to the exist
node on MTG, is employed taking into consideration
the scheduling overhead and quality of the generated
schedule.
For example, the original form of the earliest executable condition of MT6 is:
{ ( M T l branches t o MT3)
OR
( M T 2 branchesto M T 4 ) )
AND
( ( M T 3 completes execution)
OR
( M T I branches to M T 2 ) }
The first partial condition before AND represents
an earliest executable condition determined by the
control dependencies. The second partial condition
after AND represents an earliest executable condition
to satisfy the d a t a dependence. In the condition, the
execution of MT3 means that MT1 has branched to
MT3 and the execution of MT2 means that MT1 has
branched to MT2. Therefore, this condition is redundant and it can be simplified as the form discribed
above.
The simple earliest executable conditions of macrotasks are given by OSCAR compiler automatically. The simplest condition is important to reduce dynamic scheduling overhead. Girkar and Polychronopoulos [22] proposed a similar algorithm to obtain the earliest executable conditions based on the
original research [18]-[21]. They solved a simplified
problem t o obtain the earliest executable conditions
by assuming a condit,ional branch inside a macrotask
is executed in the end of the macrotask.
The earliest executable conditions of MTs are represented by a directed acyclic graph named a macrotask graph, or MTG, as shown in Figure 5. In MTG,
nodes represent macrotasks. Dotted edges represent
extended control-dependencies. Solid edges represent
data-dependencies.
The extended control dependence edges are classified into two types of edges, namely ordinary control
dependence edges and co-control dependence edges.
The co-control dependence edges represent conditions
on which data dependence predecessor of MTi, namely
MTk mentioned before on which MTi is data dependent, is not be executed[20].
Also, a data dependence edge, or a solid edge, originating from a small circle has two meanings, namely,
an extended control dependence edge and a data dependence edge. Arcs connecting edges at their tails or
heads have two different meanings. A solid arc represents that edges connected by the arc are in AND
relationship. A dotted arc represents that edges connected by the arc are in OR relationship. Small circles
inside nodes represent conditional branch statements.
In the MTG, the directions of the edges are also assumed to be downward though most arrows are omitted. Edges with arrows show that the edges are the
original conditional flow edges that originate from the
small circles in the MFG.
3.1.4
3.1.5
Data localization
The data-localization scheme [26, 27, 281 reduces
data transfer overhead among macrotasks composed
of Doall and sequential loops. Here, data-localization
means to decompose multiple loops, or array data, and
to assign them to processors (PES)so that shared data
among the macrotasks can be transferred through local memory on the PES.
This compilation method consists of the following
three steps: loop aligned decomposition which decomposes loop indices and arrays to minimize data transfer among processors based on inter-loop data dependence analysis, generation of dynamic scheduling routine to assign a set of decomposed loops, among which
large data transfer may occur, onto the same PE using the macrotask fusion and the partial static assignment methods and generation of parallel machine code
to transfer data via local memory among the decomposed loops assigned onto the same PE.
198
(L*2-1) (L*z)(L*2+IXL*2+2xL*2+3)
l#Y
2 g 2
x
1 2&2
'$3
204
101
+B(K'Z)+C(K+l)
ENDDO
_ . . Data dependence
0Macrotask
205
102
I (RB1)
-K (RB3)
: Inter-loop data dependence
'
:
(a)A target loop group (TLG)
0
: Iterations on which lOGth (L-th)
iteration in RE3 is data dependent
(b)lnter-loop data dependence
Figure 6: Inter loop d a t a dependence.
In this method, for example, when RBs in Figure 6(a) are executed on two P E S , RBI in Figure
6(a) is decomposed into RBI, RBi1I2)and RB! in
Figure 7(b), also RB2 and RB3 are decomposed in
a same manner. In this case, array data inside the
group composed of RB:, RB; and RB; and the group
composed of RB?, RB; and RB: in Figure 7(b) are
Dynamic scheduling of macrotasks
In the macro-dataflow computation, macrotasks are
dynamically scheduled to processor clusters (PCs) at
run-time to cope with runtime uncertainties, such as,
110
-
: Data dependence
D : Data transfer from CSM to LM
U : Data transfer from LM to CSM
........
........ : Localizable Region (LR) including CAR
.......................
........................
....................
.jjLR2
iLRl
*..................................
DGCIRI
*.....................................
DGCIR~
*.............................................................................................
GClR
j;LR3
C
*
-
0
: LR Localizable Region)
0
: CA&Commonly Accessed Region)
: Data dependence
(a)Generation of LR and CAR
LR
LR
-. .
CAR
.....
D(P(l))=l-I
B(l)=B(l-l)
.......
........................................
........................................
LR : L=callzable Region
C A R . Commonly Assesoed Region
(b)Partial MTG afier loopaligned-decomposition
U
RB Doall)
D0%2,100
C(l)=B(l)
+B(l-I)
ENDDO
i: Fused macratask
........ :
I
i
@)After MT-fusion
(a)Partial program(TLG)
Figure 7: Loop aligned decomposition for task fusion.
Figure 8: Loop aligned decomposition for the partial
static task assgnment.
passed through local memory. This loop aligned decomposition method can also be applied to multiple
loops including a sequential loop such as Figure 8(a),
where the loops are decomposed into partial loops as
shown in Figure 8(b) when they three PES are used.
decomposed into
3.2
(b)MTs after loop aligned decomposition
3.3.1
Generation of tasks and task graph
To efficiently process a BPA in parallel, computation
in the BPA must be decomposed into tasks in such a
way that parallelism is fully exploited and overhead
related with data transfer and synchronization is kept
small. In the proposed scheme, the statement level
granularity is chosen as the finest granularity for OSCAR taking into account OSCAR'S processing capability and data transfer capability.
Medium Grain Parallel Processing
Macrotasks are assigned to processor clusters (PCs)
dynamically as mentioned in the previous section. If a
macrotask assigned t o a PC is a Doall loop, the macrotask is processed in the medium grain, or iteration
level grain, by processors inside the PC. For the Doall,
several dynamic scheduling schemes have been proposed. On OSCAR, however, a simple static scheduling scheme is used because OSCAR does not have a
hardware support for the dynamic iteration scheduling
and static scheduling allows us to realize data localization among loops. If a macrotask assigned to a PC
is a loop having data dependencies among iterations]
the compiler first tries t o apply the Doacross with restructuring to minimize the synchronization overhead.
Next , the compiler compares an estimated processing
time by the Doacross and by the near fine grain parallel processing of the loop body mentioned later. If the
processing time by the Doacross is shorter than the
one by the near fine grain processing, the compiler
generates a machine code for the Doacross.
Figure 9 shows an example of statement level tasks,
or near fine grain tasks, generated for a basic block
that solves a sparse matrix. Such a large basic block
is generated by the symbolic generation technique ,
which has been used in the electronic circuit simulator
like SPICE, and by the partial evaluation.
The data dependencies] or precedence constraints,
among the generated tasks can be represented by arcs
in a task graph [la]-[15] as shown in Figure 10, in
which each task corresponds t o a node. In the graph,
figures inside a node circle represent task number, i,
and those beside it for a task processing time on a PE,
ti. An edge directed from node Ni toward N. represents partially ordered constraint that task $ precedes task Tj. When we also consider a data transfer
time between tasks, each edge generally has a variable
weight. Its weight, t i . , will be a data transfer time
between task
and
if Ti and
are assigned to
different PES. It will be zero or a time to access registers or local data memories if the tasks are assigned
t o the same PE.
fi
3.3 Near Fine Grain Parallel Processing
A BPA is decomposed into the near fine grain tasks
[17], each of which consists of a statement] and processed in parallel by processors inside a PC.
111
<< LU Decomposition >>
1
1
J
on different PES
Figure 10: Task graph for near fine grain tasks.
Figure 9: Near fine grain tasks
3.3.2
Static scheduling algorithm
and so on. Therefore, we can generate the machine
codes for each P E by putting together instructions
for tasks assigned to the P E and inserting instructions for data transfer and synchronization into the
required places. The "version number" method is used
for synchronization among tasks. At the end of a BPA,
inst>ructionsfor the barrier synchronization, which is
supported by OSCAR'S hardware, are inserted into
a program code on each PE. The compiler can also
optimize the codes by making full use of all information obtained from the static scheduling. For example,
when a task should pass shared data t o other tasks assigned t o the same PE, the data can be passed through
registers on the PE.
In additmion,the compiler minimizes the synchronization overhead by eliminating redundant synchronization considering the information about the tasks
to be synchronized, the task assignment and the execution order.
In addition t o the elimination of redundant synchronization codes, OSCAR compiler has realized
elimination of all synchronization codes inside a basic
block and a sequential loop t o which the near fine grain
processing is applied [25]. In the opt,imization, the
compiler estimates start and completion time of every
task execution and data transfer, or bus access and
memory access timing, in machine clock level exactly
with the architectural support of OSCAR. Next, the
compiler or machine code scheduler generate parallel
machine code, which can control memory and bus access timing by inserting NOP (no operation) instructions t o delay reading shared data on the distributed
shared memory t o be written by another processor and
to delay bus accesses until data t,ransfers are finished
by other processors. Also the compiler inserts NOP
instructions into program codes for PES, which reach
t o a barrier point before the last PE reaches, t o realize barrier synchronization without an explicit barrier
instruction supported by hardware.
To process a set of (near fine grain) tasks on a multiprocessor system efficiently, an assignment of tasks
onto PES and an execution order among the tasks assigned to the same P E must be determined optimally.
The problem t h a t determines the optimal assignment
and the optimal execution order can be treated as
a traditional minimum execution time multiprocessor scheduling problem [la, 151. To state formally,
the scheduling problem is t o determine such a nonpreemptive schedule in which execution time or schedule length be minimum, given a set of n computational tasks, precedence relations among them, and m
processors with the same processing capability. This
scheduling problem, however, has been known as a
"strong" NP-hard problem [13].
Considering this fact, a variety of heuristic algorithms and a practical optimization algorithm have
been proposed [15]. In OSCAR compiler, a heuristic
scheduling algorithm CP/DT/MISF (Critical Path/
Data Transfer/ Most Immediate Successors First) considering data transfer [17] has been adopted taking
into account a compilation time and quality of generated schedules.
3.3.3
Machine code generation
For efficient parallel execution of near fine grain tasks
on an actual multiprocessor system, optimal machine
codes must be generated by using a statically scheduled result. A statically scheduled result gives us the
following information:
1. which tasks are executed on each PE,
2. in which order the tasks assigned to the same P E
are executed,
3 . when and where data transfers and synchronization among PES are required,
112
I
....'...."
ENDMT
I
Figure 12: Macrotask graph of Figure 11
Figure 11: A macroflow graph for a Fortran program
with 17 macrotasks.
4
execution time t o 188[ms] (1/3.36) for 6 PES because
coarse-grain parallelism among sequential loops and
the other macrotasks can be exploited. Furthermore,
when the data-localization method is applied, execution time is reduced to 152[ms] (1/4.16) for 6 PES. In
other words, speedup of 30% for 6 PES are obtained
by the data-localization compared with conventional
Doall processing.
In the above evaluation, OSCAR needs only 4 clock
cycles to access CSM and 1 clock cycle to access LM.
However, since ratio of CSM access time to LM access
time on multiprocessor systems available in the market
is larger than that on OSCAR, the proposed datalocalization scheme may be more effective on these
machines.
Figure 14 shows the performance of the near fine
grain parallel processing using static scheduling for a
typical loop body of a CFD program called NAL test
developed by National Aerospace Laboratory. The
processing time on OSCAR was reduced from 0.85[sec]
for 1 PE to 0.34[sec] for 3 PES and 0.20[sec] for 6
PES. From this example, it is understood that near
fine grain parallel processing has been successfully realized on OSCAR.
Figure 15 represents the effectiveness of near fine
grain parallel processing without explicit synchronization, namely elimination of the all synchronization instructions inside a sequential loop with 24 statements
to calculate PAI. In the figure, the upper curve shows
the processing time with all synchronizations, the dotted curve shows the processing time with elimination
of the redundant synchronization and the lower curve
Performance Evaluation on OSCAR
This section briefly describes performance of OSCAR multi-grain compiler. Figure 11 is a macroflow
graph of an example Fortran program composed of
17 macrotasks including RBs, SBs and BPAs. Figure
12 represents the macrotask graph for the macroflow
graph. Sequential execution time on 1 PE on OSCAR
was 9.63[sec]. Execution time for macro-dataflow
using 3 PES was 3.32[sec]. This result shows the
macro-dataflow computation was realized very efficiently with negligibly small overhead. Also, execution time of multi-grain processing using 3 PCs, each
of which has 2 PES, namely, 6 PES, was 1.83[sec]. In
this case, a macrotask composed of Doall loop was
processed in parallel by 2 PES inside a PC. Also, a
inacrotask composed of a sequential loop or a BPA is
processed by using near fine grain parallel processing
scheme. This results shows that the multi-grain parallel processing allows us effectively parallel processing.
Next, to evaluate the data-localization with partial
static task assignment, a Fortran program for Spline
Interpolation having 9 Doall loops, 2 sequential loops
with loop carried data dependence and 3 basic blocks
is used. Figure 13 shows the performance of the data
localization on OSCAR. Conventional Doall processing reduces execution time from 632[ms] for 1 PE t o
218[ms] (1/2.90) for 6 PES. On the other hand, macrodataflow processing without data-localization reduces
113
t
Execution Time of Spline-Interpolation
Doall processing
::i
0.6
m
.-c
v)
Macro-dataflow
(I)
a,
0
0
L
0.3
152[msI(l/4.1)
-
L
Macro-dataflow
with localization
a
a
Number of Processors
0.2\
-
0.2218
t
1
2
3
4
5
0.1955
6
Number of Processors
shows the processing time with elimination all explicit
synchronization. When three PE are used, the processing time are reduced from 92.63[us] for all synchronization with 18 synchronization flag sets and 26
flag checkes to 61.76[us] for no explicit synchronization (33% speed up). From this result, on OSCAR,
near fine grain parallel processing with elimination of
all synchronization inside a basic block by the precise
instruction scheduling can be realized and large performance improvement can be obtained.
Figure 14: Performance of near fine grain parallel processing for a loop boby of a CFD program.
[3] D. J. Lilja, “Exploiting the Parallelism Available
in loops,” IEEE Computer, pp.13-26, Vo1.27, No.2,
Feb. 1994.
[4] M. Wolfe, High Performance Compzlers f o r Parallel Computzng, Redwood City: Addison-Wesley,
1996.
Conclusions
[5] B. Blume, R. Eigenmann, E(. Faigin, J . Grout, J.
Hoeflinger, D. Padua, P. Petersen, B. Pottenger,
L. Raughwerger, P. Tu and S. Wetherford, “Polaris: Improving the Effectiveness of Parallelizing
Compilers,” Proc. 7th Annual W o r k s h o p on Lanquag& and Compilers for Parallel Comjutmg, pp.
141-154, 1993.
This paper has described OSCAR multi-grain architecture and performance evaluation using OSCAR
Fortran parallelizing compiler. The performance evaluation showed the compiler can efficiently realize
multi-grain parallel processing, which combines the
macro-dataflow computation, the loop parallelization
and the near fine grain parallel processing, on OSCAR. Furthermore, it has been confirmed that the
data localization techniques for automatic data and
task decomposition and assignment in macro-dataflow
processing and the elimination of all synchronization
inside a basic block in near fine grain parallel processing give us large performance improvement.
Those compilation techniques and the architecture
supports will be more important for High Performance
Computers including multi-vector processors like Fujitsu’s VPP and a coming single chip multiprocessor.
[6] D. J . Kuck, E. S . Davidson, D.H.Lawrie and
A.H.Sameh, “Parallel Supercomputing Today and
Cedar Approach,” Science, Vo1.231, pp.967-974,
Feb. 1986.
171 P. Tu and D. Padua. “Automatic Arrav Privatization,” Proc. 6th Annual Workshop on“ Langiiages
a n d C o m p i l e r s for Parallel Computing, pp, 500521, 1993.
,
J
[8] M. Gupta and P. Banerjee, “Demonstration of
References
Automatic Data Partitioning Techiniques for Parallelizing Compilers on Multicomputers,” IEEE
Trans. Parallel and Ditributed System, Vo1.3, No.2,
pp. 179-193, 1992.
[1] U. Banerjee, R. Eigenmann, A. Nicolau and
D.Padua, “Automatic program parallelization,”
Proc. IEEE, Vo1.81, No.2, pp.211-243, Feb. 1993.
[a] U . Banerjee, Loop Parallelazatzon, Boston:
0.2
Oal
-0
Figure 13: Performance of data localization for a
spline interpolation program.
5
0.4
a
218[ms] (1/2.9)
188[msl (1B.3)
[9] J . M . Anderson amd M. S. Lam, “Global Optimizations for Parallelism and Locality on Scalable
Kluwer
Academic Pub., 1994.
114
140.01
[17] H. Kasahara, H. Honda and S. Narita, “Parallel
Processing of Near Fine Grain Tasks Using Static
Scheduling 011 OSCAR,” Proc. IEEE A CM Supercomputing’90, pp. 856-864, NOV. 1990.
[18] H. Kasahara, Parallel Processing Technology,
Tokyo:Corona Publishing, (in Japanese), Jun.
1991.
[19] H. Kasahara, H. Honda, A. Mogi, A. Ogura, K .
Fujiwara and S. Narita, “A Multi-grain Parallelizing Compilation Scheme on OSCAR,” Proc. 4th
Workshop on Languages and Compilers f o r Parallel Computing, pp. 283-297, Aug. 1991.
S-n: No. of flag sending
R-n: No. of flag receiving
0.0
1
2
3
4
[a01 H. Honda, M. Iwata and H. Kasahara, “Coarse
5
Grain Parallelism Detection Scheme of Fortran
programs,” Trans. IElCE, Vol.J73-D-I, No.12, pp.
951-960, Dec. 1990 (in Japanese).
6
Number of processors
----
..
[all
H. Kasahara, H. Honda, M. Iwata and M . Hirota, “A Macro-dataflow Compilation Scheme for
Hierarchical Multiprocessor Systems,” Proc. Int ’I.
Conf. on Parallel Processing, pp. 111294-295, Aug.
1990.
With synchronization code
After redundant synchronization code elimination
Without synchronization code
Figure 15: Performance of elimination of all synchronization by precise code scheduling.
[22] M . Girkar a.nd C. D. Polychronopoulos, “Op-
Parallel Machines,” Proc. SIGPLAN ’93 Conference on Programming Language Design and l m p le m e n,ta t ion, pp .112-125 , 1993.
1231 H. Kasahara, H. Honda and S. Narita, “A Fortran
timization of Data/Control Conditions in Task
Graphs,” Proc. 4th Workshop o n Languages and
Compilers for Parallel Computing, pp. 152-168,
Aug. 1991.
Parallelizing Compilation Scheme for OSCAR Using Dependence Graph Analysis,” IEICE Trans.,
Vol E74, No.10, pp.3105-3114, Oct. 1991.
[lo] A. Agrawal, D. A. Kranz and V. Na,tarajan, “Automatic partitioning of parallel loops and data arrays for distributed shared-memory multiprocessors,” lEEE Trans. Parallel and Distributed System, V0l.6, No.9, pp.943-962, 1995.
[24] H.Iiasahara, S.Narita and S.Hashimoto, “OSCAR’S Architecture,” Trans. IEICE, Vol.J71-D,
No.8, pp. 1440-1445, Aug. 1988 (in Japanese).
[11] I<. Fujiwara, I<. Shiratori, S. Suzuki and H. I<asahara, “Multiprocessor scheduling algorithms considering data-preloading and post-storing,” Trans.
IElCE, Vol.J75-D-l, NO.^., pp.495-503, Aug. 1992.
[25] W. Ogata, A. Yoshida, K. Aida, M. Okamoto and
H. Kasahara, “ Near Fine Grain Paralle Processing
without explicit Synchronization on a Multiprocessor System,” Proc. 6th Workshop on Compilers f o r
Parallel Computers, pp. 359-370, Dec. 1996
[12] E. G. Coffman Jr.(ed.), Computer and Job-shop
Scheduling Theory, New York: Wiley, 1976.
[13] NI. R. Garey and D. S. Johnson, Computers and
lntractabilrty : A Guide t o the Theory of NPCom,pleteness, San Francisco: Freeman, 1979.
[26] A. Yoshida and H. Kasahara, “Data-Localization
for Macro-Dataflow Computation Using Static
Macrotask Fusion,” Proc. 5th Workshop on Compilers ,for Parallel Computers, pp. 440-453, Jul.
1995.
[14] C. D. Polychronopoulos, Parallel Programming
and Compilers, Boston: Kluwer Academic Pub.,
1988.
[27]A. Yoshida and H. Kasahara, “Data-Localization
for Fortran Macrodataflow Computation Using
Partial Static Task Assignment,” Proc. ACM Int.
Conf. on Supercomputing, pp. 61-68, May 1996.
[15] H. Kasahara and S. Narita, “Practical Multiprocessor Scheduling Algorithms for Efficient Parallel Processing,” IEEE Trans. Comput., V01.c-33,
N0.11, pp. 1023-1029, NOV. 1984.
[28] A . Yoshida and H . Kasahara, “Data Localization
Using Loop Aligned Decomposition for Ma.croDataflow Processing,” Proc. 9th Workshop on
Languages and Compilers for Parallel Computers,
pp. 56-74, Aug. 1996.
[16] F. Allen, M. Burke, R. Cytron, J . Ferrante, W.
Hsieh and V. Sarkar, “A Framework for Deter-
mining Useful Parallelism,” Proc. 2nd A CM Int ’1.
Conf. on Supercomputing, 1988.
115
Download