OSCAR Multi-grain Architecture and Its Evaluation H. Kasahara M. Okamoto Toshiba Corporation Fuchu Tokyo 183 JAPAN W. Ogata E(. Kimura G. Matsui Dept. of EECE Waseda University Shinjuku Tokyo 169 JAPAN A. Yoshida Dept. of IS Toho University Funabashi Chiba 274 JAPAN Abstract H. Honda Dept. of IS Univ. of Electro-Communications Chofu Tokyo 182 JAPAN grain parallelism among loop iterations by the loop parallelization. A multiprocessor system OSCAR [24] was developed to efficiently realize compiler parallelization techniques to handle the above problems, such as: O S C A R (Optimally Se h edu led Advanced Mu It i p rocessor) was designed to efficiently realize multi-grain parallel processing using static and dynamic scheduling. It is a shared memory multiprocessor system having centralized and distributed shared memories i n addition to local memory on each processor .with data transfer controller f o r overlapping of data transfer and task processing. Also, its Fortran multigrain compiler hierarchically exploits coarse grain parallelism among loops, subroutines and basic blocks, conventional medium grain parallelism among loopiterations in a Doall loop and near fine grain parallelism, among statements. At the coarse grain parallel processang, data localization(automatic data distribution) have been employed to minimize d a t a transfer overhead. I n the near fine grain processing of a basic block, explicit synchronization can be removed b y use of a clock level accurate code scheduling technique with architectural supports. This p a p e r describes OSCAR’s architecture, its compiler and the performance f o r the multi-grain parallel processing. OSCAR’S architecture and compilation technology wrll be more important in future High Performance Computers and single chip multaprocessors. dynamic scheduling for coarse grain parallel processing, or macro-dataflow [6], using a scheduling routine generated by OSCAR compiler [18, 19, 20, 211, static scheduling for (near) fine grain parallel processing inside a basic block [17, 18, 251 , medium grain parallel processing among loop iterations, macro-dataflow processing, efficient use of local memory by decomposition of tasks and data, or data localization [26, 27, 281, overlapping of task processing and data transfer using a data transfer controller [ll]. On OSCAR, a multigrain compiler [18, 19, 231 realizing the above techniques 1. to 3 . was implemented and evaluated. The last techniques originally developed for OSCAR, however, has been evaluated on Fujitsu VPP 500 having a stronger data transfer controller. This paper describes OSCAR’S architecture and its multi-grain parallelizing compiler and the evaluation of the performance. 1 Introduction Currently many multiprocessor systems use the loop parallelism using parallelizing compilers [1]-[5]. Those compilers parallelize many types of Do loop using strong data dependency analysis techniques and program restructuring techniques. There still exist, however, sequential loops that can not be parallelized efficiently because of loop carrying dependencies and conditional branches to the outside of loops. Also, most existing compilers for multiprocessor systems can not exploit fine grain parallelism inside a basic block and coarse grain parallelism among loops, subroutines and basic blocks effectively. Therefore, to improve the effective performance of multiprocessor systems, it is important to exploit the fine grain parallelism and the coarse grain parallelism in addition t o the medium 0-8186-8424-0/98 $10.00 0 1998 IEEE H. Matsuzaki 2 OSCAR’S Architecture This section describes the architecture of OSCAR (Optimally Scheduled Advanced Multiprocessor) de- signed to support the multi-grain parallel processing. OSCAR itself was designed ten years ago. However, its architecture gives us many suggestions for future high performance multiprocessor systems requiring a new parallelizing technology like macro-dataflow and a single chip multiprocessor searching for post-instructionlevel-parallelism. Figure 1 shows the architecture of OSCAR. OSCAR is a shared memory multiprocessor system with 106 I HOSTCOMPUTER I SYSTEM BUS ,........................................... (......t . . . . . . w , I ................................. DP i ........................................................................................................................................ I D M A :DMA CONTROLLER L P M : LOCAL PROGRAM MEMORY (128KW * 2BANK) I N S C : INSTRUCTION CONTROL UNIT D S M : DISTRIBUTED SHARED MEMORY (2KW) L s M : LOCAL STACK MEMORY (4KW) L D M : LOCAL DATA MEMORY (25hKW) :DATA PATH i p U :INTEGER PROCESSING UNIT F p U : FLOATING PROCESSING UNIT R E G : REGISTER FILE (64REGISTERS) DP Figure 1: OSCAR’s architecture. Figure 2: OSCAR’s processor element. both centralized and distributed shared memories, in which sixteen processor elements (PES) having distributed shared memory (DSM) and local program and data memories are uniformly connected to three modules of centralized shared memory (CSMs) by three buses. Each P E shown in Figure 2 has a custom-made 32 bit RISC processor with throughput of 5 MFLOPS. It consists of the processor having sixty-four registers, an integer processing unit and a floating point processing unit, a data memory, two banks of program memories for instruction preloading, a dual port memory used as a distributed shared memory (DSM), a stack memory (SM) and a DMA controller used for data pre-loading and post-storing to CSMs. The PE executes every instruction including a floating point addition and a multiplication in one clock cycle. The distributed shared memory on each P E can be accessed simultaneously by the PE itself and another PE. Also, OSCAR provides the following three types of data transfer modes by using the DSMs and the OSCAR’s memory space consists of a local memory space on each PE and a system memory space as shown in Figure 3. The local memory space on each PE consists of DSM space, 2 banks of program memory (PM) for program preloading, data memory (DM) and control. The system memory space consists of area for data broadcast onto all DSMs, areas for all PES and CSMs. Therefore, memories on each P E can be accessed by local memory address through the local bus by each P E and by system memory address through the interconnection buses by every PE. 2.1 Architectural Supports for Dynamic Scheduling In OSCAR’s multigrain compiler, the dynamic scheduling is adopted for handle run time uncertainty caused by conditional branches among coarse grain tasks, or macro-tasks, mainly for macro-dataflow processing since dynamic scheduling overhead can be kept relatively low since processing times of macrotasks are relatively large [18, 191. When macro-tasks are assigned t o processors or processor clusters at run time, optimal allocation of shared data among macrotasks t o DSMs is very difficult. To simplify this problem, OSCAR provides CSM t o assign shared data used by the macrotasks to be dynamically scheduled. Also, OSCAR can simulate a multiple processor cluster (PC) system with the global shared memory. The number of PCs and the number of PES inside a PC can be changed even at run-time according to par- CSMs: 1. One PE to one P E direct data transfers using DSMs, 2. One P E to all PES data broadcasting using the DSM, 3. One PE to several PES indirect data transfers through CSMs. Each module of the centralized shared memory (CSM) is a simultaneously readable memory of which the same address or different addresses can be read by three PES in the same clock cycle. allelism of the target program, or the macrotask graph mentioned later, because partitioning of PES into PCs is made by compiler. Furthermore, each bus has a 107 SYSTEM MEMORY SPACE oooooooor' ........................ ! LOCAL MEMORY SPACE compiler and finally all synchronization codes inside a basic block can be removed [25]. UNDEFINED 3 WOIIKXX) L P M(Bunk0) (LWd . P q r a m Mcmoiyl OW2r" LPM(Bmk1) . 0003wo0 3.1 NOTUSE L I1 M (Local Dala Mcmory) ~' 0008oWX 02 I oooo0~- L NOTUSE ' Compilation for Macro-dataflow The macro-dataflow compilation scheme [18,20,21] is mainly composed of the following four steps: 1)generation of macrotasks, 2)control-flow and data-flow analysis among macrotasks, 3)earliest executable condition analysis [18, 20, 211 of macrotasks to detect parallelism among macrotasks considering control- and data dependencies and 4)code generation for PCs and for dynamic schedulers. .w o m 0 2 2 m Multi-grain Compilation S c h e m e This section briefly describes OSCAR Fortran multi-grain compilation scheme [17, 18, 231 that mainly consists of the macro-dataflow processing, the loop parallelization and the near fine grain parallel processing. WOFW CONTROL WlwooO SYSEM ACCESSING 3.1.1 Generation of macrotasks AREA A Fortran program is decomposed into macrotasks. The macrotasks are so generated that they have relatively large processing time compared with dynamic scheduling overhead and data transfer overhead. OSCAR compiler generates three types of macrotasks, namely, Block of Pseudo Assignment statements (BPA), Repetition Block (RB) and Subroutine Block (SB). A BPA is usually defined as an ordinary basic clock (BB). However, it is sometimes defined as a block generated by decomposing a BB into independents parts to extract larger parallelism or by fusing BBs to reduce dynamic scheduling overhead into a coarser macrotask. A RB is a Do loop or a loop generated by a backward branch, namely, an outermost natural loop. RBs can be defined for reducible flow graphs and for irreducible flow graphs with copying code. A RB can be hierarchically decomposed into submacrotasks. For the sub-macrotasks, the macrodataflow processing scheme is hierarchically applied by using sub-processor clusters defined inside a procesor cluster. In the decomposition of RB into sub-macrotasks, overlapped loops are structured into nested loop by copying code to exploit parallelism. In the above definition of RBI a Doall loop is treated as a macrotask assigned t o a processor cluster. In other words, a Doall loop is not processed by all processors although it has enough parallelism to use all processors or all processor clusters. Therefore, in the proposed compilation scheme, a Doall loop is decomposed into "k" smaller Doall loops. The decomposed Doall loops are assigned to processor clusters to process the original Doall loop by using all processor clusters or all processors. Here "k" is usually determined as a number of processor clusters in the multiprocessor system or multiples of the number of processor clusters. Furthermore, in generation of the RB using the loop decomposition, a loop aligned decomposition method is applied for data localization among data dependent loops, which minimizes data transfer among processor FFFFFFFF FFFFFPFF Figure 3: OSCAR'S memory space. hardware for fast barrier synchronization. By using the hardware, each PC can take barrier synchronization in a few clocks. 2.2 Architectural Scheduling Supports for Static On OSCAR, static scheduling at a compile time is used for near fine grain parallel processing, loop parallel processing, macro-dataflow processing as much as possible to minimize runtime overheads. For the near fine grain parallel processing [17], OSCAR provides three data transfer modes mentioned above. The one to one direct data transfer or the data broadcast needs only 4 clock cycles to write one word data from a register of a sender PE to a DSM on a receiver PE or DSMs on PES. On the other hand, the indirect data transfer requires 8 clock cycles to write one word data from a register of a sender PE onto a CSM and read the data from the CSM into a register of a receiver PE. Therefore, the optimal selection of the above three modes using static scheduling allows us to reduce data transfer overhead markedly. Also, synchronization using DSMs reduces synchronization overhead because assigning synchronization-flags onto the DSMs prevents degradation of bus band width that is caused by the busy wait t o check synchronization-flags on CSMs. Furthermore, the fixed clock execution of every instruction by OSCAR RISC processor and a single reference clock for PES and buses allows the compiler to generate most efficient parallel machine code precisely scheduled in clock level. In the optimized parallel machine code, data transfer timing including bus accesses and remote memory accesses are determined by the 108 -Data dependency ......... Exlended control dependency 0 Conditional branch c ...._. ‘‘OR P ‘‘;:W& ’,..,, _..’. AND ’ Original control flow Figure 5: Macrotask graph (MTG). Figure 4: Macroflow graph (MFG). parallelism among macrotasks considering control dependencies and data dependencies. The earliest executable condition of macrotask i, MTi, is a condition on which MTi may begin its execution earliest. For example, an earliest executable condition of MT6 in Figure 7, which is control-dependent on MT1 and on MT2 and is data-dependent on MT3, is: ( M T 3 completes execution) clusters by using local memory on each processor when there exist array data dependencies among loops prior to the decomposition [26, 27, 281. As to subroutines, the in-line expansion is applied as much as possible taking code length into account. Subroutines for which the in-line expansion can not be applied efficiently are defined as SBs. SBs can also be hierarchically decomposed into sub-macrotasks as well as RBs. 3.1.2 OR ( M T 2 branches t o M T 4 ) Here, ” M T 3 completes execution’’ means to satisfy the data dependence of MT6 on MT3 because the following conditions for macro-dataflow execution are assumed in this paper: Generation of macroflow graph (MFG) 1. If macrotask i (MTi) is data-dependent on macrotask j (MTj), MTi can not start execution before MTj completes execution. A macroflow graph represents both control flow and data dependency among macrotaskS. Figure 4 shows an example of a macroflow graph. In this macroflow graph, nodes represent macrotasks. Dotted edges represent control flow. Solid edges represent data dependencies among macrotasks. Small circles inside nodes represent conditional branch statements inside macrotasks. In this graph, directions of the edges are assumed to be downward though arrows are omitted. MFG is a directed acyclic graph because all back-edges are contained in RBs. 3.1.3 2. A conditional branch statement in a macrotask may be executed as soon as data dependencies of the branch statement are satisfied. This is because statements in a macrotask are processed in parallel by using near fine grain parallel processing described later. Therefore, MTi, which is control-dependent on a conditional statement in MTj, can begin execution as soon as the branch direction is determined even if MTj has not completed. Macrotasks parallelism extraction The above earliest executable condition of MT6 represents the simplest form of the condition. An original form of the condition of MTi which is controldependent on MTj and data-depen ent on MTk: 0 5 k 5 N ) can be represented in the following: ( M T j branches t o M T i ) The MFG represents control flow and data dependency among macrotasks though it does not show any parallelism among macrotasks. The program dependence graph [la] represents maximum parallelism among macrotasks with control dependency and data dependency . In practice, however, the macrotask scheduler needs to know when a macrotasks can start 6 AND { ( M T k complete e z e c u t i o n ) execution. In this earliest executable macro-dataflow OR computationscheme, earliest executable conditions of macrotasks [18]-[21] are used to show the maximum (it i s determinedthat M T k is not be executed)} 109 conditional branches among macrotasks and a variation of macrotask execution time. The use of dynamic scheduling for coarse grain t,asks keeps the relative scheduling overhead small. Furthermore, the dynamic scheduling in this scheme is performed not by OS calls like in popular multiprocessor systems but by a special scheduling routine generated by the compiler. In other words, the compiler generates an efficient dynamic scheduling code exclusively for each Fortran program based on the earliest executable conditions, or the macrotask graph. The scheduling routine is executed by a processor element. Dynamic-CP algorithm, which is a dynamic scheduling algorithm using longest path length of each macrotask to the exist node on MTG, is employed taking into consideration the scheduling overhead and quality of the generated schedule. For example, the original form of the earliest executable condition of MT6 is: { ( M T l branches t o MT3) OR ( M T 2 branchesto M T 4 ) ) AND ( ( M T 3 completes execution) OR ( M T I branches to M T 2 ) } The first partial condition before AND represents an earliest executable condition determined by the control dependencies. The second partial condition after AND represents an earliest executable condition to satisfy the d a t a dependence. In the condition, the execution of MT3 means that MT1 has branched to MT3 and the execution of MT2 means that MT1 has branched to MT2. Therefore, this condition is redundant and it can be simplified as the form discribed above. The simple earliest executable conditions of macrotasks are given by OSCAR compiler automatically. The simplest condition is important to reduce dynamic scheduling overhead. Girkar and Polychronopoulos [22] proposed a similar algorithm to obtain the earliest executable conditions based on the original research [18]-[21]. They solved a simplified problem t o obtain the earliest executable conditions by assuming a condit,ional branch inside a macrotask is executed in the end of the macrotask. The earliest executable conditions of MTs are represented by a directed acyclic graph named a macrotask graph, or MTG, as shown in Figure 5. In MTG, nodes represent macrotasks. Dotted edges represent extended control-dependencies. Solid edges represent data-dependencies. The extended control dependence edges are classified into two types of edges, namely ordinary control dependence edges and co-control dependence edges. The co-control dependence edges represent conditions on which data dependence predecessor of MTi, namely MTk mentioned before on which MTi is data dependent, is not be executed[20]. Also, a data dependence edge, or a solid edge, originating from a small circle has two meanings, namely, an extended control dependence edge and a data dependence edge. Arcs connecting edges at their tails or heads have two different meanings. A solid arc represents that edges connected by the arc are in AND relationship. A dotted arc represents that edges connected by the arc are in OR relationship. Small circles inside nodes represent conditional branch statements. In the MTG, the directions of the edges are also assumed to be downward though most arrows are omitted. Edges with arrows show that the edges are the original conditional flow edges that originate from the small circles in the MFG. 3.1.4 3.1.5 Data localization The data-localization scheme [26, 27, 281 reduces data transfer overhead among macrotasks composed of Doall and sequential loops. Here, data-localization means to decompose multiple loops, or array data, and to assign them to processors (PES)so that shared data among the macrotasks can be transferred through local memory on the PES. This compilation method consists of the following three steps: loop aligned decomposition which decomposes loop indices and arrays to minimize data transfer among processors based on inter-loop data dependence analysis, generation of dynamic scheduling routine to assign a set of decomposed loops, among which large data transfer may occur, onto the same PE using the macrotask fusion and the partial static assignment methods and generation of parallel machine code to transfer data via local memory among the decomposed loops assigned onto the same PE. 198 (L*2-1) (L*z)(L*2+IXL*2+2xL*2+3) l#Y 2 g 2 x 1 2&2 '$3 204 101 +B(K'Z)+C(K+l) ENDDO _ . . Data dependence 0Macrotask 205 102 I (RB1) -K (RB3) : Inter-loop data dependence ' : (a)A target loop group (TLG) 0 : Iterations on which lOGth (L-th) iteration in RE3 is data dependent (b)lnter-loop data dependence Figure 6: Inter loop d a t a dependence. In this method, for example, when RBs in Figure 6(a) are executed on two P E S , RBI in Figure 6(a) is decomposed into RBI, RBi1I2)and RB! in Figure 7(b), also RB2 and RB3 are decomposed in a same manner. In this case, array data inside the group composed of RB:, RB; and RB; and the group composed of RB?, RB; and RB: in Figure 7(b) are Dynamic scheduling of macrotasks In the macro-dataflow computation, macrotasks are dynamically scheduled to processor clusters (PCs) at run-time to cope with runtime uncertainties, such as, 110 - : Data dependence D : Data transfer from CSM to LM U : Data transfer from LM to CSM ........ ........ : Localizable Region (LR) including CAR ....................... ........................ .................... .jjLR2 iLRl *.................................. DGCIRI *..................................... DGCIR~ *............................................................................................. GClR j;LR3 C * - 0 : LR Localizable Region) 0 : CA&Commonly Accessed Region) : Data dependence (a)Generation of LR and CAR LR LR -. . CAR ..... D(P(l))=l-I B(l)=B(l-l) ....... ........................................ ........................................ LR : L=callzable Region C A R . Commonly Assesoed Region (b)Partial MTG afier loopaligned-decomposition U RB Doall) D0%2,100 C(l)=B(l) +B(l-I) ENDDO i: Fused macratask ........ : I i @)After MT-fusion (a)Partial program(TLG) Figure 7: Loop aligned decomposition for task fusion. Figure 8: Loop aligned decomposition for the partial static task assgnment. passed through local memory. This loop aligned decomposition method can also be applied to multiple loops including a sequential loop such as Figure 8(a), where the loops are decomposed into partial loops as shown in Figure 8(b) when they three PES are used. decomposed into 3.2 (b)MTs after loop aligned decomposition 3.3.1 Generation of tasks and task graph To efficiently process a BPA in parallel, computation in the BPA must be decomposed into tasks in such a way that parallelism is fully exploited and overhead related with data transfer and synchronization is kept small. In the proposed scheme, the statement level granularity is chosen as the finest granularity for OSCAR taking into account OSCAR'S processing capability and data transfer capability. Medium Grain Parallel Processing Macrotasks are assigned to processor clusters (PCs) dynamically as mentioned in the previous section. If a macrotask assigned t o a PC is a Doall loop, the macrotask is processed in the medium grain, or iteration level grain, by processors inside the PC. For the Doall, several dynamic scheduling schemes have been proposed. On OSCAR, however, a simple static scheduling scheme is used because OSCAR does not have a hardware support for the dynamic iteration scheduling and static scheduling allows us to realize data localization among loops. If a macrotask assigned to a PC is a loop having data dependencies among iterations] the compiler first tries t o apply the Doacross with restructuring to minimize the synchronization overhead. Next , the compiler compares an estimated processing time by the Doacross and by the near fine grain parallel processing of the loop body mentioned later. If the processing time by the Doacross is shorter than the one by the near fine grain processing, the compiler generates a machine code for the Doacross. Figure 9 shows an example of statement level tasks, or near fine grain tasks, generated for a basic block that solves a sparse matrix. Such a large basic block is generated by the symbolic generation technique , which has been used in the electronic circuit simulator like SPICE, and by the partial evaluation. The data dependencies] or precedence constraints, among the generated tasks can be represented by arcs in a task graph [la]-[15] as shown in Figure 10, in which each task corresponds t o a node. In the graph, figures inside a node circle represent task number, i, and those beside it for a task processing time on a PE, ti. An edge directed from node Ni toward N. represents partially ordered constraint that task $ precedes task Tj. When we also consider a data transfer time between tasks, each edge generally has a variable weight. Its weight, t i . , will be a data transfer time between task and if Ti and are assigned to different PES. It will be zero or a time to access registers or local data memories if the tasks are assigned t o the same PE. fi 3.3 Near Fine Grain Parallel Processing A BPA is decomposed into the near fine grain tasks [17], each of which consists of a statement] and processed in parallel by processors inside a PC. 111 << LU Decomposition >> 1 1 J on different PES Figure 10: Task graph for near fine grain tasks. Figure 9: Near fine grain tasks 3.3.2 Static scheduling algorithm and so on. Therefore, we can generate the machine codes for each P E by putting together instructions for tasks assigned to the P E and inserting instructions for data transfer and synchronization into the required places. The "version number" method is used for synchronization among tasks. At the end of a BPA, inst>ructionsfor the barrier synchronization, which is supported by OSCAR'S hardware, are inserted into a program code on each PE. The compiler can also optimize the codes by making full use of all information obtained from the static scheduling. For example, when a task should pass shared data t o other tasks assigned t o the same PE, the data can be passed through registers on the PE. In additmion,the compiler minimizes the synchronization overhead by eliminating redundant synchronization considering the information about the tasks to be synchronized, the task assignment and the execution order. In addition t o the elimination of redundant synchronization codes, OSCAR compiler has realized elimination of all synchronization codes inside a basic block and a sequential loop t o which the near fine grain processing is applied [25]. In the opt,imization, the compiler estimates start and completion time of every task execution and data transfer, or bus access and memory access timing, in machine clock level exactly with the architectural support of OSCAR. Next, the compiler or machine code scheduler generate parallel machine code, which can control memory and bus access timing by inserting NOP (no operation) instructions t o delay reading shared data on the distributed shared memory t o be written by another processor and to delay bus accesses until data t,ransfers are finished by other processors. Also the compiler inserts NOP instructions into program codes for PES, which reach t o a barrier point before the last PE reaches, t o realize barrier synchronization without an explicit barrier instruction supported by hardware. To process a set of (near fine grain) tasks on a multiprocessor system efficiently, an assignment of tasks onto PES and an execution order among the tasks assigned to the same P E must be determined optimally. The problem t h a t determines the optimal assignment and the optimal execution order can be treated as a traditional minimum execution time multiprocessor scheduling problem [la, 151. To state formally, the scheduling problem is t o determine such a nonpreemptive schedule in which execution time or schedule length be minimum, given a set of n computational tasks, precedence relations among them, and m processors with the same processing capability. This scheduling problem, however, has been known as a "strong" NP-hard problem [13]. Considering this fact, a variety of heuristic algorithms and a practical optimization algorithm have been proposed [15]. In OSCAR compiler, a heuristic scheduling algorithm CP/DT/MISF (Critical Path/ Data Transfer/ Most Immediate Successors First) considering data transfer [17] has been adopted taking into account a compilation time and quality of generated schedules. 3.3.3 Machine code generation For efficient parallel execution of near fine grain tasks on an actual multiprocessor system, optimal machine codes must be generated by using a statically scheduled result. A statically scheduled result gives us the following information: 1. which tasks are executed on each PE, 2. in which order the tasks assigned to the same P E are executed, 3 . when and where data transfers and synchronization among PES are required, 112 I ....'...." ENDMT I Figure 12: Macrotask graph of Figure 11 Figure 11: A macroflow graph for a Fortran program with 17 macrotasks. 4 execution time t o 188[ms] (1/3.36) for 6 PES because coarse-grain parallelism among sequential loops and the other macrotasks can be exploited. Furthermore, when the data-localization method is applied, execution time is reduced to 152[ms] (1/4.16) for 6 PES. In other words, speedup of 30% for 6 PES are obtained by the data-localization compared with conventional Doall processing. In the above evaluation, OSCAR needs only 4 clock cycles to access CSM and 1 clock cycle to access LM. However, since ratio of CSM access time to LM access time on multiprocessor systems available in the market is larger than that on OSCAR, the proposed datalocalization scheme may be more effective on these machines. Figure 14 shows the performance of the near fine grain parallel processing using static scheduling for a typical loop body of a CFD program called NAL test developed by National Aerospace Laboratory. The processing time on OSCAR was reduced from 0.85[sec] for 1 PE to 0.34[sec] for 3 PES and 0.20[sec] for 6 PES. From this example, it is understood that near fine grain parallel processing has been successfully realized on OSCAR. Figure 15 represents the effectiveness of near fine grain parallel processing without explicit synchronization, namely elimination of the all synchronization instructions inside a sequential loop with 24 statements to calculate PAI. In the figure, the upper curve shows the processing time with all synchronizations, the dotted curve shows the processing time with elimination of the redundant synchronization and the lower curve Performance Evaluation on OSCAR This section briefly describes performance of OSCAR multi-grain compiler. Figure 11 is a macroflow graph of an example Fortran program composed of 17 macrotasks including RBs, SBs and BPAs. Figure 12 represents the macrotask graph for the macroflow graph. Sequential execution time on 1 PE on OSCAR was 9.63[sec]. Execution time for macro-dataflow using 3 PES was 3.32[sec]. This result shows the macro-dataflow computation was realized very efficiently with negligibly small overhead. Also, execution time of multi-grain processing using 3 PCs, each of which has 2 PES, namely, 6 PES, was 1.83[sec]. In this case, a macrotask composed of Doall loop was processed in parallel by 2 PES inside a PC. Also, a inacrotask composed of a sequential loop or a BPA is processed by using near fine grain parallel processing scheme. This results shows that the multi-grain parallel processing allows us effectively parallel processing. Next, to evaluate the data-localization with partial static task assignment, a Fortran program for Spline Interpolation having 9 Doall loops, 2 sequential loops with loop carried data dependence and 3 basic blocks is used. Figure 13 shows the performance of the data localization on OSCAR. Conventional Doall processing reduces execution time from 632[ms] for 1 PE t o 218[ms] (1/2.90) for 6 PES. On the other hand, macrodataflow processing without data-localization reduces 113 t Execution Time of Spline-Interpolation Doall processing ::i 0.6 m .-c v) Macro-dataflow (I) a, 0 0 L 0.3 152[msI(l/4.1) - L Macro-dataflow with localization a a Number of Processors 0.2\ - 0.2218 t 1 2 3 4 5 0.1955 6 Number of Processors shows the processing time with elimination all explicit synchronization. When three PE are used, the processing time are reduced from 92.63[us] for all synchronization with 18 synchronization flag sets and 26 flag checkes to 61.76[us] for no explicit synchronization (33% speed up). From this result, on OSCAR, near fine grain parallel processing with elimination of all synchronization inside a basic block by the precise instruction scheduling can be realized and large performance improvement can be obtained. Figure 14: Performance of near fine grain parallel processing for a loop boby of a CFD program. [3] D. J. Lilja, “Exploiting the Parallelism Available in loops,” IEEE Computer, pp.13-26, Vo1.27, No.2, Feb. 1994. [4] M. Wolfe, High Performance Compzlers f o r Parallel Computzng, Redwood City: Addison-Wesley, 1996. Conclusions [5] B. Blume, R. Eigenmann, E(. Faigin, J . Grout, J. Hoeflinger, D. Padua, P. Petersen, B. Pottenger, L. Raughwerger, P. Tu and S. Wetherford, “Polaris: Improving the Effectiveness of Parallelizing Compilers,” Proc. 7th Annual W o r k s h o p on Lanquag& and Compilers for Parallel Comjutmg, pp. 141-154, 1993. This paper has described OSCAR multi-grain architecture and performance evaluation using OSCAR Fortran parallelizing compiler. The performance evaluation showed the compiler can efficiently realize multi-grain parallel processing, which combines the macro-dataflow computation, the loop parallelization and the near fine grain parallel processing, on OSCAR. Furthermore, it has been confirmed that the data localization techniques for automatic data and task decomposition and assignment in macro-dataflow processing and the elimination of all synchronization inside a basic block in near fine grain parallel processing give us large performance improvement. Those compilation techniques and the architecture supports will be more important for High Performance Computers including multi-vector processors like Fujitsu’s VPP and a coming single chip multiprocessor. [6] D. J . Kuck, E. S . Davidson, D.H.Lawrie and A.H.Sameh, “Parallel Supercomputing Today and Cedar Approach,” Science, Vo1.231, pp.967-974, Feb. 1986. 171 P. Tu and D. Padua. “Automatic Arrav Privatization,” Proc. 6th Annual Workshop on“ Langiiages a n d C o m p i l e r s for Parallel Computing, pp, 500521, 1993. , J [8] M. Gupta and P. Banerjee, “Demonstration of References Automatic Data Partitioning Techiniques for Parallelizing Compilers on Multicomputers,” IEEE Trans. Parallel and Ditributed System, Vo1.3, No.2, pp. 179-193, 1992. [1] U. Banerjee, R. Eigenmann, A. Nicolau and D.Padua, “Automatic program parallelization,” Proc. IEEE, Vo1.81, No.2, pp.211-243, Feb. 1993. [a] U . Banerjee, Loop Parallelazatzon, Boston: 0.2 Oal -0 Figure 13: Performance of data localization for a spline interpolation program. 5 0.4 a 218[ms] (1/2.9) 188[msl (1B.3) [9] J . M . Anderson amd M. S. Lam, “Global Optimizations for Parallelism and Locality on Scalable Kluwer Academic Pub., 1994. 114 140.01 [17] H. Kasahara, H. Honda and S. Narita, “Parallel Processing of Near Fine Grain Tasks Using Static Scheduling 011 OSCAR,” Proc. IEEE A CM Supercomputing’90, pp. 856-864, NOV. 1990. [18] H. Kasahara, Parallel Processing Technology, Tokyo:Corona Publishing, (in Japanese), Jun. 1991. [19] H. Kasahara, H. Honda, A. Mogi, A. Ogura, K . Fujiwara and S. Narita, “A Multi-grain Parallelizing Compilation Scheme on OSCAR,” Proc. 4th Workshop on Languages and Compilers f o r Parallel Computing, pp. 283-297, Aug. 1991. S-n: No. of flag sending R-n: No. of flag receiving 0.0 1 2 3 4 [a01 H. Honda, M. Iwata and H. Kasahara, “Coarse 5 Grain Parallelism Detection Scheme of Fortran programs,” Trans. IElCE, Vol.J73-D-I, No.12, pp. 951-960, Dec. 1990 (in Japanese). 6 Number of processors ---- .. [all H. Kasahara, H. Honda, M. Iwata and M . Hirota, “A Macro-dataflow Compilation Scheme for Hierarchical Multiprocessor Systems,” Proc. Int ’I. Conf. on Parallel Processing, pp. 111294-295, Aug. 1990. With synchronization code After redundant synchronization code elimination Without synchronization code Figure 15: Performance of elimination of all synchronization by precise code scheduling. [22] M . Girkar a.nd C. D. Polychronopoulos, “Op- Parallel Machines,” Proc. SIGPLAN ’93 Conference on Programming Language Design and l m p le m e n,ta t ion, pp .112-125 , 1993. 1231 H. Kasahara, H. Honda and S. Narita, “A Fortran timization of Data/Control Conditions in Task Graphs,” Proc. 4th Workshop o n Languages and Compilers for Parallel Computing, pp. 152-168, Aug. 1991. Parallelizing Compilation Scheme for OSCAR Using Dependence Graph Analysis,” IEICE Trans., Vol E74, No.10, pp.3105-3114, Oct. 1991. [lo] A. Agrawal, D. A. Kranz and V. Na,tarajan, “Automatic partitioning of parallel loops and data arrays for distributed shared-memory multiprocessors,” lEEE Trans. Parallel and Distributed System, V0l.6, No.9, pp.943-962, 1995. [24] H.Iiasahara, S.Narita and S.Hashimoto, “OSCAR’S Architecture,” Trans. IEICE, Vol.J71-D, No.8, pp. 1440-1445, Aug. 1988 (in Japanese). [11] I<. Fujiwara, I<. Shiratori, S. Suzuki and H. I<asahara, “Multiprocessor scheduling algorithms considering data-preloading and post-storing,” Trans. IElCE, Vol.J75-D-l, NO.^., pp.495-503, Aug. 1992. [25] W. Ogata, A. Yoshida, K. Aida, M. Okamoto and H. Kasahara, “ Near Fine Grain Paralle Processing without explicit Synchronization on a Multiprocessor System,” Proc. 6th Workshop on Compilers f o r Parallel Computers, pp. 359-370, Dec. 1996 [12] E. G. Coffman Jr.(ed.), Computer and Job-shop Scheduling Theory, New York: Wiley, 1976. [13] NI. R. Garey and D. S. Johnson, Computers and lntractabilrty : A Guide t o the Theory of NPCom,pleteness, San Francisco: Freeman, 1979. [26] A. Yoshida and H. Kasahara, “Data-Localization for Macro-Dataflow Computation Using Static Macrotask Fusion,” Proc. 5th Workshop on Compilers ,for Parallel Computers, pp. 440-453, Jul. 1995. [14] C. D. Polychronopoulos, Parallel Programming and Compilers, Boston: Kluwer Academic Pub., 1988. [27]A. Yoshida and H. Kasahara, “Data-Localization for Fortran Macrodataflow Computation Using Partial Static Task Assignment,” Proc. ACM Int. Conf. on Supercomputing, pp. 61-68, May 1996. [15] H. Kasahara and S. Narita, “Practical Multiprocessor Scheduling Algorithms for Efficient Parallel Processing,” IEEE Trans. Comput., V01.c-33, N0.11, pp. 1023-1029, NOV. 1984. [28] A . Yoshida and H . Kasahara, “Data Localization Using Loop Aligned Decomposition for Ma.croDataflow Processing,” Proc. 9th Workshop on Languages and Compilers for Parallel Computers, pp. 56-74, Aug. 1996. [16] F. Allen, M. Burke, R. Cytron, J . Ferrante, W. Hsieh and V. Sarkar, “A Framework for Deter- mining Useful Parallelism,” Proc. 2nd A CM Int ’1. Conf. on Supercomputing, 1988. 115