COSC513 Operating System Research Paper Fundamental Properties of Programming for Parallelism Student: Feng Chen (134192) Conditions of Parallelism Needs in three key areas: Computation models Inter-processor communication System integration Tradeoffs exist among time, space, performance, cost factors Data and resource dependences Flow dependence: if an execution path exists from S1 to S2 and at least one output of S1 feeds in as input to S2 Antidependence: if S2 follows S1 and the output of S2 overlaps the input to S1 Output dependence: S1 and S2 produce the same output variable I/O dependence: the same file is referenced by more than one I/O statements Unknown dependence: index itself indexed (indirect addressing), no loop variable in the index, nonlinear loop index, etc. Example of data dependence S1: Load R1, A/move mem(A) to R1 S2: Add R2, R1 /R2 = (R1) + (R2) S3: Move R1, R3 /move (R3) to R1 S4: Store B, R1 /move (R1) to mem(B) S2 is flow-dependent on S1 S3 is antidependent on S2 S3 is output-dependent on S1 S2 and S4 are totally independent S4 is flow-dependent on S1 and S3 Example of I/O dependence S1: Read(4), A(i) unit 4 S2: Rewind(4) S3: Write(4), B(i) unit 4 S4: Rewind(4) /read array A from tape /rewind tape unit 4 /write array B into tape /rewind tape unit 4 S1 and S3 are I/O dependent on each other This relation should not be violated during execution; otherwise, errors occur. Control dependence The situation where the order of execution of statements cannot be determined before run time Different paths taken after a conditional branch may change data dependences May exist between operations performed in successive iterations of a loop Control dependence often prohibits parallelism from being exploited Example of control dependence Successive iterations of this loop are controlindependent: For (I=0; I<N; I++) { A(I) = C(I); if (A(I) < 0) A(I) = 1; } Example of control dependence The following loop has controldependent iterations: For (I=1; I<N; I++) { if (A(I-1) == 0) A(I) = 0 } Resource dependence Concerned with the conflicts in using shared resources, such as integer units, floating-point units, registers, and memory areas ALU dependence: ALU is the conflicting resource Storage dependence: each task must work on independent storage locations or use protected access to share writable memory area Detection of parallelism requires a check of the various dependence relations Bernstein’s conditions for parallelism Define: Ii as the input set of a process Pi Oi as the output set of a process Pi P1 and P2 can execute in parallel (denoted as P1 || P2) under the condition: ∩ O2 = 0 ∩ O1 = 0 O1 ∩ O2 = 0 Note that I1 ∩ I2 <> 0 does not prevent parallelism I1 I2 Bernstein’s conditions for parallelism Input set: also called read set or domain of a process Output set: also called write set or range of a process A set of processes can execute in parallel if Bernstein’s conditions are satisfied on a pairwise basis; that is, P1||P2|| … ||PK if and only if Pi||Pj for all i<>j Bernstein’s conditions for parallelism The parallelism relation is commutative: Pi || Pj implies that Pj || Pi The relation is not transitive: Pi || Pj and Pj || Pk do not necessarily mean Pi || Pk Associativity: Pi || Pj || Pk implies that (Pi || Pj) || Pk = Pi || (Pj || Pk) Bernstein’s conditions for parallelism For n processes, there are 3n(n-1)/2 conditions; violation of any of them prohibits parallelism collectively or partially Statements or processes which depend on run-time conditions are not transformed to parallelism. (IF or conditional branches) The analysis of dependences can be conducted at code, subroutine, process, task, and program levels; higher-level dependence can be inferred from that of subordinate levels Example of parallelism using Bernstein’s conditions P1: C = D * E P2: M = G + C P3: A = B + G P4: C = L + M P5: F = G / E Assume no pipeline is used, five steps are needed in sequential execution Example of parallelism using Bernstein’s conditions D P1 E Time * D C P2 G + P1 G P3 B L * B C G E C + P2 A P4 E M + C + L P4 P3 + P5 / M + G P5 E / F C A F Example of parallelism using Bernstein’s conditions There are 10 pairs of statements to check against Bernstein’s conditions Only P2 || P3 || P5 is possible because P2 || P3, P3 || P5 and P2 || P5 are all possible If two adders are available simultaneously, the parallel execution requires only three steps Implementation of parallelism We need special hardware and software support to implement parallelism There is a distinguish between hardware and software parallelism Parallelism cannot be achieved free Hardware parallelism Often a function of cost and performance tradeoffs If a processor issues k instructions per machine cycle, it is called a k-issue processor Conventional processor takes one or more machine cycles to issue a single instruction: one-issue processor A multiprocessor system built with n k-issue processors should be able to handle maximum nk threads of instructions simultaneously Software parallelism Defined by the control and data dependence of programs A function of algorithm, programming styles, and compiler optimization Two most cited types of parallel programming: Control parallelism: in the form of pipelining and multiple functional units Data parallelism: similar operations performed over many data elements by multiple processors; practiced in SIMD and MIMD systems Hardware vs. Software parallelism Software parallelism Totally eight instructions: 4 loads (L), 2 multiplication (X), 1 addition (+) and 1 subtraction (-) Theoretically, the computation will be accomplished in 3 cycles (steps) Step 1 L L L L Step 2 X X Step 3 + - A B Hardware vs. Software parallelism Hardware parallelism (Example 1) By a 2-issue processor which can execute one memory access and one arithmetic operation simultaneously The computation needs 7 cycles (steps) Mismatch between HW and SW parallelism Step 1 L Step 2 L Step 3 X \ X L Step 4 Step 5 X Step 6 + A Step 7 L B Hardware vs. Software parallelism Hardware parallelism (example 2) Using a dual-processor system, each processor is single-issue 6 cycles are needed to execute the 12 instructions, where 2 store operations and 2 load operations are inserted for interprocessor communication through the shared memory Step 1 L L Step 2 L L Step 3 X X Step 4 S S Step 5 L L Step 6 + - A S statements: added instructions for interprocessor communication B