PPT - University of Delaware

Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University of Delaware UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT Instruction Scheduling   Reordering instructions to improve performance Takes into account anticipated latencies    Machine-specific Performed late in optimization pass Instruction-Level Parallelism (ILP) UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT Modern Architectures Features     Superscalar  Multiple logic units Multiple issue  2 or more instructions issued per cycle Speculative execution  Branch predictors  Speculative loads Deep pipelines UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 3 Types of Instruction Scheduling   Local Scheduling  Basic Block Scheduling Global Scheduling    Trace Scheduling Superblock Scheduling Software Pipelining UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 4 Scheduling for different Computer Architectures  Out-of-order Issue   In-order issue   Scheduling is useful Scheduling is very important VLIW  Scheduling is essential! UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 5 Challenges to ILP    Structural hazards:  Insufficient resources to exploit parallelism Data hazards  Instruction depends on result of previous instruction still in pipeline Control hazards  Branches & jumps modify PC  affect which instructions should be in pipeline UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 6 Recall from Architecture…      IF – Instruction Fetch ID – Instruction Decode EX – Execute MA – Memory access WB – Write back IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT Structural Hazards Instruction latency: execute takes > 1 cycle addf R3,R1,R2 addf R3,R3,R4 IF ID EX EX IF MA WB ID stall EX EX MA WB Assumes floating point ops take 2 execute cycles UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT Data Hazards Memory latency: data not ready lw R1,0(R2) add R3,R1,R4 IF ID EX MA WB IF ID stall EX MA WB UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT Control Hazards Taken Branch IF ID EX MA WB Instr + 1 Branch Target Branch Target + 1 IF --- --- --- --IF ID EX MA WB IF ID EX MA WB UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT Basic Block Scheduling  For each basic block:  Construct directed acyclic graph (DAG) using dependences between statements    Node = statement / instruction Edge (a,b) = statement a must execute before b Schedule instructions using the DAG UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 11 Data Dependences   If two operations access the same register and one access is a write, they are dependent Types of data dependences RAW=Read after Write WAW WAR r1 = r2 + r3 r1 = r2 + r3 r1 = r2 + r3 r4 = r1 * 6 r1 = r4 * 6 r2 = r5 * 6 Cannot reorder two dependent instructions UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT Basic Block Scheduling Example Original Schedule a) b) c) d) lw R2, (R1) lw R3, (R1) 4 R4  R2 + R3 R5  R2 - 1 Schedule 1 (5 cycles) a) b) Dependence DAG lw R2, (R1) lw R3, (R1) 4 --- nop ----c) R4  R2 + R3 d) R5  R2 - 1 a 2 b 2 d 2 c Schedule 2 (4 cycles) a) b) d) c) lw R2, (R1) lw R3, (R1) 4 R5  R2 - 1 R4  R2 + R3 UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT Scheduling Algorithm     Construct dependence dag on basic block Put roots in candidate set Use scheduling heuristics (in order) to select instruction While candidate set not empty    Evaluate all candidates and select best one Delete scheduled instruction from candidate set Add newly-exposed candidates UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 14 Instruction Scheduling Heuristics   NP-complete = we need heuristics Bias scheduler to prefer instructions:   Earliest execution time Have many successors    Progress along critical path Free registers   More flexibility in scheduling Reduce register pressure Can be a combination of heuristics UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 15 Computing Priorities Height(n) =   exec(n) if n is a leaf max(height(m)) + exec(n) for m, where m is a successor of n Critical path(s) = path through the dependence DAG with longest latency UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT Example – Determine Height and CP Code a 3 a lw r1, w b c b add r1,r1,r1 1 3 c d 2 f 2 Critical path: _______ e 3 h 2 i lw r2,x d mult r1,r1,r2 g 3 e lw r2,y f mult r1,r1,r2 Assume: memory instrs = 3 mult = 2 = (to have result in register) rest = 1 cycle g lw r2,z h mult r1,r1,r2 i sw r1, a UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 17 Example 13 3 10 a Code a b c 1 3 9 d 2 7 12 e 3 f 2 h 2 i ___ cycles lw r1, w b add r1,r1,r1 c 10 g Schedule lw r2,x d mult r1,r1,r2 e lw r2,y 3 f mult r1,r1,r2 5 g lw r2,z 3 star t 8 h mult r1,r1,r2 i sw r1, a UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 18 Global Scheduling: Superblock  Definition:    Formation algorithm:    single trace of contiguous, frequently executed blocks a single entry and multiple exits pick a trace of frequently executed basic block eliminate side entrance (tail duplication) Scheduling and optimization:   speculate operations in the superblock apply optimization to scope defined by superblock UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT Superblock Formation Tail duplicate Select a trace A A 100 100 B C 90 10 D E 0 90 F 100 B C 90 10 E D 90 0 F F’ 90 10 UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT Optimizations within Superblock  By limiting the scope of optimization to superblock:    optimize for the frequent path may enable optimizations that are not feasible otherwise (CSE, loop invariant code motion,...) For example: CSE r1 = r2*3 r1 = r2*3 r2 = r2 +1 r3 = r2*3 r2 = r2 +1 r3 = r2*3 trace selection r1 = r2*3 r3 = r2*3 tail duplication r2 = r2 +1 r3 = r1 r3 = r2*3 CSE within superblock (no merge since single entry) UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT Scheduling Algorithm Complexity  Time complexity: O(n2)   Building dependence dag: worst-case O(n2)    n = max number of instructions in basic block Each instruction must be compared to every other instruction Scheduling then requires each instruction be inspected at each step = O(n2) Average-case: small constant (e.g., 3) UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 22 Very Long Instruction Word (VLIW)      Compiler determines exactly what is issued every cycle (before the program is run) Schedules also account for latencies All hardware changes result in a compiler change Usually embedded systems (hence simple HW) Itanium is actually an EPIC-style machine (accounts for most parallelism, not latencies) UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT Sample VLIW code VLIW processor: 5 issue 2 Add/Sub units (1 cycle) 1 Mul/Div unit (2 cycle, unpipelined) 1 LD/ST unit (2 cycle, pipelined) 1 Branch unit (no delay slots) Add/Sub Add/Sub Mul/Div Ld/St Branch c=a+b d=a-b e=a*b ld j = [x] nop g=c+d h=c-d nop ld k = [y] nop nop nop i=j*c ld f = [z] br g UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT Next Time  Phase-ordering UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 25

PPT - University of Delaware

Related documents

Products

Support

PPT - University of Delaware

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib