Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University of Delaware UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT Instruction Scheduling Reordering instructions to improve performance Takes into account anticipated latencies Machine-specific Performed late in optimization pass Instruction-Level Parallelism (ILP) UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT Modern Architectures Features Superscalar Multiple logic units Multiple issue 2 or more instructions issued per cycle Speculative execution Branch predictors Speculative loads Deep pipelines UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 3 Types of Instruction Scheduling Local Scheduling Basic Block Scheduling Global Scheduling Trace Scheduling Superblock Scheduling Software Pipelining UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 4 Scheduling for different Computer Architectures Out-of-order Issue In-order issue Scheduling is useful Scheduling is very important VLIW Scheduling is essential! UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 5 Challenges to ILP Structural hazards: Insufficient resources to exploit parallelism Data hazards Instruction depends on result of previous instruction still in pipeline Control hazards Branches & jumps modify PC affect which instructions should be in pipeline UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 6 Recall from Architecture… IF – Instruction Fetch ID – Instruction Decode EX – Execute MA – Memory access WB – Write back IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT Structural Hazards Instruction latency: execute takes > 1 cycle addf R3,R1,R2 addf R3,R3,R4 IF ID EX EX IF MA WB ID stall EX EX MA WB Assumes floating point ops take 2 execute cycles UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT Data Hazards Memory latency: data not ready lw R1,0(R2) add R3,R1,R4 IF ID EX MA WB IF ID stall EX MA WB UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT Control Hazards Taken Branch IF ID EX MA WB Instr + 1 Branch Target Branch Target + 1 IF --- --- --- --IF ID EX MA WB IF ID EX MA WB UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT Basic Block Scheduling For each basic block: Construct directed acyclic graph (DAG) using dependences between statements Node = statement / instruction Edge (a,b) = statement a must execute before b Schedule instructions using the DAG UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 11 Data Dependences If two operations access the same register and one access is a write, they are dependent Types of data dependences RAW=Read after Write WAW WAR r1 = r2 + r3 r1 = r2 + r3 r1 = r2 + r3 r4 = r1 * 6 r1 = r4 * 6 r2 = r5 * 6 Cannot reorder two dependent instructions UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT Basic Block Scheduling Example Original Schedule a) b) c) d) lw R2, (R1) lw R3, (R1) 4 R4 R2 + R3 R5 R2 - 1 Schedule 1 (5 cycles) a) b) Dependence DAG lw R2, (R1) lw R3, (R1) 4 --- nop ----c) R4 R2 + R3 d) R5 R2 - 1 a 2 b 2 d 2 c Schedule 2 (4 cycles) a) b) d) c) lw R2, (R1) lw R3, (R1) 4 R5 R2 - 1 R4 R2 + R3 UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT Scheduling Algorithm Construct dependence dag on basic block Put roots in candidate set Use scheduling heuristics (in order) to select instruction While candidate set not empty Evaluate all candidates and select best one Delete scheduled instruction from candidate set Add newly-exposed candidates UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 14 Instruction Scheduling Heuristics NP-complete = we need heuristics Bias scheduler to prefer instructions: Earliest execution time Have many successors Progress along critical path Free registers More flexibility in scheduling Reduce register pressure Can be a combination of heuristics UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 15 Computing Priorities Height(n) = exec(n) if n is a leaf max(height(m)) + exec(n) for m, where m is a successor of n Critical path(s) = path through the dependence DAG with longest latency UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT Example – Determine Height and CP Code a 3 a lw r1, w b c b add r1,r1,r1 1 3 c d 2 f 2 Critical path: _______ e 3 h 2 i lw r2,x d mult r1,r1,r2 g 3 e lw r2,y f mult r1,r1,r2 Assume: memory instrs = 3 mult = 2 = (to have result in register) rest = 1 cycle g lw r2,z h mult r1,r1,r2 i sw r1, a UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 17 Example 13 3 10 a Code a b c 1 3 9 d 2 7 12 e 3 f 2 h 2 i ___ cycles lw r1, w b add r1,r1,r1 c 10 g Schedule lw r2,x d mult r1,r1,r2 e lw r2,y 3 f mult r1,r1,r2 5 g lw r2,z 3 star t 8 h mult r1,r1,r2 i sw r1, a UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 18 Global Scheduling: Superblock Definition: Formation algorithm: single trace of contiguous, frequently executed blocks a single entry and multiple exits pick a trace of frequently executed basic block eliminate side entrance (tail duplication) Scheduling and optimization: speculate operations in the superblock apply optimization to scope defined by superblock UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT Superblock Formation Tail duplicate Select a trace A A 100 100 B C 90 10 D E 0 90 F 100 B C 90 10 E D 90 0 F F’ 90 10 UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT Optimizations within Superblock By limiting the scope of optimization to superblock: optimize for the frequent path may enable optimizations that are not feasible otherwise (CSE, loop invariant code motion,...) For example: CSE r1 = r2*3 r1 = r2*3 r2 = r2 +1 r3 = r2*3 r2 = r2 +1 r3 = r2*3 trace selection r1 = r2*3 r3 = r2*3 tail duplication r2 = r2 +1 r3 = r1 r3 = r2*3 CSE within superblock (no merge since single entry) UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT Scheduling Algorithm Complexity Time complexity: O(n2) Building dependence dag: worst-case O(n2) n = max number of instructions in basic block Each instruction must be compared to every other instruction Scheduling then requires each instruction be inspected at each step = O(n2) Average-case: small constant (e.g., 3) UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 22 Very Long Instruction Word (VLIW) Compiler determines exactly what is issued every cycle (before the program is run) Schedules also account for latencies All hardware changes result in a compiler change Usually embedded systems (hence simple HW) Itanium is actually an EPIC-style machine (accounts for most parallelism, not latencies) UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT Sample VLIW code VLIW processor: 5 issue 2 Add/Sub units (1 cycle) 1 Mul/Div unit (2 cycle, unpipelined) 1 LD/ST unit (2 cycle, pipelined) 1 Branch unit (no delay slots) Add/Sub Add/Sub Mul/Div Ld/St Branch c=a+b d=a-b e=a*b ld j = [x] nop g=c+d h=c-d nop ld k = [y] nop nop nop i=j*c ld f = [z] br g UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT Next Time Phase-ordering UNIVERSITY OF DELAWARE • COMPUTER & INFORMATION SCIENCES DEPARTMENT 25