Simple Machine Model • Instructions are executed in sequence Fall 2005 – Fetch, decode, execute, store results – One instruction at a time Lectures 14 & 15: Instruction Scheduling • For branch instructions, start fetching from a different location if needed – Check branch condition – Next instruction may come from a new location given by the branch instruction 2 Saman Amarasinghe Simple Execution Model 6.035 ©MIT Fall 1998 Simple Execution Model time • 5 Stage pipe-line fetch decode execute Inst 1 memory writeback IF DE EXE MEM WB IF DE EXE MEM WB Inst 2 • Fetch: get the next instruction • Decode: figure-out what that instruction is • Execute: Perform ALU operation – address calculation in a memory op 3 6.035 ©MIT Fall 1998 From a Simple Machine Model to a Real Machine Model • Many pipeline stages – – – – Pentium Pentium Pro Pentium IV (130nm) Pentium IV (90nm) IF Inst 2 Inst 3 • Memory: Do the memory access in a mem. Op. • Write Back: write the results back Saman Amarasinghe Inst 1 Inst 4 DE EXE MEM WB IF DE EXE MEM WB IF DE EXE MEM WB IF DE EXE MEM WB IF DE EXE MEM Inst 5 Saman Amarasinghe 4 6.035 WB ©MIT Fall 1998 Real Machine Model cont. • Most modern processors have multiple execution units (superscalar) 5 10 20 31 • Different instructions taking different amount of time to execute – If the instruction sequence is correct, multiple operations will happen in the same cycles – Even more important to have the right instruction sequence • Hardware to stall the pipeline if an instruction uses a result that is not ready Saman Amarasinghe 5 6.035 ©MIT Fall 1998 Saman Amarasinghe 6 6.035 ©MIT Fall 1998 1 Data Dependency between Instructions Constraints On Scheduling • Data dependencies • Control dependencies • Resource Constraints • If two instructions access the same variable, they can be dependent • Kind of dependencies – True: write → read – Anti: read → write – Output: write → write • What to do if two instructions are dependent. – The order of execution cannot be reversed – Reduce the possibilities for scheduling Saman Amarasinghe 7 6.035 ©MIT Fall 1998 8 Saman Amarasinghe Computing Dependencies 6.035 ©MIT Fall 1998 Representing Dependencies • For basic blocks, compute dependencies by walking through the instructions • Identifying register dependencies is simple • Using a dependence DAG, one per basic block • Nodes are instructions, edges represent dependencies – is it the same register? • For memory accesses – – – – simple: base + offset1 ?= base + offset2 data dependence analysis: a[2i] ?= a[2i+1] interprocedural analysis: global ?= parameter pointer alias analysis: p1 ?= p Saman Amarasinghe 9 6.035 ©MIT Fall 1998 10 Saman Amarasinghe Representing Dependencies • Using a dependence DAG, one per basic block • Nodes are instructions, edges represent dependencies 2 1 1: r2 = *(r1 + 4) 2: r3 = *(r1 + 8) 3: r4 = r2 + r3 4: r5 = r2 - 1 2 4 • Edge is labeled with Latency: 2 6.035 Example 1: 2: 3: 4: r2 r3 r4 r5 = = = = *(r1 *(r2 r2 + r2 - + 4) + 4) r3 1 2 1 2 2 3 4 2 11 6.035 ©MIT Fall 1998 Saman Amarasinghe 12 2 2 3 – v(i → j) = delay required between initiation times of i and j minus the execution time required by i Saman Amarasinghe ©MIT Fall 1998 6.035 ©MIT Fall 1998 2 Control Dependencies and Resource Constraints Another Example 1: 2: 3: 4: r2 = *(r1 r3 = r5 = *(r1 + 4) r2 + r2 - + 4) = r3 r3 1 • For now, lets only worry about basic blocks • For now, lets look at simple pipelines 4 13 Saman Amarasinghe 2 1 3 6.035 ©MIT Fall 1998 Saman Amarasinghe Example 1: 2: 3: 4: 5: 6: 7: lea add inc mov add and imul 6.035 ©MIT Fall 1998 List Scheduling Algorithm Results In 1 cycle 1 cycle 1 cycle 3 cycles var_a, %rax $4, %rax %r11 4(%rsp), %r10 %r10, 8(%rsp) 16(%rsp), %rbx %rax, %rbx 14 4 cycles 3 cycles • Idea – Do a topological sort of the dependence DAG – Consider when an instruction can be scheduled without causing a stall – Schedule the instruction if it causes no stall and all its predecessors are already scheduled • Optimal list scheduling is NP-complete – Use heuristics when necessary 1 2 Saman Amarasinghe 3 4 st st 5 15 6 st st st 6.035 7 ©MIT Fall 1998 Saman Amarasinghe List Scheduling Algorithm • Create a dependence DAG of a basic block • Topological Sort READY = nodes with no predecessors Loop until READY is empty Schedule each node in READY when no stalling Update READY Saman Amarasinghe 17 6.035 ©MIT Fall 1998 16 6.035 ©MIT Fall 1998 Heuristics for selection • Heuristics for selecting from the READY list – pick the node with the longest path to a leaf in the dependence graph – pick a node with most immediate successors – pick a node that can go to a less busy pipeline (in a superscalar) Saman Amarasinghe 18 6.035 ©MIT Fall 1998 3 Heuristics for selection Heuristics for selection • pick the node with the longest path to a leaf in the dependence graph • Algorithm (for node x) • pick a node with most immediate successors • Algorithm (for node x): – fx = number of successors of x – If no successors dx = 0 – dx = MAX( dy + cxy) for all successors y of x – reverse breadth-first visitation order 19 Saman Amarasinghe ©MIT Fall 1998 6.035 20 Saman Amarasinghe Example 1: 2: 3: 4: 5: 6: 7: 8: 9: lea add inc mov add and imul mov lea 2 1 6 lea add inc mov add and imul mov lea 2 1 Saman Amarasinghe 2 4 4 st st 7 3 5 5 23 6 7 4 d=3 f=2 1 d=0 f=0 9 3 8 ©MIT Fall 1998 1 Saman Amarasinghe 8 st st st 9 4 7 3 5 d=3 f=1 8 d=7 f=1 5 d=0 f=0 d=0 f=0 9 22 6.035 ©MIT Fall 1998 • Modern machines have many resource constraints • Superscalar architectures: – can run few parallel operations – But have constraints 4 cycles 3 cycles 6 2 4 Resource Constraints Results In 1 cycle 1 cycle 1 cycle 3 cycles var_a, %rax $4, %rax %r11 4(%rsp), %r10 %r10, 8(%rsp) 16(%rsp), %rbx %rax, %rbx %rbx, 16(%rsp) var_b, %rax 3 d=4 f=1 1 6.035 d=0 f=0 3 READY = { } 4 cycles 3 cycles 21 3 1 Example 1: 2: 3: 4: 5: 6: 7: 8: 9: d=5 f=1 1 6 Saman Amarasinghe ©MIT Fall 1998 Example Results In 1 cycle 1 cycle 1 cycle 3 cycles var_a, %rax $4, %rax %r11 4(%rsp), %r10 %r10, 8(%rsp) 16(%rsp), %rbx %rax, %rbx %rbx, 16(%rsp) var_b, %rax 6.035 7 8 9 14 cycles vs 9 cycles 6.035 ©MIT Fall 1998 Saman Amarasinghe 24 6.035 ©MIT Fall 1998 4 Resource Constraints of a Superscalar Processor List Scheduling Algorithm with resource constraints • Represent the superscalar architecture as multiple pipelines • Example: – One fully pipelined reg-to-reg unit – Each pipeline represent some resource • All integer operations taking one cycle • Example In parallel with – One fully pipelined memory-to/from-reg unit – One single cycle reg-to-reg ALU unit – One two-cycle pipelined reg-to/from-memory unit • Data loads take two cycles • Data stores teke one cycle ALU MEM 1 MEM 2 25 Saman Amarasinghe 6.035 ©MIT Fall 1998 List Scheduling Algorithm with resource constraints • Create a dependence DAG of a basic block • Topological Sort READY = nodes with no predecessors Loop until READY is empty Let n ∈READY be the node with the highest priority Schedule n in the earliest slot that satisfies precedence + resource constraints Update READY 26 Saman Amarasinghe 6.035 ©MIT Fall 1998 Example 1: 2: 3: 4: 5: 6: 7: 8: 9: lea add inc mov mov and imul lea mov d=4 1 f=1 1 d=3 2 f=1 var_a, %rax 4(%rsp), %rax %r11 4(%rsp), %r10 %r10, 8(%rsp) $0x00ff, %rbx %rax, %rbx var_b, %rax %rbx, 16(%rsp) 2 1 6 d=2 f=1 4 d=2 f=1 2 d=0 5 f=0 1 7 d=1 f=2 1 8 d=0 f=0 READY = { 1, 6, 4, 3 } 3 d=0 f=0 9 d=0 f=0 ALUop MEM 1 MEM 2 27 Saman Amarasinghe 6.035 ©MIT Fall 1998 28 Saman Amarasinghe Example 1: 2: 3: 4: 5: 6: 7: 8: 9: lea add inc mov mov and imul lea mov 2 1 ALUop 3 d=0 f=0 6 d=2 f=1 4 d=2 f=1 2 d=0 5 f=0 1 7 d=1 f=2 1 8 d=0 f=0 READY = { 1, 6, 4, 3 } 9 d=0 f=0 1: 2: 3: 4: 5: 6: 7: 8: 9: lea add inc mov mov and imul lea mov ALUop 1 MEM 1 MEM 2 MEM 2 29 6.035 ©MIT Fall 1998 d=4 1 f=1 1 d=3 2 f=1 var_a, %rax 4(%rsp), %rax %r11 4(%rsp), %r10 %r10, 8(%rsp) $0x00ff, %rbx %rax, %rbx var_b, %rax %rbx, 16(%rsp) 2 1 3 d=0 f=0 6 d=2 f=1 4 d=2 f=1 2 d=0 5 f=0 1 7 d=1 f=2 1 8 d=0 f=0 READY = { 2, 6, 4, 3 } MEM 1 Saman Amarasinghe ©MIT Fall 1998 Example d=4 1 f=1 1 d=3 2 f=1 var_a, %rax 4(%rsp), %rax %r11 4(%rsp), %r10 %r10, 8(%rsp) $0x00ff, %rbx %rax, %rbx var_b, %rax %rbx, 16(%rsp) 6.035 9 d=0 f=0 1 Saman Amarasinghe 2 2 30 6.035 ©MIT Fall 1998 5 Example 1: 2: 3: 4: 5: 6: 7: 8: 9: lea add inc mov mov and imul lea mov 2 1 ALUop 3 d=0 f=0 6 d=2 f=1 4 d=2 f=1 2 d=0 5 f=0 1 7 d=1 f=2 1 8 d=0 f=0 READY = { 6, 4, 3 } MEM 1 Example d=4 1 f=1 1 d=3 2 f=1 var_a, %rax 4(%rsp), %rax %r11 4(%rsp), %r10 %r10, 8(%rsp) $0x00ff, %rbx %rax, %rbx var_b, %rax %rbx, 16(%rsp) 9 d=0 f=0 lea add inc mov mov and imul lea mov ALUop MEM 1 31 6.035 ©MIT Fall 1998 lea add inc mov mov and imul lea mov 2 1 ALUop MEM 1 1 6 4 2 MEM 2 3 d=0 f=0 6 d=2 f=1 4 d=2 f=1 2 d=0 5 f=0 1 7 d=1 f=2 1 8 d=0 f=0 READY = { 7, 3, 5 } 9 d=0 f=0 32 1: 2: 3: 4: 5: 6: 7: 8: 9: lea add inc mov mov and imul lea mov lea add inc mov mov and imul lea mov ALUop MEM 1 1 6 4 2 MEM 2 4 2 33 6.035 2 1 ©MIT Fall 1998 ALUop MEM 1 1 6 4 2 MEM 2 Saman Amarasinghe 3 5 3 d=0 f=0 6 d=2 f=1 4 d=2 f=1 2 d=0 5 f=0 1 7 d=1 f=2 1 8 d=0 f=0 READY = { 5, 8, 9 } 9 d=0 f=0 3 2 1 3 d=0 f=0 6 d=2 f=1 4 d=2 f=1 2 d=0 5 f=0 1 7 d=1 f=2 1 8 d=0 f=0 9 d=0 f=0 7 4 2 34 Saman Amarasinghe 1: 2: 3: 4: 5: 6: 7: 8: 9: lea add inc mov mov and imul lea mov 6.035 MEM 1 1 6 4 2 MEM 2 4 2 6.035 ©MIT Fall 1998 d=4 1 f=1 1 d=3 2 f=1 var_a, %rax 4(%rsp), %rax %r11 4(%rsp), %r10 %r10, 8(%rsp) $0x00ff, %rbx %rax, %rbx var_b, %rax %rbx, 16(%rsp) 2 1 ©MIT Fall 1998 Saman Amarasinghe 3 5 3 d=0 f=0 6 d=2 f=1 4 d=2 f=1 2 d=0 5 f=0 1 7 d=1 f=2 1 8 d=0 f=0 READY = { 8, 9 } ALUop 7 35 ©MIT Fall 1998 Example d=4 1 f=1 1 d=3 2 f=1 var_a, %rax 4(%rsp), %rax %r11 4(%rsp), %r10 %r10, 8(%rsp) $0x00ff, %rbx %rax, %rbx var_b, %rax %rbx, 16(%rsp) 6.035 d=4 1 f=1 1 d=3 2 f=1 var_a, %rax 4(%rsp), %rax %r11 4(%rsp), %r10 %r10, 8(%rsp) $0x00ff, %rbx %rax, %rbx var_b, %rax %rbx, 16(%rsp) Example 1: 2: 3: 4: 5: 6: 7: 8: 9: 9 d=0 f=0 4 2 READY = { 3, 5, 8, 9 } 7 Saman Amarasinghe d=2 f=1 Example d=4 1 f=1 1 d=3 2 f=1 var_a, %rax 4(%rsp), %rax %r11 4(%rsp), %r10 %r10, 8(%rsp) $0x00ff, %rbx %rax, %rbx var_b, %rax %rbx, 16(%rsp) 1 6 4 d=2 f=1 2 d=0 5 f=0 1 7 d=1 f=2 1 8 d=0 f=0 Saman Amarasinghe Example 1: 2: 3: 4: 5: 6: 7: 8: 9: 2 3 d=0 f=0 1 6 4 2 MEM 2 2 Saman Amarasinghe d=4 1 f=1 1 d=3 2 f=1 var_a, %rax 4(%rsp), %rax %r11 4(%rsp), %r10 %r10, 8(%rsp) $0x00ff, %rbx %rax, %rbx var_b, %rax %rbx, 16(%rsp) READY = { 4, 7, 3 } 1 6 2 MEM 2 1: 2: 3: 4: 5: 6: 7: 8: 9: 9 d=0 f=0 7 8 4 2 36 6.035 ©MIT Fall 1998 6 Example 1: 2: 3: 4: 5: 6: 7: 8: 9: lea add inc mov mov and imul lea mov var_a, %rax 4(%rsp), %rax %r11 4(%rsp), %r10 %r10, 8(%rsp) $0x00ff, %rbx %rax, %rbx var_b, %rax %rbx, 16(%rsp) ALUop 1 6 4 2 MEM 2 1 1 3 4 2 • Number of instructions in a basic block is small 2 6 5 – Cannot keep a multiple units with long pipelines busy by just scheduling within a basic block 1 2 1 – Scheduling constraints across basic blocks – Scheduling policy 1 9 8 3 5 • Need to handle control dependence 7 READY = { } MEM 1 Scheduling across basic blocks 7 8 9 4 2 37 Saman Amarasinghe 6.035 ©MIT Fall 1998 Moving across basic blocks • Downward to adjacent basic block B C 39 ©MIT Fall 1998 • Upward to adjacent basic block C A • A path to B that does not execute A? Saman Amarasinghe 6.035 Moving across basic blocks A B 38 Saman Amarasinghe • A path from C that does not reach A? 6.035 ©MIT Fall 1998 Saman Amarasinghe 40 6.035 ©MIT Fall 1998 Control Dependencies Control Dependencies • Constraints in moving instructions across basic blocks • Constraints in moving instructions across basic blocks if ( . . . ) a = b op c Saman Amarasinghe If ( valid address? ) d = *(a1) 41 6.035 ©MIT Fall 1998 Saman Amarasinghe 42 6.035 ©MIT Fall 1998 7 Trace Scheduling Trace Scheduling • Find the most common trace of basic blocks A – Use profile information • Combine the basic blocks in the trace and schedule them as one block • Create clean-up code if the execution goes offtrace B C D E F G H 43 Saman Amarasinghe ©MIT Fall 1998 6.035 44 Saman Amarasinghe 6.035 ©MIT Fall 1998 Large Basic Blocks via Code Duplication Trace Scheduling • Creating large extended basic blocks by duplication • Schedule the larger blocks A B A D B A C B C D D D E E E E G H 45 Saman Amarasinghe ©MIT Fall 1998 6.035 Saman Amarasinghe Trace Scheduling 46 6.035 ©MIT Fall 1998 Scheduling Loops • Loop bodies are small • But, lot of time is spend in loops due to large number of iterations • Need better ways to schedule loops A B C D D E E F F G H H G H Saman Amarasinghe 47 6.035 ©MIT Fall 1998 Saman Amarasinghe 48 6.035 ©MIT Fall 1998 8 Loop Example Loop Example • Machine • Source Code – One load/store unit for i = 1 to N A[i] = A[i] * b • load 2 cycles • store 2 cycles • Assembly Code – Two arithmetic units • add 2 cycles • branch 2 cycles • multiply 3 cycles loop: mov imul mov sub bge – Both units are pipelined (initiate one op each cycle) • Source Code for i = 1 to N A[i] = A[i] * b 49 Saman Amarasinghe 6.035 Loop Example ©MIT Fall 1998 imul d=5 (%rdi,%rax), %r10 %r11, %r10 %r10, (%rdi,%rax) $4, %rax loop mov d=2 mov – Create a much larger basic block for the body – Eliminate few loop bounds checks d=2 2 • Schedule (9 cycles per iteration) • Cons: bge d=0 – Much larger program – Setup code (# of iterations < unroll factor) – beginning and end of the schedule can still have unused slots mov mov mov imul bge imul ©MIT Fall 1998 • Unroll the loop body few times • Pros: 0 sub 6.035 Loop Unrolling mov d=7 3 loop: mov imul mov sub bge 50 Saman Amarasinghe 2 • Assembly Code (%rdi,%rax), %r10 %r11, %r10 %r10, (%rdi,%rax) $4, %rax loop bge imul sub sub 51 Saman Amarasinghe 6.035 mov d=14 mul d=12 mov d=9 ©MIT Fall 1998 52 Saman Amarasinghe 6.035 ©MIT Fall 1998 2 loop: mov imul mov sub mov imul mov sub bge Loop Example (%rdi,%rax), %r10 %r11, %r10 %r10, (%rdi,%rax) $4, %rax (%rdi,%rax), %r10 %r11, %r10 %r10, (%rdi,%rax) $4, %rax loop Loop Unrolling 3 0 sub d=9 mov d=7 mul d=5 mov d=2 sub d=2 • Rename registers 2 – Use different registers in different iterations 2 3 0 2 • Schedule (8 cycles per iteration) mov mov mov bge mov mov d=0 mov mov imul mov imul imul bge imul imul bge imul sub sub sub Saman Amarasinghe sub 53 6.035 ©MIT Fall 1998 Saman Amarasinghe 54 6.035 ©MIT Fall 1998 9 loop: mov imul mov sub mov imul mov sub bge mov d=14 mul d=12 mov d=9 sub d=9 mov d=7 2 Loop Example 0 • Rename registers 2 – Use different registers in different iterations 2 mul d=5 mov d=2 sub d=2 bge d=0 3 • Eliminate unnecessary dependencies 0 – again, use more registers to eliminate true, anti and output dependencies – eliminate dependent-chains of calculations when possible 2 55 Saman Amarasinghe Loop Unrolling 3 (%rdi,%rax), %r10 %r11, %r10 %r10, (%rdi,%rax) $4, %rax (%rdi,%rax), %rcx %r11, %rcx %rcx, (%rdi,%rax) $4, %rax loop 6.035 mov d=12 mul d=10 mov d=7 ©MIT Fall 1998 Saman Amarasinghe 56 6.035 ©MIT Fall 1998 2 Loop Example loop: mov imul mov sub mov imul mov sub bge (%rdi,%rax), %r10 %r11, %r10 %r10, (%rdi,%rax) $8, %rax (%rdi,%rbx), %rcx %r11, %rcx %rcx, (%rdi,%rbx) $8, %rbx loop 2 mov mov mov mov imul mov d=5 • Try to overlap multiple iterations so that the slots will be filled • Find the steady-state window so that: mul d=3 mov d=0 sub d=2 bge d=0 3 – all the instructions of the loop body is executed – but from different iterations 2 mov bge imul imul sub d=7 mov imul imul sub mov 2 • Schedule (4.5 cycles per iteration mov Software Pipelining 3 bge imul sub sub sub 57 Saman Amarasinghe 6.035 ©MIT Fall 1998 Saman Amarasinghe Loop Example 58 • Assembly Code – value of %r11 don’t change – 4 regs for (%rdi,%rax) – each addr. incremented by 4*4 (%rdi,%rax), %r10 %r11, %r10 %r10, (%rdi,%rax) $4, %rax loop – 4 regs to keep value %r10 • Schedule mov mov1 mov mul Saman Amarasinghe mov2 st mov1 mov2 mul1 mul mul1 mul sub mov3 st1 mov mov3 mul2 bge mul2 mul1 sub1 sub 59 mov4 mov1 mul3 bge mul2 st2 mov4 bge1 mul3 mov5 mov2 mul4 bge1 mul3 sub2 sub1 ©MIT Fall 1998 Loop Example • 4 iterations are overlapped loop: mov imul mov sub bge 6.035 st3 ld5 bge2 mul4 mov6 mov3 mul5 bge2 mul4 – Same registers can be reused after 4 of these blocks generate code for 4 blocks, otherwise need to move mov3 st1 mov mov3 mul2 bge mul2 mul1 sub1 sub loop: mov imul mov sub bge (%rdi,%rax), %r10 %r11, %r10 %r10, (%rdi,%rax) $4, %rax loop sub3 sub2 6.035 sub3 ©MIT Fall 1998 Saman Amarasinghe 60 6.035 ©MIT Fall 1998 10 Register Allocation and Instruction Scheduling Software Pipelining • Optimal use of resources • Need a lot of registers • If register allocation is before instruction scheduling – Values in multiple iterations need to be kept – restricts the choices for scheduling • Issues in dependencies – Executing a store instruction in an iteration before branch instruction is executed for a previous iteration (writing when it should not have) – Loads and stores are issued out-of-order (need to figure-out dependencies before doing this) • Code generation issues – Generate pre-amble and post-amble code – Multiple blocks so no register copy is needed 61 Saman Amarasinghe ©MIT Fall 1998 6.035 62 Saman Amarasinghe Example 1: 2: 3: 4: mov add mov add 6.035 ©MIT Fall 1998 Example 1 4(%rbp), %rax %rax, %rbx 8(%rbp), %rax %rax, %rcx 1: 2: 3: 4: 3 2 1 1 mov add mov add 4(%rbp), %rax %rax, %rbx 8(%rbp), %r10 %r10, %rcx 1 3 2 1 1 3 3 3 3 ALUop MEM 1 2 1 MEM 2 ALUop 4 MEM 1 3 1 Saman Amarasinghe 4 1 MEM 2 3 63 2 6.035 ©MIT Fall 1998 4 3 1 Saman Amarasinghe 4 3 64 6.035 ©MIT Fall 1998 Register Allocation and Instruction Scheduling Register Allocation and Instruction Scheduling • If register allocation is before instruction scheduling • If register allocation is before instruction scheduling – restricts the choices for scheduling – restricts the choices for scheduling • If instruction scheduling before register allocation – Register allocation may spill registers – Will change the carefully done schedule!!! Saman Amarasinghe 65 6.035 ©MIT Fall 1998 Saman Amarasinghe 66 6.035 ©MIT Fall 1998 11 Superscalar: Where have all the transistors gone? • Out of order execution • Register renaming – If an instruction stalls, go beyond that and start executing non-dependent instructions – Pros: • Hardware scheduling • Tolerates unpredictable latencies – If there is an anti or output dependency of a register that stalls the pipeline, use a different hardware register – Pros: • Avoids anti and output dependencies – Cons: – Cons: • Cannot do more complex transformations to eliminate dependencies • Instruction window is small Saman Amarasinghe Superscalar: Where have all the transistors gone? 67 6.035 ©MIT Fall 1998 Saman Amarasinghe 68 6.035 ©MIT Fall 1998 Hardware vs. Compiler • In a superscalar, hardware and compiler scheduling can work hand-in-hand • Hardware can reduce the burden when not predictable by the compiler • Compiler can still greatly enhance the performance – Large instruction window for scheduling – Many program transformations that increase parallelism • Compiler is even more critical when no hardware support – VLIW machines (Itanium, DSPs) Saman Amarasinghe 69 6.035 ©MIT Fall 1998 12