Compiling for EDGE Architectures: The TRIPS Prototype Compiler Kathryn McKinley Doug Burger, Steve Keckler, Jim Burrill1, Xia Chen, Katie Coons, Sundeep Kushwaha, Bert Maher, Nick Nethercote, Aaron Smith, Bill Yoder et al. The University of Texas at Austin of Massachusetts, Amherst 1University July 13, 2016 ASPLOS XII Technology Scaling Hitting the Wall Analytically … Qualitatively … 35 nm 70 nm 100 nm 130 nm 20 mm chip edge Either way … Partitioning for on-chip communication is key July 13, 2016 ASPLOS XII OO SuperScalars Out of Steam Clock ride is over Wire and pipeline limits Quadratic out-of-order issue logic Power, a first order constraint Problems for any architectural solution ILP - instruction level parallelism Memory and on-chip latency Major vendors ending processor lines July 13, 2016 ASPLOS XII OO SuperScalars Out of Steam Clock ride is over Wire and pipeline limits Quadratic out-of-order issue logic Power, a first order constraint Problems for any architectural solution ILP - instruction level parallelism Memory and on-chip latency Major vendors ending processor lines What’s next? July 13, 2016 ASPLOS XII Post-RISC Solutions CMP - An evolutionary path Replicate what we already have 2 to N times on a chip Coarse grain parallelism Exposes the resources to the programmer and compiler Explicit Data Graph Execution (EDGE) 1. Program graph is broken into sequence of blocks Blocks commit atomically or not - a block never partially commits 2. Dataflow within a block, ISA support for direct producer-consumer communication No shared named registers (point-to-point dataflow edges only) Memory is still a shared namespace The block’s dataflow graph (DFG) is explicit in the architecture July 13, 2016 ASPLOS XII Outline TRIPS Execution Model & ISA TRIPS Architectural Constraints Compiler Spatial July 13, 2016 Structure Path Scheduling ASPLOS XII Block Atomic Execution Model TRIPS block Flow Graph Dataflow Graph read add add ld cmp write read shl ld cmp br ld shl sw br write addi addi mov write bro_t lw_f Gtile D[0] read sw sw add br write • • July 13, 2016 Register File read Gtile Data Caches read Execution Substrate D[0] write bro_t addi lw_f mov read write addi addi addi write TRIPS block - single entry constrained hyperblock Dataflow execution w/ target position encoding ASPLOS XII TRIPS Block Constraints Registers: 32 reads and 32 writes, 8 to each of 4 banks (in addition to 128) Memory Load/Store Identifiers: 32 load or store queue identifiers More than 32 static loads and stores is possible PC 32 loads 1 - 128 PC read 32 reads 32 writes 32 stores instruction DFG Memory Register banks Fixed Size: 128 instructions Padded with no-ops if needed terminating branch PC Constant Output: all stores and writes execute, one branch Simplifies hardware logic for detecting block completion Every path of execution through a block must produce the same stores and register writes Simplifies the hardware, more work for the compiler July 13, 2016 ASPLOS XII Compiler Phases (Classic) Scale Compiler (UTexas/UMass) C FORTRAN Frontend Inlining Unrolling/Flattening Scalar Optimizations Code Generation Alpha July 13, 2016 SPARC PPC PRE Global Value Numbering Scalar Replacement Global Variable Replacement SCC Copy Propagation Array Access Strength Reduction LICM Tree Height Reduction Useless Copy Removal Dead Variable Elimination TIL: TRIPS Intermediate Language - RISC-like threeaddress form TRIPS TIL TASL: TRIPS Assembly Language - dataflow target form w/ locations encoded ASPLOS XII Backend Compiler Flow Hyperblock Formation TIL Resource Allocation Scheduling July 13, 2016 If-conversion Loop peeling While loop unrolling Instruction merging Predicate optimizations Register allocation Reverse if-conversion & split Load/Store ID assignment SSA for constant outputs Fanout insertion Instruction placement Target form generation TASL ASPLOS XII Correctness: Progressively Satisfy Constraints Hyperblock Formation TIL Resource Allocation Scheduling July 13, 2016 If-conversion Loop peeling While loop unrolling Instruction merging Predicate optimizations Register allocation Reverse if-conversion & split Load/Store ID assignment SSA for constant outputs Fanout insertion Instruction placement Target form generation Constraint 128 instructions 32 load/store IDs 32 reg. read/write (8 per 4 banks) constant output TASL ASPLOS XII Predication & Hyperblock Formation Predication Convert control dependence to data dependence Improves instruction fetch bandwidth Eliminates branch mispredictions Adds overhead Any instruction can have a predicate, but... Predicate head (low power) or bottom (speculative) Hyperblock Scheduling region (set of basic blocks) Single entry, multiple exit, predicated instructions Expose parallelism w/o over saturating resources Must satisfy block constraints head P bottom P P July 13, 2016 ASPLOS XII Accuracy? Hyperblock Formation TIL Resource Allocation Scheduling July 13, 2016 If-conversion Loop peeling While loop unrolling Instruction merging Predicate optimizations Register allocation Reverse if-conversion & split Load/Store ID assignment SSA for constant outputs Fanout insertion Instruction placement Target form generation Constraint 128 instructions 32 load/store IDs 32 reg. read/write (8 per 4 banks) constant output TASL ASPLOS XII Block Atomic Execution Model TRIPS block Flow Graph Dataflow Graph read add add ld cmp write read shl ld cmp br ld shl sw br write addi addi mov write bro_t lw_f Gtile D[0] read sw sw add br Register File read Gtile Data Caches read Execution Substrate D[0] write bro_t addi lw_f mov read write addi addi addi write write TRIPS block - single entry constrained hyperblock Dataflow execution w/ target position encoding July 13, 2016 ASPLOS XII Spatial Scheduling Problem Partitioned microarchitecture add mul mul ld ld ld mul ld mul add st July 13, 2016 ASPLOS XII Spatial Scheduling Problem Partitioned microarchitecture add mul ld ld ld mul ld mul ld ld mul st ld ld add st Anchor points July 13, 2016 ASPLOS XII Spatial Scheduling Problem Balance latency and concurrency Partitioned microarchitecture add mul mul ld ld ld mul ld mul st ld mul add mul ld mul add ld ld add mul st Anchor points July 13, 2016 ASPLOS XII Outline Background Spatial Path Scheduling Simulated Annealing Extending SPS Conclusions July 13, 2016 and Future Work ASPLOS XII Dissecting the Problem Scheduling can have two components Placement: Issue: Where an instruction executes When an instruction executes July 13, 2016 Static Dynamic Static VLIW (SPSI) Bad idea (DPSI) Dynamic Issue Placement TRIPS (SPDI) Superscalars (DPDI) EDGE ASPLOS XII Explicit Data Graph Execution Block-atomic execution Instruction groups fetch, execute, and commit atomically Direct instruction communication Explicitly encode dataflow graph by specifying targets RISC EDGE R4 add r1, r4, r5 add r2, r5, r6 add r3, r1, r2 July 13, 2016 Centralized Register File i1: add i3 i2: add i3 i3: add i4 R5 add i1 i2 add i3 i2 R6 add i2 i2 ASPLOS XII Scheduling for TRIPS TRIPS ISA Ctrl R0 R1 R2 R3 Up to 8 blocks in flight 1 cycle latency between adjacent ALUs D0 E0 E1 E2 E3 D1 E4 E5 E6 E7 D2 E8 E9 E10 E11 D3 E12 E13 E14 E15 Known Register File TRIPS microarchitecture Up to 128 instructions/block Any instruction can be in any slot Execution latencies Lower bound for communication latency Unknown (estimated) Data Cache Memory access latencies Resource conflicts July 13, 2016 ASPLOS XII Scheduling for TRIPS TRIPS ISA Ctrl Up to 8 blocks in flight 1 cycle latency between adjacent ALUs D0 Known Register File TRIPS microarchitecture Up to 128 instructions/block Any instruction can be in any slot Execution latencies Lower bound for communication latency Unknown Data Cache D1 R0 R1 R2 R3 E2 E4 D2 D3 Memory access latencies Resource conflicts July 13, 2016 ASPLOS XII Greedy Scheduling for TRIPS GRST [PACT ‘04]: Based on VLIW list-scheduling Augmented with five heuristics Prioritizes critical path (C) Reprioritizes after each placement (R) Accounts for data cache locality (L) Accounts for register output locality (O) Load balancing for local issue contention (B) 1. 2. 3. 4. 5. Drawbacks Unnecessary restrictions on scheduling order Inelegant and overly specific Replace heuristics with elegant approach designed for spatial scheduling July 13, 2016 ASPLOS XII Greedy Scheduling for TRIPS GRST [PACT ‘04]: Based on VLIW list-scheduling Augmented with five heuristics Prioritizes critical path (C) Reprioritizes after each placement (R) Accounts for data cache locality (L) Accounts for register output locality (O) Load balancing for local issue contention (B) 1. 2. 3. 4. 5. Drawbacks Unnecessary restrictions on scheduling order Inelegant and overly specific Replace heuristics with elegant approach designed for spatial scheduling July 13, 2016 ASPLOS XII Outline Background Spatial Path Scheduling Simulated Annealing Extending SPS Conclusions July 13, 2016 and Future Work ASPLOS XII Spatial Path Scheduling Overview Legend read add Register Data cache mul br ld ld ctrl D0 D1 Execution Control ctrl Dataflow Graph R1 R2 read mul D0 ld add mul D1 ld mul add add write Scheduler br Placement Topology July 13, 2016 ASPLOS XII Spatial Path Scheduling Overview Legend read add Register Data cache mul br ld ld ctrl D0 D1 Control read mul Dataflow Graph Execution R1 add mul ctrl D0 D1 R2 mul ld add write Scheduler ld Placement Topology July 13, 2016 ASPLOS XII Spatial Path Scheduling Overview Legend read add Register Data cache mul br ld ld ctrl D0 D1 Control read mul Dataflow Graph Execution R1 add R2 mul add ld mul ld br D0 D1 add write Scheduler Placement Topology July 13, 2016 ASPLOS XII Spatial Path Scheduling Overview Initialize all known anchor points Until all instructions are scheduled: 1. Populate the open list 2. Find placement costs 3. Choose the minimum cost location 4. Schedule the instruction whose minimum placement cost is largest (Choose the max of the mins) read R2 add br mul ld ld read R1 mul add write R1 July 13, 2016 ASPLOS XII Spatial Path Scheduling Example Initialize all known anchor points read R2 add Register File ctrl R1 mul br ld ld ctrl D0 D1 R2 Data Cache D0 read R1 mul D1 Legend Register add Data cache Execution Control write R1 Unplaced July 13, 2016 ASPLOS XII Spatial Path Scheduling Example Populate the open list (marked in yellow) read R2 add Open list: Instructions that are candidates for scheduling We include: Instructions with no parents, or with at least one placed parent July 13, 2016 mul br ld ld ctrl D0 D1 read R1 mul add write R1 ASPLOS XII Spatial Path Scheduling Example Calculate placement cost for each instruction in the open list at each slot 1 read R2 add Placement cost(i,slot): Longest path length through i if placed at slot cost = inputCost + execCost + outputCost (includes communication and execution latencies) July 13, 2016 mul 3 br ld ld 1 ctrl D0 D1 3 3 read R1 mul 1 add 1 write R1 ASPLOS XII Spatial Path Scheduling Example Calculate placement cost for each instruction in the open list at each slot read R2 1 mul 3 5 Register File ctrl R1 Data Cache 3 cycles mul E1 1 D1 3 R2 1 cycle D0 ld 3 3 mul D1 5 cycles 1 1 add 1 write R1 Total placement cost = 16 + 3 + 3 = 22 July 13, 2016 ASPLOS XII Spatial Path Scheduling Example Calculate placement cost for each instruction in the open list at each slot Register File ctrl Data Cache D0 D1 R1 22 22 24 22 24 read R2 add mul br ld ld ctrl D0 D1 R2 24 24 26 26 26 28 mul 22 22 24 26 22 22 24 26 24 24 26 28 26 26 28 30 read R1 mul add write R1 26 July 13, 2016 22 add 10 8 8 10 10 10 10 12 12 12 12 14 14 14 14 16 mul 24 24 22 24 22 22 22 24 24 24 24 28 26 26 26 28 26 28 30 add 22 22 24 26 22 22 24 26 24 24 26 28 26 26 28 30 ASPLOS XII Spatial Path Scheduling Example Choose the minimum cost location for each instruction Register File ctrl Data Cache D0 D1 R1 22 22 24 22 24 read R2 add mul br ld ld ctrl D0 D1 R2 24 24 26 26 26 28 mul 22 22 24 26 22 22 24 26 24 24 26 28 26 26 28 30 read R1 mul add write R1 26 July 13, 2016 22 add 10 8 8 10 10 10 10 12 12 12 12 14 14 14 14 16 mul 24 24 22 24 22 22 22 24 24 24 24 28 26 26 26 28 26 28 30 add 22 22 24 26 22 22 24 26 24 24 26 28 26 26 28 30 ASPLOS XII Spatial Path Scheduling Example mul 24 24 22 24 22 22 22 24 24 24 24 28 26 26 26 28 Break ties add 10 8 8 10 10 10 10 12 12 12 12 14 14 14 14 16 Example heuristics: Links consumed ALU utilization Register File ctrl Data Cache D0 D1 R1 22 22 24 22 24 add mul br ld ld ctrl D0 D1 R2 24 24 26 26 26 30 mul 22 22 24 26 22 22 24 26 24 24 26 28 26 26 28 30 read R1 mul add write R1 26 July 13, 2016 22 read R2 26 28 30 add 22 22 24 26 22 22 24 26 24 24 26 28 26 26 28 30 ASPLOS XII Spatial Path Scheduling Example Place the instruction with the highest minimum cost (Choose the max of the mins) Register File ctrl Data Cache D0 D1 July 13, 2016 R1 add 10 8 8 10 10 10 10 12 12 12 12 14 14 14 14 16 mul 24 24 22 24 22 22 22 24 24 24 24 28 26 26 26 28 read R2 add mul br ld ld ctrl D0 D1 R2 mul mul 22 22 24 26 22 22 24 26 24 24 26 28 26 26 28 30 read R1 mul add write R1 add 22 22 24 26 22 22 24 26 24 24 26 28 26 26 28 30 ASPLOS XII Spatial Path Scheduling Algorithm Schedule (block, topology) initialize known anchor points while (not all instructions scheduled) for each instruction in open list, i for each available location, n calculate placement cost for (i, n) keep track of n with min placement cost keep track of i with highest min placement cost schedule i with highest min placement cost Per-block complexity: SPS: O(i2 * n) GRST: O(i2 Exhaustive search: i! i = # of instructions n = # of ALUs July 13, 2016 + i * n) ASPLOS XII SPS Benefits and Limitations Benefits Automatically exploits known communication latencies Designed for spatial scheduling Minimizes critical path length at each step Naturally encompasses four of five GRST heuristics Limitations of basic algorithm Does not account for resource contention Uses no global information Minimum communication latencies may be optimistic July 13, 2016 ASPLOS XII Experimental Methodology 26 hand-optimized microbenchmarks Cycle-accurate simulator Extracted from SPEC2000, EEMBC, Livermore Loops, MediaBench, and C libraries Average dynamic instructions fetched/block: 67.3 (Ranges from 14.5 to 117.5) Within 4% of RTL on average Models communication and contention delays Comparison points Greedy Scheduling for TRIPS (GRST) Simulated annealing July 13, 2016 ASPLOS XII July 13, 2016 nv se a cm p str sh qr rbt ree pm svd _G MT I va dd Ge o. Me an spo Hand-coded microbenchmark tra n co cfa r ct ke_ 1 ge na lg gz ip_ 1 gz ip_ 2 ma trix _1 me mc hr me mc py me ms et pa rse r_1 eq ua am 1 mp _2 art _1 art _2 art _3 bz ip2 _1 e0 a2 t im Speedup SPS Performance Geometric mean of speedup over GRST: 1.19 Basic SPS 2 1.8 1.6 1.4 1.2 1 0.8 ASPLOS XII July 13, 2016 nv se a cm p str sh qr rbt ree pm svd _G MT I va dd Ge o. Me an spo Hand-coded microbenchmark tra n co cfa r ct ke_ 1 ge na lg gz ip_ 1 gz ip_ 2 ma trix _1 me mc hr me mc py me ms et pa rse r_1 eq ua am 1 mp _2 art _1 art _2 art _3 bz ip2 _1 e0 a2 t im Speedup SPS Performance Geometric mean of speedup over GRST: 1.19 Basic SPS 2 1.8 1.6 1.4 1.2 1 0.8 ASPLOS XII July 13, 2016 nv se a cm p str sh qr rbt ree pm svd _G MT I va dd Ge o. Me an spo Hand-coded microbenchmark tra n co cfa r ct ke_ 1 ge na lg gz ip_ 1 gz ip_ 2 ma trix _1 me mc hr me mc py me ms et pa rse r_1 eq ua am 1 mp _2 art _1 art _2 art _3 bz ip2 _1 e0 a2 t im Speedup SPS Performance Geometric mean of speedup over GRST: 1.19 Basic SPS 2 1.8 1.6 1.4 1.2 1 0.8 ASPLOS XII Outline Background Spatial Path Scheduling Simulated Annealing Extending SPS Conclusions July 13, 2016 and Future Work ASPLOS XII How well can we do? Simulated annealing Cost function: simulated cycles Artificial intelligence search technique Uses random perturbations to avoid local optima Approximates a global optimum Uncertainty makes static cost functions insufficient Best cost function Purpose Optimization Discover performance upper bound Tool to improve scheduler July 13, 2016 ASPLOS XII Speedup with Simulated Annealing Geometric mean of speedup over GRST Basic SPS: 1.19 Annealed: 1.40 Basic SPS Annealed 2.2 2 Speedup 1.8 1.6 1.4 1.2 1 July 13, 2016 svd spo se _G MT I va dd Ge o. me an a str cm p tra n Hand-coded microbenchmark sh qr rbt ree pm mc hr mc py me ms et pa rse r_1 me _1 me 1 2 trix ma ip_ gz lg ip_ gz 1 ge na ke_ ct eq ua r nv co cfa art _2 art _1 art _3 ip2 _1 bz am a2 t im e0 1 mp _2 0.8 ASPLOS XII Speedup with Simulated Annealing Geometric mean of speedup over GRST Basic SPS: 1.19 Annealed: 1.40 Basic SPS Annealed 2.2 2 Speedup 1.8 1.6 1.4 1.2 1 July 13, 2016 svd spo se _G MT I va dd Ge o. me an a str cm p tra n Hand-coded microbenchmark sh qr rbt ree pm mc hr mc py me ms et pa rse r_1 me _1 me 1 2 trix ma ip_ gz lg ip_ gz 1 ge na ke_ ct eq ua r nv co cfa art _2 art _1 art _3 ip2 _1 bz am a2 t im e0 1 mp _2 0.8 ASPLOS XII Speedup with Simulated Annealing Geometric mean of speedup over GRST Basic SPS: 1.19 Annealed: 1.40 Basic SPS Annealed 2.2 2 Speedup 1.8 1.6 1.4 1.2 1 July 13, 2016 svd spo se _G MT I va dd Ge o. me an a str cm p tra n Hand-coded microbenchmark sh qr rbt ree pm mc hr mc py me ms et pa rse r_1 me _1 me 1 2 trix ma ip_ gz lg ip_ gz 1 ge na ke_ ct eq ua r nv co cfa art _2 art _1 art _3 ip2 _1 bz am a2 t im e0 1 mp _2 0.8 ASPLOS XII Outline Background Spatial Path Scheduling Simulated Annealing Extending SPS Conclusions July 13, 2016 and Future Work ASPLOS XII Extending SPS Contention Network link contention Local and Global ALU contention Global Path July 13, 2016 register prioritization volume scheduling ASPLOS XII ALU Contention What if two instructions are ready to execute on the same ALU at the same time? read R2 add Register File ctrl Data Cache D0 R1 br mul br ld ld ctrl D0 D2 R2 add add ld mul read R1 mul mul ld add D2 July 13, 2016 write R1 ASPLOS XII Local vs. Global ALU Contention Local ALU contention Keep track of expected issue time Increase placement cost if conflict occurs Global ALU contention Resource utilization in previous/next block Weighting function Modify July 13, 2016 placement cost ASPLOS XII Speedup over GRST Geometric mean of speedup over GRST Basic SPS: 1.19 SPS extended: 1.31 Basic SPS 2.2 Annealed: 1.40 SPS extended Annealed 2 Speedup 1.8 1.6 1.4 1.2 1 July 13, 2016 svd spo se _G MT I va dd Ge o. me an sh a str cm p tra n Hand-coded microbenchmark qr rbt ree pm ct ke_ 1 ge na lg gz ip_ 1 gz ip_ 2 ma trix _1 me mc hr me mc py me ms et pa rse r_1 eq ua r nv co cfa art _2 art _1 art _3 ip2 _1 bz am a2 t im e0 1 mp _2 0.8 ASPLOS XII Speedup over GRST Geometric mean of speedup over GRST Basic SPS: 1.19 SPS extended: 1.31 Basic SPS 2.2 Annealed: 1.40 SPS extended Annealed 2 Speedup 1.8 1.6 1.4 1.2 1 July 13, 2016 svd spo se _G MT I va dd Ge o. me an sh a str cm p tra n Hand-coded microbenchmark qr rbt ree pm ct ke_ 1 ge na lg gz ip_ 1 gz ip_ 2 ma trix _1 me mc hr me mc py me ms et pa rse r_1 eq ua r nv co cfa art _2 art _1 art _3 ip2 _1 bz am a2 t im e0 1 mp _2 0.8 ASPLOS XII Speedup over GRST Geometric mean of speedup over GRST Basic SPS: 1.19 SPS extended: 1.31 Basic SPS 2.2 Annealed: 1.40 SPS extended Annealed 2 Speedup 1.8 1.6 1.4 1.2 1 July 13, 2016 svd spo se _G MT I va dd Ge o. me an sh a str cm p tra n Hand-coded microbenchmark qr rbt ree pm ct ke_ 1 ge na lg gz ip_ 1 gz ip_ 2 ma trix _1 me mc hr me mc py me ms et pa rse r_1 eq ua r nv co cfa art _2 art _1 art _3 ip2 _1 bz am a2 t im e0 1 mp _2 0.8 ASPLOS XII Related Work Scheduling for VLIW [Ellis, Fisher] Scheduling for other partitioned architectures Partitioned VLIW [Gilbert, Kailas, Kessler, Özer, Qian, Zalamea] RAW [Lee] Wavescalar [Mercaldi] ASIC and FPGA place and route [Paulin] Resource conflicts known statically Substrate may not be fixed Simulated annealing [Betz] July 13, 2016 ASPLOS XII Conclusions and Future Work Future work Register allocation Memory placement Reliability-aware scheduling Conclusions General spatial instruction scheduling algorithm Reasons explicitly about anchor points Performance within 4% of annealed results July 13, 2016 ASPLOS XII Questions? July 13, 2016 ASPLOS XII Mapping instructions to Physical Locations Scheduler converts operand format to target format, and assigns IDs ID assigned to each instruction indicates physical location The microarchitecture can interpret this ID in many different ways To schedule well, the scheduler must understand how the microarchitecture translates ID -> Physical location TIL (operand format): read read muli ld ld mul add addi br write July 13, 2016 t0, g1 t1, g2 t2, t1, 4 t3, 0(t2) t4, 4(t2) t5, t3, t4 t6, t5, t0 t7, t1, 8 t7 g1, t6 Scheduler TASL(target format) R[1] read, G[1], N[5] R[2] read, N[2], N[6] N[2] muli, N[34], N[1] N[34] ld, N[32] N[1] ld, N[32] N[32] mul, N[5] N[5] add, W[1] N[6] addi, N[0] N[0] br W[1] write, G[1] ASPLOS XII Mapping instructions to Physical Locations Scheduler converts operand format to target format, and assigns IDs ID assigned to each instruction indicates physical location The microarchitecture can interpret this ID in many different ways To schedule well, the scheduler must understand how the microarchitecture translates ID -> Physical location ctrl R0 R0 R0 D0 D1 D2 D3 July 13, 2016 R1 R1 R1 R2 R2 R2 R3 R3 R3 0 1 2 3 32 33 34 35 64 65 66 67 96 97 98 99 TASL(target format) R[1] read, G[1], N[5] R[2] read, N[2], N[6] N[2] muli, N[34], N[1] N[34] ld, N[32] N[1] ld, N[32] N[32] mul, N[5] N[5] add, W[1] N[6] addi, N[0] N[0] br W[1] write, G[1] ASPLOS XII Mapping instructions to Physical Locations Scheduler converts operand format to target format, and assigns IDs ID assigned to each instruction indicates physical location The microarchitecture can interpret this ID in many different ways To schedule well, the scheduler must understand how the microarchitecture translates ID -> Physical location ctrl R0 R4 R1 R5 R2 R6 R3 R7 D0 4 5 6 7 D1 36 37 38 39 D2 68 69 70 71 D3 100 101 102 103 July 13, 2016 TASL(target format) R[1] read, G[1], N[5] R[2] read, N[2], N[6] N[2] muli, N[34], N[1] N[34] ld, N[32] N[1] ld, N[32] N[32] mul, N[5] N[5] add, W[1] N[6] addi, N[0] N[0] br W[1] write, G[1] ASPLOS XII Mapping instructions to Physical Locations Scheduler converts operand format to target format, and assigns IDs ID assigned to each instruction indicates physical location The microarchitecture can interpret this ID in many different ways To schedule well, the scheduler must understand how the microarchitecture translates ID -> Physical location ctrl R0,R4, … R28 R1,R5, … R29 R2,R6, … R30 R3,R7, … R31 D0 0,4,8, … 28 1,5,9, … 29 2,6,10, … 30 3,7,11, … 31 D1 32,36, … 60 33,37, … 61 34,38, … 62 35,39, … 63 D2 64,68, … 92 65,69, … 93 66,70, … 94 67,69, … 95 D3 96,100, … 124 97,101, … 125 98,101, … 126 99,102, … 127 July 13, 2016 TASL(target format) R[1] read, G[1], N[5] R[2] read, N[2], N[6] N[2] muli, N[34], N[1] N[34] ld, N[32] N[1] ld, N[32] N[32] mul, N[5] N[5] add, W[1] N[6] addi, N[0] N[0] br W[1] write, G[1] ASPLOS XII Simulated Annealing Over Time 100000 random accepted 95000 random best guided accepted guided best Simulation Cycles 90000 85000 80000 75000 70000 65000 60000 1 74 147 220 293 366 439 512 585 658 731 804 877 950 1023 1096 1169 1242 1315 1388 1461 1534 1607 1680 Annealing Iterations July 13, 2016 ASPLOS XII Simulated Annealing Cost function: Simulated cycles Prune space further with critical path tool Guided vs. unguided Annealing for memset_hand 83000 Random Move Guided Move Simulation Cycles 82000 81000 80000 79000 78000 77000 76000 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101 Annealing Times July 13, 2016 ASPLOS XII Contention ALU contention Network link contention Local (within a block) - Estimate temporal schedule Global (between blocks) - Probabilistic - use weighting function Precise measurements too inaccurate Estimate with threshold, weighting function Weight network link and global ALU contention based on annealed results criticality weight = (1 - fullness) * (1 ) concurrency July 13, 2016 ASPLOS XII Global Register Prioritization Problem: Any register dependence may be important with speculative execution Solution: Extend path lengths through registers Register prioritization: 1) Schedule smaller loops before larger loops 2) Schedule loop-carried dependences first 3) Extend placement cost through registers to previous/next block July 13, 2016 ASPLOS XII Path Volume Scheduling Problem: The basic SPS algorithm does not account for the number of instructions in the path Solution: Perform a depth-first search with iterative deepening to find the shortest path that holds all instructions July 13, 2016 ASPLOS XII