Page 1 C. Kessler, IDA, Linköpings Universitet, 2004. ADVANCED COMPILER CONSTRUCTION — Background Material Page 2 A short review of microprocessor architecture concepts ADVANCED COMPILER CONSTRUCTION — Background Material BACKGROUND MATERIAL (Prerequisites) Basic principle: von-Neumann architecture CPU Page 4 data DATA MEMORY MEMORY PROGRAM MEMORY C. Kessler, IDA, Linköpings Universitet, 2004. addresses instructions PC Topics: von-Neumann cycle: FO fetch operands EX compute (ALU) WB write result PC + +, repeat ADVANCED COMPILER CONSTRUCTION — Background Material Memory hierarchy C. Kessler, IDA, Linköpings Universitet, 2004. 2..5 ns 4..10 ns 20..100 ns 50..1000 ns 5..15 ms 1..50 s typical access time Register Primary cache (“L1”, on-chip) Secondary cache (“L2”, off-chip, SRAMs) Main memory (DRAMs) Secondary memory (hard disk) Archive memory (disks, tapes) usable explicitly: register, main memory ! addressing modes 64..1024 byte 8kB..256kB 512kB..4MB 8MB..4GB 500MB..1TB many TB typical capacity since 1985: CPU cycle rate grows annually by 55%, memory cycle rate by 7% Memory hierarchy (1) von-Neumann bottleneck Harvard vs. Princeton (Non-Harvard) architecture Some microprocessor architecture concepts (memory hierarchy, addressing modes, pipelining, ILP) and their consequences for compiler design IF instruction fetch from PM PC] C. Kessler, IDA, Linköpings Universitet, 2004. (typed: int / float / ...) ID instruction decode System software tool chain: Assembler, linker, loader Page 3 usually covered in undergraduate courses on computer architecture and operating systems. ADVANCED COMPILER CONSTRUCTION — Background Material Instruction classes Compute — arithmetic or logic computations e.g., addition, bitwise AND, shift left, ... Load — load contents of a memory location into a CPU register e.g., load, pop Store — write a value from the CPU into a memory location e.g., store, push Branch — computation modifies the program counter PC e.g., jump (nonconditional), branch (conditional), call (subroutine) Special instructions e.g., system call (trap), status queries, ... usable implicitly: caches ADVANCED COMPILER CONSTRUCTION — Background Material Addressing modes Page 5 C. Kessler, IDA, Linköpings Universitet, 2004. register absolute immediate Rdest Rsrc1 + M Rsrc2] Rdest Rsrc1 + Rsrc2 (Add-reg-indirect) (Add-register) Rdest Rsrc + M constant] (Add-absolute) Rdest constant (Load-direct-displ.) possible combinations of register- and memory addresses or constants for the operands of an instruction register-indirect Rdest M Rsrc+offset] (Set-immediate) displacement (Pop = load-autodecrement) Rdest M M Rsrc1] + Rsrc2] (Load-memory-indirect) auto-inc/decrement Rdest M Rsrc ; ;] memory-indirect PC-relative, etc. ... Page 7 C. Kessler, IDA, Linköpings Universitet, 2004. Challenge for the compiler: instruction / addressing mode selection exploit complex addressing modes to reduce number or cost of instructions generated ADVANCED COMPILER CONSTRUCTION — Background Material RISC versus CISC processor architectures Motivation for RISC: Simple operations and addressing modes occur much more often in relevant benchmark programs (e.g., SPEC int, SPEC fp) than more complex operations. Rare, complex operations and addressing modes (e.g., division, sqrt, ...) can be replaced by an equivalent sequence of simple instructions ! better utilization of hardware ! higher clock rate possible Usually, complex addressing modes are rarely used. Rem.: if using a benchmark for quantitative argumentation like this, make sure it represents your typical profile of computational load. ADVANCED COMPILER CONSTRUCTION — Background Material Page 6 simple instruction formats ! simple decoding logic ! higher clock rate ! cache (L1) on chip few, simple instructions many registers (general purpose) few addressing modes Compute instr. only on registers (“Load-Store architecture”) RISC alignment to 8 / 16 bit ! optimized for space economy “packed” instruction formats extensive decoding logics or microcode no space for cache on chip but: shorter programs many, more complex instructions few registers, partly special registers many addressing modes compute also with memory operands ! longer cycle time: ! M ! ALU ! CISC C. Kessler, IDA, Linköpings Universitet, 2004. alignment to 32 (64) bit words ! optimized for access time sometimes floatingpoint-coprocessor RISC versus CISC processor architectures floatingpoint arithmetics on chip Page 8 C. Kessler, IDA, Linköpings Universitet, 2004. SPARC, MIPS, PowerPC, Alpha, ... Intel 80x86, Motorola 680x0, NSC 32000, ... ADVANCED COMPILER CONSTRUCTION — Background Material Example: A simple RISC processor: DLX (1) ::: F30 DLX = “predecessor” of the SGI MIPS line, see [Hennessy/Patterson’96] data types: 8 bit int byte 16 bit int half-word 32 bit int/float word 64 bit double double-word Registers: 31 general purpose registers R1 ::: R31, R0 0 32/16 float/double registers F0 F1 ::: F31 resp. F0 F2 special registers, e.g. floatingpoint status register Page 9 C. Kessler, IDA, Linköpings Universitet, 2004. ADVANCED COMPILER CONSTRUCTION — Background Material Page 10 Pipelining, at example DLX (1) ADVANCED COMPILER CONSTRUCTION — Background Material Example: A simple RISC processor: DLX (2) Consider n subsequent instructions I1 I2 ID1 ID2 ID3 EX1 EX2 EX3 MEM1 MEM2 MEM3 W B1 W B2 W B3 C. Kessler, IDA, Linköpings Universitet, 2004. bad utilization! Each phase takes 1 (faster) cycle ADVANCED COMPILER CONSTRUCTION — Background Material EX1 EX2 EX3 EX4 ::: MEM1 MEM2 MEM3 C. Kessler, IDA, Linköpings Universitet, 2004. C. Kessler, IDA, Linköpings Universitet, 2004. (jump/branch/store: no WB) W B1 W B2 ALU1 DM/ALU2 Regs Pipelining, at example DLX (3) with pipelining: PM Decoder IF 1 IF ID1 2 IF3 ID2 IF4 ID3 IF5 ID4 IF6 ID5 ! faster by factor 5 ... (in theory, BUT ...) pipeline depth = number of phases, here 5 issue cycle I1 1 I 2 2 I3 3 I4 4 I5 5 I6 6 ... Page 12 IFk instruction fetch IDk instruction decode + read operand registers EXk compute operand address or compute part 1 MEMk load operand from memory or compute part 2 W Bk write result back in register Instruction Ik consists of 5 phases: Instruction set, addressing modes: + Load: only one addr.-mode, direct (with offset) + Store: -”+ Compute: only on registers resp. immediates ! “Load-Store architecture” + Branches: J, JR absolute, direct, B, BR PC-relative, JALR with pushing the PC (call subroutine), BEQZ, BNEZ, BFPT, BFPF conditional branches Simplest case: all instructions take one cycle to execute (no cache) ! longest data path dominates the cycle time (usually, Load) ! CPUtime = # executed instructions * cycle-time Page 11 Idea: increase clock rate by decomposing instructions in phases ADVANCED COMPILER CONSTRUCTION — Background Material IF2 IF2 PM Decoder ALU1 DM/ALU2 Regs IF1 Pipelining, at example DLX (2) without pipelining: ins. phase cycle I1 IF1 1 ID1 2 EX1 3 MEM1 4 W B1 5 I2 IF2 6 ID 7 2 EX2 8 MEM2 9 W B2 10 I3 IF3 11 ID3 12 EX3 13 MEM3 14 W B3 15 ADVANCED COMPILER CONSTRUCTION — Background Material Page 13 C. Kessler, IDA, Linköpings Universitet, 2004. ADVANCED COMPILER CONSTRUCTION — Background Material Page 14 C. Kessler, IDA, Linköpings Universitet, 2004. Pipelining, at example DLX (5) EX1 EX2 EX3 Page 16 MEM1 MEM2 W B1 ALU1 DM/ALU2 Regs Pipelining, at example DLX (4) ADVANCED COMPILER CONSTRUCTION — Background Material issue cycle PM Dec/Opnd t =1 1 IF1 t =2 2 IF2 ID1 t =3 3 IF3 ID2 t =4 4 IF ID3 4 t =5 5 IF5 ID4 ... ! wrong result! ! Control hazards branch instruction may wish to continue at other than next instruction C. Kessler, IDA, Linköpings Universitet, 2004. Pipelining, at example DLX (7) unit occupation / issue interval k > 1 cycle (structural hazard) special unit (e.g., float mul, div) available at most every kth cycle delayed branch, branch target prediction (control hazard) Ik+1 executed in any case, indep. of outcome of branch Ik delayed Load (data hazard) result written to destination register only after d > 0 delay cycles In some cases the hazard is exposed explicitly to the (assembler) programmer / compiler for efficiency reasons Most superscalar processors handle such hazards internally (dynamic instruction dispatch). C. Kessler, IDA, Linköpings Universitet, 2004. Example: Data hazard S1: ADD R1, R2, R3 ; overwrites register R1 in WB1 S2: SUB R4, R5, R1 ; reads register R1 in ID2 ... ! Structural hazards for cost reasons the hardware cannot admit arbitrary combinations of instructions in the pipeline, so that the same component may be needed simultaneously by multiple instructions ! Data hazards Instruction Ik may need a value computed by Ik;1, but at the desired point in time (EXk ) is that computation not yet complete (W Bk;1) Page 15 In these cases: “pipeline stall” do not issue a new instruction until problem disappears ADVANCED COMPILER CONSTRUCTION — Background Material Pipelining, at example DLX (6) Processor logic must delay ID2 by (at least) 2 cycles issue cycle PM Dec/Opnd ALU1 DM/ALU2 Regs I1 1 IF1 I2 2 IF2 ID1 — 3 stall stall EX1 — 4 stall stall stall MEM1 I 5 IF ID stall stall W B1 3 3 2 I4 6 IF4 ID3 EX2 stall stall (assuming that WB writes registers early and ID reads registers late) pipeline depth Speedup = 1 + pipeline stall cycles per instruction ! constraints for code placement / instruction scheduling, fill delay slots with useful instructions or with NOP ADVANCED COMPILER CONSTRUCTION — Background Material Page 17 C. Kessler, IDA, Linköpings Universitet, 2004. Example: [Hennessy/Patterson’03] Execution takes 10 clock cycles: Pipelining, at example DLX (8): data / control hazard example for single-issue MIPS processor: producer ins. consumer ins. latency FP ALU op FP ALU op 3 FP ALU op store double 2 Load double FP ALU op 1 Load double Store double 0 ;F0=array elem. ;add scalar in F2 ;store result ;decrem. pointer ;branch if R1!=R2 Page 19 PC t=1: 2 3 4 5 6 addi ADD load MEM REGISTER FILE FMUL NOP NOP SHIFT C. Kessler, IDA, Linköpings Universitet, 2004. L.D F0,0(R1) DADDUI R1,R1,#-8 ADD.D F4,F0,F2 (stall) BNE R1,R2,Loop S.D F4,8(R1) ;(altered) reschedule loop ! 6 clock cycles: L.D F0,0(R1) (stall) ADD.D F4,F0,F2 (stall) (stall) S.D F4,0(R1) DADDUI R1,R1,#-8 (stall) BNE R1,R2,Loop (stall) ;delay slot t=1: 2 3 4 5 6 7 8 9 10 F0,0(R1) F4,F0,F2 F4,0(R1) R1,R1,#-8 R1,R2,Loop for (i=1000; i>0; i--) x[i] += s; Loop: L.D ADD.D S.D DADDUI BNE ADVANCED COMPILER CONSTRUCTION — Background Material VLIW and EPIC processors VLIW: (very long instruction word) multiple functional units work in parallel single stream of long instruction words with explicitly parallel subinstructions statically bound to functional units requires static scheduling (! compiler) e.g.: Philips Trimedia, Intel i860 EPIC: (explicitly parallel instruction computer) one-dim. stream of instructions grouped statically (by markup bits) instructions in a group may be issued in parallel (as in VLIW) requires static scheduling (! compiler) e.g.: Intel Itanium ADVANCED COMPILER CONSTRUCTION — Background Material Page 18 Architectures for instruction-level parallelism VLIW (very long instruction word), EPIC multiple functional units (integer ALU, shifter, float mul, load/store unit, ...) C. Kessler, IDA, Linköpings Universitet, 2004. direct, static specification of parallel operations in the corresponding sections of the long instruction word superscalar processor multiple functional units single stream of ordinary instructions (sequential program code) multiple issue: k-way superscalar = up to k > 1 subsequent instructions may start execution simultaneously Unit 10 C. Kessler, IDA, Linköpings Universitet, 2004. recognized + resolved by dispatcher - pipeline conflicts - competition for same unit - out-of-order completion Conflicts: if no instruction can be issued: wait (pipeline stall) if only 1 instruction can be issued: issue In, advance In;1 to place 1 if no conflicts: issue both In, In+1 Page 20 dynamic instruction dispatcher maps instructions to available units ADVANCED COMPILER CONSTRUCTION — Background Material I n+1 Unit 3 DISPATCHER internal instruction buffer (2 instructions) Superscalar processors Example: Motorola 88110 place 1 I-cache In Unit 2 place 2 2-way superscalar Unit 1 ........ (reservation stations, out-of-order execution, reorder buffer, ...) ADVANCED COMPILER CONSTRUCTION — Background Material Page 21 VLIW / EPIC / Superscalar processors Consequences for code generation: VLIW / EPIC: C. Kessler, IDA, Linköpings Universitet, 2004. static scheduling to keep units busy (“code compaction”) analyze and resolve hazards at compile time superscalar: generate just 1 linear instruction stream as usual, instruction dispatcher guarantees sequential semantics explicit, static mapping instructions!units not possible but: dispatcher has limited lookahead static reordering of instructions to remove hazards may help! C. Kessler, IDA, Linköpings Universitet, 2004. ADVANCED COMPILER CONSTRUCTION — Background Material Predicated execution Page 22 C. Kessler, IDA, Linköpings Universitet, 2004. Branch misprediction ! large penalty (stall time) e.g. Pentium IV: 20 pipeline stages, at misprediction ! discard up to 126 in-flight instructions Not all branches can be predicted well (statically or dynamically) branch F buf=*inp inp=inp+1 t=buf>>4 d=t&0xf Page 24 cmp p2, p3 d=buf&0x1 buf=*inp inp=inp+1 t=buf>>4 d=t&0xf .......... C. Kessler, IDA, Linköpings Universitet, 2004. removes branch, no pipeline stall, but more work performed [p2] [p3] [p3] [p3] [p3] Alternative: predicated execution, conversion to predicated code: “if-conversion” T d = buf&0x1 .......... ADVANCED COMPILER CONSTRUCTION — Background Material Memory hierarchy (3) Page 23 Memory hierarchy (2) ADVANCED COMPILER CONSTRUCTION — Background Material Caches (Instruction cache, data cache) cache hit = accessed word already in cache, get it fast. ! Compiler issue: optimize for cache utilization ! suitable for applications with high (also dynamic) data locality Cache-based systems profit from + spatial access locality (access also other data in same cache line) + temporal access locality (access same location multiple times) + dynamic adaptivity of cache contents cache miss = not in cache, load from main memory (slower) Cache line size: from 16 bytes (Dash) ... Mapping memory blocks ! cache lines / page frames: direct mapped: 8 j 9!i : B j 7! Ci, namely where i j mod m. fully-associative: any memory block may be placed in any cache line set-associative ADVANCED COMPILER CONSTRUCTION — Background Material source program ASCII Page 25 Steps and tools for generating an executable program myprog.c C. Kessler, IDA, Linköpings Universitet, 2004. C. Kessler, IDA, Linköpings Universitet, 2004. FRONT END (lexical, syntactical analysis, type checking) intermediate representation BACK END (code generation) assembler program ASCII COMPILER myprog.s ASSEMBLER object program (not executable) (COFF / ELF / ... ) object program (executable) (COFF / ELF / ... ) LINKER myprog.o a.out LOADER (program in execution) Page 27 ADVANCED COMPILER CONSTRUCTION — Background Material Object code format Page 26 COFF = Common Object File Format (Unix) A COFF program consists (mainly) of 3 segments: Text segment – the instructions Data segment – initialized variables BSS segment – non-initialized (global) variables C. Kessler, IDA, Linköpings Universitet, 2004. ! protection units more: header, string table, relocation info, symbolic debug info, ... C. Kessler, IDA, Linköpings Universitet, 2004. ! nop instruction ! expands to r17 += r5 Reference: Gircys: Understanding and Using COFF. O’Reilly, 1988. Page 28 Other object code formats: ELF (Linux), Java bytecode (JVM), ... ADVANCED COMPILER CONSTRUCTION — Background Material Assembler syntax (2) ADVANCED COMPILER CONSTRUCTION — Background Material Assembler syntax (1) define(nop, ori r1,r1,r1) define(length,r17) addi %length,r5,%length Example: ASCII textfile ! comment ] Abbreviations e.g. for registers or instructions operands switch to other segment: .text, .data, .bss instruction-mnemonic Instruction format: Label : ] Example: start: ! makes label _main globally visible ! convention: _ prefix to ! compiler-generated labels addi r4,r5,r4 ! integer addition r4 = r4+r5 beqz r4, start ! if r4==0 goto start ... .globl _main _main: ... Page 29 C. Kessler, IDA, Linköpings Universitet, 2004. ADVANCED COMPILER CONSTRUCTION — Background Material Assembler ADVANCED COMPILER CONSTRUCTION — Background Material Assembler syntax (3) Page 31 Relocation table C. Kessler, IDA, Linköpings Universitet, 2004. library routines (printf, sqrt, ...) ... (target memory) a.out in execution .text .data .bss Relocation table symbol table Page 30 Page 32 C. Kessler, IDA, Linköpings Universitet, 2004. C. Kessler, IDA, Linköpings Universitet, 2004. list of locations in the (object code) program where (local) addresses of data or functions are referenced which must later be patched by the linker or loader (replaced by the corresponding (global) addresses) to make the object code executable generated by the assembler Relocation table Relocation (1) ADVANCED COMPILER CONSTRUCTION — Background Material generates output in COFF format (instructions and data in target machine format) sorts into .text, .data, .bss creates a relocation table resolves local labels expands abbreviations and macros The assembler program ! trailing \0 Constants myfconst: .float 3.1415 mystring: .asciz "hello world" ! 48 bytes filled with 0 (similar: .int, .long, .byte, .double, .word) myblock: .space 48, 0 Make names globally visible with .globl ... symbol table libm.a .text .data .bss Relocation table LOADER symbol table .text .data .bss ar next value starts at a byte address divisible by 2k;1 .globl mystring .align k .byte 17 .align 3 .word 74 ADVANCED COMPILER CONSTRUCTION — Background Material file2.c file2.s cc ... ... libc.a Assembler, Linker, Archiver, Libraries file1.c file1.s cc Relocation table as symbol table as Relocation table file2.o .text .data .bss symbol table ld file1.o .text .data .bss ld LINKER a.out .text .data .bss Relocation table symbol table ADVANCED COMPILER CONSTRUCTION — Background Material Relocation (2) Page 33 C. Kessler, IDA, Linköpings Universitet, 2004. ADVANCED COMPILER CONSTRUCTION — Background Material Linker Linker Page 34 C. Kessler, IDA, Linköpings Universitet, 2004. + for the programmer, each .s assembler program starts at local address 0 and is a contiguous block 3. resolves global symbols checks for duplicate (global) labels and undefined labels 2. merges .text, .data, .bss segments of these (may filter out unused functions) 1. reads all object codes to be linked, including library archives + multiple object programs can be linked together to one ! modularity 4. writes the resulting object file, adds a new relocation table to make it relocatable Relocation is necessary to be able to program independently of the code’s position in the target machine’s address space + program can be decomposed into blocks, blocks can have different levels of protection C. Kessler, IDA, Linköpings Universitet, 2004. 5. and marks it as executable. Page 36 + program can be moved to another block in memory Variants: + shared libraries + dynamic linking ADVANCED COMPILER CONSTRUCTION — Background Material Loader C. Kessler, IDA, Linköpings Universitet, 2004. Shared Libraries Loader Page 35 + space economy + consistent update 1. copy a.out to a free memory block in the target machine (for Harvard architectures: text segment to program memory, data/bss to data memory) ADVANCED COMPILER CONSTRUCTION — Background Material requires static check for undefined globals / multiple definitions against a table of contents for each shared library 4. start execution (call startup code in executable) 3. relocate addresses according to relocation table 2. allocate space for runtime stack and heap in a free memory block [Muchnick 5.7] – run-time overhead for dynamic linking – shared objects are not patchable, must consist of position-independent code, e.g. 1. PC-relative addressing of routine entries etc., 2. global offset table in shared library, pointed to by a global pointer gp, 3. stubs for global acc. / procedure linkage table in the non-shared code to be patched (lazily: call invokes dynamic linker) ADVANCED COMPILER CONSTRUCTION — Background Material Startup code Page 37 set up protection for segments (if applicable) set up initial stack frame and heap execute global initializers copy main’s arguments to stack and call main(). C. Kessler, IDA, Linköpings Universitet, 2004. ADVANCED COMPILER CONSTRUCTION — Background Material Page 38 Virtual machines as compilation target C. Kessler, IDA, Linköpings Universitet, 2004. Virtual machine: abstract execution platform simulated in software (by interpretation or by a further compilation step) Examples: stack machines e.g. Java Virtual Machine JVM, C-machine + simple code generation for expression evaluation + addressing via SP implicit ! more compact code [Davis et al.’03] – stack addressing not applicable to global and heap data structures – all modern processors are register machines ! convert register machines e.g. Random Access Machine RAM more flexible, closer to HW, but needs register allocation