BA CKGR OUND M

advertisement
Page 1
C. Kessler, IDA, Linköpings Universitet, 2004.
ADVANCED COMPILER CONSTRUCTION — Background Material
Page 2
A short review of microprocessor architecture concepts
ADVANCED COMPILER CONSTRUCTION — Background Material
BACKGROUND MATERIAL (Prerequisites)
Basic principle: von-Neumann architecture
CPU
Page 4
data
DATA
MEMORY
MEMORY
PROGRAM
MEMORY
C. Kessler, IDA, Linköpings Universitet, 2004.
addresses
instructions
PC
Topics:
von-Neumann cycle:
FO fetch operands
EX compute (ALU)
WB write result
PC + +, repeat
ADVANCED COMPILER CONSTRUCTION — Background Material
Memory hierarchy
C. Kessler, IDA, Linköpings Universitet, 2004.
2..5 ns
4..10 ns
20..100 ns
50..1000 ns
5..15 ms
1..50 s
typical access time
Register
Primary cache (“L1”, on-chip)
Secondary cache (“L2”, off-chip, SRAMs)
Main memory (DRAMs)
Secondary memory (hard disk)
Archive memory (disks, tapes)
usable explicitly: register, main memory ! addressing modes
64..1024 byte
8kB..256kB
512kB..4MB
8MB..4GB
500MB..1TB
many TB
typical capacity
since 1985:
CPU cycle rate grows annually by 55%, memory cycle rate by 7%
Memory hierarchy (1)
von-Neumann bottleneck
Harvard vs. Princeton (Non-Harvard) architecture
Some microprocessor architecture concepts
(memory hierarchy, addressing modes, pipelining, ILP)
and their consequences for compiler design
IF instruction fetch from PM PC]
C. Kessler, IDA, Linköpings Universitet, 2004.
(typed: int / float / ...)
ID instruction decode
System software tool chain: Assembler, linker, loader
Page 3
usually covered in undergraduate courses on computer architecture and operating systems.
ADVANCED COMPILER CONSTRUCTION — Background Material
Instruction classes
Compute — arithmetic or logic computations
e.g., addition, bitwise AND, shift left, ...
Load — load contents of a memory location into a CPU register
e.g., load, pop
Store — write a value from the CPU into a memory location
e.g., store, push
Branch — computation modifies the program counter PC
e.g., jump (nonconditional), branch (conditional), call (subroutine)
Special instructions
e.g., system call (trap), status queries, ...
usable implicitly: caches
ADVANCED COMPILER CONSTRUCTION — Background Material
Addressing modes
Page 5
C. Kessler, IDA, Linköpings Universitet, 2004.
register
absolute
immediate
Rdest Rsrc1 + M Rsrc2]
Rdest Rsrc1 + Rsrc2
(Add-reg-indirect)
(Add-register)
Rdest Rsrc + M constant] (Add-absolute)
Rdest constant
(Load-direct-displ.)
possible combinations of register- and memory addresses or constants
for the operands of an instruction
register-indirect
Rdest M Rsrc+offset]
(Set-immediate)
displacement
(Pop = load-autodecrement)
Rdest M M Rsrc1] + Rsrc2] (Load-memory-indirect)
auto-inc/decrement Rdest M Rsrc ; ;]
memory-indirect
PC-relative, etc. ...
Page 7
C. Kessler, IDA, Linköpings Universitet, 2004.
Challenge for the compiler: instruction / addressing mode selection
exploit complex addressing modes to reduce number or cost
of instructions generated
ADVANCED COMPILER CONSTRUCTION — Background Material
RISC versus CISC processor architectures
Motivation for RISC:
Simple operations and addressing modes occur much more often
in relevant benchmark programs (e.g., SPEC int, SPEC fp)
than more complex operations.
Rare, complex operations and addressing modes (e.g., division, sqrt, ...)
can be replaced by an equivalent sequence of simple instructions
! better utilization of hardware
! higher clock rate possible
Usually, complex addressing modes are rarely used.
Rem.: if using a benchmark for quantitative argumentation like this,
make sure it represents your typical profile of computational load.
ADVANCED COMPILER CONSTRUCTION — Background Material
Page 6
simple instruction formats
! simple decoding logic
! higher clock rate
! cache (L1) on chip
few, simple instructions
many registers (general purpose)
few addressing modes
Compute instr. only on registers
(“Load-Store architecture”)
RISC
alignment to 8 / 16 bit
! optimized for space economy
“packed” instruction formats
extensive decoding logics or microcode
no space for cache on chip
but: shorter programs
many, more complex instructions
few registers, partly special registers
many addressing modes
compute also with memory operands
! longer cycle time: ! M ! ALU !
CISC
C. Kessler, IDA, Linköpings Universitet, 2004.
alignment to 32 (64) bit words
! optimized for access time
sometimes floatingpoint-coprocessor
RISC versus CISC processor architectures
floatingpoint arithmetics on chip
Page 8
C. Kessler, IDA, Linköpings Universitet, 2004.
SPARC, MIPS, PowerPC, Alpha, ... Intel 80x86, Motorola 680x0, NSC 32000, ...
ADVANCED COMPILER CONSTRUCTION — Background Material
Example: A simple RISC processor: DLX (1)
:::
F30
DLX = “predecessor” of the SGI MIPS line, see [Hennessy/Patterson’96]
data types:
8 bit int byte
16 bit int half-word
32 bit int/float word
64 bit double double-word
Registers: 31 general purpose registers R1 ::: R31, R0 0
32/16 float/double registers F0 F1 ::: F31 resp. F0 F2
special registers, e.g. floatingpoint status register
Page 9
C. Kessler, IDA, Linköpings Universitet, 2004.
ADVANCED COMPILER CONSTRUCTION — Background Material
Page 10
Pipelining, at example DLX (1)
ADVANCED COMPILER CONSTRUCTION — Background Material
Example: A simple RISC processor: DLX (2)
Consider n subsequent instructions I1 I2
ID1
ID2
ID3
EX1
EX2
EX3
MEM1
MEM2
MEM3
W B1
W B2
W B3
C. Kessler, IDA, Linköpings Universitet, 2004.
bad utilization!
Each phase takes 1 (faster) cycle
ADVANCED COMPILER CONSTRUCTION — Background Material
EX1
EX2
EX3
EX4
:::
MEM1
MEM2
MEM3
C. Kessler, IDA, Linköpings Universitet, 2004.
C. Kessler, IDA, Linköpings Universitet, 2004.
(jump/branch/store: no WB)
W B1
W B2
ALU1 DM/ALU2 Regs
Pipelining, at example DLX (3)
with pipelining:
PM Decoder
IF
1
IF
ID1
2
IF3
ID2
IF4
ID3
IF5
ID4
IF6
ID5
! faster by factor 5 ... (in theory, BUT ...)
pipeline depth = number of phases, here 5
issue cycle
I1
1
I
2
2
I3
3
I4
4
I5
5
I6
6
...
Page 12
IFk instruction fetch
IDk instruction decode + read operand registers
EXk compute operand address or compute part 1
MEMk load operand from memory or compute part 2
W Bk write result back in register
Instruction Ik consists of 5 phases:
Instruction set, addressing modes:
+ Load: only one addr.-mode, direct (with offset)
+ Store: -”+ Compute: only on registers resp. immediates ! “Load-Store architecture”
+ Branches: J, JR absolute, direct, B, BR PC-relative,
JALR with pushing the PC (call subroutine),
BEQZ, BNEZ, BFPT, BFPF conditional branches
Simplest case: all instructions take one cycle to execute (no cache)
! longest data path dominates the cycle time (usually, Load)
! CPUtime = # executed instructions * cycle-time
Page 11
Idea: increase clock rate by decomposing instructions in phases
ADVANCED COMPILER CONSTRUCTION — Background Material
IF2
IF2
PM Decoder ALU1 DM/ALU2 Regs
IF1
Pipelining, at example DLX (2)
without pipelining:
ins. phase cycle
I1 IF1
1
ID1
2
EX1
3
MEM1 4
W B1
5
I2 IF2
6
ID
7
2
EX2
8
MEM2 9
W B2
10
I3 IF3
11
ID3
12
EX3
13
MEM3 14
W B3
15
ADVANCED COMPILER CONSTRUCTION — Background Material
Page 13
C. Kessler, IDA, Linköpings Universitet, 2004.
ADVANCED COMPILER CONSTRUCTION — Background Material
Page 14
C. Kessler, IDA, Linköpings Universitet, 2004.
Pipelining, at example DLX (5)
EX1
EX2
EX3
Page 16
MEM1
MEM2
W B1
ALU1 DM/ALU2 Regs
Pipelining, at example DLX (4)
ADVANCED COMPILER CONSTRUCTION — Background Material
issue cycle PM Dec/Opnd
t =1
1
IF1
t =2
2
IF2
ID1
t =3
3
IF3
ID2
t =4
4
IF
ID3
4
t =5
5
IF5
ID4
...
! wrong result!
! Control hazards
branch instruction may wish to continue at other than next instruction
C. Kessler, IDA, Linköpings Universitet, 2004.
Pipelining, at example DLX (7)
unit occupation / issue interval k > 1 cycle (structural hazard)
special unit (e.g., float mul, div) available at most every kth cycle
delayed branch, branch target prediction (control hazard)
Ik+1 executed in any case, indep. of outcome of branch Ik
delayed Load (data hazard)
result written to destination register only after d > 0 delay cycles
In some cases the hazard is exposed explicitly
to the (assembler) programmer / compiler for efficiency reasons
Most superscalar processors handle such hazards internally
(dynamic instruction dispatch).
C. Kessler, IDA, Linköpings Universitet, 2004.
Example: Data hazard
S1: ADD R1, R2, R3 ; overwrites register R1 in WB1
S2: SUB R4, R5, R1 ; reads register R1 in ID2
...
! Structural hazards
for cost reasons the hardware cannot admit arbitrary combinations
of instructions in the pipeline, so that the same component
may be needed simultaneously by multiple instructions
! Data hazards
Instruction Ik may need a value computed by Ik;1,
but at the desired point in time (EXk ) is that computation
not yet complete (W Bk;1)
Page 15
In these cases: “pipeline stall”
do not issue a new instruction until problem disappears
ADVANCED COMPILER CONSTRUCTION — Background Material
Pipelining, at example DLX (6)
Processor logic must delay ID2 by (at least) 2 cycles
issue cycle PM Dec/Opnd ALU1 DM/ALU2 Regs
I1
1
IF1
I2
2
IF2
ID1
—
3 stall
stall
EX1
—
4 stall
stall
stall MEM1
I
5
IF
ID
stall
stall
W B1
3
3
2
I4
6
IF4
ID3
EX2
stall
stall
(assuming that WB writes registers early and ID reads registers late)
pipeline depth
Speedup = 1 + pipeline stall cycles per instruction
! constraints for code placement / instruction scheduling,
fill delay slots with useful instructions or with NOP
ADVANCED COMPILER CONSTRUCTION — Background Material
Page 17
C. Kessler, IDA, Linköpings Universitet, 2004.
Example: [Hennessy/Patterson’03]
Execution takes 10 clock cycles:
Pipelining, at example DLX (8): data / control hazard example
for single-issue MIPS processor:
producer ins. consumer ins. latency
FP ALU op FP ALU op
3
FP ALU op store double 2
Load double FP ALU op
1
Load double Store double 0
;F0=array elem.
;add scalar in F2
;store result
;decrem. pointer
;branch if R1!=R2
Page 19
PC
t=1:
2
3
4
5
6
addi
ADD
load
MEM
REGISTER FILE
FMUL
NOP
NOP
SHIFT
C. Kessler, IDA, Linköpings Universitet, 2004.
L.D
F0,0(R1)
DADDUI R1,R1,#-8
ADD.D F4,F0,F2
(stall)
BNE
R1,R2,Loop
S.D
F4,8(R1) ;(altered)
reschedule loop ! 6 clock cycles:
L.D
F0,0(R1)
(stall)
ADD.D F4,F0,F2
(stall)
(stall)
S.D
F4,0(R1)
DADDUI R1,R1,#-8
(stall)
BNE
R1,R2,Loop
(stall)
;delay slot
t=1:
2
3
4
5
6
7
8
9
10
F0,0(R1)
F4,F0,F2
F4,0(R1)
R1,R1,#-8
R1,R2,Loop
for (i=1000; i>0; i--)
x[i] += s;
Loop:
L.D
ADD.D
S.D
DADDUI
BNE
ADVANCED COMPILER CONSTRUCTION — Background Material
VLIW and EPIC processors
VLIW: (very long instruction word)
multiple functional units work in parallel
single stream of long instruction words
with explicitly parallel subinstructions
statically bound to functional units
requires static scheduling (! compiler)
e.g.: Philips Trimedia, Intel i860
EPIC: (explicitly parallel instruction computer)
one-dim. stream of instructions grouped statically (by markup bits)
instructions in a group may be issued in parallel (as in VLIW)
requires static scheduling (! compiler)
e.g.: Intel Itanium
ADVANCED COMPILER CONSTRUCTION — Background Material
Page 18
Architectures for instruction-level parallelism
VLIW (very long instruction word), EPIC
multiple functional units
(integer ALU, shifter, float mul, load/store unit, ...)
C. Kessler, IDA, Linköpings Universitet, 2004.
direct, static specification of parallel operations
in the corresponding sections of the long instruction word
superscalar processor
multiple functional units
single stream of ordinary instructions (sequential program code)
multiple issue:
k-way superscalar =
up to k > 1 subsequent instructions may start execution simultaneously
Unit 10
C. Kessler, IDA, Linköpings Universitet, 2004.
recognized + resolved by dispatcher
- pipeline conflicts
- competition for same unit
- out-of-order completion
Conflicts:
if no instruction can be issued:
wait (pipeline stall)
if only 1 instruction can be issued:
issue In, advance In;1 to place 1
if no conflicts: issue both In, In+1
Page 20
dynamic instruction dispatcher maps instructions to available units
ADVANCED COMPILER CONSTRUCTION — Background Material
I n+1
Unit 3
DISPATCHER
internal instruction
buffer (2 instructions)
Superscalar processors
Example:
Motorola 88110
place 1
I-cache
In
Unit 2
place 2
2-way superscalar
Unit 1
........
(reservation stations, out-of-order
execution, reorder buffer, ...)
ADVANCED COMPILER CONSTRUCTION — Background Material
Page 21
VLIW / EPIC / Superscalar processors
Consequences for code generation:
VLIW / EPIC:
C. Kessler, IDA, Linköpings Universitet, 2004.
static scheduling to keep units busy (“code compaction”)
analyze and resolve hazards at compile time
superscalar:
generate just 1 linear instruction stream as usual,
instruction dispatcher guarantees sequential semantics
explicit, static mapping instructions!units not possible
but:
dispatcher has limited lookahead
static reordering of instructions to remove hazards may help!
C. Kessler, IDA, Linköpings Universitet, 2004.
ADVANCED COMPILER CONSTRUCTION — Background Material
Predicated execution
Page 22
C. Kessler, IDA, Linköpings Universitet, 2004.
Branch misprediction ! large penalty (stall time)
e.g. Pentium IV: 20 pipeline stages,
at misprediction ! discard up to 126 in-flight instructions
Not all branches can be predicted well (statically or dynamically)
branch
F
buf=*inp
inp=inp+1
t=buf>>4
d=t&0xf
Page 24
cmp p2, p3
d=buf&0x1
buf=*inp
inp=inp+1
t=buf>>4
d=t&0xf
..........
C. Kessler, IDA, Linköpings Universitet, 2004.
removes branch, no pipeline stall, but more work performed
[p2]
[p3]
[p3]
[p3]
[p3]
Alternative: predicated execution,
conversion to predicated code: “if-conversion”
T
d = buf&0x1
..........
ADVANCED COMPILER CONSTRUCTION — Background Material
Memory hierarchy (3)
Page 23
Memory hierarchy (2)
ADVANCED COMPILER CONSTRUCTION — Background Material
Caches (Instruction cache, data cache)
cache hit = accessed word already in cache, get it fast.
! Compiler issue: optimize for cache utilization
! suitable for applications with high (also dynamic) data locality
Cache-based systems profit from
+ spatial access locality
(access also other data in same cache line)
+ temporal access locality
(access same location multiple times)
+ dynamic adaptivity of cache contents
cache miss = not in cache, load from main memory (slower)
Cache line size: from 16 bytes (Dash) ...
Mapping memory blocks ! cache lines / page frames:
direct mapped: 8 j 9!i : B j 7! Ci, namely where i j mod m.
fully-associative: any memory block may be placed in any cache line
set-associative
ADVANCED COMPILER CONSTRUCTION — Background Material
source program
ASCII
Page 25
Steps and tools for generating an executable program
myprog.c
C. Kessler, IDA, Linköpings Universitet, 2004.
C. Kessler, IDA, Linköpings Universitet, 2004.
FRONT END
(lexical, syntactical analysis, type checking)
intermediate representation
BACK END
(code generation)
assembler program
ASCII
COMPILER
myprog.s
ASSEMBLER
object program (not executable)
(COFF / ELF / ... )
object program (executable)
(COFF / ELF / ... )
LINKER
myprog.o
a.out
LOADER
(program in execution)
Page 27
ADVANCED COMPILER CONSTRUCTION — Background Material
Object code format
Page 26
COFF = Common Object File Format (Unix)
A COFF program consists (mainly) of 3 segments:
Text segment – the instructions
Data segment – initialized variables
BSS segment – non-initialized (global) variables
C. Kessler, IDA, Linköpings Universitet, 2004.
! protection units
more: header, string table, relocation info, symbolic debug info, ...
C. Kessler, IDA, Linköpings Universitet, 2004.
! nop instruction
! expands to r17 += r5
Reference:
Gircys: Understanding and Using COFF. O’Reilly, 1988.
Page 28
Other object code formats:
ELF (Linux), Java bytecode (JVM), ...
ADVANCED COMPILER CONSTRUCTION — Background Material
Assembler syntax (2)
ADVANCED COMPILER CONSTRUCTION — Background Material
Assembler syntax (1)
define(nop, ori r1,r1,r1)
define(length,r17)
addi %length,r5,%length
Example:
ASCII textfile
! comment ]
Abbreviations
e.g. for registers or instructions
operands
switch to other segment: .text, .data, .bss
instruction-mnemonic
Instruction format:
Label : ]
Example:
start:
! makes label _main globally visible
! convention: _ prefix to
!
compiler-generated labels
addi r4,r5,r4 ! integer addition r4 = r4+r5
beqz r4, start ! if r4==0 goto start
...
.globl _main
_main: ...
Page 29
C. Kessler, IDA, Linköpings Universitet, 2004.
ADVANCED COMPILER CONSTRUCTION — Background Material
Assembler
ADVANCED COMPILER CONSTRUCTION — Background Material
Assembler syntax (3)
Page 31
Relocation
table
C. Kessler, IDA, Linköpings Universitet, 2004.
library
routines
(printf,
sqrt, ...)
...
(target memory)
a.out in execution
.text
.data
.bss
Relocation
table
symbol table
Page 30
Page 32
C. Kessler, IDA, Linköpings Universitet, 2004.
C. Kessler, IDA, Linköpings Universitet, 2004.
list of locations in the (object code) program
where (local) addresses of data or functions are referenced
which must later be patched by the linker or loader
(replaced by the corresponding (global) addresses)
to make the object code executable
generated by the assembler
Relocation table
Relocation (1)
ADVANCED COMPILER CONSTRUCTION — Background Material
generates output in COFF format
(instructions and data in target machine format)
sorts into .text, .data, .bss
creates a relocation table
resolves local labels
expands abbreviations and macros
The assembler program
! trailing \0
Constants
myfconst: .float 3.1415
mystring: .asciz "hello world"
! 48 bytes filled with 0
(similar: .int, .long, .byte, .double, .word)
myblock: .space 48, 0
Make names globally visible with .globl
...
symbol table
libm.a
.text
.data
.bss
Relocation
table
LOADER
symbol table
.text
.data
.bss
ar
next value starts at a byte address divisible by 2k;1
.globl mystring
.align k
.byte 17
.align 3
.word 74
ADVANCED COMPILER CONSTRUCTION — Background Material
file2.c
file2.s cc
...
...
libc.a
Assembler, Linker, Archiver, Libraries
file1.c
file1.s cc
Relocation
table
as
symbol table
as
Relocation
table
file2.o
.text
.data
.bss
symbol table
ld
file1.o
.text
.data
.bss
ld
LINKER
a.out
.text
.data
.bss
Relocation
table
symbol table
ADVANCED COMPILER CONSTRUCTION — Background Material
Relocation (2)
Page 33
C. Kessler, IDA, Linköpings Universitet, 2004.
ADVANCED COMPILER CONSTRUCTION — Background Material
Linker
Linker
Page 34
C. Kessler, IDA, Linköpings Universitet, 2004.
+ for the programmer, each .s assembler program starts at local address 0
and is a contiguous block
3. resolves global symbols
checks for duplicate (global) labels and undefined labels
2. merges .text, .data, .bss segments of these
(may filter out unused functions)
1. reads all object codes to be linked, including library archives
+ multiple object programs can be linked together to one
! modularity
4. writes the resulting object file,
adds a new relocation table to make it relocatable
Relocation is necessary to be able to program independently of
the code’s position in the target machine’s address space
+ program can be decomposed into blocks,
blocks can have different levels of protection
C. Kessler, IDA, Linköpings Universitet, 2004.
5. and marks it as executable.
Page 36
+ program can be moved to another block in memory
Variants:
+ shared libraries
+ dynamic linking
ADVANCED COMPILER CONSTRUCTION — Background Material
Loader
C. Kessler, IDA, Linköpings Universitet, 2004.
Shared Libraries
Loader
Page 35
+ space economy
+ consistent update
1. copy a.out to a free memory block in the target machine
(for Harvard architectures:
text segment to program memory, data/bss to data memory)
ADVANCED COMPILER CONSTRUCTION — Background Material
requires static check for undefined globals / multiple definitions
against a table of contents for each shared library
4. start execution (call startup code in executable)
3. relocate addresses according to relocation table
2. allocate space for runtime stack and heap in a free memory block
[Muchnick 5.7]
– run-time overhead for dynamic linking
– shared objects are not patchable,
must consist of position-independent code, e.g.
1. PC-relative addressing of routine entries etc.,
2. global offset table in shared library, pointed to by a global pointer gp,
3. stubs for global acc. / procedure linkage table in the non-shared code
to be patched (lazily: call invokes dynamic linker)
ADVANCED COMPILER CONSTRUCTION — Background Material
Startup code
Page 37
set up protection for segments (if applicable)
set up initial stack frame and heap
execute global initializers
copy main’s arguments to stack
and call main().
C. Kessler, IDA, Linköpings Universitet, 2004.
ADVANCED COMPILER CONSTRUCTION — Background Material
Page 38
Virtual machines as compilation target
C. Kessler, IDA, Linköpings Universitet, 2004.
Virtual machine: abstract execution platform simulated in software
(by interpretation or by a further compilation step)
Examples:
stack machines e.g. Java Virtual Machine JVM, C-machine
+ simple code generation for expression evaluation
+ addressing via SP implicit ! more compact code
[Davis et al.’03]
– stack addressing not applicable to global and heap data structures
– all modern processors are register machines ! convert
register machines e.g. Random Access Machine RAM
more flexible, closer to HW, but needs register allocation
Download