Intel Itanium IA64 - Merced

advertisement
Intel Itanium IA64 - Merced
Barbora Petrtýlová
Tomáš Kubeš
LS 2002/2003
Presentaiton for E36APS
presented: 24. 4. 2003
Introduction
Basics
• Brand new Intel architecture (designed
from ground)
• Not compactible with x86
• 64 bit (only 44bit physical addressing, 54bit virtual addressing)
• RISC + Superscalar
• EPIC (Explicitly Parallel Instruction
Computing)
• Speeds 733, 800MHz
3
Performance
• EPIC technology enables up to 20
operations/clock (peak)
• BUT needs optimized code
• First tests: running of x86 code (SW emulation) –
performance on level of Pentium I 150MHz (hence: „Itanic“ )
• Generally server or workstation processor;
up to 32 processors can be in one
machine (512 processors can work
together)
4
HW Overview I
Massive HW resources
• 17 execution units
• 128 integer registers
• 128 floating point registers
• 64 predicate registers
• 8 branch registers
• Supports register stacking, register rotating;
predicating, branch hints, speculation,
parallelism
•
Processor also contains own ROM (CPU information) and EEPROM (can be programmed)
5
• Execution: 6 inst.
per clock (effective value up to
20 ops.)
• 10 stage pipeline
• 4 ALU, 4
Multimedia ALU, 4
FP (up to 8 FP
ops./cycle), 2 Load
/ Store, 3 Branch
units
•
•
•
HW Overview II
Allows instruction
templates, that can
increase effective value of
executed isntructions
Cache 32KB (split) L1/
96KB L2 / 2-4MB L3
Transistor Count: 25 million
transistors in CPU; 300
million in cache.
6
Instruction Set & Predicating
Basics
• Instruction set: based on classical RISC, but
offering new instructions for branch
prediction, prefetching, and parallelism
• Has SIMD instructions (1 inst. operating with
multiple single prec. FP or integer data)
• Inst. set architecture: “Revolutionary“ – each
triplet of instructions is packed to bundle, it
can give inst. special properties
• Each instruction has predicate - reference to
predicate register - if it‘s value is 0, instruction
is carried out as NOP.
8
Instruction Format
[qp] mnemonic [comp1] [comp2] dest src
•
•
•
•
qp – predication register
mnemonic – instruction name
comp – completer (kvalifikátor)
destination, source – registers, generally 3
instrukce 2
41b
svazek (bundle) 128b
instrukce 1
instrukce 0
41b
41b
template
5b
EPIC “architecture“
• Compiler knows about parallelism.
• Compiler supports parallelism and is able to express it.
• Dependent instructions in each bundle need to be
explicitely marked, classes of instructions that can be executed
in parallel need to be explicitely marked – cycle break.
9
Instruction Bundles
• Machine
fetches two
bundels each
clock.
• Each bundle
can have own
template
• Depending on
template,
instruction can
represent more
operations.
Ex: Standard: up to 8 ops. – 2 branches, 2 load/store, 2 ALU, 2 post increment
Scientific: 12 ops. – 4 dp load, 4 dp FP, 2 ALU, 2 post increment
Digital content: 20 ops. (SIMD) – 8 sp load, 8 sp FP, 2 ALU, 2 post inc.
*there are 10 template formats all together
10
Predicating
• Itanium has 64 1bit predicate registers.
• Each physical instruction has predicate.
• Predicate determines if instruction will be
executed normally, or if it will be executed
as NOP.
• This allows to implement program
branching without jumps, which bother
pipeline so much.
11
Predicating – Motivation I
• Why to predicate? Let‘s see example!
(note: example is not exactly corresponding to real Itanium program, it would make it too
complicated)
Ex:
some independent instructions A
add r2, r3, r4 ;;
cmp r2, r7 ;;
je equal
some instructions B
equal:
some instructions C
marks dependence
of next instruction
• What happens if branch is taken, but
processor was thinking it will be not?
12
Predicating – Motivation II
• On pipelined processor, taking an unpredicted
branch usually means to flush whole pipeline!
• For Itanium, it would mean also emptying
buffers and queues = throwing away 9*6
instructions from pipeline + those from buffer
which could mean loosing up to 200 effecitve
operations (not counting. probable necessity of
mem. acces)
• Tests shows that 5%-10% of wrong predictions
can decrease performance by 25%!
NIGTMARE OF ALL HW DEVELOPERS
13
Predicating – Motivation III
• So what would Itanium do? (assuming optimizing compiler)
- Use scheduling and predicating!
add r2, r3, r4
some indep. instructions A
cmp.eq p1, p2=r2, r7
some more indep. insts.
(p1) some instructions B
(p2) some instructions C
other instructions D
some indep. instructions A
add r2, r3, r4 ;;
cmp r2, r7 ;;
je equal
some instructions B
equal:
some instructions C
other isntructions D
• This will result only in loss of few
instructions, that will not be taken.
14
Predicating - Effects
• Removes
unpredictable
branches
• Basic block of parlallel
instructions increases
• ILP in a block
increases
• Thus it allows much
better resource
utilization
Predicating trnasforms
that to this.
15
Predicating - Conclusion
• Predicating means that both outcomes of
branch will be put into execution and
wrong one will be discarded
• Reduction of “penalty for branch output
misprediction“ is HUGE
• This effect is significant for short branches
• Drawback: instructions for both outputs
need to be fetched, so this method is
effective only for short branches
16
Instruction Pipeline
INSTRUCTION PIPELINE
BLOCK DIAGRAM
18
INSTRUCTION PIPELINE
DESCRIPTION
• 10 stage in-order core pipeline
• executes up to 6 instructions in parallel each cycle
• 2-3 stages shorter than Pentium III but 2-3 stages longer than Alpha 21264
• queue of instructions
 instruction selection continues even in case when the execution part is stopped
and vice versa
• 8 bundles of instructions or 24 instructions in the queue
 enough to overcome resteer
 insufficient for whole coverage of prediction miss
19
INSTRUCTION PIPELINE
FRONT-END
• IPG - address calculation
• FET - instruction cache access
• ROT - instruction rotation
 instruction fetch and instruction
delivery into a decoupling buffer
in ROT
 bold line is point of decoupling
INSTRUCTION DELIVERY
• EXP - expand
• REN - register
rename/remapping
 dispersal and register
renaming
20
INSTRUCTION PIPELINE
OPERAND DELIVERY
• WLD - word-line decode
• REG - register read
 operand delivery
EXECUTION CORE
• EXE - instruction execution
• DET - exception detection
• WRB - writeback
 wide parallel execution, followed by exception management and retirement
• ItaniumTM does not change the sequence of instructions
BUT they may finish in a different sequence
21
Instruction Fetching & Jump Prediction
Instruction Fetching
Fetch:
• Itanium has 16K of L1
instruction cache. (4way set associative.)
• Fetches 2 bundles (6 inst.)/cl
• They are fed to decoupling
buffer. (holds 8 bundles)
• From decoupling buffer, they
are sent to inst. issue & reg.
rename logic depending on
availability of resources (or
eventual dependencies)
• Instructions can be issued in
order, one by one
23
Instruction Prefetching
Prefetch
• Itanium has sophisticated prediction logic (its
principles will be explained 2 slides later)
• Probable target adresses can be stored in 4
target address registers
• Itanium tries to speculatively fetch instructions
for possible branch output to decoupling buffer
• Prefetch can be also initiated by SW – it can
probe if instructions for branch target are in L1
cache
• SW prefetched instructions are taken from L2
cache and filled into a streaming buffer and
eventualy stored in L1I cache
24
Why Predict?
• Penalty for wrong branch prediction is very high:
9 cycles are lost = 9*6 instructions; memory
acces might be required = major slowdown
• Short branches can be fully eliminated by
predicating, but long cannot (not effective)
• Longer branches need to be accurately
predicted - tests shows that 5%-10% of wrong
predictions can decrease performance by 25%!
• So Itanium is equipped with complex prediction
means, both on HW and compiler level
25
HW for Branch Prediction
• 4 TAR registers, they store branch address and
branch instruction address – it is compared with
current state of Program Counter – when PC
reaches this value instruction from pointing
address will be fetched next cycle
• 8 item RSB – return stack buffer, to know where
to return from procedure calls
• 512 item BHT – branch history table (20Kbit)
Itanium does not store designated statically
predicted branches, this increases BTH efficiency
• 64 item BTAC – branch target address cache
• 2 Branch address calculation units
26
Branch Prediction I
• Itanium employs a hierarchy of branch prediction
structures to deliver high-accuracy predictions
• It is assisted by branch hint directives - Branch
PRrediction instructions (BPR) and hints directly in
branch instruction code
• Those provide: branch target address, static hint, and
indication where to use dynamic prediction
• Machine provides 4 progressive predictions and
corrections to the fetch pointer
Resteer 1 – single cycled, using address from compiler fed
Target Address Registers (Itanium has 4 TARs)
Resteer 2 – two level multiway predictor and return
predictor
Resteer 3,4 – Branch address calculation and prediction
BAC1, BAC2
27
Branch Prediction II
• Compiler is able to load items into BTAC
• Branching, that is in BTAC or RSB goes directly to pipline (fetch)
• If branching is in BHT, but is not in BTAC, target address needs to be
calculated in some of BACs (branch address calculator)
• If branching is not in BHT, BAC will use info. from static prediction hint
• BAC1 is able to trace end of cycle (surpress TAR)
• BAC2 can compute target adress of any branch
Reminder:
FET
IPG
8-bundle
instr. queue
I-Cache and I-TLB
PC
index
4-entry
TAR
EXP
ROT
3rd instr
512-entry
BHT
8-entry
RSB
64-entry
MBHT
64-entry
BTAC
BAC1
to di spat ch
all instr
•
•
•
BAC2
•
28
TAR – target
address
register
RSB – return
stack buffer
BHT – branch
history table
BTAC –
branch target
address
cache
Branch Prediction
• Using prediction from BTAC or RSB causes 1
tact bubble in instruction loading
• BAC1 causes 2 tact bubble
• BAC2 even 3 tact bubble
BUT:
• Loading of instructions is separated from
instruction execution, so bubbles can be usually
compensated by instructions waiting in
decoupling buffer
• So pipline only stops if prediction was wrong,
since true result of branch is known in DET stage
29
Branch Prediction Conclusion
• Missed branch penalty is high for Itanium
• So Itanium posses powerful tools for
branch prediction – various prediction
hints on compiler level and complex
prediction logic on HW level
• This in most cases ensures that proper
instructions for branch result are fed to
pipeline before true result is known
• Thus pipeline only needs to stop when
prediction was wrong – that is rare case if
code is optimized properly
30
Instruction Queue & Execution,
Work with Registers
INSTRUCTION QUEUE
• buffer between the first (instruction fetch) and the second (execution) stage of the
pipeline
 one part can work even if the second is not working
• queue is dimensioned on 8 bundles (groups of instructions)
• up to 2 bundles can be selected in each cycle
32
INSTRUCTION EXECUTION
• in the execution stage of the pipeline
• in-order instruction execution
• instruction stream - divided into so-called instruction groups
• end of one of execution groups is defined in part of template bundles
• if instructions selected from an instruction queue belong to the same
instruction group
 assigned to functional units
 operands are chosen for those instructions
 renaming/remapping of registers (if necessary) is performed
 contents of specific (now renamed!) registers are loaded
 instruction is executed
 possible exceptions are executed and misprediction is checked
 result is writen
33
WORK WITH REGISTERS
• large number of registers
 register file
AVAILABLE REGISTERS
• 128 integer registers (64 bits + 1 NaT bit)
• 128 floating point registers
• 64 predicate registers (1 bit fire/do not fire)
• 8 branch registers
• 128 application registers
• CPUID registers
34
WORK WITH REGISTERS
GENERAL PURPOSE REGISTERS
• 65 –64 bits for data and one NaT bit
• accessible from all privilege levels
• 2 subgroups
 static GRs
• GR0 – GR31 –> visible and shared by all subprograms (procedures)
• GR0 has permanent value 0
 stacked GRs
• GR32 – GR127
• register stack frame
35
WORK WITH REGISTERS
FLOATING POINT REGISTERS
• IA-64 fully implements IEEE 754 standard
• accessible from all program levels
• usage: floating point operations
• 2 subgroups
 static FRs
• FR0 – FR31
• FR0 and FR1 have permanent value 0.0 and +1.0, respectively
 rotating FRs
• FR32 – FR127
• can be renamed -> acceleration of cycle execution
36
WORK WITH REGISTERS
PREDICATE REGISTERS
• accessible from all program levels
• usage: store values of comparator instructions
• 2 subgroups
 static PRs
• PR0 – PR15
• PR0
- if used as source operand and has a permanent value 1
- if used as destination operand the result of such operation is
ignored
 rotating PRs
• PR16 – PR63
• can be renamed -> acceleration of cycle execution
37
WORK WITH REGISTERS
BRANCH REGISTERS
• usage: store information about branching of the program
• accessible from all program levels
INSTRUCTION POINTER
• addresses of bundles with presently-executed IA-64 instructions
 possible to read its value directly but not modify it directly
• lowest quartile of bits is zero
CURRENT FRAME MAKER
• describes the current state of stack frame
• not possible to read and write directly
CPU IDENTIFICATION REGISTERS (CPUID)
• number is larger or equal to 4
• registers 0 – 3 contain information about the processor
38
WORK WITH REGISTERS
APPLICATION REGISTERS
• usage: numbering of specific operations
• kernel registers (AR0 – 7)
• previous state function registers (AR64)
• loop counter registers (AR65)
• etc.
USER MASK
• information about the addressing method and arrangement of addressable units
in memory
• organisational method of multi-byte units (big-endian, little-endian) and user
defined efficiency monitors
PERFORMANCE MONITOR DATA REGISTERS
• information about instruction execution
• can be set only by privileged instructions
• read-only accessible from all program levels
39
WORK WITH REGISTERS
ROTATING REGISTERS
• general purpose (GR32 – GR127), floating-point (FR32 – FR127), and
predicate registers (PR16 – PR63)
 the value RRB for each type is stored in register CFM
• enable a more effective way of cycle execution
• cycle execution: sequentially or in parallel
• problem with parallel execution: each loop iteration works with the same
registers
 loop unrolling
• most modern processors use so-called register renaming
• on the hardware level – usually complicated
• on the software level – done by the compiler
-> may lead to code enlargement
• IA-64 uses loop unrolling but reduces the enlargement of the code by rotating
registers
• in the cycle some offsets to a base (saved in Register Rename Base register)
are used instead of ”absolute” numbers of registers
 value of this register is decremented after each rotation
40
WORK WITH REGISTERS
ROTATING REGISTERS
• one iteration of a cycle:
 if a value A is stored in some register X, then the value in the register after
one rotation will be X+1
 if X is the highest value of for example general purpose register GR127, then
the value A will be located in register GR32 after one iteration
ex. simple cycle (the register ”rXX” in code represents a GRXX):
loop: ld4 r34=[r10],4
; load 4B into ”r34”, address is
; stored in r10
st4 [r11]=r36,4
; store from previous ”r34” to
; address stored in r11
br.ctop loop
; decrement loop counter and
; branch
note: ”r34” denotes only offset, in reality (if rrb.gr=40) it could be register GR74
• automatically renames registers in hardware to improve software loop
performance without the additional overhead required in traditional models
41
WORK WITH REGISTERS
REGISTER STACK
• better manipulation with subprograms
• registers GR32 – GR127
• each subroutine (procedure) has a set of registers reserved for itself
 stack frame (0 – 96 registers)
• achieved by register renaming
• appears to be infinite on the outside!!!
• Register Stack Engine (RSE)
• stack frame is generally divided into two sections:
 local data part
 output data part
42
Instructions Load/Store,
Speculative Execution
LOAD AND STORE INSTRUCTIONS
LOAD
• transfers data into general-purpose (GR) and floating-point (FR) registers and
possibly floating-point pairs (pairs of floating-point registers)
• general register loads: data of size 1, 2, 4 and 8 bytes can be transferred
• floating-point loads:
-single precision (4B)
-double precision (8B)
-double-extended precision (10B)
-single precision pair (8B)
-double precision pair (16B)
• load instructions can be speculative
44
LOAD AND STORE INSTRUCTIONS
STORE
• instructions opposite to the previous case
• NOT defined for floating-point pairs!
• possible to store blocks of data of the same size (except for single and double
precision pairs)
• all store instructions are NON-speculative
45
SPECULATIVE EXECUTION
• enables reduction of memory latency
• compiler during compilation ‘performs’ a call on instruction which is to be
executed speculatively earlier than is the actual instruction call for the given
instruction
• speculative instructions are those that can be executed speculatively
• each instruction which stores results into GRs or FRs can be speculative
• instruction which modifies different registers is non-speculative
• itanium-implemented speculation has two forms
46
SPECULATIVE EXECUTION
CONTROL SPECULATION
• optimisation when an instruction is performed earlier before the dynamic
process of the program reaches the place, where the result of this instruction is
needed
 for instructions whose execution is longer
• rest of the program is performed in parallel
• basic necessary condition: in case when a speculatively executed instruction
produces an exception, then this exception is deferred and this erroneous state
is marked in the target register
 if it is a general purpose register, then NaT-bit is set to 1
 if it is a floating point register, then the NaT value is stored in it
• performed in such a way that the speculative instruction (e.g. ld.s) is placed
in the code before the original instruction (in this case ld) and an instruction of
result check of speculative execution (chk.s) is placed at this point
47
SPECULATIVE EXECUTION
CONTROL SPECULATION
• result(s) of speculative instruction can be used for other speculative execution
• exception token
• usually applied on instructions in different branches of a program which are
then processed speculatively before the place of the specific branching
48
SPECULATIVE EXECUTION
DATA SPECULATION
• optimisation which enables speculative execution of instructions whose
operands could be dependent on the results of other (non-speculative)
instructions -> data dependencies
• example: store instruction precedes the load instruction which we would like to
execute speculatively
 question: How do we know that the result of the load instruction are
dependent on the store instruction which follows?
 ALAT – Advanced Load Address Table
(Merced: ALAT is a 32-entry and indexable by a 7-bit register ID)
• no deference of exceptions
49
SPECULATIVE EXECUTION
SUMMARY
• improves performance by allowing the compiler to schedule load instructions
ahead of branches and stores to reduce memory latency
• basis of the mechanism is a physical execution of an instruction before its
actual location in the program
 results of the instruction are known already at the moment when they are just
required
 accelerates the program execution in the case of time-demanding instructions
• if the result of this speculatively executed instruction is incorrect -> it is
necessary to perform the instruction again at the place of its actual location in
the program
50
Conclusion
Itanium IA64 - Merced:
• Processor designed from ground, based on the
latest knowledge (design thoughs started in
about 1998) using many new techniques and
having new features: 64bit, massive HW
resources, superscalar execution (6 inst/clock),
RISC & EPIC, register renaming + stacking,
sophisticated jump predictions, predication...
• Drawback of these is that code optimization is
vital – absolutely necessary.
• It was quite big piece to handle – even for
51
References
• [1] Intel: Itanium Product Brief (www.intel.com)
• [2] Intel: Itanium Hardware Developer‘s Manual
(www.intel.com)
• [3] Intel: Itanium Data Sheet (www.intel.com)
• [4] Intel: IA64 Assembler: Users Guide
(www.intel.com)
• [5] Intel: Understanding Itanium Architecture –
presentation slides (www.intel.com)
Presentation can be downloaded also from:
www.tomaskubes.net/download.html
52
Download