Lecture 6

advertisement
Computer architecture
Lecture 6: Processor’s structure
Piotr Bilski
Procesor’s tasks:
•
•
•
•
•
Instruction fetching
Instruction interpretation
Data fetching
Data processing
Data saving
These justify existence of the registers (temporary
memory space)
Internal processor’s structure
ALU
Status flags
Registers
Shifter
Complementer
Arithmetic
and Boolean
Logic
Control Unit
Block Scheme of Pentium 3 Processor
Block Scheme of
P6 Core (Pentium
Pro) – 1995 r.
• Front-end of the
processor
• Core
• Completion unit
Register types
• Accessible for the user (addressing, data
etc.)
• Inaccessible for the user (control, status)
• This categorization is not formal!
Registers accessible by the user
• General Purpose Registers (GPR)
• Data
• Addressing (segment pointer, stack,
indexing)
• Conditional codes (state pointer, flags) –
read-only!
Control and state registers
• Basic:
–
–
–
–
Program Counter (PC)
Instruction Decoding Register (IR)
Memory Address Register (MAR)
Memory Buffer Register (MBR)
• Program Status Word (PSW)
• Interrupt Vector Register
• Page Table Pointer
Program Status Word
0
3 4
S
Z P R O
15
I
N
S – sign bit
Z – bit set, if operation result is zero
P – carry bit
R – logical comparison result bit
O – overflow bit
I – Enable/disable interrupt execution
N – supervisor mode
OTHER
Registers in the Motorola MC68000
processor
• Data and address registers (32-bit)
• Specialization: 8 data registers (D0-D7) and 9
address registers (two used interchangeably in
the user and supervisor modes)
• Control bus 24-bit, data bus 16-bit
• A7 register used as a Stack Pointer (SP)
• State register (SR)16-bit (another name: CCR)
• Program counter (PC) 32-bit
• Instructions are stored under even addresses
Registers in the Intel 8086
Processor
• 16-bit address and data registers
• Data/General Purpose Registers (AX, BX,
CX, DX)
• Pointer and index registers (SP, BP, SI, DI)
• Segment registers (CS, DS, SS, ES)
• Instruction pointer
• State register
Intel 8086 Registers (cont.)
AX
Accumulator
SP
Stack pointer
BX
Base
BP
Base pointer
CX
Counting
SI
Source index
DX
Data
DI
Displ. ndex
Intel 386 - Pentium Processors
Registers Organization
• 32-bit data and address registers
• Eight General Purpose Registers (EAX,
EBX, ECX, EDX, ESP, EBP, ESI, EDI)
• For the backward compatibility, the lower
part of the registers are 16-bit registers
• 32-bit status register
• 32-bit instruction pointer
Floating-point registers of the
Pentium processor
•
•
•
•
Eight 80-bit numerical registers
16-bit control register
16-bit state register
16-bit floating point register content type
word
• 48-bit instruction pointer
• 48-bit data pointer
EFLAGS register
31
21
I VI VI A V R
D P F C M F
•
•
•
•
•
•
•
0
15
N IO O D I T S
T F F F F F F
TF – trap flag
IF – interrupt enable flag
DF – direction flag
IOPL – privileged input/output flag
RF – resume flag
AC – alignment control
ID – identification flag
Z
F
A
F
P
F
C
F
Registers in the Athlon 64
processor
• Compatibility with x86-64 architecture (40-bit physical
address space, 48-bit virtual address space)
• Data and address registers 64-bit
• 8 general purpose registers (RAX, RBX, RCX, RDX,
RBP, RSI, RDI, RSP), work in the 32-bit compatibility
mode
• Opteron contains additional 8 general purpose registers
(R8-R15)
• 16 SSE registers (XMM0-XMM15)
• 8 floating-point registers x87, 80-bit
Registers in the PowerPC
processor
• 32 general purpose registers (64-bit) +
exception register (XER)
• 32 registers for the floating point unit (64bit) + state and control register (FPSCR)
• Branch processing unit registers: 32-bit
condition register, 64-bit counting and
binding registers
Instruction mode
Indirect
addressing
Argument
address
calc.
Argument
fetching
Instruction
fetch
Multiple
arguments
Instruction
address
calc.
Instruction
decoding
Argument
address
calc.
No interrupts
Instruction executed,
fetch the next one
Interrupt
handling
Multiple
results
Data
operation
Writing
argument
Return to
data
Interrupts
checking
Indirect
addressing
Instruction fetching cycle
Processor
PC
Address
bus
Data
bus
Control
bus
MAR
Memory
CU
IR
MBR
Indirect mode
Processor
Address
bus
Data
bus
Control
bus
MAR
Memory
CU
MBR
Interrupt mode
Processor
PC
Address Data
bus
bus
Control
bus
MAR
Memory
CU
MBR
Pipeline
• Problem: during the instruction cycle only
one instruction is processed
• Solution: divide the cycle into smaller
fragments
• Condition: time instants, when no main
memory access is required!
Cycle 1
Cycle 2
Cycle 3
Pipeline example - laundry
3 hours / cycle – 9 hours for all
LA
DR
PA
LA
CYCLE 1
DR
PA
CYCLE 2
LA DR PA
CYCLE 3
3 hours / cycle – 5 hours for all !!
LA
DR
LA
PA
DR
PA
LA
DR
PA
Prefetch
Instruction
Instruction
Instruction
fetch
Result
Execution
New address
Waiting
Instruction
Instruction
Instruction
fetching
Waiting
Result
Execution
Denial
• NOTE: acceleration is smaller than double, as the memory
access lasts longer than the instruction execution
Basic phases of the instruction cycle:
•
•
•
•
•
•
Instruction fetching (FI)
Instruction decoding (DI)
Operands calculation (CO)
Operands fetching (FO)
Instruction execution (EI)
Writing outcome (WO)
I1
I2
I3
I4
1
2
3
4
5
6
7
FI
DI
CO FO EI
FI
DI
CO FO EI
FI
DI
CO FO EI
FI
DI
8
9
WO
WO
WO
CO FO EI
WO
10 11
Branches and pipelining
I1
I2
I3
I4
I5
I6
I21
I22
1
2
3
4
5
6
FI
DI
CO FO EI
FI
DI
CO FO EI
FI
DI
CO FO
FI
DI
CO
FI
DI
7
8
9
10 11
DI
CO FO EI
FI
DI
12 13
WO
WO
FI
FI
WO
CO FO EI
WO
Pipeline implementation algorithm
Problems of the pipelining
• Subsequent pipe phases don’t last the same
amount of time
• Transferring data between the buffers may
significantly increase pipeline execution time
• Dependency between the registers and memory in
the pipeline optimization may be minimized with
high stakes
Efficiency of the pipelining
Cycle execution time:
Time required to execute all the instructions:
Tk  [k  (n  1)] 
Instruction pipeline acceleration ratio:
T1
nk
Sk  
Tk k  (n  1)
Example of the pipeline efficiency
Modern Processors Pipelines
•
•
•
•
Pentium 3 – 10 stages
Athlon – 10 stages for ALU, 15 stages for FPU
Pentium M – 12 stages
Athlon 64/ 64 X2 – 12 stages for ALU, 17 stages for
FPU
• Pentium 4 Northwood – 20 stages (hyperpipeline!!)
• Pentium 4 Prescott – 31 stages
• Core2Duo – 14 stages
Hazards
• They are pipelining disturbances
• There are data, resources and control
hazards
Branch handling
•
•
•
•
•
Pipeline multiplication
Prefetch of the instruction
Loop buffer
Branch prediction
Delayed branch
Multiplied pipelining
• Both instructions for simultaneous
processing as a result of branch are
loaded into two pipelines
• The main problem is to gain memory
access for both instructions
Prefetch and loop buffer
Prefetch
• When branch instruction is decoded, the
target instruction is fetched. It is stored
until the branch is executed
Loop buffer
• A buffer in memory to store the
subsequent instructions is created
• It is useful when there are conditional
branch instructions and loops involved
Conditional Branch Prediction
• Static
– Never occuring branch (Sun SPARC, MIPS)
– Always occuring branch
– Operation code prediction
• Dynamic
– Occured/Didn’t occur switch
– Branch history table
Static prediction
• The simplest, used as the fallback method,
for instance in the Motorola MPC7450
processor
• Pentium 4 allowed inserting the code
suggesting if the static prediction should
point at the branch or not (so-called
prediction hint)
Dynamic prediction of the conditional
branches
• A conditional branch instruction history is
stored
• It is represented by the bits stored in the
cache memory
• Every instruction has its own history bits
• Another solution is the table storing
informations about the conditional branch
result
History bits prediction
Branch history table
Branch
instruction
address
History bits
Target
instruction
Local Branch Prediction
• Requires a separate history buffer for each
instruction, although the history table can
be common for all instructions
• Pentium MMX, Pentium 2 i 3 processors
have local prediction circuits with 4 history
bits and 16 positions for every type of
instruction
• Local prediction efficiency is estimated at
97 %
Global Branch Prediction
• A common history for all branch instructions
is stored in memory. It allows to consider
dependencies between different branch
instructions
• Rarely a better solution than the local
prediction
• Hybrid solutions: shared unit of the global
prediction and the history table (AMD
processors, Pentium M, Core, Core 2)
Branch Prediction Unit
• A processor circuit responsible for prediction of the
disturbances in the sequential code execution
• Often connected with the microoperation cache
memory
• In Pentium 4 processor, the buffer for the branch
prediction has 4096, in Pentium 3 – only 512.
Therefore the former has a 33 percent better hit
ratio than the latter
Location of the Branch Prediction
Unit
Download