Pentium Summary

advertisement
Chapter 3 : P5 Architecture
Features of Pentium
•
Introduced in 1993 with clock frequency ranging from 60 to 66 MHz
•
The primary changes in Pentium Processor were:
– Superscalar Architecture
– Dynamic Branch Prediction
– Pipelined Floating-Point Unit
–
Separate 8K Code and Data Caches
– Writeback MESI Protocol in the Data Cache
–
64-Bit Data Bus
– Bus Cycle Pipelining
UQ: Explain with block diagram how superscalar operation is carried out in Pentium
Processor
UQ: Draw and explain Pentium Processor architecture . Highlight architectural features
•
It has data bus of 64 bit and address bus of 32-bit
•
There are two separate 8kB caches – one for code and one for data.
•
Each cache has a separate address translation TLB which translates linear addresses
to physical.
•
Code Cache:
– It is an 8KB cache dedicated to supply instructions to processor’s execution
pipeline
– 2 way set associative cache with a line size of 32 bytes
– 256 lines b/w code cache and prefetch buffer, permits prefetching of 32
bytes (256/8) of instructions
•
Prefetch Buffers:
– Four prefetch buffers within the processor works as two independent pairs.
•
When instructions are prefetched from cache, they are placed into one
set of prefetch buffers.
•
The other set is used as when a branch operation is predicted.
– Prefetch buffer sends a pair of instructions to instruction decoder
•
Instruction Decode Unit:
– It occurs in two stages – Decode1 (D1) and Decode2(D2)
– D1 checks whether instructions can be paired
– D2 calculates the address of memory resident operands
•
Control Unit :
– This unit interprets the instruction word and microcode entry point fed to
it by Instruction Decode Unit
– It handles exceptions, breakpoints and interrupts.
– It controls the integer pipelines and floating point sequences
•
Microcode ROM :
– Stores microcode sequences
•
Arithmetic/Logic Units (ALUs) :
– There are two parallel integer instruction pipelines: u-pipeline and v-pipeline
– The u-pipeline has a barrel shifter
– The two ALUs perform the arithmetic and logical operations specified by their
instructions in their respective pipeline
•
Address Generators :
– Two address generators (one for each pipeline) form the address specified by
the instructions in their respective pipeline.
– They are equivalent to segmentation unit.
•
Paging Unit :
– If enabled, it translates linear address (from address generator) to physical
address
– It can handle two linear addresses at the same time to support both pipelines
with one TLB per cache
•
Floating Point Unit:
– It can accept upto two floating point operations per clock when one of the
instruction is an exchange instruction
– Three types of floating point operations can operate simultaneously within
FPU: addition, division and multiplication.
•
Data Cache:
– It is an 8KB write-back , two way set associative cache with line size of 32
bytes
•
Bus Unit:
•
Address Drivers and Receivers:
– Push address onto the processor’s local address bus (A31:A3 and BE7:BE0)
•
Data Bus Transceivers:
– gate data onto/into the processor ‘s local data bus
•
Bus Control Logic:
– controls whether a standard (8/16/32bits)or burst(32bytes) bus cycle is to be
run
•
Branch Target Buffer: supplies jump target prefetch addresses to the code cache
Superscalar Operation
•
The prefetcher sends an address to code cache and if present, a line of 32 bytes is
send to one of the prefetch buffers
•
The prefetch buffer transfers instructions to decode unit
•
Initially it checks if the instructions can be paired.
•
If paired, one goes to ‘u’ and other goes to ‘v’ pipeline as long as no dependencies
exist between them.
•
Pair of instructions enter and exit each stage of pipeline in unison
•
Pentium uses a five stage execution pipeline as shown:
Integer Pipeline
UQ: Explain in brief integer instruction pipeline stages of Pentium
•
The pipelines are called “u” and “v” pipes.
•
The u-pipe can execute any instruction, while the v-pipe can execute “simple”
instructions as defined in the “Instruction Pairing Rules”.
•
When instructions are paired, the instruction issued to the v-pipe is always the next
in sequential after the one issued to u-pipe.
•
The integer pipeline stages are as follows:
1. Prefetch(PF) :

Instructions are prefetched from the on-chip instruction cache
2. Decode1(D1):





Two parallel decoders attempt to decode and issue the next two sequential instructions
It checks whether the instructions can be paired
It decodes the instruction to generate a control word
A single control word causes direct execution of an instruction
Complex instructions require microcoded control sequencing
3. Decode2(D2):
 Decodes the control word
 Address of memory resident operands are calculated
4. Execute (EX):
 The instruction is executed in ALU
 Data cache is accessed at this stage
 For both ALU and data cache access requires more than one clock.
5. Writeback(WB):
 The CPU stores the result and updates the flags
Integer Instruction Pairing Rules
UQ: Explain the instruction pairing rules for Pentium
•
To issue two instructions simultaneously they must satisfy the following conditions:
– Both instructions in the pair must be “simple”.
– There must be no read-after-write(RAW) or write-after-write
register(WAW) dependencies
RAW:
i1. R2  R1 + R3
i2. R4  R2 + R3
WAW:
i1. R2  R4 + R7
i2. R2  R1 + R3
– Neither instruction may contain both a displacement and an immediate
– Instruction with prefixes (lock,repne) can only occur in the u-pipe
Simple Instructions
•
They are entirely hardwired
•
They do not require any microcode control
•
Executes in one clock cycle
– Exception: ALU mem,reg and ALU reg,mem are 3 and 2 clock operations
respectively
•
The following integer instructions are considered simple and may be paired:
1. mov reg, reg/mem/imm
2. mov mem, reg/imm
3. alu reg, reg/mem/imm
4. alu mem, reg/imm
5. inc reg/mem
6. dec reg/mem
7. push reg/mem
8. pop reg
9. lea reg,mem
10. jmp/call/jcc near
11. nop
12. test reg, reg/mem
13. test acc, imm
Instruction Issue Algorithm
UQ: List the steps in instruction issue algorithm
•
•
•
•
Decode the two consecutive instructions I1 and I2
If the following are all true
– I1 and I2 are simple instructions
– I1 is not a jump instruction
– Destination of I1 is not a source of I2
– Destination of I1 is not a destination of I2
Then issue I1 to u pipeline and I2 to v pipeline
Else issue I1 to u pipeline
Floating Point Pipeline Stages
UQ: Explain the floating point pipeline stages.
•
The floating point pipeline has 8 stages as follows:
1. Prefetch(PF) :
– Instructions are prefetched from the on-chip instruction cache
2. Instruction Decode(D1):
– Two parallel decoders attempt to decode and issue the next two sequential
instructions
– It checks whether the instructions can be paired
– It decodes the instruction to generate a control word
– A single control word causes direct execution of an instruction
– Complex instructions require microcoded control sequencing
3. Address Generate (D2):
− Decodes the control word
– Address of memory resident operands are calculated
4. Memory and Register Read (Execution Stage) (EX):
– Register read or memory read performed as required by the instruction to
access an operand.
5. Floating Point Execution Stage 1(X1):
– Information from register or memory is written into FP register.
– Data is converted to floating point format before being loaded into the floating
point unit
6. Floating Point Execution Stage 2(X2):
– Floating point operation performed within floating point unit.
7. Write FP Result (WF):
– Floating point results are rounded and the result is written to the target
floating point register.
8. Error Reporting(ER)
– If an error is detected, an error reporting stage is entered where the error is
reported and FPU status word is updated
Instruction Issue for Floating Point Unit
•
The rules of how floating-point (FP) instructions get issued on the Pentium processor
are :
1. FP instructions do not get paired with integer instructions.
2. When a pair of FP instructions is issued to the FPU, only the FXCH instruction can
be the second instruction of the pair.
The first instruction of the pair must be one of a set F where F = [ FLD, FADD,
FSUB, FMUL, FDIV, FCOM, FUCOM, FTST, FABS, FCHS].
3. FP instructions other than FXCH and instructions belonging to set F, always get
issued singly to the FPU.
4. FP instructions that are not directly followed by an FXCH instruction are issued
singly to the FPU.
Branch Prediction Logic
UQ: Explain how the flushing of pipeline can be minimized in Pentium Architecture
Flushing of pipeline problem
•
Performance gain through pipelining can be reduced by the presence of program
transfer instructions (such as JMP,CALL,RET and conditional jumps).
•
They change the sequence causing all the instructions that entered the pipeline after
program transfer instruction invalid.
•
Suppose instruction I3 is a conditional jump to I50 at some other address(target
address), then the instructions that entered after I3 is invalid and new sequence
beginning with I50 need to be loaded in.
•
This causes bubbles in pipeline, where no work is done as the pipeline stages are
reloaded.
Branch Prediction Logic
•
To avoid this problem, the Pentium uses a scheme called Dynamic Branch Prediction.
•
In this scheme, a prediction is made concerning the branch instruction currently in
pipeline.
•
Prediction will be either taken or not taken.
•
If the prediction turns out to be true, the pipeline will not be flushed and no clock
cycles will be lost.
•
If the prediction turns out to be false, the pipeline is flushed and started over with the
correct instruction.
•
It results in a 3 cycle penalty if the branch is executed in the u-pipeline and 4 cycle
penalty in v-pipeline.
•
It is implemented using a 4-way set associative cache with 256 entries. This is
referred to as the Branch Target Buffer(BTB).
•
The directory entry (tag) for each line contains the following information:
•
•
Valid Bit : Indicates whether or not the entry is in use
•
History Bits: track how often the branch has been taken
•
Source memory address that the branch instruction was fetched from (address
of I3)
•
If its directory entry is valid, the target address of the branch (address of I50)
is stored in corresponding data entry in BTB
BTB is a look-aside cache that sits off to the side of D1 stages of two pipelines and
monitors for branch instructions.The first time that a branch instruction enters either
pipeline, the BTB uses its source memory address to perform a lookup in the cache.
Since the instruction has not been seen before, this results in a BTB miss.It means the
prediction logic has no history on instruction. It then predicts that the branch will not
be taken and program flow is not altered. Even unconditional jumps will be predicted
as not taken the first time that they are seen by BTB.When the instruction reaches the
execution stage, the branch will be either taken or not taken. If taken, the next
instruction to be executed should be the one fetched from branch target address. If not
taken, the next instruction is the next sequential memory address. When the branch is
taken for the first time, the execution unit provides feedback to the branch prediction
logic. The branch target address is sent back and recorded in BTB. A directory entry
is made containing the source memory address and history bits set as strongly taken
History
Bits
Resulting
Description
Prediction
Made
If branch
is taken
If branch is
not taken
11
Strongly
Taken
Branch
Taken
Remains
Strongly
Taken
Downgrades
to Weakly
Taken
10
Weakly
Taken
Branch
Taken
Upgrades to
Strongly
Taken
Downgrades
to Weakly Not
Taken
01
Weakly Not
Taken
Branch Not
Taken
Upgrades to Downgrades to
Weakly
Strongly Not
Taken
Taken
00
Strongly Not
Taken
Branch Not
Taken
Upgrades to
Weakly Not
Taken
Remains
Strongly Not
Taken
Cache Organization of Pentium
Code Cache Organization
Code Cache Structure
•
It is 8KB in size and 2-way set associative,
•
Cache banks are referred to as Way0 and Way1
•
Code cache is read only
•
Each cache line is 32 bytes
•
Each cache way contains 128 cache lines and there is a separate 128 entry cache
directory associated with each of the cache ways
•
The cache directories are triple ported:
•
•
Two ports support split line access capability
•
Third is used for bus snooping
Bus Snooping: It is used to maintain consistent data in a multiprocessor system.
Split-line Access
•
In Pentium (since CISC), instructions are of variable length(1-15bytes)
•
Multibyte instructions may staddle two sequential lines stored in code cache
•
Then it has to go for two sequential access which degrades performance.
•
Solution: Split line Access
•
It permits upper half of one line and
lower half of next to be fetched from
code cache in one clock cycle.
•
When split-line is read, the
information is not correctly aligned.
•
The bytes need to be rotated so that
prefetch queue receives instruction in
proper order.
Code Cache Entry Format
•
20 bit tag identifies the page in the memory
•
State bit indicates whether the line in the cache contains valid or invalid information
•
P(Parity) Bit is used to detect errors when reading each entry
•
Each cache line holds 4 quadwords of information
•
Code cache miss results in transfer of 4 quadwords from memory
•
As each QW is received, it is placed in cache line fill buffer
•
When full, a directory entry is made to record its presence
•
As each QW is received from memory, it is supplied directly to prefetcher to fullfil its
request as soon as possible
Data Cache
•
8KB data cache is organized into two 4KB ways referred to as Way0 and Way1
•
Each way contains 128 lines (0:127) and each line holds 32 bytes of data
•
Each directory entry has a tag field used to record the page number of the memory
page that the line of information came from
•
Each data cache line consists of 8 doublewords and cache ways are interleaved on
doubleword boundaries.
•
Parity is generated for each byte within a line. During read operation, parity is
checked and if there is error it is signaled.
•
Each memory address to cache controller is examined to determine the page, line and
doubleword that contain the target location
•
Cache controller interprets the address sent from pipelines as follows:
•
A31:A12 identify the page of target location
•
A11:A5 identify the line within the page
•
A4:A2 identify the doubleword within line
•
A1:A0 are not used
•
Cache directories are triple ported to allow access from both pipelines and for
snooping
Translation Lookaside Buffers
•
They translate virtual addresses to physical addresses
Data Cache
– Data cache contains two TLBs
•
First:
– 4-way set associative with 64 entries
– Translates addresses for 4KB pages of main memory
– The lower 12 bits addresses are same
– The upper 20-bits of virtual address are checked against four tags and
translated into upper 20-bit physical address during a hit
– Since translation need to be quick, TLB is kept small
•
Second:
– 4 way set-associative with 8 entries
– Used to handle 4MB pages
•
Both TLBs are parity protected and dual ported.
Instruction Cache:
– Uses a single 4-way set associative TLB with 32 entries
– Both 4KB and 4MB are supported (4MB in 4KB chunks)
•
Parity bits are used on tags and data to maintain data integrity
•
Entries are placed in all 3 TLBs through the use of a 3-bit LRU counter stored in each
set.
Cache Coherency
Cache Coherency in Multiprocessor System
•
When multiple processors are used in a single system, there needs to be a mechanism
whereby all processors agree on the contents of shared cache information.
•
For e.g., two or more processors may utilize data from the same memory location,X.
•
Each processor may change value of X, thus which value of X has to be considered?
A multiprocessor system with incoherent cache data
•
The Intel’s mechanism for maintaining cache coherency in its data cache is called
MESI (Modified/Exclusive/Shared/Invalid)Protocol.
•
This protocol uses two bits stored with each line of data to keep track of the state of
cache line.
•
The four states are defined as follows:
•
Modified:
•
•
•
•
The current line has been modified and is only available in a single cache.
Exclusive:
•
The current line has not been modified and is only available in a single cache
•
Writing to this line changes its state to modified
Shared:
•
Copies of the current line may exist in more than one cache.
•
A write to this line causes a writethrough to main memory and may invalidate
the copies in the other cache
Invalid:
•
The current line is empty
•
A read from this line will generate a miss
•
A write will cause a writethrough to main memory
•
Only the shared and invalid states are used in code cache.
•
MESI protocol requires Pentium to monitor all accesses to main memory in a
multiprocessor system. This is called bus snooping.
•
Consider the above example.
•
If the Processor 3 writes its local copy of X(30) back to memory, the memory write
cycle will be detected by the other 3 processors.
•
Each processor will then run an internal inquire cycle to determine whether its data
cache contains address of X.
•
Processor 1 and 2 then updates their cache based on individual MESI states.
•
Inquire cycles examine the code cache as well (as code cache supports bus snooping)
•
The Pentium’s address lines are used as inputs during an inquire cycle to accomplish
bus snooping.
Download