Computer Architecture - Princess Sumaya University for Technology

advertisement
LOGO
P r i n c e s s
S u m a y a
U n i v e r s i t y
f o r
Computer
Architecture
Dr. Esam Al_Qaralleh
Te c h n o l o g y
Review
computer arctecture ~ PSUT
2
The Von Neumann Machine, 1945
The Von Neumann model consists of five
major components:
input unit
 output unit
 ALU
memory unit
control unit.
Sequential Execution
computer arctecture ~ PSUT
3
Von Neumann Model
 A refinement of the Von Neumann model, the system bus model
has a CPU (ALU and control), memory, and an input/output unit.
 Communication among components is handled by a shared
pathway called the system bus, which is made up of the data
bus, the address bus, and the control bus. There is also a power
bus, and some architectures may also have a separate I/O bus.
computer arctecture ~ PSUT
4
Performance
Both Hardware and Software affect
performance:
 Algorithm determines number of sourcelevel statements
 Language/Compiler/Architecture determine
machine instructions
 Processor/Memory determine how fast
instructions are executed
computer arctecture ~ PSUT
5
Computer Architecture
Instruction Set Architecture - ISA refers to
the actual programmer-visible machine
interface such as instruction set, registers,
memory organization and exception
handling. Two main approaches: RISC and
CISC architectures.
computer arctecture ~ PSUT
6
Applications Change over Time
Data-sets & memory requirements  larger
 Cache & memory architecture become more critical
Standalone  networked
 IO integration & system software become more critical
Single task  multiple tasks
 Parallel architectures become critical
• Limited IO requirements  rich IO
requirements





60s: tapes & punch cards
70s: character oriented displays
80s: video displays, audio, hard disks
90s: 3D graphics; networking, high-quality audio
00s: real-time video, immersion, …
computer arctecture ~ PSUT
7
Application Properties to
Exploit in Computer Design
 Locality in memory/IO references
 Programs work on subset of instructions/data at any point in time
 Both spatial and temporal locality
 Parallelism
 Data-level (DLP): same operation on every element of a data
sequence
 Instruction-level (ILP): independent instructions within sequential
program
 Thread-level (TLP): parallel tasks within one program
 Multi-programming: independent programs
 Pipelining
 Predictability
 Control-flow direction, memory references, data values
computer arctecture ~ PSUT
8
Levels of Machines
There are a number of levels in a computer,
from the user level down to the transistor level.
computer arctecture ~ PSUT
9
How Do the Pieces Fit Together?
Application
Operating
System
Compiler
Memory
system
Firmware
Instr. Set Proc.
Instruction Set
Architecture
I/O system
Datapath & Control
Digital Design
Circuit Design
computer arctecture ~ PSUT
10
Instruction Set Architecture (ISA)
Complex Instruction Set (CISC)
 Single instructions for complex tasks (string
search, block move, FFT, etc.)
 Usually have variable length instructions
 Registers have specialized functions
Reduced Instruction Set (RISC)
 Instructions for simple operations only
 Usually fixed length instructions
 Large orthogonal register sets
computer arctecture ~ PSUT
11
RISC Architecture
RISC designers focused on two critical
performance techniques in computer
design:
 the exploitation of instruction-level
parallelism, first through pipelining and later
through multiple instruction issue,
 the use of cache, first in simple forms and
later using sophisticated organizations and
optimizations.
computer arctecture ~ PSUT
12
RISC ISA Characteristics
 All operations on data apply to data in registers and
typically change the entire register;
 The only operations that affect memory are load and
store operations that move data from memory to a
register or to memory from a register, respectively;
 A small number of memory addressing modes;
 The instruction formats are few in number with all
instructions typically being one size;
 Large number of registers;
 These simple properties lead to dramatic
simplifications in the implementation of advanced
pipelining techniques, which is why RISC architecture
instruction sets were designed this way.
computer arctecture ~ PSUT
13
Performance
&
cost
computer arctecture ~ PSUT
14
Computer Designers and Chip Costs
The computer designer affects die size,
and hence cost, both by what functions
are included on or excluded from the die
and by the number of I/O pins
computer arctecture ~ PSUT
15
LOGO
Measuring and Reporting
Performance
performance
Time to do the task (Execution Time)
– execution time, response time, latency
Tasks per day, hour, week, sec, ns. .. (Performance)
– performance, throughput, bandwidth
Response time– the time between the start and the completion of a task
Thus, to maximize performance, need to minimize execution time
1
perform ancex 
execution_ tim ex
If X is n times faster than Y, then
performancex execution_ timey

N
performancey execution_ timex
Throughput – the total amount of work done in a given time
Important to data center managers
Decreasing response time almost always improves
computer arctecture ~ PSUT
throughput
17
Calculating CPU Performance
 Want to distinguish elapsed time and the time spent on
our task
 CPU execution time (CPU time) – time the CPU spends
working on a task
 Does not include time waiting for I/O or running other
programs
CPU _ Time  CPU _ clock _ cycles _ for _ a _ program * Clock _ cycle _ time
OR
CPU _ clock _ cycles _ for _ a _ program
CPU _ Time 
Clock _ rate
 Can improve performance by reducing either the length
of the clock cycle or the number of clock cycles required
for a program
computer arctecture ~ PSUT
18
Calculating CPU Performance (Cont.)
We tend to count instructions executed = IC
 Note looking at the object code is just a start
 What we care about is the dynamic count - e.g. don’t
forget loops, recursion, branches, etc.
CPI (Clock Per Instruction) is a figure of merit
CPU _ clock _ cycles _ for _ a _ program
CPI 
IC
IC * CPI
CPU _ Time  IC * CPI * Clock _ cycle _ time 
Clock _ rate
computer arctecture ~ PSUT
19
Calculating CPU Performance (Cont.)
 3 Focus Factors -- Cycle Time, CPI, IC
 Sadly - they are interdependent and making one better often
makes another worse (but small or predictable impacts)
• Cycle time depends on HW technology and organization
• CPI depends on organization (pipeline, caching...) and ISA
• IC depends on ISA and compiler technology
 Often CPI’s are easier to deal with on a per instruction
basis
n
CPU _ clock _ cycles   CPI i * ICi
n
i 1
 CPI i * ICi
n
ICi
Overall _ CPI 
  CPI i *
Instructio n _ count i 1
Instructio n _ count
i 1
computer arctecture ~ PSUT
20
Example of Computing CPU time
 If a computer has a clock rate of 50 MHz, how long
does it take to execute a program with 1,000
instructions, if the CPI for the program is 3.5?
 Using the equation
CPU time = Instruction count x CPI / clock rate
gives
CPU time = 1000 x 3.5 / (50 x 106)
 If a computer’s clock rate increases from 200 MHz to
250 MHz and the other factors remain the same, how
many times faster will the computer be?
CPU time old
clock rate new
250 MHz
------------------- = ---------------------- = ---------------- = 1.25
CPU time new
clock rate old
200 MHZ
Evaluating ISAs
 Design-time metrics:
 Can it be implemented, in how long, at what cost?
 Can it be programmed? Ease of compilation?
 Static Metrics:
 How many bytes does the program occupy in memory?
 Dynamic Metrics:
 How many instructions are executed? How many bytes does the
processor fetch to execute the program?
CPI
 How many clocks are required per instruction?
Best Metric: Time to execute the program!
Inst. Count
depends on the instructions set, the processor
organization, and compilation computer
techniques.
arctecture ~ PSUT
Cycle Time
24
LOGO
Quantitative Principles of
Computer Design
Amdahl’s Law
Defines speedup gained from a particular feature
Depends on 2 factors
 Fraction of original computation time that can take
advantage of the enhancement - e.g. the commonality
of the feature
 Level of improvement gained by the feature
 Amdahl’s law
Quantification of the
diminishing return principle
computer arctecture ~ PSUT
26
Amdahl's Law (Cont.)
Suppose that enhancement E accelerates a
fraction F of the task by a factor S,
and the remainder of the task is unaffected
computer arctecture ~ PSUT
27
Simple Example
Important Application:
Amdahl’s Law says
 FPSQRT 20%
 FP instructions account for 50% nothing about cost
 Other 30%
Designers say same cost to speedup:
 FPSQRT by 40x
 FP by 2x
 Other by 8x
Which one should you invest?
Straightforward plug in the numbers & compare
BUT what’s your guess??
computer arctecture ~ PSUT
28
And the Winner Is…?
computer arctecture ~ PSUT
29
Example of Amdahl’s Law
 Floating point instructions are improved to run twice as fast, but only
10% of the time was spent on these instructions originally. How
much faster is the new machine?
1
ExTimeold
Speedup =
=
(1 - Fractionenhanced) + Fractionenhanced
ExTimenew
Speedupenhanced
1
Speedup =
= 1.053
(1 - 0.1) + 0.1/2
° The new machine is 1.053 times as fast, or 5.3% faster.
° How much faster would the new machine be if floating point
instructions become 100 times faster?
1
Speedup =
= 1.109
(1 - 0.1) + 0.1/100
Estimating Performance Improvements
 Assume a processor currently requires 10 seconds to
execute a program and processor performance
improves by 50 percent per year.
 By what factor does processor performance improve in
5 years?
(1 + 0.5)^5 = 7.59
 How long will it take a processor to execute the
program after 5 years?
ExTimenew = 10/7.59 = 1.32 seconds
Performance Example
 Computers M1 and M2 are two implementations of the
same instruction set.
 M1 has a clock rate of 50 MHz and M2 has a clock rate of
75 MHz.
 M1 has a CPI of 2.8 and M2 has a CPI of 3.2 for a given
program.
 How many times faster is M2 than M1 for this program?
ExTimeM1
=
ExTimeM2
ICM1 x CPIM1 / Clock RateM1
=
ICM2 x CPIM2 / Clock RateM2
2.8/50
3.2/75
 What would the clock rate of M1 have to be for them to
have the same execution time?
= 1.31
Simple Example
Suppose we have made the following
measurements:
 Frequency of FP operations (other than FPSQR)
=25%
 Average CPI of FP operations=4.0
 Average CPI of other instructions=1.33
 Frequency of FPSQR=2%
 CPI of FPSQR=20
Two design alternatives
 Reduce the CPI of FPSQR to 2
 Reduce the average CPI of all FP operations to 2
computer arctecture ~ PSUT
33
And The Winner is…
n
ICi
CPI original   CPI i *
 (4 * 25%)  (1.33 * 75%)  2.0
Instructio n _ count
i 1
CPI with _ new _ FPSQR  CPI original  2% * (CPI oldFPSQR CPI ofnewFPSQRonly)
 2.0  2% * (20  2)  1.64
CPI newFP  (75% * 1.33)  (25% * 2.0)  1.5
computer arctecture ~ PSUT
34
Instruction Set
Architecture
(ISA)
computer arctecture ~ PSUT
35
Outline
Introduction
Classifying instruction set architectures
Instruction set measurements







Memory addressing
Addressing modes for signal processing
Type and size of operands
Operations in the instruction set
Operations for media and signal processing
Instructions for control flow
Encoding an instruction set
MIPS architecture
computer arctecture ~ PSUT
36
LOGO
Instruction Set Principles and
Examples
Basic Issues in Instruction Set Design
 What operations and How many
 Load/store/Increment/branch are sufficient to do any
computation, but not useful (programs too long!!).
 How (many) operands are specified?
 Most operations are dyadic (e.g., AB+C); Some are
monadic (e.g., A B).
 How to encode them into instruction format?
 Instructions should be multiples of Bytes.
 Typical Instruction Set




32-bit word
Basic operand addresses are 32-bit long.
Basic operands (like integer) are 32-bit long.
In general, Instruction could refer 3 operands (AB+C).
 Challenge: Encode operations in a small number of
bits.
computer arctecture ~ PSUT
38
Brief Introduction to ISA
 Instruction Set Architecture: a set of instructions
 Each instruction is directly executed by the CPU’s hardware
 How is it represented?
 By a binary format since the hardware understands only bits
6
opcode
5
rs
5
16
rt
Immediate
 Options - fixed or variable length formats
 Fixed - each instruction encoded in same size field (typically 1
word)
 Variable – half-word, whole-word, multiple word instructions are
possible
computer arctecture ~ PSUT
39
What Must be Specified?
Instruction Format (encoding)
 How is it decoded?
Location of operands and result
 Where other than memory?
 How many explicit operands?
 How are memory operands located?
Data type and Size
Operations
 What are supported?
computer arctecture ~ PSUT
40
LOGO
Classifying
Instruction Set
Architecture
Instruction Set Design
CPU _ Time  IC * CPI * Cycle _ time
The instruction set influences everything
computer arctecture ~ PSUT
42
Instruction Characteristics
 Usually a simple operation
 Which operation is identified by the op-code field
 But operations require operands - 0, 1, or 2
 To identify where they are, they must be addressed
• Address is to some piece of storage
• Typical storage possibilities are main memory, registers, or a stack
 2 options explicit or implicit addressing
 Implicit - the op-code implies the address of the operands
• ADD on a stack machine - pops the top 2 elements of the stack,
then pushes the result
• HP calculators work this way
 Explicit - the address is specified in some field of the instruction
• Note the potential for 3 addresses - 2 operands + the destination
computer arctecture ~ PSUT
43
Operand Locations for Four ISA Classes
computer arctecture ~ PSUT
44
C=A+B
 Stack
Register (registermemory)
 Push A
 Push B
 Add
• Pop the top-2 values of
the stack (A, B) and push
the result value into the
stack
 Pop C
 Accumulator (AC)
 Load A
 Add B
• Add AC (A) with B and
store the result into AC
 Store C
 Load R1, A
 Add R3, R1, B
 Store R3, C
Register (load-store)




Load R1, A
Load R2, B
Add R3, R1, R2
Store R3, C
computer arctecture ~ PSUT
45
Modern Choice – Load-store Register
(GPR) Architecture
 Reasons for choosing GPR (general-purpose registers)
architecture
 Registers (stacks and accumulators…) are faster than memory
 Registers are easier and more effective for a compiler to use
• (A+B) – (C*D) – (E*F)
– May be evaluated in any order (for pipelining concerns or …)
» But on a stack machine  must left to right
 Registers can be used to hold variables
• Reduce memory traffic
• Speed up programs
• Improve code density (fewer bits are used to name a register)
 Compiler writers prefer that all registers be equivalent
and unreserved
 The number of GPR: at least 16
computer arctecture ~ PSUT
46
LOGO
Memory Addressing
Memory Addressing Basics
All architectures must address memory
What is accessed - byte, word, multiple words?
 Today’s machine are byte addressable
 Main memory is organized in 32 - 64 byte lines
 Big-Endian or Little-Endian addressing
Hence there is a natural alignment problem
 Size s bytes at byte address A is aligned if
A mod s = 0
 Misaligned access takes multiple aligned memory
references
Memory addressing mode influences instruction
counts (IC) and clock cycles per instruction (CPI)
computer arctecture ~ PSUT
48
Big-Endian and Little-Endian Assignments
Big-Endian: lower byte addresses are used for the most significant bytes of the word
Little-Endian: opposite ordering. lower byte addresses are used for the less significant
bytes of the word
Word
address
Byte address
Byte address
0
0
1
2
3
0
3
2
1
0
4
4
5
6
7
4
7
6
5
4
•
•
•
k
2 -4
k
2 -4
k
2 -3
•
•
•
k
2- 2
k
2 - 1
(a) Big-endian assignment
k
2 - 4
k
2- 1
k
2 - 2
k
2 -3
k
2 -4
(b) Little-endian assignment
computer
arctecture ~ PSUT
Byte and
word addressing.
49
Addressing Modes
Immediate
Add R4, #3
Regs[R4]  Regs[R4]+3
Register
Add R4, R3
Regs[R4]  Regs[R4]+Regs[R3]
R3
Operand:3
Register Indirect
Add R4, (R1)
Regs[R4]  Regs[R4]+Mem[Regs[R1]]
R1
Operand
Registers
Operand
Registers
Memory
computer
arctecture ~ PSUT
50
Addressing Modes(Cont.)
Direct
Memory Indirect
Add R4, (1001)
Add R4, @(R3)
Regs[R4]  Regs[R4]+Mem[1001] Regs[R4]  Regs[R4]+Mem[Mem[Regs[R3]]]
R3
1001
Operand
Operand
Memory
Registers
computer arctecture ~ PSUT
Memory
51
Addressing Modes(Cont.)
Displacement
Add R4, 100(R1)
Regs[R4]  Regs[R4]+Mem[100+R1]
R1
100
Scaled
Add R1, 100(R2) [R3]
Regs[R1]  Regs[R1]+Mem[100+
Regs[R2]+Regs[R3]*d]
R3 R2
100
Operand
Operand
*d
Registers
Memory
Registers
computer arctecture ~ PSUT
Memory
52
Typical Address Modes (I)
computer arctecture ~ PSUT
53
Typical Address Modes (II)
computer arctecture ~ PSUT
54
Operand Type & Size
Typical types: assume word= 32 bits
 Character - byte - ASCII or EBCDIC (IBM) - 4
per word
 Short integer - 2- bytes, 2’s complement
 Integer - one word - 2’s complement
 Float - one word - usually IEEE 754 these
days
 Double precision float - 2 words - IEEE 754
 BCD or packed decimal - 4- bit values packed
8 per word
computer arctecture ~ PSUT
55
LOGO
ALU Operations
What Operations are Needed
 Arithmetic + Logical
 Integer arithmetic: ADD, SUB, MULT, DIV, SHIFT
 Logical operation: AND, OR, XOR, NOT
 Data Transfer - copy, load, store
 Control - branch, jump, call, return, trap
 System - OS and memory management
 We’ll ignore these for now - but remember they are needed
 Floating Point
 Same as arithmetic but usually take bigger operands
 Decimal
 String - move, compare, search
 Graphics – pixel and vertex,
compression/decompression operations
computer arctecture ~ PSUT
57
Top 10 Instructions for 80x86
 load: 22%
 conditional branch: 20%
 compare: 16%
 store: 12%
 add: 8%
 and: 6%
 sub: 5%
 move register-register:
4%
 call: 1%
 return: 1%
The most widely
executed instructions
are the simple
operations of an
instruction set
The top-10
instructions for 80x86
account for 96% of
instructions executed
Make them fast, as
they are the common
case
computer arctecture ~ PSUT
58
Control Instructions are a Big Deal
Jumps - unconditional transfer
Conditional Branches
 How is condition code set? – by flag or part of the
instruction
 How is target specified? How far away is it?
Calls
 How is target specified? How far away is it?
 Where is return address kept?
 How are the arguments passed? Callee vs. Caller
save!
Returns
 Where is the return address? How far away is it?
 How are the results passed?
computer arctecture ~ PSUT
59
Branch Address Specification
Known at compile time for unconditional and
conditional branches - hence specified in the
instruction
 As a register containing the target address
 As a PC-relative offset
Consider word length addresses, registers, and
instructions
 Full address desired? Then pick the register option.
• BUT - setup and effective address will take longer.
 If you can deal with smaller offset then PC relative
works
• PC relative is also position independent - so simple linker
duty
computer arctecture ~ PSUT
60
Returns and Indirect Jumps
Branch target is not known at compile time
Need a way to specify the target
dynamically
 Use a register
 Permit any addressing mode
 Regs[R4]  Regs[R4] + Mem[Regs[R1]]
Also useful for
 case or switch
 Dynamically shared libraries
 High-order functions or function pointers
computer arctecture ~ PSUT
61
LOGO
Encoding an
Instruction Set
Encoding the ISA
 Encode instructions into a binary representation for
execution by CPU
 Can pick anything but:
 Affects the size of code - so it should be tight
 Affects the CPU design - in particular the instruction decode
 So it may have a big influence on the CPI or cycle-time
 Must balance several competing forces
 Desire for lots of addressing modes and registers
 Desire to make average program size compact
 Desire to have instructions encoded into lengths that will be easy
to handle in a pipelined implementation (multiple of bytes)
computer arctecture ~ PSUT
63
3 Popular Encoding Choices
 Variable (compact code but difficult to encode)




Primary opcode is fixed in size, but opcode modifiers may exist
Opcode specifies number of arguments - each used as address fields
Best when there are many addressing modes and operations
Use as few bits as possible, but individual instructions can vary widely in
length
 e. g. VAX - integer ADD versions vary between 3 and 19 bytes
 Fixed (easy to encode, but lengthy code)
 Every instruction looks the same - some field may be interpreted
differently
 Combine the operation and the addressing mode into the opcode
 e. g. all modern RISC machines
 Hybrid
 Set of fixed formats
 e. g. IBM 360 and Intel 80x86
Trade-off between size of program
VS. ease of decoding
computer arctecture ~ PSUT
64
3 Popular Encoding Choices (Cont.)
computer arctecture ~ PSUT
65
An Example of Variable Encoding -- VAX
addl3 r1, 737(r2), (r3): 32-bit integer add
instruction with 3 operands  need 6 bytes to
represent it
 Opcode for addl3: 1 byte
 A VAX address specifier is 1 byte (4-bits: addressing
mode, 4-bits: register)
• r1: 1 byte (register addressing mode + r1)
• 737(r2)
– 1 byte for address specifier (displacement addressing + r2)
– 2 bytes for displacement 737
• (r3): 1 byte for address specifier (register indirect + r3)
Length of VAX instructions: 1—53 bytes
computer arctecture ~ PSUT
66
Short Summary – Encoding the
Instruction Set
Choice between variable and fixed
instruction encoding
 Code size than performance  variable
encoding
 Performance than code size  fixed encoding
computer arctecture ~ PSUT
67
LOGO
Role of Compilers
Critical goals in ISA from the compiler
viewpoint
 What features will lead to high-quality code
 What makes it easy to write efficient
compilers for an architecture
computer arctecture ~ PSUT
69
Compiler and ISA
ISA decisions are no more for programming AL
easily
Due to HLL, ISA is a compiler target today
Performance of a computer will be significantly
affected by compiler
Understanding compiler technology today is
critical to designing and efficiently implementing
an instruction set
Architecture choice affects the code quality and
the complexity of building a compiler for it
computer arctecture ~ PSUT
70
Optimization Observations
Hard to reduce branches
Biggest reduction is often memory
references
Some ALU operation reduction happens
but it is usually a few %
Implication:
 Branch, Call, and Return become a larger
relative % of the instruction mix
 Control instructions among the hardest to
speed up
computer arctecture ~ PSUT
71
LOGO
The MIPS
Architecture
MIPS Instruction Format
Encode addressing mode into the opcode
All instructions are 32 bits with 6-bit
primary opcode
computer arctecture ~ PSUT
73
MIPS Instruction Format (Cont.)
I-Type Instruction
6
5
opcode
rs
5
rt
16
Immediate
 Loads and Stores
LW R1, 30(R2), S.S F0, 40(R4)
 ALU ops on immediates
DADDIU R1, R2, #3
 rt <-- rs op immediate
 Conditional branches
BEQZ R3, offset
 rs is the register checked
 rt unused
 immediate specifies the offset
 Jump registers ,jump and link register
JR R3
 rs is target register
 rt and immediate are unused but = 011
computer arctecture ~ PSUT
74
MIPS Instruction Format (Cont.)
6
opcode
R-Type Instruction
5
5
5
rs
rt
rd
5
shamt
6
func
 Register-register ALU operations: rdrs funct rt DADDU R1, R2, R3
 Function encodes the data path operations: Add, Sub...
 read/write special registers
 Moves
J-Type Instruction: Jump, Jump and Link, Trap and return from exception
6
26
opcode
Offset added to PC
computer arctecture ~ PSUT
75
Data path
computer arctecture ~ PSUT
76
The processor : Data Path and Control
Data
PC
Address
Instructions
Register #
Register
Bank
Register #
Instruction
Memory
A
L
U
Address
Data Memory
Register #
Data
Two types of functional units:
elements that operate on data values (combinational)
 elements that contain state (state elements)
computer arctecture ~ PSUT
77
Single Cycle Implementation
State
element
1
Combinational
Logic
State
element
2
Clock Cycle
Typical execution:
read contents of some state elements,
send values through some combinational logic
write results to one or more state elements
 Using a clock signal for synchronization
 Edge triggered methodology
computer arctecture ~ PSUT
78
A portion of the datapath used for fetching instructions
computer arctecture ~ PSUT
79
The datapath for R-type instructions
computer arctecture ~ PSUT
80
The datapath for load and store insructions
computer arctecture ~ PSUT
81
The datapath for branch instructions
computer arctecture ~ PSUT
82
Complete Data Path
computer arctecture ~ PSUT
83
Control
 Selecting the operations to perform (ALU, read/write,
etc.)
 Controlling the flow of data (multiplexor inputs)
 Information comes from the 32 bits of the instruction
 Example: lW $1, 100($2)
 Value of control signals is dependent upon:
 what instruction is being executed
 which step is being performed
computer arctecture ~ PSUT
84
Data Path with Control
computer arctecture ~ PSUT
85
Single Cycle Implementation
Calculate cycle time assuming negligible delays except:
memory (2ns), ALU and adders (2ns), register file access
(1ns)
computer arctecture ~ PSUT
86
pipelining
computer arctecture ~ PSUT
87
Basic DataPath
•What do we need to add to actually split the datapath into stages?
computer arctecture ~ PSUT
88
Pipeline DataPath
computer arctecture ~ PSUT
89
The Five Stages of the Load Instruction
Cycle 1 Cycle 2
Load Ifetch
Reg/Dec
Cycle 3 Cycle 4 Cycle 5
Exec
Mem
Wr
Ifetch: Instruction Fetch
Fetch the instruction from the Instruction
Memory
Reg/Dec: Registers Fetch and Instruction
Decode
Exec: Calculate the memory address
Mem: Read the data from the Data
Memory
computer arctecture
Wr: Write the data
back~ PSUT
to the register file 90
Pipelined Execution
Time
IFetch Dcd
Exec
IFetch Dcd
Mem
WB
Exec
Mem
WB
Exec
Mem
WB
Exec
Mem
WB
Exec
Mem
WB
Exec
Mem
IFetch Dcd
IFetch Dcd
IFetch Dcd
Program Flow
IFetch Dcd
On a processor multiple instructions are in
various stages at the same time.
Assume each instruction takes five cycles
computer arctecture ~ PSUT
WB
91
Single Cycle, Multiple Cycle, vs. Pipeline
computer arctecture ~ PSUT
92
Graphically Representing Pipelines
Can help with answering questions like:
 How many cycles does it take to execute this code?
 What is the ALU doing during cycle 4?
 Are two instructions trying to use the same resource
computer arctecture ~ PSUT
93
at the same time?
Why Pipeline? Because the resources are there!
computer arctecture ~ PSUT
94
Why Pipeline?
Suppose
 100 instructions are executed
 The single cycle machine has a cycle time of 45 ns
 The multicycle and pipeline machines have cycle times of 10
ns
 The multicycle machine has a CPI of 4.6
Single Cycle Machine
 45 ns/cycle x 1 CPI x 100 inst = 4500 ns
Multicycle Machine
 10 ns/cycle x 4.6 CPI x 100 inst = 4600 ns
Ideal pipelined machine
 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns
Ideal pipelined vs. single cycle speedup
 4500 ns / 1040 ns = 4.33
What has not yet been
considered?
computer arctecture ~ PSUT
95
Compare Performance
Compare: Single-cycle, multicycle and pipelined control using
SPECint2000
 Single-cycle: memory access = 200ps, ALU = 100ps, register file read
and write = 50ps
 200+50+100+200+50=600ps
 Multicycle: 25% loads, 10% stores, 11% branches, 2% jumps, 52%
ALU
 CPI = 4.12, The clock cycle = 200ps (longest functional unit)
 Pipelined






1 clock cycle when there is no load-use dependence
2 when there is, average 1.5 per load
Stores and ALU take 1 clock cycle
Branches - 1 when predicted correctly and 2 when not, average 1.25
Jump – 2
1.5x25%+1x10%+1x52%+1.25x11%+2x2% = 1.17
 Average instruction time: single-cycle = 600ps, multicycle =
4.12x200=824, pipelined 1.17x200 = 234ps
 Memory access 200ps is the bottleneck. How to improve?
computer arctecture ~ PSUT
96
Can pipelining get us into trouble?
Yes: Pipeline Hazards
 structural hazards: attempt to use the same resource two
different ways at the same time
•
E.g., two instructions try to read the same memory at the same
time
 data hazards: attempt to use item before it is ready
•
instruction depends on result of prior instruction still in the
pipeline
add r1, r2, r3
sub r4, r2, r1
 control hazards: attempt to make a decision before condition is
evaulated
•
branch instructions
beq r1, loop
add r1, r2, r3
Can always resolve hazards by waiting
 pipeline control must detect the hazard
 take action (or delay action) to resolve hazards
computer arctecture ~ PSUT
97
Single Memory is a Structural Hazard
Time (clock cycles)
Instr 4
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
ALU
Instr 3
Reg
ALU
Instr 2
Mem
Mem
ALU
Instr 1
Reg
ALU
O
r
d
e
r
Load
Mem
ALU
I
n
s
t
r.
Mem
Reg
Detection is easy in this case!
(right half highlight means read, left half write)98
computer arctecture ~ PSUT
Structural Hazards limit performance
Example: if 1.3 memory accesses per
instruction and only one memory access
per cycle then
average CPI = 1.3
otherwise resource is more than 100%
utilized
Solution 1: Use separate instruction and
data memories
Solution 2: Allow memory to read and
write more than one word per cycle
computer arctecture ~ PSUT
99
Solution 3: Stall
Control Hazard Solutions
Stall: wait until decision is clear
 Its possible to move up decision to 2nd stage by
adding hardware to check registers as being read
Beq
Load
Reg
Mem
Mem
Reg
Reg
Mem
Reg
Mem
Reg
ALU
Add
Mem
ALU
O
r
d
e
r
Time (clock cycles)
ALU
I
n
s
t
r.
Mem
Reg
Impact: 2 clock cycles per branch instruction
=> slow
computer arctecture ~ PSUT
100
Control Hazard Solutions
Predict: guess one direction then back up if
wrong
Beq
Load
Reg
Mem
Mem
Reg
Reg
Mem
Reg
Mem
Reg
ALU
Add
Mem
ALU
O
r
d
e
r
 Predict not takenTime (clock cycles)
ALU
I
n
s
t
r.
Mem
Reg
Impact: 1 clock cycle per branch instruction if
right, 2 if wrong (right 50% of time)
More dynamic scheme: history of 1 branch (
computer arctecture ~ PSUT
90%)
101
Control Hazard Solutions
Redefine branch behavior (takes place after next
instruction) “delayed branch”
Misc
Load
Mem
Mem
Reg
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
ALU
Beq
Reg
ALU
Add
Mem
ALU
O
r
d
e
r
Time (clock cycles)
ALU
I
n
s
t
r.
Mem
Reg
Impact: 1 clock cycles per branch instruction if
can find instruction to put in “slot” ( 50% of time)
Launch more instructions per clock cycle=>less
computer arctecture ~ PSUT
useful
102
Data Hazard
computer arctecture ~ PSUT
103
Data Hazard---Forwarding
Use temporary results, don’t wait for them to be written
register file forwarding to handle read/write to same register
ALU forwarding
computer arctecture ~ PSUT
104
Can’t always forward
 Load word can still cause a hazard:
 an instruction tries to read a register following a load instruction that
writes to the same register.
 Thus, we need a hazardcomputer
detection
unit ~toPSUT
“stall” the load instruction 105
arctecture
Stalling
 We can stall the pipeline by keeping an instruction in the same
stage
computer arctecture ~ PSUT
106
Memory Hierarchy Design
computer arctecture ~ PSUT
107
LOGO
5.1 Introduction
Memory Hierarchy Design
Motivated by the principle of locality - A 90/10
type of rule
 Take advantage of 2 forms of locality
• Spatial - nearby references are likely
• Temporal - same reference is likely soon
Also motivated by cost/performance structures
 Smaller hardware is faster: SRAM, DRAM, Disk, Tape
 Access vs. bandwidth variations
 Fast memory is more expensive
Goal – Provide a memory system with cost
almost as low as the cheapest level and speed
almost as fast as the fastest level
computer arctecture ~ PSUT
109
Memory relevance in Computer Design ?
A computer’s performance is given by the number of
instructions executed per time unit
The time for executing an instruction depends on:
 The ALU speed (I.e. the data-path cycle duration)
 The time it takes for each instruction to load/store its
operands/result from/into the memory (in brief, the time to
access memory)
The processing speed (CPU speed) grows faster than
the memory speed. As a result the CPU speed cannot
be fully exploited. This speed gap leads to an
Unbalanced System !
computer arctecture ~ PSUT
110
Levels in A Typical Memory
Hierarchy
computer arctecture ~ PSUT
111
Unit of Transfer / Addressable Unit
 Unit of Transfer: Number of bits read from, or written
into memory at a time
 Internal : usually governed by data bus width
 External : usually a block of words e.g 512 or
more.
 Addressable unit: smallest location which can be
uniquely addressed
 Internal : word or byte
 External : device dependent e.g. a disk “cluster”
computer arctecture ~ PSUT
112
Access Method
Sequential
 Data is stored in records, access is in linear sequence
(tape)
Direct
 Data blocks have a unique and direct access, data
within block is sequential (disk)
Random
 Data has unique and direct access (ram)
Associative
 Data retrieved based on (partial) match rather than
address (cache)
computer arctecture ~ PSUT
113
LOGO
5.2 Review of the
ABCs of Caches
36 Basic Terms on Caches
Cache
Full associative
Write allocate
Virtual memory
dirty bit
unified cache
memory stall cycles
block offset
misses per instruction
directed mapped
write back
block
valid bit
data cache
locality
block address
hit time
address trace
write through
cache miss
set
instruction cache
page fault
random placement
average memory access time miss rate
index field
cache hit
n-way set
associative
no-write allocate
page
least-recently used
write buffer
miss penalty
tag field
write stall
computer arctecture ~ PSUT
115
Cache
The first level of the memory hierarchy
encountered once the address leaves the CPU
 Persistent mismatch between CPU and main-memory
speeds
 Exploit the principle of locality by providing a small,
fast memory between CPU and main memory -- the
cache memory
Cache is now applied whenever buffering is
employed to reuse commonly occurring terms
(ex. file caches)
Caching – copying information into faster
storage system
 Main memory can be viewed as a cache for
secondary storage
computer arctecture ~ PSUT
116
General Hierarchy Concepts
 At each level - block concept is present (block is the
caching unit)
 Block size may vary depending on level
• Amortize longer access by bringing in larger chunk
• Works if locality principle is true
 Hit - access where block is present - hit rate is the probability
 Miss - access where block is absent (in lower levels) - miss rate
 Mirroring and consistency
 Data residing in higher level is subset of data in lower level
 Changes at higher level must be reflected down - sometime
• Policy of sometime is the consistency mechanism
 Addressing
 Whatever the organization you have to know how to get at it!
 Address checking and protection
computer arctecture ~ PSUT
117
Physical Address Structure
Key is that you want different block sizes
at different levels
computer arctecture ~ PSUT
118
Latency and Bandwidth
The time required for the cache miss depends
on both latency and bandwidth of the memory
(or lower level)
Latency determines the time to retrieve the first
word of the block
Bandwidth determines the time to retrieve the
rest of this block
A cache miss is handled by hardware and
causes processors following in-order execution
to pause or stall until the data are available
computer arctecture ~ PSUT
119
Predicting Memory Access Times
On a hit: simple access time to the cache
On a miss: access time + miss penalty
 Miss penalty = access time of lower + block transfer
time
 Block transfer time depends on
• Block size - bigger blocks mean longer transfers
• Bandwidth between the two levels of memory
– Bandwidth usually dominated by the slower memory and the
bus protocol
Performance
 Average-Memory-Access-Time = Hit-Access-Time +
Miss-Rate * Miss-Penalty
 Memory-stall-cycles = IC * Memory-reference-perinstruction * Miss-Rate * Miss-Penalty
computer arctecture ~ PSUT
120
Four Standard Questions
Block Placement
 Where can a block be placed in the upper
level?
Block Identification
 How is a block found if it is in the upper level?
Block Replacement
 Which block should be replaced on a miss?
Write Strategy
 What happens on a write?
Answer the four questions for the first level of the memory hierarchy
computer arctecture ~ PSUT
121
Block Placement Options
Direct Mapped
 (Block address) MOD (# of cache blocks)
Fully Associative
 Can be placed anywhere
Set Associative
 Set is a group of n blocks -- each block is called a
way
 Block first mapped into a set  (Block address)
MOD (# of cache sets)
 Placed anywhere in the set
Most caches are direct mapped, 2- or 4-way
set associative
computer arctecture ~ PSUT
122
Block Placement Options (Cont.)
computer arctecture ~ PSUT
123
Block Identification
Each cache block carries tags
Address Tags: which block am I?
Many memory blocks may
map to the same cache
block
 Physical address now: address tag## set index##
block offset
 Note relationship of block size, cache size, and tag
size
 The smaller the set tag the cheaper it is to find
 Status Tags: what state is the block in?
 valid, dirty, etc.
Physical address =
r + m + n bits
r
(address tag)
m
(set index)
n
(block offset)
2m addressable sets
in the cache
2n bytes
per block
computer arctecture ~ PSUT
124
Block Identification (Cont.)
Physical address = r + m + n bits
r (address tag)
m
2m addressable sets
in the cache
n
2n bytes
per block
•
Caches have an address tag on each block frame that gives
the block address.
•
A valid bit to say whether or not this entry contains a valid
address.
•
The block frame address can be divided into the tag field and
the index field.
computer arctecture ~ PSUT
125
Block Replacement
Random: just pick one and chuck it
 Simple hash game played on target block frame
address
 Some use truly random
• But lack of reproducibility is a problem at debug time
LRU - least recently used
 Need to keep time since each block was last
accessed
• Expensive if number of blocks is large due to global compare
• Hence approximation is oftenOnly
usedone
= Use
bitfor
tagdirect-mapped
and LFU
choice
FIFO
placement
computer arctecture ~ PSUT
126
Short Summaries from the Previous
Figure
More-way associative is better for small cache
2- or 4-way associative perform similar to 8-way
associative for larger caches
Larger cache size is better
LRU is the best for small block sizes
Random works fine for large caches
FIFO outperforms random in smaller caches
Little difference between LRU and random for
larger caches
computer arctecture ~ PSUT
127
Improving Cache Performance
MIPS mix is 10% stores and 37% loads
 Writes are about 10%/(100%+10%+37%) = 7% of
overall memory traffic, and 10%/(10%+37%)=21% of
data cache traffic
Make the common case fast
 Implies optimizing caches for reads
Read optimizations
 Block can be read concurrent with tag comparison
 On a hit the read information is passed on
 On a miss the - nuke the block and start the miss
access
Write optimizations
 Can’t modify until after tag check - hence take longer
computer arctecture ~ PSUT
128
Write Options
 Write through: write posted to cache line and through to next lower
level
 Incurs write stall (use an intermediate write buffer to reduce the stall)
 Write back
 Only write to cache not to lower level
 Implies that cache and main memory are now inconsistent
• Mark the line with a dirty bit
• If this block is replaced and dirty then write it back
 Pro’s and Con’s  both are useful
 Write through
• No write on read miss, simpler to implement, no inconsistency with main
memory
 Write back
• Uses less main memory bandwidth, write times independent of main
memory speeds
• Multiple writes within a block require only one write to the main memory
computer arctecture ~ PSUT
129
LOGO
5.3 Cache
Performance
Cache Performance
computer arctecture ~ PSUT
131
Cache Performance Example
 Each instruction takes 2 clock cycle (ignore memory
stalls)
 Cache miss penalty – 50 clock cycles
 Miss rate = 2%
 Average 1.33 memory reference per instructions
•
•
•
•
Ideal – IC * 2 * cycle-time
With cache – IC*(2+1.33*2%*50)*cycle-time = IC * 3.33 * cycle-time
No cache – IC * (2+1.33*100%*50)*cycle-time
The importance of cache for CPUs with lower CPI and higher clock
rates is greater – Amdahl’s Law
computer arctecture ~ PSUT
132
Average Memory Access Time VS
CPU Time
 Compare two different cache organizations
 Miss rate – direct-mapped (1.4%), 2-way associative (1.0%)
 Clock-cycle-time – direct-mapped (2.0ns), 2-way associative
(2.2ns)
 CPI with a perfect cache – 2.0, average memory
reference per instruction – 1.3; miss-penalty – 70ns; hittime – 1 CC
• Average Memory Access Time (Hit time + Miss_rate * Miss_penalty)
• AMAT(Direct) = 1 * 2 + (1.4% * 70) = 2.98ns
• AMAT(2-way) = 1 * 2.2 + (1.0% * 70) = 2.90ns
• CPU Time
• CPU(Direct) = IC * (2 * 2 + 1.3 * 1.4% * 70) = 5.27 * IC
• CPU(2-way) = IC * (2 * 2.2 + 1.3 * 1.0% * 70) = 5.31 * IC
Since CPU time is our bottom-line evaluation, and since direct mapped is
simpler to build, the preferred cache is direct mapped in this example
computer arctecture ~ PSUT
133
Unified and Split Cache




Unified – 32KB cache, Split – 16KB IC and 16KB DC
Hit time – 1 clock cycle, miss penalty – 100 clock cycles
Load/Store hit takes 1 extra clock cycle for unified cache
36% load/store – reference to cache: 74% instruction, 26% data
• Miss rate(16KB instruction) = 3.82/1000/1.0 = 0.004
Miss rate (16KB data) = 40.9/1000/0.36 = 0.114
• Miss rate for split cache – (74%*0.004) + (26%*0.114) = 0.0324
Miss rate for unified cache – 43.3/1000/(1+0.36) = 0.0318
• Average-memory-access-time = % inst * (hit-time + inst-miss-rate *
miss-penalty) + % data * (hit-time + data-miss-rate * miss-penalty)
• AMAT(Split) = 74% * (1 + 0.004 * 100) + 26% * (1 + 0.114 * 100) = 4.24
• AMAT(Unified) = 74% * (1 + 0.0318 * 100) + 26% * (1 + 1 + 0.0318* 100)
= 4.44
computer arctecture ~ PSUT
134
Improving Cache Performance
Average-memory-access-time = Hittime + Miss-rate * Miss-penalty
Strategies for improving cache
performance
 Reducing the miss penalty
 Reducing the miss rate
 Reducing the miss penalty or miss rate via
parallelism
 Reducing the time to hit in the cache
computer arctecture ~ PSUT
135
LOGO
5.4
Reducing Cache
Miss Penalty
Techniques for Reducing Miss
Penalty
Multilevel Caches (the most important)
Critical Word First and Early Restart
Giving Priority to Read Misses over Writes
Merging Write Buffer
Victim Caches
computer arctecture ~ PSUT
137
Multi-Level Caches
Probably the best miss-penalty reduction
Performance measurement for 2-level
caches
 AMAT = Hit-time-L1 + Miss-rate-L1* Misspenalty-L1
 Miss-penalty-L1 = Hit-time-L2 + Miss-rate-L2 *
Miss-penalty-L2
 AMAT = Hit-time-L1 + Miss-rate-L1 * (Hit-timeL2 + Miss-rate-L2 * Miss-penalty-L2)
computer arctecture ~ PSUT
138
Critical Word First and Early
Restart
 Do not wait for full block to be loaded before restarting
CPU
 Critical Word First – request the missed word first from memory
and send it to the CPU as soon as it arrives; let the CPU
continue execution while filling the rest of the words in the block.
Also called wrapped fetch and requested word first
 Early restart -- as soon as the requested word of the block
arrives, send it to the CPU and let the CPU continue execution
 Benefits of critical word first and early restart depend on
 Block size: generally useful only in large blocks
 Likelihood of another access to the portion of the block that has
not yet been fetched
• Spatial locality problem: tend to want next sequential word, so not
clear if benefit
block
computer arctecture ~ PSUT
139
Victim Caches
 Remember what was just discarded in case it is need
again
 Add small fully associative cache (called victim cache)
between the cache and the refill path
 Contain only blocks discarded from a cache because of a miss
 Are checked on a miss to see if they have the desired data
before going to the next lower-level of memory
• If yes, swap the victim block and cache block
 Addressing both victim and regular cache at the same time
• The penalty will not increase
 Jouppi (DEC SRC) shows miss reduction of 20 - 95%
 For a 4KB direct mapped cache with 1-5 victim blocks
computer arctecture ~ PSUT
140
Victim Cache Organization
computer arctecture ~ PSUT
141
LOGO
5.5 Reducing Miss
Rate
Classify Cache Misses - 3 C’s
Compulsory  independent of cache size
 First access to a block  no choice but to load it
 Also called cold-start or first-reference misses
Capacity  decrease as cache size increases
 Cache cannot contain all the blocks needed during
execution, then blocks being discarded will be later
retrieved
Conflict (Collision)  decrease as associativity
increases
 Side effect of set associative or direct mapping
 A block may be discarded and later retrieved if too
many blocks map to the same cache block
computer arctecture ~ PSUT
143
Techniques for Reducing Miss Rate
Larger Block Size
Larger Caches
Higher Associativity
Way Prediction Caches
Compiler optimizations
computer arctecture ~ PSUT
144
Larger Block Sizes
Obvious advantages: reduce compulsory
misses
 Reason is due to spatial locality
Obvious disadvantage
 Higher miss penalty: larger block takes longer
to move
 May increase conflict misses and capacity
miss if cache is small
Don’t let increase in miss penalty outweigh the decrease in miss rate
computer arctecture ~ PSUT
145
Large Caches
Help with both conflict and capacity
misses
May need longer hit time AND/OR higher
HW cost
Popular in off-chip caches
computer arctecture ~ PSUT
146
Higher Associativity
8-way set associative is for practical purposes
as effective in reducing misses as fully
associative
2: 1 Rule of thumb
 2 way set associative of size N/ 2 is about the same
as a direct mapped cache of size N (held for cache
size < 128 KB)
Greater associativity comes at the cost of
increased hit time
 Lengthen the clock cycle
 Hill [1988] suggested hit time for 2-way vs. 1-way:
external cache +10%, internal + 2%
computer arctecture ~ PSUT
147
Way Prediction
Extra bits are kept in cache to predict the way, or
block within the set of the next cache access
Multiplexor is set early to select the desired
block, and only a single tag comparison is
performed that clock cycle
A miss results in checking the other blocks for
matches in subsequent clock cycles
Alpha 21264 uses way prediction in its 2-way
set-associative instruction cache. Simulation
using SPEC95 suggested way prediction
accuracy is in excess of 85%
computer arctecture ~ PSUT
148
Compiler Optimization for Data
Idea – improve the spatial and temporal locality
of the data
Lots of options
 Array merging – Allocate arrays so that paired
operands show up in same cache block
 Loop interchange – Exchange inner and outer loop
order to improve cache performance
 Loop fusion – For independent loops accessing the
same data, fuse these loops into a single aggregate
loop
 Blocking – Do as much as possible on a sub- block
before moving on
computer arctecture ~ PSUT
149
Merging Arrays Example
/* Before: 2 sequential arrays */
int val[SIZE];
int key[SIZE];
val
key
/* After: 1 array of stuctures */
struct merge {
int val;
val key val key val key
int key;
};
struct merge merged_array[SIZE];
Reducing conflicts between val & key; improve
spatial locality
computer arctecture ~ PSUT
150
LOGO
5.7 Reducing Hit
Time
Reducing Hit Time
Hit time is critical because it affects the
clock cycle time
 On many machines, cache access time limits
the clock cycle rate
A fast hit time is multiplied in importance
beyond the average memory access time
formula because it helps everything
 Average-Memory-Access-Time = HitAccess-Time + Miss-Rate * Miss-Penalty
• Miss-penalty is clock-cycle dependent
computer arctecture ~ PSUT
152
Techniques for Reducing Hit Time
Small and Simple Caches
Avoid Address Translation during Indexing
of the Cache
Pipelined Cache Access
Trace Caches
computer arctecture ~ PSUT
153
Cache Optimization Summary
computer arctecture ~ PSUT
154
LOGO
5.9 Main Memory
Main Memory -- 3 important issues
 Capacity
 Latency
 Access time: time between a read is requested and the word
arrives
 Cycle time: min time between requests to memory (> access
time)
• Memory needs the address lines to be stable between accesses
 By addressing big chunks - like an entire cache block (amortize
the latency)
 Critical to cache performance when the miss is to main
 Bandwidth -- # of bytes read or written per unit time
 Affects the time it takes to transfer the block
computer arctecture ~ PSUT
156
3 Examples of Bus Width, Memory Width, and
Memory Interleaving to Achieve Memory Bandwidth
computer arctecture ~ PSUT
157
Wider Main Memory
 Doubling or quadrupling the width of the cache or
memory will doubling or quadrupling the memory
bandwidth
 Miss penalty is reduced correspondingly
 Cost and Drawback
 More cost on memory bus
 Multiplexer between the cache and the CPU may be on the
critical path (CPU is still access the cache one word at a time)
• Multiplexors can be put between L1 and L2
 The design of error correction become more complicated
• If only a portion of the block is updated, all other portions must be
read for calculating the new error correction code
 Since main memory is traditionally expandable by the customer,
the minimum increment is doubled or quadrupled
computer arctecture ~ PSUT
158
LOGO
5.10 Virtual
Memory
Virtual Memory
Virtual memory divides physical memory into
blocks (called page or segment) and allocates
them to different processes
With virtual memory, the CPU produces virtual
addresses that are translated by a combination
of HW and SW to physical addresses, which
accesses main memory. The process is called
memory mapping or address translation
Today, the two memory-hierarchy levels
controlled by virtual memory are DRAMs and
magnetic disks
computer arctecture ~ PSUT
160
Example of Virtual to Physical
Address Mapping
Mapping by a
page table
computer arctecture ~ PSUT
161
Address Translation Hardware for
Paging
frame number frame offset
f (l-n)
d (n)
computer arctecture ~ PSUT
162
Cache vs. VM Differences
Replacement
 Cache miss handled by hardware
 Page fault usually handled by OS
Addresses
 VM space is determined by the address size of the
CPU
 Cache space is independent of the CPU address size
Lower level memory
 For caches - the main memory is not shared by
something else
 For VM - most of the disk contains the file system
• File system addressed differently - usually in I/ O space
• VM lower level is usually called SWAP space
computer arctecture ~ PSUT
163
2 VM Styles - Paged or Segmented?
 Virtual systems can be categorized into two classes: pages (fixed-size
blocks), and segments (variable-size blocks)
Page
Segment
Words per address
One
Two (segment and offset)
Programmer visible?
Invisible to application
programmer
May be visible to application
programmer
Replacing a block
Trivial (all blocks are the same
size)
Hard (must find contiguous, variable-size,
unused portion of main memory)
Memory use inefficiency
Internal fragmentation (unused
portion of page)
External fragmentation (unused pieces
of main memory)
Efficient disk traffic
Yes (adjust page size to balance
access time and transfer time)
Not always (small segments may
transfer just a few bytes)
computer arctecture ~ PSUT
164
Virtual Memory – The Same 4
Questions
Block Placement
 Choice: lower miss rates and complex placement or
vice versa
• Miss penalty is huge, so choose low miss rate  place
anywhere
• Similar to fully associative cache model
Block Identification - both use additional data
structure
 Fixed size pages - use a page table
 Variable sized segments - segment table
frame number frame offset
f (l-n)
d (n)
computer arctecture ~ PSUT
165
Virtual Memory – The Same 4
Questions (Cont.)
Block Replacement -- LRU is the best
 However true LRU is a bit complex – so use
approximation
• Page table contains a use tag, and on access the use tag is
set
• OS checks them every so often - records what it sees in a
data structure - then clears them all
• On a miss the OS decides who has been used the least and
replace that one
Write Strategy -- always write back
 Due to the access time to the disk, write through is
silly
 Use a dirty bit to only write back pages that have
been modified
computer arctecture ~ PSUT
166
Techniques for Fast Address
Translation
 Page table is kept in main memory (kernel memory)
 Each process has a page table
 Every data/instruction access requires two memory
accesses
 One for the page table and one for the data/instruction
 Can be solved by the use of a special fast-lookup hardware
cache called associative registers or translation look-aside
buffers (TLBs)
 If locality applies then cache the recent translation
 TLB = translation look-aside buffer
 TLB entry: virtual page no, physical page no, protection bit, use
bit, dirty bit
computer arctecture ~ PSUT
167
TLB = Translation Look-aside
Buffer
The TLB must be on chip; otherwise it is
worthless
 Fully associative – parallel search
Typical TLB’s




Hit time - 1 cycle
Miss penalty - 10 to 30 cycles
Miss rate - .1% to 2%
TLB size - 32 B to 8 KB
computer arctecture ~ PSUT
168
Paging Hardware with TLB
computer arctecture ~ PSUT
169
TLB of Alpha 21264
Address Space Number: process
ID to prevent context switch
A total of 128 TLB entries
computer arctecture ~ PSUT
170
Page Size – An Architectural Choice
Large pages are good:
 Reduces page table size
 Amortizes the long disk access
 If spatial locality is good then hit rate will
improve
 Reduce the number of TLB miss
Large pages are bad:
 More internal fragmentation
• If everything is random each structure’s last page
is only half full
 Process start computer
up time
takes
arctecture
~ PSUTlonger
171
Download