to this cache - Portland State University

advertisement
OMSE 510: Computing
Foundations
3: Caches, Assembly, CPU
Overview
Chris Gilmore <grimjack@cs.pdx.edu>
Portland State University/OMSE
Today
Caches
DLX Assembly
CPU Overview
Computer System (Idealized)
Disk
CPU
Memory
Disk
Controller
System Bus
The Big Picture: Where are We
Now?
The Five Classic Components of a Computer
Processor
Input
Control
Memory
Datapath
Output
Next Topic:
Simple caching techniques
Many ways to improve cache performance
Recap: Levels of the Memory Hierarchy
Upper Level
Processor
faster
Instr. Operands
Cache
Blocks
Memory
Pages
Disk
Files
Tape
Larger
Lower Level
Recap: exploit locality to achieve fast
memory
Two Different Types of Locality:
 Temporal Locality (Locality in Time): If an item is referenced, it will
tend to be referenced again soon.
 Spatial Locality (Locality in Space): If an item is referenced, items
whose addresses are close by tend to be referenced soon.
By taking advantage of the principle of locality:
 Present the user with as much memory as is available in the
cheapest technology.
 Provide access at the speed offered by the fastest technology.
DRAM is slow but cheap and dense:
 Good choice for presenting the user with a BIG memory system
SRAM is fast but expensive and not very dense:
 Good choice for providing the user FAST access time.
Memory Hierarchy: Terminology
Hit: data appears in some block in the upper level (example: Block X)
 Hit Rate: the fraction of memory access found in the upper level
 Hit Time: Time to access the upper level which consists of
RAM access time + Time to determine hit/miss
Miss: data needs to be retrieve from a block in the lower level (Block
Y)
 Miss Rate = 1 - (Hit Rate)
 Miss Penalty: Time to replace a block in the upper level +
Time to deliver the block the processor
Hit Time << Miss Penalty
To Processor
Upper Level
Memory
Lower Level
Memory
Blk X
From Processor
Blk Y
The Art of Memory System Design
Workload or
Benchmark
programs
Processor
reference stream
<op,addr>, <op,addr>,<op,addr>,<op,addr>, . . .
op: i-fetch, read, write
Memory
$
MEM
Optimize the memory system organization
to minimize the average memory access time
for typical workloads
Example: Fully Associative
Fully Associative Cache
 No Cache Index
 Compare the Cache Tags of all cache entries in parallel
 Example: Block Size = 32 B blocks, we need N 27-bit
comparators
By definition: Conflict Miss = 0 for a fully associative cache
31
4
Cache Tag (27 bits long)
0
Byte Select
Ex: 0x01
Valid Bit Cache Data
X
Byte 31
X
Byte 63
: :
Cache Tag
Byte 1
Byte 33 Byte 32
X
X
X
:
:
Byte 0
:
Example: 1 KB Direct Mapped Cache with 32 B Blocks
For a 2 ** N byte cache:
 The uppermost (32 - N) bits are always the Cache Tag
 The lowest M bits are the Byte Select (Block Size = 2 ** M)
Block address
31
9
Cache Tag
Example: 0x50
4
0
Cache Index
Byte Select
Ex: 0x01
Ex: 0x00
Stored as part
of the cache “state”
Cache Tag
Cache Data
Byte 31
0x50
Byte 63
: :
Valid Bit
Byte 1
Byte 0
0
Byte 33 Byte 32 1
2
3
:
:
Byte 1023
:
:
Byte 992 31
Set Associative Cache
N-way set associative: N entries for each Cache Index
 N direct mapped caches operates in parallel
Example: Two-way set associative cache
 Cache Index selects a “set” from the cache
 The two tags in the set are compared to the input in parallel
 Data is selected based on the tag result
Valid
Cache Tag
:
:
Adr Tag
Compare
Cache Data
Cache Index
Cache Data
Cache Block 0
Cache Block 0
:
:
Sel1 1
Mux
0 Sel0
OR
Hit
Cache Block
Cache Tag
Valid
:
:
Compare
Disadvantage of Set Associative Cache
N-way Set Associative Cache versus Direct Mapped Cache:
 N comparators vs. 1
 Extra MUX delay for the data
 Data comes AFTER Hit/Miss decision and set selection
In a direct mapped cache, Cache Block is available BEFORE
Hit/Miss:
 Possible to assume a hit and continue. Recover later if miss.
Valid
Cache Tag
:
:
Adr Tag
Compare
Cache Data
Cache Index
Cache Data
Cache Block 0
Cache Block 0
:
:
Sel1 1
Mux
0 Sel0
OR
Hit
Cache Block
Cache Tag
Valid
:
:
Compare
Block Size Tradeoff
Larger block size take advantage of spatial locality BUT:
 Larger block size means larger miss penalty:
 Takes longer time to fill up the block
 If block size is too big relative to cache size, miss rate
will go up
 Too few cache blocks
In general, Average Access Time:
= Hit Time x (1 - Miss Rate) + Miss Penalty x Miss
Average
Rate
Access
Miss
Miss
Penalty
Rate Exploits Spatial Locality
Fewer blocks:
compromises
temporal locality
Block Size
Block Size
Time
Increased Miss Penalty
& Miss Rate
Block Size
A Summary on Sources of Cache
Misses
Compulsory (cold start or process migration, first reference):
first access to a block
 “Cold” fact of life: not a whole lot you can do about it
 Note: If you are going to run “billions” of instruction,
Compulsory Misses are insignificant
Conflict (collision):
 Multiple memory locations mapped
to the same cache location
 Solution 1: increase cache size
 Solution 2: increase associativity
Capacity:
 Cache cannot contain all blocks access by the program
 Solution: increase cache size
Coherence (Invalidation): other process (e.g., I/O) updates
memory
Source of Cache Misses Quiz
Assume constant cost.
Direct Mapped
N-way Set Associative
Fully Associative
Cache Size:
Small, Medium, Big?
Compulsory Miss:
Conflict Miss
Capacity Miss
Coherence Miss
Choices: Zero, Low, Medium, High, Same
es of Cache Misses Answer
Direct Mapped
Cache Size
Compulsory Miss
Big
Same
N-way Set Associative
Medium
Same
Fully Associative
Small
Same
Conflict Miss
High
Medium
Zero
Capacity Miss
Low
Medium
High
Coherence Miss
Same
Same
Same
Note:
If you are going to run “billions” of instruction, Compulsory Misses are insignificant.
Recap: Four Questions for Caches
and Memory Hierarchy
Q1: Where can a block be placed in the upper
level? (Block placement)
Q2: How is a block found if it is in the upper
level?
(Block identification)
Q3: Which block should be replaced on a
miss?
(Block replacement)
Q4: What happens on a write?
(Write strategy)
Q1: Where can a block be placed in
the upper level?
Block 12 placed in 8 block cache:
 Fully associative, direct mapped, 2-way set associative
 S.A. Mapping = Block Number Modulo Number Sets
Fully associative:
block 12 can go
anywhere
Block
no.
01234567
Direct mapped:
block 12 can go
only into block 4
(12 mod 8)
Block
no.
01234567
Set associative:
block 12 can go
anywhere in set 0
(12 mod 4)
Block
no.
Block-frame address
Block
no.
1111111111222222222233
01234567890123456789012345678901
01234567
Set Set Set Set
0 1 2 3
Q2: How is a block found if it is in
the upper level?
Block Address
Tag
Block
offset
Index
Set Select
Data Select
Direct indexing (using index and block
offset), tag compares, or combination
Increasing associativity shrinks index,
expands tag
Q3: Which block should be
replaced on a miss?
Easy for Direct Mapped
Set Associative or Fully Associative:
 Random
 FIFO
 LRU (Least Recently Used)
 LFU (Least Frequently Used)
Associativity: 2-way
4-way
Size
LRU Random LRU Random
16 KB
5.2% 5.7% 4.7% 5.3%
64 KB
1.9% 2.0% 1.5% 1.7%
256 KB 1.15% 1.17% 1.13% 1.13%
8-way
LRU Random
4.4% 5.0%
1.4% 1.5%
1.12% 1.12%
Q4: What happens on a write?
Write through—The information is written to both the block in the
cache and to the block in the lower-level memory.
Write back—The information is written only to the block in the
cache. The modified cache block is written to main memory only
when it is replaced.
 is block clean or dirty?
Pros and Cons of each?
 WT: read misses cannot result in writes, coherency easier
 WB: no writes of repeated writes
WT always combined with write buffers so they don’t wait for lower
level memory
Write Buffer for Write Through
Processor
Cache
DRAM
Write Buffer
A Write Buffer is needed between the Cache and Memory
 Processor: writes data into the cache and the write buffer
 Memory controller: write contents of the buffer to memory
Write buffer is just a FIFO:
 Typical number of entries: 4
 Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write
cycle
Memory system designer’s nightmare:
 Store frequency (w.r.t. time) > 1 / DRAM write cycle
 Write buffer saturation
Write Buffer Saturation
Processor
Cache
DRAM
Write Buffer
Store frequency (w.r.t. time) > 1 / DRAM write cycle
 If this condition exist for a long period of time (CPU cycle time too
quick and/or too many store instructions in a row):
 Store buffer will overflow no matter how big you make it
 The CPU Cycle Time <= DRAM Write Cycle Time
Solution for write buffer saturation:
 Use a write back cache
 Install a second level (L2) cache: (does this always work?)
Processor
Cache
Write Buffer
L2
Cache
DRAM
Write-miss Policy: Write Allocate
versus Not Allocate
Assume: a 16-bit write to memory location 0x0 and causes a miss
 Do we read in the block?
 Yes: Write Allocate
 No: Write Not Allocate
31
9
Cache Tag
Cache Tag
Cache Index
Byte Select
Ex: 0x00
Ex: 0x00
0
Cache Data
0x50
Byte 31
Byte 63
: :
Valid Bit
Example: 0x00
4
Byte 1
Byte 0
0
Byte 33 Byte 32 1
2
3
:
:
Byte 1023
:
:
Byte 992 31
Impact on Cycle Time
PC
Cache Hit Time:
directly tied to clock rate
increases with cache size
increases with associativity
I -Cache
miss
IR
IRex
A
B
invalid
IRm
Average Memory Access time =
Hit Time + Miss Rate x Miss Penalty
R
D Cache
IRwb
Time = IC x CT x (ideal CPI + memory stalls)
T
Miss
What happens on a Cache miss?
For in-order pipeline, 2 options:
 Freeze pipeline in Mem stage (popular early on: Sparc, R4000)
IF
ID
IF
EX
ID
Mem stall stall stall … stall Mem
Wr
EX stall stall stall … stall stall Ex Wr
Use Full/Empty bits in registers + MSHR queue
 MSHR = “Miss Status/Handler Registers” (Kroft)
Each entry in this queue keeps track of status of outstanding memory
requests to one complete memory line.

Per cache-line: keep info about memory address.

For each word: register (if any) that is waiting for result.

Used to “merge” multiple requests to one memory line
 New load creates MSHR entry and sets destination register to
“Empty”. Load is “released” from pipeline.
 Attempt to use register before result returns causes instruction to
block in decode stage.
 Limited “out-of-order” execution with respect to loads.
Popular with in-order superscalar architectures.
Out-of-order pipelines already have this functionality built in… (load queues,
etc).

Improving Cache Performance: 3
general options
Time = IC x CT x (ideal CPI + memory stalls)
Average Memory Access time =
Hit Time + (Miss Rate x Miss Penalty) =
(Hit Rate x Hit Time) + (Miss Rate x Miss Time)
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.
Improving Cache
Performance
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.
3Cs Absolute Miss Rate (SPEC92)
0.14
1-way
Conflict
Miss Rate per Type
0.12
2-way
0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
64
32
16
8
Cache Size (KB)
128
Compulsory vanishingly
small
4
2
1
0
Compulsory
2:1 Cache Rule
miss rate 1-way associative cache size X
= miss rate 2-way associative cache size X/2
0.14
1-way
Conflict
2-way
0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
Cache Size (KB)
128
64
32
16
8
4
2
0
1
Miss Rate per Type
0.12
Compulsory
3Cs Relative Miss Rate
100%
Miss Rate per Type
1-way
80%
Conflict
2-way
4-way
8-way
60%
40%
Capacity
20%
128
64
Flaws: for fixed block size
Good: insight => invention Cache Size (KB)
32
16
8
4
2
1
0%
Compulsory
1. Reduce Misses via Larger
Block Size
25%
1K
20%
15%
16K
10%
64K
5%
256K
Block Size (bytes)
256
128
64
32
0%
16
Miss
Rate
4K
2. Reduce Misses via Higher
Associativity
2:1 Cache Rule:

Miss Rate DM cache size N Miss Rate 2-way
cache size N/2
Beware: Execution time is only final
measure!
 Will
Clock Cycle time increase?
 Hill [1988] suggested hit time for 2-way vs.
1-way
external cache +10%,
internal + 2%
Example: Avg. Memory Access
Time vs. Miss Rate
Example: assume CCT = 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8-way vs. CCT
direct mapped
Cache Size
Associativity
(KB)
1-way
2-way
4-way
8-way
1
2.33
2.15
2.07
2.01
2
1.98
1.86
1.76
1.68
4
1.72
1.67
1.61
1.53
8
1.46
1.48
1.47
1.43
16
1.29
1.32
1.32
1.32
32
1.20
1.24
1.25
1.27
64
1.14
1.20
1.21
1.23
128
1.10
1.17
1.18
1.20
(Green means A.M.A.T. not improved by more associativity)
(AMAT = Average Memory Access Time)
3. Reducing Misses via a
“Victim Cache”
How to combine fast hit time
of direct mapped
yet still avoid conflict
misses?
Add buffer to place data
discarded from cache
Jouppi [1990]: 4-entry victim
cache removed 20% to 95%
of conflicts for a 4 KB direct
mapped data cache
Used in Alpha, HP
machines
TAGS
DATA
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
To Next Lower Level In
Hierarchy
4. Reducing Misses by Hardware
Prefetching
E.g., Instruction Prefetching
 Alpha 21064 fetches 2 blocks on a miss
 Extra block placed in “stream buffer”
 On miss check stream buffer
Works with data blocks too:
 Jouppi [1990] 1 data stream buffer got 25%
misses from 4KB cache; 4 streams got 43%
 Palacharla & Kessler [1994] for scientific
programs for 8 streams got 50% to 70% of
misses from
2 64KB, 4-way set associative caches
Prefetching relies on having extra memory bandwidth
that can be used without penalty
5. Reducing Misses by
Software Prefetching Data
Data Prefetch
 Load data into register (HP PA-RISC loads)
 Cache Prefetch: load into cache
(MIPS IV, PowerPC, SPARC v. 9)
 Special prefetching instructions cannot cause faults;
a form of speculative execution
Issuing Prefetch Instructions takes time
 Is cost of prefetch issues < savings in reduced
misses?
 Higher superscalar reduces difficulty of issue
bandwidth
6. Reducing Misses by Compiler
Optimizations
McFarling [1989] reduced caches misses by 75%
on 8KB direct mapped cache, 4 byte blocks in software
Instructions
 Reorder procedures in memory so as to reduce conflict misses
 Profiling to look at conflicts(using tools they developed)
Data
 Merging Arrays: improve spatial locality by single array of compound
elements vs. 2 arrays
 Loop Interchange: change nesting of loops to access data in order
stored in memory
 Loop Fusion: Combine 2 independent loops that have same looping
and some variables overlap
 Blocking: Improve temporal locality by accessing “blocks” of data
repeatedly vs. going down whole columns or rows
Improving Cache
Performance (Continued)
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.
0. Reducing Penalty: Faster DRAM /
Interface
New DRAM Technologies
 RAMBUS - same initial latency, but much higher
bandwidth
 Synchronous DRAM
Better BUS interfaces
CRAY Technique: only use SRAM
1. Reducing Penalty: Read Priority over Write on Miss
Processor
Cache
DRAM
Write Buffer
A Write Buffer Allows reads to bypass writes
 Processor: writes data into the cache and the write buffer
 Memory controller: write contents of the buffer to memory
Write buffer is just a FIFO:
 Typical number of entries: 4
 Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle
Memory system designer’s nightmare:
 Store frequency (w.r.t. time) > 1 / DRAM write cycle
 Write buffer saturation
1. Reducing Penalty: Read
Priority over Write on Miss
Write-Buffer Issues:
 Write through with write buffers offer RAW conflicts with main
memory reads on cache misses
 If simply wait for write buffer to empty, might increase read miss
penalty (old MIPS 1000 by 50% )
 Check write buffer contents before read;
if no conflicts, let the memory access continue
Write Back?
 Read miss replacing dirty block
 Normal: Write dirty block to memory, and then do the read
 Instead copy the dirty block to a write buffer, then do the read,
and then do the write
 CPU stall less since restarts as soon as do read
2. Reduce Penalty: Early Restart and
Critical Word First
Don’t wait for full block to be loaded before restarting CPU
 Early restart—As soon as the requested word of the block arrives,
send it to the CPU and let the CPU continue execution
 Critical Word First—Request the missed word first from memory
and send it to the CPU as soon as it arrives; let the CPU continue
execution while filling the rest of the words in the block. Also called
wrapped fetch and requested word first
Generally useful only in large blocks,
Spatial locality a problem; tend to want next sequential word, so not
clear if benefit by early restart
block
3. Reduce Penalty: Non-blocking
Caches
Non-blocking cache or lockup-free cache allow data cache
to continue to supply cache hits during a miss
 requires F/E bits on registers or out-of-order execution
 requires multi-bank memories
“hit under miss” reduces the effective miss penalty by
working during miss vs. ignoring CPU requests
“hit under multiple miss” or “miss under miss” may further
lower the effective miss penalty by overlapping multiple
misses
 Significantly increases the complexity of the cache
controller as there can be multiple outstanding memory
accesses
 Requires multiple memory banks (otherwise cannot
support)
 Pentium Pro allows 4 outstanding memory misses
Value of HitHit
Miss for SPEC
Under i Under
Misses
2
1.8
Avg. Mem. Access Time
1.6
1.4
0->1
1.2
1->2
1
2->64
0.8
Base
0.6
0.4
“Hit under n Misses”
0.2
Integer
ora
spice2g6
nasa7
alvinn
hydro2d
mdljdp2
wave5
su2cor
doduc
swm256
tomcatv
fpppp
ear
mdljsp2
compress
xlisp
espresso
eqntott
0
Floating Point
FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26
Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19
8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss
4. Reduce Penalty: Second-Level
Cache
L2 Equations
AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
Proc
L1 Cache
L2 Cache
AMAT = Hit TimeL1 +
Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss PenaltyL2)
Definitions:
 Local miss rate— misses in this cache divided by the total number of
memory accesses to this cache (Miss rateL2)
 Global miss rate—misses in this cache divided by the total number of
memory accesses generated by the CPU
(Miss RateL1 x Miss RateL2)
 Global Miss Rate is what matters
Reducing Misses: which apply to L2
Cache?
Reducing Miss Rate
1. Reduce Misses via Larger Block Size
2. Reduce Conflict Misses via Higher Associativity
3. Reducing Conflict Misses via Victim Cache
4. Reducing Misses by HW Prefetching Instr, Data
5. Reducing Misses by SW Prefetching Data
6. Reducing Capacity/Conf. Misses by Compiler
Optimizations
L2 cache block size & A.M.A.T.
Relative CPU Time
2
1.9
1.8
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1
1.95
1.54
1.36
16
1.28
1.27
32
64
1.34
128
256
512
Block Size
32KB L1, 8 byte path to memory
Improving Cache
Performance (Continued)
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache:
- Lower Associativity (+victim caching)?
- 2nd-level cache
- Careful Virtual Memory Design
Summary #1/ 3:
The Principle of Locality:
 Program likely to access a relatively small portion of the address
space at any instant of time.
 Temporal Locality: Locality in Time
 Spatial Locality: Locality in Space
Three (+1) Major Categories of Cache Misses:
 Compulsory Misses: sad facts of life. Example: cold start misses.
 Conflict Misses: increase cache size and/or associativity.
Nightmare Scenario: ping pong effect!
 Capacity Misses: increase cache size
 Coherence Misses: Caused by external processors or I/O devices
Cache Design Space
 total size, block size, associativity
 replacement policy
 write-hit policy (write-through, write-back)
 write-miss policy
Summary #2 / 3: The Cache Design Space
Several interacting dimensions
 cache size
 block size
 associativity
 replacement policy
 write-through vs write-back
 write allocation
The optimal choice is a compromise
 depends on access characteristics
 workload
 use (I-cache, D-cache, TLB)
 depends on technology / cost
Simplicity often wins
Cache Size
Associativity
Block Size
Bad
Good Factor A
Less
Factor B
More
Summary #3 / 3: Cache Miss
Optimization
miss rate
Lots of techniques people use to improve
the miss rate of caches:
Technique
MR MP HT
Larger Block Size
+
–
Higher Associativity
+
–
Victim Caches
+
Pseudo-Associative Caches
+
HW Prefetching of Instr/Data +
Compiler Controlled Prefetching+
Compiler Reduce Misses
+
Complexity
0
1
2
2
2
3
0
Onto Assembler!
What is assembly language?
Machine-Specific Programming Language


one-one correspondence between statements and native
machine language
matches machine instruction set and architecture
What is an assembler?
Systems Level Program


Usually works in conjunction with the compiler
translates assembly language source code to machine
language


object file - contains machine instructions, initial data, and
information used when loading the program
listing file - contains a record of the translation process, line
numbers, addresses, generated code and data, and a symbol table
Why learn assembly?
Learn how a processor works
Understand basic computer architecture
Explore the internal representation of data and
instructions
Gain insight into hardware concepts
Allows creation of small and efficient programs
Allows programmers to bypass high-level language
restrictions
Might be necessary to accomplish certain operations
Machine Representation
A language of numbers, called the Processor’s
Instruction Set

The set of basic operations a processor can perform
Each instruction is coded as a number
Instructions may be one or more bytes
Every number corresponds to an instruction
Assembly vs Machine
Machine Language Programming

Writing a list of numbers representing the bytes of machine
instructions to be executed and data constants to be used
by the program
Assembly Language Programming

Using symbolic instructions to represent the raw data that
will form the machine language program and initial data
constants
Assembly
Mnemonics represent Machine Instructions


Each mnemonic used represents a single machine
instruction
The assembler performs the translation
Some mnemonics require operands

Operands provide additional information

register, constant, address, or variable
Assembler Directives
Instruction Set Architecture:
a Critical Interface
software
instruction set
hardware
Portion of the machine that is visible to the programmer or the compiler writer.
Good ISA
Lasts through many implementations (portability,
compatibility)
Can be used for many different applications
(generality)
Provide convenient functionality to higher levels
Permits an efficient implementation at lower levels
Von Neumann Machines
Von Neumann “invented” stored program
computer in 1945
Instead of program code being hardwired, the
program code (instructions) is placed in
memory along with data
Control
ALU
Program
Data
Basic ISA Classes
Memory to Memory Machines




Every instruction contains a full memory address for each
operand
Maybe the simplest ISA design
However memory is slow
Memory is big (lots of address bits)
Memory-to-memory machine
Assumptions
 Two operands per operation
 first operand is also the destination
 Memory address = 16 bits (2 bytes)
 Operand size = 32 bits (4 bytes)
 Instruction code = 8 bits (1 byte)
Example A = B+C (hypothetical code)
mov A, B
; A <= B
add A, C
; A <= B+C
 5 bytes for instruction
 4 bytes for fetch 1st and 2nd operands
 4 bytes to store results
 add needs 17 bytes and mov needs 13 byts
 Total 30 bytes memory traffic
Why CPU Storage?
A small amount of storage in the CPU

To reduce memory traffic by keeping repeatedly used
operands in the CPU



Avoid re-referencing memory
Avoid having to specify full memory address of the operand
This is a perfect example of “make the common case fast”.
Simplest Case

A machine with 1 cell of CPU storage: the accumulator
Accumulator Machine
Assumptions
 Two operands per operation
 1st operand in the accumulator
 2nd operand in the memory
 accumulator is also the destination (except for store)
 Memory address = 16 bits (2 bytes)
 Operand size = 32 bits (4 bytes)
 Instruction code = 8 bits (1 byte)
Example A = B+C (hypothetical code)
Load B
; acc <= B
Add C
; acc <= B+C
Store A
; A <= acc
 3 bytes for instruction
 4 bytes to load or store the second operand
 7 bytes per instruction
 21 bytes total memory traffic
Stack Machines
Instruction sets are based on a stack model of execution.
Aimed for compact instruction encoding
Most instructions manipulate top few data items (mostly top 2)
of a pushdown stack.
Top few items of the stack are kept in the CPU
Ideal for evaluating expressions (stack holds intermediate
results)
Were thought to be a good match for high level languages
Awkward
 Become very slow if stack grows beyond CPU local
storage
 No simple way to get data from “middle of stack”
Stack Machines
Binary arithmetic and logic operations



Operands: top 2 items on stack
Operands are removed from stack
Result is placed on top of stack
Unary arithmetic and logic operations


Operand: top item on the stack
Operand is replaced by result of operation
Data move operations


Push: place memory data on top of stack
Pop: move top of stack to memory
General Purpose Register
Machines
With stack machines, only the top two elements of
the stack are directly available to instructions. In
general purpose register machines, the CPU storage
is organized as a set of registers which are equally
available to the instructions
Frequently used operands are placed in registers
(under program control)


Reduces instruction size
Reduces memory traffic
General Purpose Registers
Dominate
1975-present all machines use general purpose
registers
Advantages of registers




registers are faster than memory
registers are easier for a compiler to use
e.g., (A*B) – (C*D) – (E*F) can do multiplies in any order
registers can hold variables


memory traffic is reduced, so program is sped up (since registers
are faster than memory)
code density improves (since register named with fewer bits than
memory location)
Classifying General Purpose
Register Machines
General purpose register machines are subclassified based on whether or not memory
operands can be used by typical ALU instructions
Register-memory machines: machines where some
ALU instructions can specify at least one memory
operand and one register operand
Load-store machines: the only instructions that can
access memory are the “load” and the “store”
instructions
Comparing number of instructions
Code sequence for A = B+C for five classes of
instruction sets:
Memory to Memory
mov A B
add A C
Accumulator
load B
add C
store A
Stack
push B
push C
add
pop A
Register
(Register-memory)
load R1 B
add R1 C
store A R1
Register
(Load-store)
Load R1 B
Load R2 C
Add R1 R1 R2
Store A R1
DLX/MIPS is one of these
Instruction Set Definition
Objects = architecture entities = machine state
 Registers
 General purpose
 Special purpose (e.g. program counter, condition code, stack pointer)
 Memory locations
 Linear address space: 0, 1, 2, …,2^s -1
Operations = instruction types
 Data operation
 Arithmetic
 Logical
 Data transfer
 Move (from register to register)
 Load (from memory location to register)
 Store (from register to memory location)
 Instruction sequencing
 Branch (conditional)
 Jump (unconditional)
Topic: DLX
Instructional Architecture
Much nicer and easier to understand than x86
(barf)
The Plan: Teach DLX, then move to x86/y86
DLX: RISC ISA, very similar to MIPS
Great links to learn more DLX:
http://www.softpanorama.org/Hardware/ar
chitecture.shtml#DLX
DLX Architecture
Based on observations about instruction set architecture
Emphasizes:
 Simple load-store instruction set
 Design for pipeline efficiency
 Design for compiler target
DLX registers
32 32-bit GPRS named R0, R1, ..., R31
 32 32-bit FPRs named F0, F2, ..., F30
 Accessed independently for 32-bit data
 Accessed in pairs for 64-bit (double-precision) data
 Register R0 is hard-wired to zero
 Other status registers, e.g., floating-point status register
Byte addressable in big-endian with 32-bit address
Arithmetic instructions operands must be registers

MIPS: Software conventions
for Registers
0
zero constant 0
16 s0 callee saves
1
at
. . . (callee must save)
2
v0 expression evaluation &
23 s7
3
v1 function results
24 t8
4
a0 arguments
25 t9
5
a1
26 k0 reserved for OS kernel
6
a2
27 k1
7
a3
28 gp Pointer to global area
8
t0
...
15 t7
reserved for assembler
temporary (cont’d)
temporary: caller saves
29 sp Stack pointer
(callee can clobber)
30 fp
frame pointer
31 ra
Return Address (HW)
Addressing Modes
This table shows the most common modes.
Addressing Mode
Example
Instruction
Meaning
When Used
Register
Add R4, R3
R[R4] <- R[R4] + R[R3]
When a value is in a
register.
Immediate
Add R4, #3
R[R4] <- R[R4] + 3
For constants.
Displacement
Add R4, 100(R1)
R[R4] <- R[R4] +
M[100+R[R1] ]
Accessing local
variables.
Register Deferred
Add R4, (R1)
R[R4] <- R[R4] +
M[R[R1] ]
Using a pointer or a
computed address.
Absolute
Add R4, (1001)
R[R4] <- R[R4] + M[1001]
Used for static data.
Memory Organization
Viewed as a large, single-dimension array, with an
address.
A memory address is an index into the array
"Byte addressing" means that the index points to a
byte of memory.
8 bits of data
0
1
2
3
4
5
6
8 bits of data
8 bits of data
8 bits of data
8 bits of data
8 bits of data
8 bits of data
Memory Addressing
Bytes are nice, but most data items use larger
"words"
For DLX, a word is 32 bits or 4 bytes.
2 questions for design of ISA:



Since one could read a 32-bit word as four loads of bytes
from sequential byte addresses or as one load word from a
single byte address,
How do byte addresses map to word addresses?
Can a word be placed on any byte boundary?
Addressing Objects:
Endianess and Alignment
Big Endian: address of most significant byte = word address (xx00 =
Big End of word)

IBM 360/370, Motorola 68k, MIPS, Sparc, HP PA
Little Endian: address of least significant byte = word address (xx00
= Little End of word)

Intel 80x86, DEC Vax, DEC Alpha (Windows NT)
little endian byte 0
3
2
1
0
lsb
msb
0
0
big endian byte 0
1
2
3
Aligned
Not
Alignment: require that objects fall on address
that is multiple of their size.
Aligned
1
2
3
Assembly Language vs.
Machine Language
Assembly provides convenient symbolic
representation


much easier than writing down numbers
e.g., destination first
Machine language is the underlying reality

e.g., destination is no longer first
Assembly can provide 'pseudoinstructions'


e.g., “move r10, r11” exists only in Assembly
would be implemented using “add r10,r11,r0”
When considering performance you should
count real instructions
DLX arithmetic
ALU instructions can have 3 operands


add $R1, $R2, $R3
sub $R1, $R2, $R3
Operand order is fixed (destination first)
Example:
C code:
A = B + C
DLX code:
add r1, r2, r3
(registers associated with
variables by compiler)
DLX arithmetic
Design Principle: simplicity favors regularity.
Of course this complicates some things...
C code:
Why?
A = B + C + D;
E = F - A;
MIPS code: add r1, r2, r3
add r1, r1, r4
sub r5, r6, r1
Operands must be registers, only 32 registers provided
Design Principle: smaller is faster.
Why?
Execution assembly instructions
Program counter holds
the instruction address
CPU fetches
instruction from
memory and puts it
onto the instruction
register
Control logic decodes
the instruction and tells
the register file, ALU
and other registers
what to do
An ALU operation (e.g.
add) data flows from
register file, through
ALU and back to
register file
ALU Execution Example
ALU Execution example
Memory Instructions
Load and store instructions
 lw r11, offset(r10);
 sw r11, offset(r10);
Example:
C code: A[8] = h + A[8];
assume h in r2 and base address of the array A in r3
DLX code:
lw r4, 32(r3)
add r4, r2, r4
sw r4, 32(r3)
Store word has destination last
Remember arithmetic operands are registers, not memory!
Memory Operations Loads
Load data from memory
lw R6, 0(R5)
# R6 <= mem[0x14]
Memory Operations - Stores
Storing data to memory works essentially the
same way



sw R6 , 0(R5)
R6 = 200; let’s assume R5 =0x18
mem[0x18] <-- 200
So far we’ve learned:
DLX
— loading words but addressing bytes
— arithmetic on registers only
Instruction
Meaning
add r1, r2, r3
sub r1, r2, r3
lw r1, 100(r2)
sw r1, 100(r2)
r1 = r2 + r3
r1 = r2 – r3
r1 = Memory[r2+100]
Memory[r2+100] = r1
Use of Registers
Example:
a = ( b + c) - ( d + e) ; // C statement
# r1 – r5 : a - e
 add r10, r2, r3
 add r11, r4, r5
 sub r1, r10, r11
a = b + A[4];
// add an array element to a var
// r3 has address A


lw r4, 16(r3)
add r1, r2, r4
Use of Registers : load and store
Example
A[8]= a + A[6] ;
lw r1, 24(r3)
add r1, r2, r1
sw r1, 32(r3)
// A is in r3, a is in r2
# r1 gets A[6] contents
# r1 gets the sum
# sum is put in A[8]
load and store
Ex:
a = b + A[i];
// A is in r3, a,b, i in
// r1, r2, r4
add
add
add
r11, r4, r4
r11, r11, r11
r11, r11, r3
lw
add
r10, 0(r11)
r1, r2, r10
# r11 = 2 * i
# r11 = 4 * i
# r11 =addr. of A[i]
(r3+(4*i))
# r10 = A[i]
# a = b + A[i]
Example: Swap
Swapping words
temp = v[0]
v[0] = v[1];
v[1] = temp;
swap:
lw r10,
lw r11,
sw r10,
sw r11,
r2 has the base address of the array v
0(r2)
4(r2)
4(r2)
0(r2)
DLX Instruction Format
Instruction Format

I-type R-type J-type
I-type Instructions
6
5
Opcode
Rs1
5
16
rd Immediate
R-type Instructions
6
Opcode
5
5
5
Rs1 Rs2
rd Immediate
J-type Instructions
6
Opcode
11
26
Offset added to PC
Machine Language
Instructions, like registers and words of data, are
also 32 bits long


Example: add r10, r1, r2
registers have numbers, 10, 1, 2
Instruction Format:
R-type Instructions
6
Opcode
5
5
Rs1 Rs2
5
rd
11
func
000000 00001 00010 01010 0…000100000
Machine Language
Consider the load-word and store-word instructions,


What would the regularity principle have us do?
New principle: Good design demands a compromise
Introduce a new type of instruction format


I-type for data transfer instructions
other format was R-type for register
Example: lw r10, 32(r2)
I-type Instructions (Loads/stores)
6
5
Opcode
Rs1
5
16
rd Immediate
1000011 01010 00010 0…000100000
Machine Language
Jump instructions
Example: j .L1
J-type Instructions (Jump, Jump and Link, Trap, return from exception)
6
Opcode
0000010
26
Offset added to PC
offset to .L1
DLX Instruction Format
Instruction Format

I-type R-type J-type
I-type Instructions
6
5
Opcode
Rs1
5
16
rd Immediate
R-type Instructions
6
Opcode
5
5
5
Rs1 Rs2
rd Immediate
J-type Instructions
6
Opcode
16
26
Offset added to PC
Instructions for Making
Decisions
beq reg1, reg2, L1

Go to the statement labeled L1 if the value in reg1 equals
the value in reg2
bne reg1, reg2, L1

Go to the statement labeled L1 if the value in reg1 does not
equals the value in reg2
j L1

Unconditional jump
jr r10

“jump register”. Jump to the instruction specified in register
r10
Making Decisions
Example
if ( a != b) goto L1;
x = y + z;
L1 : x = x – a;
// x,y,z,a,b mapped to r1-r5
bne r4, r5, L1
# goto L1 if a != b
add r1, r2, r3
# x = y + z (ignored if a=b)
L1:sub r1, r1, r4 # x = x – a (always ex)
if-then-else
Example:
if ( a==b) x = y + z;
else x = y – z ;
bne
r4, r5, Else
add
r1, r2, r3
j
Exit
# goto Else if a!=b
# x = y + z
# goto Exit
r1,r2,r3
# x = y – z
Else : sub
Exit :
Example: Loop with array
index
Loop:

g = g + A [i];
i = i + j;
if (i != h)
....
goto Loop
r1, r2, r3, r4 = g, h, i, j, array base = r5
LOOP: add
add
add
lw
add
add
bne
r11, r3, r3
r11, r11, r11
r11, r11, r5
r10, 0(r11)
r1, r1, r10
r3, r3, r4
r3, r2, LOOP
#r11 = 2 * i
#r11 = 4 * i
#r11 = adr. Of A[i]
#load A[i]
#g = g + A[i]
#i = i + j
Other decisions
Set R1 on R2 less than R3
slt R1, R2, R3


Compares two registers, R2 and R3
R1 = 1 if R2 < R3 else
R1 = 0 if R2 >= R3
Example
slt r11, r1, r2
Branch less than



Example: if(A < B) goto LESS
slt
r11, r1, r2
#t1 = 1 if A < B
bne
r11, $0, LESS
Loops
Example :
while ( A[i] == k )
// i,j,k in r3. r4, r5
// A is in r6
i = i + j;
Loop: sll r11, r3, 2
add
lw
bne
add
j
Exit:
r11, r11, r6
r10, 0(r11)
r10, r5, Exit
r3, r3, r4
Loop
# r11 = 4 * i
#
#
#
#
#
r11 = addr. Of A[i]
r10 = A[i]
goto Exit if A[i]!=k
i = i + j
goto Loop
Addresses in Branches
and Jumps
Instructions:
bne r14,r15,Label
beq r14,r15,Label
j Label
Next instruction is at Label if r14≠r15
Next instruction is at Label if r14=r15
Next instruction is at Label
Formats:
I
op
J
op
rs
rt
16 bit address
26 bit address
Addresses are not 32 bits
— How do we handle this with large programs?
— First idea: limitation of branch space to the first 216 bits
Addresses in Branches
Instructions:
bne r14,r15,Label
beq r14,r15,Label
Next instruction is at Label if r14≠r15
Next instruction is at Label if r14=r15
Formats:
I
op
rs
rt
16 bit address
Treat the 16 bit number as an offset to the PC register –
PC-relative addressing

Word offset instead of byte offset, why??

most branches are local (principle of locality)
Jump instructions just use the high order bits of PC –
Pseudodirect addressing


32-bit jump address = 4 Most Significant bits of PC
concatenated with 26-bit word address (or 28- bit byte
address)
Address boundaries of 256 MB
Conditional Branch Distance
Int. Avg.
FP Avg.
40%
30%
20%
10%
0%
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15
Words of Branch Dispalcement
• 65% of integer branches are 2 to 4 instructions
Conditional Branch
Addressing
PC-relative since most branches are relatively close to
the current PC
At least 8 bits suggested (128 instructions)
Compare Equal/Not Equal most important for integer
programs (86%)
7%
LT/GE
40%
Int Avg.
7%
GT/LE
23%
EQ/NE
0%
FP Avg.
86%
37%
50%
Frequency of comparison
types in branches
100%
PC-relative addressing
For larger distances:
Jump register jr required.
Example
LOOP:
EXIT:
mult
lw
bne
add
j
...
$9, $19, $10 # R9 = R19*R10
$8, 1000($9) # R8 = @(R9+1000)
$8, $21, EXIT
$19, $19, $20
#i = i + j
LOOP
Assume address of LOOP is 0x8000
op
rs
rt
80000
0
19
10
80004
35
9
8
1000
80008
5
8
21
2
80012
0
19
20
80016
2
0x8000
80020
...
9
0
24
8
19
80000
0
32
Procedure calls
Procedures or subroutines:

Needed for structured programming
Steps followed in executing a procedure call:






Place parameters in a place where the procedure
(callee) can access them
Transfer control to the procedure
Acquire the storage resources needed for the
procedure
Perform desired task
Place results in a place where the calling program
(caller) can access them
Return control to the point of origin
Resources Involved
Registers used for procedure calling:



$a0 - $a3 : four argument registers in which to pass
parameters
$v0 - $v1 : two value registers in which to return values
r31 : one return address register to return to the point of
origin
Transferring the control to the callee:




jal ProcedureAddress
jump-and-link to the procedure address
the return address (PC+4) is saved in r31
Example: jal 20000
Returning the control to the caller:


jr r31
instruction following jal is executed next
Memory Stacks
Useful for stacked environments/subroutine call & return even if
operand stack not part of architecture
Stacks that Grow Up vs. Stacks that Grow Down:
High address
inf. Big
a
b
c
SP
Low address
0 Little
grows
up
grows
down
0 Little
inf. Big
Memory
Addresses
Calling conventions
int func(int g, int h, int i, int j)
{
int f;
f = ( g + h ) – ( i + j ) ;
return ( f );
}
// g,h,i,j - $a0,$a1,$a2,$a3, f in r8
func :
addi$sp, $sp, -12
#make room in stack for 3 words
sw r11, 8($sp)
#save the regs we want to use
sw r10, 4($sp)
sw r8, 0($sp)
add r10, $a0, $a1
#r10 = g + h
add r11, $a2, $a3
#r11 = i + j
sub r8, r10, r11
#r8 has the result
add $v0, r8, r0
#return reg $v0 has f
Calling (cont.)
lw r8, 0($sp)
# restore r8
lw r10, 4($sp)
# restore r10
lw r11, 8($sp)
# restore r11
addi
$sp, $sp, 12
# restore sp
jr $ra
• we did not have to restore r10-r19
(caller save)
• we do need to restore r1-r8
(must be preserved by
callee)
Nested Calls
Stacking of Subroutine Calls & Returns and Environments:
A
A:
CALL B
A B
B:
CALL C
C:
D:
RET
A B C
A B
RET
A
Some machines provide a memory stack as part of the
architecture (e.g., VAX, JVM)
Sometimes stacks are implemented via software convention
Compiling a String Copy
Proc.
void
{
strcpy ( char x[ ], y[ ])
int i=0;
while ( x[ i ] = y [ i ] != 0)
i ++ ;
}
// x and y base addr. are in $a0 and $a1
strcpy :
addi
$sp, $sp, -4 # reserve 1 word space in stack
sw
r8, 0($sp)
# save r8
add
r8, $zer0, $zer0
# i = 0
L1 :
add
r11, $a1, r8
# addr. of y[ i ] in r11
lb
r12, 0(r11)
# r12 = y[ i ]
add
r13, $a0, r8
# addr. Of x[ i ] in r13
sb
r12, 0(r13)
# x[ i ] = y [ i ]
beq
r12, $zero, L2
# if y [ i ] = 0 goto L2
addi
r8, r8, 1
# i ++
j
L1
# go to L1
L2 :
lw
r8, 0($sp)
# restore r8
addi
$sp, $sp, 4
# restore $sp
jr
$ra
# return
IA - 32
1978: The Intel 8086 is announced (16 bit architecture)
1980: The 8087 floating point coprocessor is added
1982: The 80286 increases address space to 24 bits, +instructions
1985: The 80386 extends to 32 bits, new addressing modes
1989-1995: The 80486, Pentium, Pentium Pro add a few instructions
(mostly designed for higher performance)
1997: 57 new “MMX” instructions are added, Pentium II
1999: The Pentium III added another 70 instructions (SSE)
2001: Another 144 instructions (SSE2)
2003: AMD extends the architecture to increase address space to 64 bits,
widens all registers to 64 bits and other changes (AMD64)
2004: Intel capitulates and embraces AMD64 (calls it EM64T) and adds
more media extensions
“This history illustrates the impact of the “golden handcuffs” of compatibility
“adding new features as someone might add clothing to a packed bag”
“an architecture that is difficult to explain and impossible to love”
IA-32 Overview
Complexity:




Instructions from 1 to 17 bytes long
one operand must act as both a source and destination
one operand can come from memory
complex addressing modes
e.g., “base or scaled index with 8 or 32 bit displacement”
Saving grace:


the most frequently used instructions are not too difficult to
build
compilers avoid the portions of the architecture that are slow
“what the 80x86 lacks in style is made up in quantity,
making it beautiful from the right perspective”
IA32 Registers
Oversimplified Architecture

Four 32-bit general purpose registers:



eax, ebx, ecx, edx
al is a register to mean “the lower 8 bits of eax”
Stack Pointer

esp
Fun fact:



Once upon a time, only x86 was a 16-bit CPU
So, when they upgraded x86 to 32-bits...
Added an “e” in front of every register and called it
“extended”
Intel 80x86 Integer Registers
GPR0
EAX
Accumulator
GPR1
ECX
Count register, string, loop
GPR2
EDX
Data Register; multiply, divide
GPR3
EBX
Base Address Register
GPR4
ESP
Stack Pointer
GPR5
EBP
Base Pointer – for base of stack seg.
GPR6
ESI
Index Register
GPR7
EDI
Index Register
CS
Code Segment Pointer
SS
Stack Segment Pointer
DS
Data Segment Pointer
ES
Extra Data Segment Pointer
FS
Data Seg. 2
GS
Data Seg. 3
EIP
Instruction Counter
Eflags
Condition Codes
PC
x86 Assembly
mov <dest>, <src>
Move the value from <src> into <dest>
Used to set initial values
add <dest>, <src>
Add the value from <src> to <dest>
sub <dest>, <src>
Subtract the value from <src> from <dest>
x86 Assembly
push <target>
Push the value in <target> onto the stack
Also decrements the stack pointer, ESP
(remember stack grows from high to low)
pop <target>
Pops the value from the top of the stack, put it in
<target>
Also increments the stack pointer, ESP
x86 Assembly
jmp <address>
Jump to an instruction (like goto)
Change the EIP to <address>
Call <address>
A function call.
Pushes the current EIP + 1 (next instruction)
onto the stack, and jumps to <address>
x86 Assembly
lea <dest>, <src>
Load Effective Address of <src> into register
<dest>.
Used for pointer arithmetic (not actual memory
reference)
int <value>
interrupt – hardware signal to operating system
kernel, with flag <value>
int 0x80 means “Linux system call”
x86 Assembly
Condition Codes:
CZ: Carry Flag – Overflow Detection (Unsigned)
ZF: Zero Flag
SF: Sign Flag
OF: Overflow Flag – Overflow Detection (Signed)
Conditional Codes are you usually accessed
through conditional branches (Not Directly)
Interrupt convention
int 0x80 – System call interupt
eax – System call number (eg. 1-exit, 2-fork, 3-read,
4-write)
ebx – argument #1
ecx – argument #2
edx – argument #3
CISC vs RISC
RISC = Reduced Instruction Set Computer (DLX)
CISC = Complex Instruction Set Computer (x86)
Both have their advantages.
RISC
Not very many instructions
All instructions all the same length in both
execution time, and bit length
Results in simpler CPU’s (Easier to optimize)
Usually takes more instructions to express
useful operation
CISC
Very many instructions (x86 instruction set over
700 pages)
Variable length instructions
More Complex CPU’s
Usually takes fewer instructions to express
useful operation (lower memory requirements)
Some links
Great stuff on x86 asm:
http://www.cs.dartmouth.edu/~cs37/
Instruction Architecture Ideas
If code size is most important, use variable
length instructions
If simplicity is most important, use fixed length
instructions
Simpler ISA’s are easier to optimize!
Recent embedded machines (ARM, MIPS) added
optional mode to execute subset of 16-bit wide
instructions (Thumb, MIPS16); per procedure
decide performance or density
Summary
Instruction complexity is only one variable

lower instruction count vs. higher CPI / lower clock
rate
Design Principles:




simplicity favors regularity
smaller is faster
good design demands compromise
make the common case fast
Instruction set architecture

a very important abstraction indeed!
Download