Lecture 2: Memory Systems Internal and External Memories Main CPU

advertisement
2015-11-03
Lecture 2: Memory Systems




Basic components
Memory hierarchy
Cache memory
Virtual Memory
Zebo Peng, IDA, LiTH
1
TDTS 08 – Lecture 2
Internal and External Memories
Date transfer
CPU
Main
Memory
Control
Data transfer
Control
Secondary Memory
Zebo Peng, IDA, LiTH
2
TDTS 08 – Lecture 2
1
2015-11-03
Main Memory Model
A word (8, 16, 32, or 64 bits)
Memory
Control
Unit
3
address 21
0
a bit
Read/write control
Address selection
MBR (in CPU)
MAR (in CPU)
Zebo Peng, IDA, LiTH
3
TDTS 08 – Lecture 2
Memory Characteristics
The most important characteristics of a memory:
 speed — as fast as possible;
 size — as large as possible;
 cost — reasonable price.
They are determined by the technology used for implementation.
Your personal library
Zebo Peng, IDA, LiTH
4
TDTS 08 – Lecture 2
2
2015-11-03
Memory Access Bottleneck
CPU
Quantitative measurement of the capacity
of the bottleneck is the Memory Bandwidth
Memory
Zebo Peng, IDA, LiTH
5
TDTS 08 – Lecture 2
Memory Bandwidth

Memory bandwidth denotes the amount of data that can
be accessed from a memory per second:
M-Bandwidth =
1
memory cycle time
∙ amount of data per access
Ex. MCT = 100 nano second and 4 bytes (a word) per access:
M-Bandwidth = 40 mega bytes per second.

There are two basic techniques to increase the bandwidth
of a given memory:
 Reduce the memory cycle time
• Expensive
• Memory size limitation
 Divide the memory into several banks, each of which has its own
control unit.
Zebo Peng, IDA, LiTH
6
TDTS 08 – Lecture 2
3
2015-11-03
Memory Banks
Interleaving placement of program and data
12
8
4
0
13
9
5
1
Control
Unit
14
10
6
2
Control
Unit
15
11
7
3
Control
Unit
Control
Unit
CPU
Zebo Peng, IDA, LiTH
7
TDTS 08 – Lecture 2
Lecture 2: Memory Systems




Basic components
Memory hierarchy
Cache memory
Virtual Memory
Zebo Peng, IDA, LiTH
8
TDTS 08 – Lecture 2
4
2015-11-03
Motivation

What do we need?
 A memory to store very large programs and to work at a speed
comparable to that of the CPU.

The reality is:
 The larger a memory, the slower it will be;
 The faster a memory, the greater the cost per bit.

A solution:
 To build a composite memory system which combines a small
and fast memory with a large and slow memory, and behaves,
most of the time, like a large and fast memory.
 This two-level principle can be extended to a hierarchy of many
levels.
Zebo Peng, IDA, LiTH
9
TDTS 08 – Lecture 2
Memory Hierarchy
CPU
Registers
Cache
Main Memory
Secondary Memory
of direct access type
Secondary Memory
of archive type
Zebo Peng, IDA, LiTH
10
TDTS 08 – Lecture 2
5
2015-11-03
Memory Hierarchy
Access time
example
Capacity example
16-256
CPU
1-10 ns
Registers
10-50 ns
Cache
4-512K
40-500 ns
Main Memory
4-256M
5-100 ms
(for 4KB)
0.5-5 s
(for 8KB)
Secondary Memory
of direct access type
40G/unit
Secondary Memory
of archive type
Zebo Peng, IDA, LiTH
As one goes down the
hierarchy, the following
occur:

Decreasing cost/bit.

Increasing capacity.

Increasing access
time.

Decreasing frequency
of access by the CPU.
50M/tape
11
TDTS 08 – Lecture 2
Lecture 2: Memory Systems




Basic components
Memory hierarchy
Cache memory
Virtual Memory
Zebo Peng, IDA, LiTH
12
TDTS 08 – Lecture 2
6
2015-11-03
Mismatch of CPU and MM Speeds
4
Cycle Time (nano second)
10
Speed Gap
(ca. one order
of magnitude,
i.e., 10 times)
3
10
2
10
1
10
0
1955
1960
Zebo Peng, IDA, LiTH
1965
1970
1975
1980
13
1985
1990
2000
2005
TDTS 08 – Lecture 2
Cache Memory
CPU
addresses
Main Memory
instructions
and data
instructions
and data
addresses
Cache
addresses
instructions
and data

A cache is a very fast memory which is put between
the main memory and the CPU, and used to hold
segments of program and data of the main memory.
Zebo Peng, IDA, LiTH
14
TDTS 08 – Lecture 2
7
2015-11-03
Zebo’s Cache Memory Model
Personal library for a high-speed reader
Cache
Storage cells

Memory controller
A computer is a “predictable and iterative reader,” therefore
high cache hit ratio, e.g., 96%, is achievable, even with a
relatively small cache.
Zebo Peng, IDA, LiTH
15
TDTS 08 – Lecture 2
Cache Memory Features

It is transparent to the programmers.

It is only a small part of the program/data in the main memory
which has its copy in the cache (e.g., 8KB cache with 8MB
memory).

If the CPU wants to access program/data not in the cache (called a
cache miss), the relevant block of the main memory will be copied
into the cache.

The intermediate-future memory access will usually refer to the
same word or words in the neighborhood, and will not have to
involve the main memory.

This property of program executions is denoted as locality of reference.
Zebo Peng, IDA, LiTH
16
TDTS 08 – Lecture 2
8
2015-11-03
Locality of Reference



Temporal locality: If an item is referenced, it will
tend to be referenced again soon.
Spatial locality: If an item is referenced, items
whose addresses are close by will tend to be
referenced soon.
This access pattern is referred as locality of
reference principle, which is an intrinsic features
of the von Neumann architecture:
 Sequential instruction storage.
 Loops and iterations (e.g., subroutine calls).
 Sequential data storage (e.g., array).
Zebo Peng, IDA, LiTH
17
TDTS 08 – Lecture 2
Layered Memory Performance
Average Access Time 
Phit x Tcache_access +
(1 – Phit) x (Tmm_access + Tcache_access) x Block_size +
Tchecking
where
Phit = the probability of cache hit, cache hit ratio;
Tcache_access = cache access time;
Tmm_access = main memory access time;
Block_size = number of words in a cache block; and
Tchecking = the time needed to check for cache hit or miss.
Ex. A computer has 8MB MM with 100 ns access time, 8KB cache with 10 ns
access time, BS=4, and Tchecking = 2.1 ns, Phit = 0.97, AAT will be 25 ns.
Zebo Peng, IDA, LiTH
18
TDTS 08 – Lecture 2
9
2015-11-03
Cache Design

The size and nature of the copied block must be carefully designed, as well as the algorithm to decide which
block to be removed from the cache when it is full:
 Cache block size (line size).
 Total cache size.
 Mapping function.
 Replacement method.
 Write policy.
 Numbers of caches:
• Single, two-level, or three-level cache.
• Unified vs. split cache.
Zebo Peng, IDA, LiTH
19
TDTS 08 – Lecture 2
Split Data and Instruction Caches?

Split caches (Harvard Architectures):
+
Competition for the cache between instruction processing and
execution units is eliminated.
+
Instruction fetch can proceed in parallel with memory access
from the CPU for operands.
 One may be overloaded while the other is under utilized.

Unified caches:
+
Better balance the load between instruction and data fetches
depending on the dynamics of the program execution.
+
Design and implementation are cheaper.
 Lower performance.
Zebo Peng, IDA, LiTH
20
TDTS 08 – Lecture 2
10
2015-11-03
Direct Mapping Cache

Direct mapping - Each block of the main memory
is mapped into a fixed cache slot.
1
2
1
1
2
2
Cache
Storage cells
Zebo Peng, IDA, LiTH
Memory controller
21
TDTS 08 – Lecture 2
Direct Mapping Cache Example
We have a 10,000-word MM and a 100-word cache. 10 memory
cells are grouped into a block.
Tag
2
Memory address =
9990-9999
0
Slot
1
1
1
Word
1
5
Tag
Slot No.
9 90-99
8 80-89
7 70-79
6 66-69
5 50-59
4 40-49
3 30-39
2 20-29
1 10-19
01
0 00-09
100-Word Cache
0120-0129
0110-0119
0100-0109
0020-0029
0010-0019
0000-0009
10,000-Word Memory
Zebo Peng, IDA, LiTH
22
TDTS 08 – Lecture 2
11
2015-11-03
Direct Mapping Pros & Cons


Simple to implement and therefore inexpensive.
Fixed location for blocks.
 If a program accesses 2 blocks that map to the same
cache slot repeatedly, cache miss rate is very high.
1
2
1
1
2
2
Cache
Storage cells
Zebo Peng, IDA, LiTH
Memory controller
23
TDTS 08 – Lecture 2
Associative Mapping

A main memory block can be loaded into any slot of the cache.

To determine if a block is in the cache, a mechanism is needed
to simultaneously examine every slot’s tag.
associative memory example
9990-9999
Tag (3 ps)
Tag
0120-0129
0110-0119
0100-0109
0106 , 0107
010
287
001
297
100-Word Cache
0020-0029
0010-0019
0000-0009
10,000-Word Memory
Zebo Peng, IDA, LiTH
24
90-99
80-89
70-79
66-69
50-59
40-49
30-39
20-29
10-19
00-09
TDTS 08 – Lecture 2
12
2015-11-03
Fully Associative Organization
Zebo Peng, IDA, LiTH
25
TDTS 08 – Lecture 2
Set Associative Organization



The cache is divided into a number of sets (K).
Each set contains a number of slots (W).
A given block maps to any slot in a given set.
 e.g. block i can be in any slot of set j.

For example, 2 slots per set (W = 2):
 2-way associative mapping.
 A given block can be in one of 2 slots.


Direct mapping: W = 1 (no alternative).
Fully associative: K = 1 (W = total number of all
slots in the cache, all mappings possible).
Zebo Peng, IDA, LiTH
26
TDTS 08 – Lecture 2
13
2015-11-03
Replacement Algorithms
With direct mapping, it is no need.
With associative mapping, a replacement algorithm is
needed in order to determine which block to replace:


First-in-first-out (FIFO).
Least-recently used (LRU) replace the block that has been
in the cache longest with not
reference to it.
Lest-frequently used (LFU) replace the block that has
experienced the fewest
references.
Use info Tag
54
15:55
Random.
Zebo Peng, IDA, LiTH
27
TDTS 08 – Lecture 2
Write Policy

The problem:
 How to keep cache content and main memory content
consistent without losing too much performance?

Write through:
 All write operations are passed to main memory:
If the addressed location is currently in the cache, the
cache is updated so that it is coherent with the main
memory.
 For writes, the processor always slows down to main
memory speed.
 Since the percentage of writes is small (ca. 15%), this
scheme doesn’t lead to large performance reduction.
Zebo Peng, IDA, LiTH
28
TDTS 08 – Lecture 2
14
2015-11-03
Write Policy (Cont’d)

Write through with buffered write:
 The same as write-through, but instead of slowing the processor
down by writing directly to main memory, the write address and data
are stored in a high-speed write buffer; the write buffer transfers
data to main memory while the processor continues its task.
 Higher speed, but more complex hardware.

Write back:
 Write operations update only the cache memory which is not kept
coherent with main memory. When the slot is replaced from the
cache, its content has to be copied back to memory.
 Good performance (usually several writes are performed on a cache
block before it is replaced), but more complex hardware is needed.
Cache coherence problems are very complex and difficult to
solve in multiprocessor systems (to be discussed later)!
Zebo Peng, IDA, LiTH
29
TDTS 08 – Lecture 2
Cache Architecture Examples

Intel 80486 (introduced 1989)
 a single on-chip cache of 8 Kbytes
 line size: 16 bytes
 4-way set associative organization

Intel Pentium (introduced 1993)





two on-chip caches, one for data and one for instructions
each cache: 8 Kbytes
line size: 32 bytes
2-way set associative organization
IBM PowerPC 620 (introduced 1995)




two on-chip caches, one for data and one for instructions
each cache: 32 Kbytes
line size: 64 bytes
8-way set associative organization
Zebo Peng, IDA, LiTH
30
TDTS 08 – Lecture 2
15
2015-11-03
Cache Architecture Examples (Cont’d)

Intel Itanium 2 (introduced 2002)  three levels of cache:
L1
L2
L3
Contents
Split D and I
Unified D + I
Unified D + I
Size
16 Kbytes each
256 Kbytes
3 Mbytes
Line size
64 bytes
128 bytes
128 bytes
Associativity 4 way
8 way
12 way
Access time 1 cycle
5-7 cycles
14-17 cycles
Store policy
Write-back
Write-back
Write-through
Zebo Peng, IDA, LiTH
31
TDTS 08 – Lecture 2
Lecture 2: Memory Systems




Basic components
Memory hierarchy
Cache memory
Virtual Memory
Zebo Peng, IDA, LiTH
32
TDTS 08 – Lecture 2
16
2015-11-03
Motivation for Virtual Memory

The physical main memory (RAM) is very limited in space.

It may not be big enough to store all the executing
programs at the same time.

Some program may need memory larger than the main
memory size, but not all the program need to be
maintained in the main memory at the same time.

Virtual Memory takes advantage of the fact that at any
given instant of time, an executing program needs only a
fraction of the memory that the whole program occupies.

The basic idea: Load only pieces of each executing
program which are currently needed.
Zebo Peng, IDA, LiTH
33
TDTS 08 – Lecture 2
Paging

Divide programs (processes) into equal sized, small
blocks, called pages.

Divide the primary memory into equal sized, small
blocks called page frames.

Allocate the required number of page frames to a
program.

A program does not require continuous page
frames!

The operating system (OS) is responsible for:
 Maintaining a list of free frames.
 Using a page table to keep track of the mapping
between pages and page frames.
Zebo Peng, IDA, LiTH
34
TDTS 08 – Lecture 2
17
2015-11-03
Logical and Physical Addresses
Implementation of
the page-tables:
0
1
2
3
Zebo Peng, IDA, LiTH
35

Main memory —
slow since an
extra memory
access is needed.

Separate registers
— fast but
expensive.

Cache.
TDTS 08 – Lecture 2
Objective of Virtual Memory

To provide the user/programmer with a much bigger
memory than the main memory with the help of the
operative system.
 Virtual memory size >> main memory size.
Program addresses
0000
1000
2000
3000
MM addresses
0000
Secondary memory
1000
2000
3000
4000
5000
Zebo Peng, IDA, LiTH
36
TDTS 08 – Lecture 2
18
2015-11-03
Page Fault
When accessing a VM page which is not in the main memory,
a page fault occurs.
The page must then be loaded from the secondary memory
into the main memory by the OS.
Virtual Address


Page Number
Offset
Page Map
Page Fault
(Interrupt to OS)
Pages in MM
Zebo Peng, IDA, LiTH
37
TDTS 08 – Lecture 2
Page Replacement

When a page fault occurs and all page frames are
occupied, one of them must be replaced.

If the replaced page has been modified during the time it
resides in the main memory, the updated version should
be written back to the secondary memory.

Our wish is to replace the page which will not be
accessed in the future for the longest amount of time.

Problem — We don’t know exactly what will happen in
the future.

Solution — We predict the future by studying the access
patterns up till now (“learn from history”).
Zebo Peng, IDA, LiTH
38
TDTS 08 – Lecture 2
19
2015-11-03
Replacement Algorithms

FIFO (First In First Out) — To replace the one in
MM the longest of time.

LRU (Least Recently Used) — To replace the
one that has not be accessed the longest time.

LFU (Least Frequently Used) — To replace the
one that has the smallest number of access
during the latest time period.
The replacement by random (used for Cache)
is not used for VM!
Zebo Peng, IDA, LiTH
39
TDTS 08 – Lecture 2
Summary



A memory system has to store very large programs and
a lot of data and still provide fast access.
No single type of memory can provide all the need of a
computer system.
Usually several different storage mechanisms are
organized in a layer hierarchy.
 Cache is a hardware solution to improve memory access which
is transparent to the programmers.
 Virtual memory provides a much larger address space than the
available physical space with the help of the OS (software
solution).

The layer structure works very well due to the locality of
reference principle.
Zebo Peng, IDA, LiTH
40
TDTS 08 – Lecture 2
20
Download