PART5

advertisement
CMPE 421 Parallel Computer Architecture
PART5
More Elaborations with cache
&
Virtual Memory
1
Cache Optimization into categories

Reducing Miss Penalty

Multilevel caches

Critical word first: Don’t wait for the full block to be loaded before
sending the requested word and restarting the CPU

Read Miss Before write miss: This optimization serves reads before
writes have been completed.
SW R2, 512(R0) ; M[512] ← R3 (cache index 0)
LW R1,1024(R0) ; R1 ← M[1024] (cache index 0)
LW R2,512(R0) ; R2 ← M[512] (cache index 0)
- If the write buffer hasn’t completed writing to location 512 in memory, the
read of location 512 will put the old, wrong value into the cache block, and
then into R2.

Victim Caches
2
Victim Caches

One approach to lower miss
penalty is to remember what
was discarded in case it is
needed again.

This victim cache contains
only blocks that are
discarded from a cache
because of a miss—
“victims”—and are checked
on a miss to see if they have
the desired data before
going to the next lower-level
memory.
Jouppi [1990] found that victim caches of one to
five entries are effective at reducing misses,
especially for small, direct-mapped data caches.
Depending on the program, a four-entry victim
cache might remove one quarter of the misses
in a 4-KB direct-mapped data cache.
The AMD Athlon has a
victim cache with eight
entries.
3
Cache Optimization into categories

Reducing the miss rate

Larger block size,

Larger cache size,

Higher associativity,

Way prediction Pseudo-associativity,
- In way-prediction, extra bits are kept in the cache to predict the set
of the next cache access.


Compiler optimizations
Reducing the time to hit in the cache

small and simple caches,

avoiding address translation,

and pipelined cache access.
4
Cache Optimization

Complier-based cache optimization reduces the miss rate
without any hardware change

For Instructions


Reorder procedures in memory to reduce conflict

Profiling to determine likely conflicts among groups of instructions
For Data

Merging Arrays: improve spatial locality by single array of compound
elements vs. two arrays

Loop Interchange: change nesting of loops to access data in order
stored in memory

Loop Fusion: Combine two independent loops that have same
looping and some variables overlap

Blocking: Improve temporal locality by accessing “blocks” of data
repeatedly vs. going down whole columns or rows
5
Examples


Reduces misses by improving spatial locality through combined arrays
that are accessed simultaneously
Sequential accesses instead of striding through memory every 100 words;
improved spatial locality
6
Examples

Some programs have separate sections of code that access the
same arrays (performing different computation on common data)

Fusing multiple loops into a single loop allows the data in cache to
be used repeatedly before being swapped out

Loop fusion reduces missed through improved temporal locality
(rather than spatial locality in array merging and loop interchange)

Accessing array “a” and “c” would have caused twice the number of
misses without loop fusion
7
Blocking Example
8
Example

B called Blocking
Factor

Conflict misses
can go down too

Blocking is also
useful for register
allocation
9
Summary of performance equations
10
VIRTUAL MEMORY

You’re running a huge program that requires 32MB

Your PC has only 16MB available...
• Rewrite your program so that it implements overlays
• Execute the first portion of code (fit it in the available
memory)
• When you need more memory...
• Find some memory that isn’t needed right now
• Save it to disk
• Use the memory for the latter portion of code
• So on...
• The memory is to disk as registers are to memory
• Disk as an extension of memory
• Main memory can act as a “cache” for the secondary
stage (magnetic disk)
11
A Memory Hierarchy

Extend the hierarchy



512KB typical
Memory: About $0.15/MBtye,
50ns access time

Registers
Main memory acts like a
cache for the disk
Cache: About $20/Mbyte
<2ns access time


CPU
Store
Load or I-Fetch
Cache
HW
manages
movement
256MB typical
Main
Memory
(DRAM)
Disk: About $0.0015/MByte,
15ms (15,000,000 ns) access
SW
time
manages
movement
The operating system is responsible for managing
the movement of memory between disk and main
memory, and for keeping the address translation
table accurate.

40GB typical
Disk
12
Virtual Memory
•
Idea: Keep only the portions of a program (code, data)
that are currently needed in Main Memory
•
•
Currently unused data is saved on disk, ready to be brought in
when needed
Appears as a very large virtual memory (limited only by the disk
size)
• Advantages:
• Programs that require large amounts of memory can be run (As
long as they don’t need it all at once)
• Multiple programs can be in virtual memory at once, only active
programs will be loaded into memory
• A program can be written (linked) to use whatever addresses it
wants to! It doesn’t matter where it is physically loaded!
• When a program is loaded, it doesn’t need to be placed in
continuous memory locations
• Disadvantages:
• The memory a program needs may all be on disk
• The operating system has to manage virtual memory
13
Virtual Memory

We will focus
on using the disk as a storage area for chunks of
main memory that are not being used.

The basic concepts are similar to providing a cache for
main memory, although we now view part of the hard
disk as being the memory.

Only few programs are active

An active might not need all the memory that has been reserved
by the program (store rest in the Hard disk)
14
The Virtual Memory Concept
Virtual Memory Space:
All possible memory addresses
(4GB in 32-bit systems)
All that can be held as an option
(conceived) .
Virtual Memory Space
Disk Swap
Space
Main
Memory
Disk Swap Space:
Area on hard disk that can
be used as an extension of
memory.
(Typically equal to ram size)
All that can be used.
Main Memory:
Physical memory.
(Typically 1GB)
All that physically exists.
15
The Virtual Memory Concept
This address can be conceived of,
but doesn’t correspond to any
memory. Accessing it will produce
an error.
Virtual Memory Space
Error
Disk Swap
Space
Disk Address: 58984
Not in main memory
Main
Memory Physical Address: 883232
Disk Address: 322321
This address can be accessed.
However, it currently is only on
disk and must be read into
main memory before being used.
A table maps from its virtual
address to the disk location.
This address can be accessed
immediately since it is already in
memory. A table maps from its
virtual address to its physical
address. There will also be a
back-up location on disk.
16
The Process

The CPU deals with Virtual Addresses
• Steps to accessing memory with a virtual address
1. Convert the virtual address to a physical address
• Need a special table (Virtual Addr-> Physical Addr.)
• The table may indicate that the desired address is
on disk, but not in physical memory
• Read the location from the disk into memory (this may require
moving something else out of memory to make room)
2. Do the memory access using the physical address
• Check the cache first (note: cache uses only
physical addresses)
• Update the cache if needed
17
Structure of Virtual Memory
Return our Library Analogy
•Virtual addresses as the title of a book
•Physical address as the location of that in the library
From Processor
Virtual Address
Address Translator
Physical Address
Page fault
Using elaborate
Software
page fault
Handling
algorithm
To Memory
18
Translation (hardware that translates these virtual addresses to physical addresses)

Since the hardware access memory, we need to convert from a
logical address to a physical address in hardware

The Memory Management Unit (MMU) provides this functionality
2n-1
CPU
MMU
Virtual
Address
(Logical)
Physical
Address
(Real)
Physical
Memory
0
19
Address Translation
In Virtual Memory, blocks of memory (called pages) are mapped from
one set of address (called virtual addresses) to another set (called
physical addresses)
20
Page Faults
If the valid bit for a virtual page is off, a page fault occurs. The operating system
must be given control. Once the operating system gets control, it must find the
page in the next level of the hierarchy (usually magnetic disk) and decide
where to place the requested page in main memory.
21
Terminology

page: The unit of memory transferred between disk and
the main memory.

page fault: when a program accesses a virtual memory
location that is not currently in the main memory.

address translation: the process of finding the physical
address that corresponds to a virtual address.
Cache
Block
Cache miss
Block addressing
⇒
⇒
⇒
Virtual memory
Page
page fault
Address translation
22
Difference between virtual and cache memory

The miss penalty is huge (millions of seconds)

Solution: Increase block size (page size) around 8KB
- Because transfers have a large startup time, but data transfer is
relatively fast after started

Even on faults (misses) VM must provide info on the
disk location

VM system must have an entry for all possible locations

When there is a hit, the VM system provides the physical
address in memory (not the actual data, in the cache we have
data itself )
- Saves room – one address rather than 8 KB data

Since miss penalty is very huge, VM systems typically
have a miss (page fault) rate of 0.00001- 0.0001%
23
In Virtual Memory Systems

Pages should be large enough to amortize the high
access time. (from 4 kB to 16 kB are typical, and some
designers are considering size as large as 64 kB.)

Organizations that reduce the page fault rate are
attractive. The primary technique used here is to allow
flexible placement of pages. (e.g. fully associative)

Sophisticated LRU replacement policy is preferable
Page
faults can be handled in software.
Write-back
 we
(Write-through scheme does not work.)
need a scheme that reduce the number of disk writes.
24
Keeping track of pages: The page table

All programs use the same virtual addressing space

Each program must have its own memory mapping

Each program has its own page table to map virtual
addresses to physical addresses
virtual Address
Page Table
Physical Address

The page table resides in memory, and is pointed to by
the page table register

The page table has an entry for every possible page (in
principle, not in practice...), no tags are necessary.

A valid bit indicates whether the page is in memory or on
disk.
25
Virtual to Physical Mapping
Both virtual and physical address are broken down a page number and page offset
(index)
31
No tag - All entries are unique
13 12
Virtual Page Number
Translation
24
0
Page Offset
Note: may involve reading from disk
Page tables are stored in main MEM
13 12
0
Example
• 4GB (32-bit) Virtual
Address Space
• 32MB (25-bit) Physical
Address Space
• 8 KB (13-bit) page size
(block size)
Physical Page Number Page Offset

A 32-bit virtual address is given to the V.M. hardware

•
The virtual page number (index) is derived from this by removing
the page (block) offset
The Virtual Page Number is looked up in a page table
• When found, entry is either:
• The physical page number, if in memory
• The disk address, if not in memory (a page fault)
• If not found, the address is invalid
V->1
V->0
Virtual Memory (32-bit system): 8KB page size,16MB Mem
31
Index
0
13 12
19
219=512K
Page offset
13
Virtual Address
4GB / 8KB =
512K entries
Virt.
Pg.# V Phys. Page #
0
1
2
...
Disk Address
...
512K
11
23
13 12
0
Physical Address
27
Virtual Memory Consists

Bits for page address

Bits for virtual page number

Number of virtual pages

Entries in the page table

Bits for physical page number

Number of physical pages

Bits per page table line

Total page table size
28
Write issues

Write Through - Update both disk and memory




+ Easy to implement
- Requires a write buffer
- Requires a separate disk write for every write to memory
- A write miss requires reading in the page first, then writing back the
single word
• Write Back - Write only to main memory. Write to the disk
only when block is replaced.
• + Writes are fast
• + Multiple writes to a page are combined into one disk write
• - Must keep track of when page has been written (dirty bit)
29
Page replacement policy

Exact Least Recently Used (LRU) but it is expensive.

So, use Approximate LRU:

a use bit (or reference bit) is added to every page table
line

If there is a hit, PPN is used to form the address and reference
bit is turned on so the bit is set at every access

the OS periodically clears all use bits

the page to replace is chosen among the ones with their
use bit at zero


Choose one entry as a victim randomly
If the OS chooses to replace the page, the dirty bit
indicates whether the page to be written out before its
location in memory can be given to another (give a
Figure)
30
Virtual memory example
System with 20-bit V.A., 16KB pages, 256KB of physical memory
Page offset takes 14 bits, 6 bits for V.P.N. and 4 bits for P.P.N.
Page Table:
Virtual Page #
(index)
000000
000001
000010
000011
000100
000101
000110
000111
Valid
Bit
1
0
1
0
1
1 0
0 1
1
Physical Page #/
Disk address
1001
sector 5000...
0010
sector 4323…
1011
1010 sector xxxx...
sector 1239... 1010
0001
Access to:
0000 1000 1100 1010 1010
PPN = 0010
Physical Address:
00 1000 1100 1010 1010
Access to:
0001 1001 0011 1100 0000
PPN = Page Fault to
sector 1239...
Pick a page to “kick out” of memory (use LRU). Read data from sector 1239
Assume LRU is VPN 000101 for this example. into PPN 1010
31
Download