PPT - Duke University

advertisement
The Buffer Cache
Jeff Chase
Duke University
The kernel
syscall trap/return
fault/return
system call layer: file API
fault entry: VM page faults
memory management: block/page cache
I/O completions
interrupt/return
policy
timer ticks
DeFiler interfaces: overview
create, destroy, read, write a dfile
list dfiles
DFS
DBuffer dbuf = getBlock(blockID)
releaseBlock(dbuf)
DBufferCache
read(), write()
startFetch(), startPush()
waitValid(), waitClean()
DBuffer
ioComplete()
startRequest(dbuf, r/w)
VirtualDisk
Memory Allocation
How should an OS allocate its memory resources
among contending demands?
– Virtual address spaces: fork, exec, sbrk, page fault.
– The kernel controls how many machine memory frames back
the pages of each virtual address space.
– The kernel can take memory away from a VAS at any time.
– The kernel always gets control if a VAS (or rather a thread
running within a VAS) asks for more.
– The kernel controls how much machine memory to use as a
cache for data blocks whose home is on slow storage.
– Policy choices: which pages or blocks to keep in memory? And
which ones to evict from memory to make room for others?
Memory/storage
hierarchy
Terms to know
cache index/directory
cache line/entry, associativity
cache hit/miss, hit ratio
spatial locality of reference
temporal locality of reference
eviction / replacement
write-through / writeback
dirty/clean
small
and fast
registers
(ns)
caches
L1/L2
off-core
L3
off-chip
main memory (RAM)
off-module
disk, other storage, network RAM
• In general, each layer is a cache over the layer below.
– inclusion property
• Technology trends  rapid change
• The triangle is expanding vertically  bigger gaps, more levels
big and
slow
(ms)
Memory as a cache
data
Processes access external
storage objects through file
APIs and VM abstraction. The
OS kernel manages caching of
pages/blocks in main memory.
virtual
address
spaces
data
files and
filesystems,
databases,
other
storage
objects
page/block read/write
accesses
disk and other storage
network RAM
memory
(frames)
backing storage volumes
(pages and blocks)
The Buffer Cache
Proc
Memory
File
cache
Ritchie and Thompson
The UNIX Time-Sharing
System, 1974
Editing Ritchie/Thompson
The system maintains a buffer cache (block cache, file
cache) to reduce the number of I/O operations.
Proc
Suppose a process makes a system call to access a
single byte of a file. UNIX determines the affected
disk block, and finds the block if it is resident in the
cache. If it is not resident, UNIX allocates a cache
buffer and reads the block into the buffer from the disk.
Then, if the op is a write, it replaces the affected byte
in the buffer. A buffer with modified data is marked
dirty: an entry is made in a list of blocks to be written.
The write call may then return. The actual write may
not be completed until a later time.
If the op is a read, it picks the requested byte out of the
buffer and returns it, leaving the block in the cache.
Memory
File
cache
The DeFiler buffer cache
File abstraction implemented in upper DFS layer.
All knowledge of how files are laid out on disk is at this layer.
Access underlying disk volume through buffer cache API.
Obtain buffers (dbufs), write/read to/from buffers, orchestrate I/O.
DBuffer dbuf = getBlock(blockID)
releaseBlock(dbuf)
DBufferCache
Device I/O interface
Asynchronous I/O to/from buffers
block read and write
Blocks numbered by blockIDs
DBuffer
read(), write()
startFetch(), startPush()
waitValid(), waitClean()
Page/block cache internals
HASH(blockID)
Each frame/buffer of memory is
described by a meta-object (header).
Resident pages or blocks are
accessible through through a global
hash table.
An ordered list of eviction candidates
winds through the hash chains.
Some frames/buffers are free (no
valid data). These are on a free list.
DBufferCache internals
HASH(blockID)
Any given block (blockID) is either resident or not. If
resident, then it has exactly one copy (dbuf) in the cache.
If it is resident then getBlock finds the dbuf (cache hit).
This requires some kind of cache index, e.g., a hash table.
DBuffer dbuf = getBlock(blockID)
DBufferCache
I/O cache buffers
Each is byte[blocksize]
DBuffer
Buffer headers
DBuffer dbuf
There is a one-to-one correspondence of dbufs to buffers.
DBufferCache internals
HASH(blockID)
If the requested block is not resident, then getBlock
allocates a dbuf for the block and places the correct
block contents in its buffer (cache miss). If there are
no free dbufs in the cache, then we must evict some
other block from the cache and reuse its dbuf.
DBuffer dbuf = getBlock(blockID)
DBufferCache
I/O cache buffers
Each is byte[blocksize]
DBuffer
Buffer headers
DBuffer dbuf
There is a one-to-one correspondence of dbufs to buffers.
Page/block cache internals
HASH(blockID)
cache directory
List(s) of free buffers (bufs) or
eviction candidates. These dbufs
might be listed in the cache directory
if they contain useful data, or not, if
they are truly free.
To replace a dbuf
Remove from free/eviction list.
Remove from cache directory.
Change dbuf blockID and status.
Enter in directory w/ new blockID.
Re-register on eviction list.
Beware of concurrent accesses.
Dbuffer (dbuf) states
DFS
A DBuffer dbuf returned by
getBlock is always associated
with exactly one block in the
disk volume. But it might or
might not be “in sync” with the
underlying disk contents.
read(…)
write(...)
startFetch(), startPush()
waitValid(), waitClean()
DBuffer
A dbuf is valid iff it has the “correct” copy of the data. A dbuf is
dirty iff it is valid and has an update (a write) that has not yet been
written to disk. A valid dbuf is clean if it is not dirty.
Your DeFiler should return only valid data to a client. That may require
you to zero the dbuf or fetch data from the disk. Your DeFiler should
ensure that all dirty data is eventually pushed to disk.
Asynchronous I/O on dbufs
Start I/O on a dbuf by posting it to a
producer/consumer queue for service by a
startFetch(),
device thread.startPush()
Client threads may wait on the dbuf
for asynchronous I/O to complete.
waitValid(), waitClean()
DBuffer
startRequest(dbuf, r/w)
Device I/O interface
Async I/O on dbufs
device threads
VirtualDisk
startFetch(), startPush()
waitValid(), waitClean()
ioComplete()
Thread upcalls dbuf
ioComplete when I/O
operation is done.
More dbuf states
Do not evict a dbuf that is in active use (busy)!
DFS
A dbuf is pinned if I/O is in progress, i.e., a
disk request has started but not yet completed.
dbuf = getBlock(blockID)
releaseBlock(dbuf)
A dbuf is held if DFS obtained a reference to
the dbuf from getBlock but has not yet
released the dbuf.
DBufferCache
DBuffer
startRequest(dbuf, r/w);
VirtualDisk
ioComplete()
File system layer (DFS)
create, destroy, read, write a dfile
list dfiles
Allocate blocks to files and file metadata.
Allocate DFileIDs to files.
Track which blockIDs and DFileIDs are free
and which are in use.
“inode”
Maintain a block map “inode” for each file,
as metadata stored on disk.
DBuffer dbuf = getBlock(blockID)
releaseBlock(dbuf)
sync()
DBufferCache
DBuffer
read(), write()
startFetch(), startPush()
waitValid(), waitClean()
A Filesystem On Disk
sector 0
sector 1
allocation
bitmap file
wind: 18
0
directory
file
11100010
00101101
10111101
snow: 62
0
once upo
n a time
/n in a l
10011010
00110001
00010101
00101110
00011001
01000100
rain: 32
hail: 48
and far
far away
, lived th
Data
A Filesystem On Disk
sector 0
sector 1
allocation
bitmap file
wind: 18
0
directory
file
11100010
00101101
10111101
snow: 62
0
once upo
n a time
/n in a l
10011010
00110001
00010101
00101110
00011001
01000100
rain: 32
hail: 48
and far
far away
, lived th
Metadata
Managing files
create, destroy, read, write a dfile
list dfiles
Each file has a size: it is the first byte offset in
the file that has never been written. Never
return data past a file’s size.
Fetch blocks for data and metadata (or zero
new ones fresh), read and write in place, and
push dirty blocks back to the disk.
“inode”
Serialize DFS read/write on each inode.
DBuffer dbuf = getBlock(blockID)
releaseBlock(dbuf)
sync()
DBufferCache
DBuffer
read(), write()
startFetch(), startPush()
waitValid(), waitClean()
Representing a File On Disk
file attributes
e.g., size
block map
Index by logical block number
maps to a blockID
blockID
access blocks through
the block cache with
getBlock, startFetch,
waitValid, read,
releaseBlock.
“inode”
once upo
n a time
/nin a l
logical
block 0
and far
far away
,/nlived t
logical
block 1
he wise
and sage
wizard.
logical
block 2
Filesystem layout on disk
inode 0
bitmap file
inode 1
root directory
fixed
locations
on disk
11100010
00101101
10111101
wind: 18
0
snow: 62
0
once upo
n a time
/n in a l
10011010
00110001
00010101
allocation
bitmap file
blocks
rain: 32
hail: 48
file
blocks
00101110
00011001
01000100
and far
far away
, lived th
inode
This is a toy example (Nachos).
Filesystem layout on disk
inode 0
bitmap file
X
inode 1
root directory
X
X
11100010
00101101
10111101
Your DeFiler volume is small.
You can keep the free
block/inode maps in memory.
You don’t need metadata
structures on disk for that. But
you have to scan the disk to
rebuild the in-memory
structures on initialization.
DeFiler must be
able to find all
valid inodes on
disk.
X
0
once upo
n a time
/n in a l
rain: 32
hail: 48
file
blocks
and far
far away
, lived th
inode
DeFiler has no directories.
You just need to keep track
of which DFileIDs are
currently valid, and return a
list.
Disk layout: the easy way
DeFiler must be able
to find all valid
inodes on disk.
Given a list of valid inodes,
you can determine which
inodes and blocks are free and
which are in use.
once upo
n a time
/n in a l
file
blocks
and far
far away
, lived th
inode
Download