Internals Of PostgreSQL Checkpoint_and_BufferMgmt

advertisement
CheckPoint Internals
of PostgreSQL
Buffer Cache Basics in PostgreSQL
Checkpoint Definition
Checkpoint Implementation in PostgreSQL
Advantages & Disadvantages of PostgreSQL Implementation
Buffer Cache Basics in Oracle
Checkpoint Implementation in Oracle
Advantages & Disadvantages of Oracle Implementation
Improvements in PostgreSQL
Buffer Cache Basics in PostgreSQL
• The buffer cache component structures are:
• Buffers
– Each buffer may hold an image of one data block at any
one time
• Buffer headers
– Store metadata about contents of the buffers
– Act as cache management structures
• Buffer Pool
– A freelist of buffers is maintained, from which a buffer is
assigned to a new page request.
Buffer Cache Basics in PostgreSQL
• Hash Table
– Hash table is to get faster the shared buffer for a page if it is
already in shared buffer pool.
– Hash value is calculated as – Oid of tablespace, database,
relation; Filetype, blocknumber.
– Hash table has 16 partitions which will ensure that 16
concurrent access to hash table will be allowed.
Buffer Cache Basics in PostgreSQL
• Usage count in buffer header tracks the usage of buffer based on
which it is decided whether to flush particular buffer when no more
free buffers are available.
• Once freelist buffers are finished, backend runs the clock sweep
algorithm to find a suitable buffer for reuse.
• If that page is dirty, it is written out to disk. This can cause the
backend trying to allocate that buffer to block as it waits for that
write I/O to complete.
• Clock sweep also marks nextvictim buffer which will be used to
traverse the list next time backend wants a buffer for a new page
request.
• A buffer becomes dirty at a particular LSN (Log sequence number)
which is a point in WAL file.
• Dirty page images written by the background writer or checkpoint
makes sure that corresponding part of WAL file is written flushing
page.
Search For Blocks Already In Buffer Pool
Request For block with hash key 42
15
42
Request For block with hash key 156
53
93
Request For block with hash key 93
Request For block with hash key 183
Block not found in existing buffers.
Search for a free buffer for block in buffer freelist.
156
Search for buffer in Freelist Of Buffers
Buffer Pool
bid 1
uc 2
bid 2
uc 3
bid 3
uc 1
bid 4
uc 5
bid 5
uc 2
bid6 6
bid
ucuc1 0
bid
bid77
uc
uc10
FreeList
Get Buffer
No buffer in freelist; run clock sweep on buffer
pool
bid 6
uc 0
bid 7
uc 0
Clock Sweep for buffer in Buffer Pool
Buffer Pool
rcnt 1
uc 2
rcnt 0
uc 3
2
rcnt 1
0
uc 1
0
rcnt 0
uc 4
5
rcnt 2
uc 2
Check if victim buffer doesn’t have
ref count
next
If refcnt and usage
cnt is0;0 then
then move
return to
buff.
buffer
If ref count is 0, then if Usage Cnt
> 0 then decrement it and move to
next buffer.
rcnt 0
uc 0
1
Get Buffer
rcnt 1
0
uc 1
0
Buffer Cache I/O in PostgreSQL
• Buffer contents are flushed at following events
- When backend doesn’t find free buffers in freelist.
- Page writer at predefined interval
- Checkpoint
- due to DDL like Alter Table or Tablespace all buffers having
pages corresponding to that table will be flushed.
• For operations which needs many pages access but don’t reuse
previous page, a different buffer access strategy is used.
• It creates a ring of buffers and then only buffers from the ring will
be used. This limits the usage of buffers for heavy operations.
•
Operations for which this strategy is used are as follows:
Vacuum, Bulk Read (SeqScan), BulkWrite (insert into … select * ..)
Buffer Cache Basics in PostgreSQL
Checkpoint Definition
Checkpoint Implementation in PostgreSQL
Advantages & Disadvantages of PostgreSQL Implementation
Buffer Cache Basics in Oracle
Checkpoint Implementation in Oracle
Advantages & Disadvantages of Oracle Implementation
Improvements in PostgreSQL
Checkpoint Definition
• A checkpoint is a point in the transaction log sequence at which all
data files have been updated to reflect the information in the log.
• At checkpoint time, all dirty data pages are flushed to disk and a
special point is noted to determine the point in the log from which it
should start the REDO operation during CRASH recovery.
• Checkpoint occurs at following events
•
When predefined log segment gets full
•
Checkpoint timeout interval
•
CHECKPOINT command
Checkpoint Definition
• A background checkpoint process monitors events which can cause
checkpoint and performs required actions.
• To avoid flooding the I/O system with a burst of page writes, page
writer will intermittently flush dirty pages and writing dirty buffers
during a checkpoint can spread over a period of time.
Buffer Cache Basics in PostgreSQL
Checkpoint Definition
Checkpoint Implementation in PostgreSQL
Advantages & Disadvantages of PostgreSQL Implementation
Buffer Cache Basics in Oracle
Checkpoint Implementation in Oracle
Advantages & Disadvantages of Oracle Implementation
Improvements in PostgreSQL
Background Writer
• There is a separate server process called the background writer,
who does 2 activities :
- Page writing, the functionality is to issue writes of some
“dirty” shared buffers.
- Checkpoint, whose function is to write all “dirty” buffers and
other shared memory buffers (Commit Log buffers,
Subtrasaction pages).
It also remove old WAL files which are now not required.
• It will automatically dispatch a checkpoint after a certain amount of
time has elapsed since the last one
Background Writer
• It can be signaled to perform requested checkpoints as well.
• Backends signal the bgwriter when they fill WAL segments to
perform checkpoint.
• If the bgwriter exits unexpectedly, the postmaster treats that the
same as a backend crash: shared memory may be corrupted, so
remaining backends should be killed by SIGQUIT and then a
recovery cycle started.
Page Writing
• Page writing is to ensure that sessions seldom or never have to
wait for a write to occur, because the background writer will do it.
However, regular backends (sessions) are still empowered to issue
writes if the bgwriter fails to maintain enough clean shared buffers.
• In general due to Page writer there is a net overall increase in I/O
load, because a repeatedly-dirtied page might otherwise be written
only once per checkpoint interval, but the background writer might
write it several times in the same interval.
• ‘bgwriter_delay’ specifies the delay between activity rounds for the
background writer. In each round the writer issues writes for some
number of dirty buffers
• In each round, no more than ‘bgwriter_lru_maxpages’ buffers will
be written by the background writer.
Page Writing Contd..
• The number of dirty buffers written in each round is based on the
number of new buffers that have been needed by server processes
during recent rounds.
• It will also consider the number of reusable buffers from previous
bgwriter run which are accumulated based on extra pages it has
flushed than expected.
• It will start from nextvictim buffer, which will be useful for backends
as chances of getting them useful buffers is high.
• It will flush the blocks one by one. There is no concept of grouping
buffers for flushing.
• Even in cases where there's been little or no buffer allocation
activity, we want to make a small amount of progress through the
buffer cache so that as many reusable buffers as possible are
clean after an idle period.
Checkpoint Implementation
• Internally checkpoint is done at shutdown or at end of recovery.
• To avoid flooding the I/O system with a burst of page writes, writing
dirty buffers during a checkpoint is spread over a period of time.
• That period is controlled by checkpoint_completion_target, which is
given as a fraction of the checkpoint interval.
• With the default value of 0.5, PostgreSQL can be expected to
complete each checkpoint in about half the time before the next
checkpoint starts.
• Top level Checkpoint Design
• In the start of process, it marks the current WAL location as the
Redo Replay location incase of crash recovery.
Checkpoint Implementation Contd..
•
Before flushing data, we must wait for any transactions that are currently
in their commit critical sections.
- If an xact inserted its commit record into XLOG just before the
REDO point, then a crash restart from the REDO point would
not replay that record.
- so this checkpoint flushing better include the xact's update of
pg_clog.
•
Flush all the shared memory data to disk.
•
Write out all dirty buffers in the pool
- Loop over all buffers, and mark the ones that need to be written
with BM_CHECKPOINT_NEEDED.
- This allows us to write only those pages that were dirty when the
checkpoint began.
- Loop over all buffers again, and write the ones marked with
BM_CHECKPOINT_NEEDED. In this loop, we start at the clock
sweep point since we might as well dump soon-to-be-recycled
buffers first.
Checkpoint Implementation Contd..
• Each write (either by backend, pagewrite, checkpoint) will put a
request in Fsync request queue.
• After writing pages in checkpoint, it calls Fsync filewise so that all
blocks of that file will be written to disk together. This will reduce
random I/O.
• It writes Checkpoint Redo record which contains information like
Redo Replay start pos., nextXid, nextOid etc.
• Delete old log files (those no longer needed even for previous
checkpoint.
Buffer Cache Basics in PostgreSQL
Checkpoint Definition
Checkpoint Implementation in PostgreSQL
Advantages & Disadvantages of PostgreSQL Implementation
Buffer Cache Basics in Oracle
Checkpoint Implementation in Oracle
Advantages & Disadvantages of Oracle Implementation
Improvements in PostgreSQL
Advantages/Disadvantages of PG Implementation
• Advantages
1. To reduce I/O throttle during checkpoint bgwriter intermittently
flushes pages; also during checkpoint operation it sleeps based on
checkpoint progress.
2. It limits the usage of buffers for heavy operations such as
Sequence Scan, Vacuum.
Advantages/Disadvantages of PG Implementation
• Disadvantages
1. Currently bgwriter never puts the buffer in freelist even if it can
identify that the buffer is reusable.
2. Once buffers are over in freelist, it has to always traverse serially the
buffer pool.
3. Bgwriter_delay is at fixed interval rather than activity based.
4. Currently there is no protection for pg_control file, only it is
controlled as the structure is currently 512 bytes which is one sector
write.
Buffer Cache Basics in PostgreSQL
Checkpoint Definition
Checkpoint Implementation in PostgreSQL
Advantages & Disadvantages of PostgreSQL Implementation
Buffer Cache Basics in Oracle
Checkpoint Implementation in Oracle
Advantages & Disadvantages of Oracle Implementation
Improvements in PostgreSQL
Buffer Cache Basics In Oracle
• The buffer cache component structures are:
•
•
•
•
Buffers
– Each buffer may hold an image of one data block at any
one time
Buffer headers
– Store metadata about contents of the buffers
– Act as cache management structures
Buffer pools
– Collection of buffers used for the same purpose and
managed accordingly
Working set
– All or part of a buffer pool
– Assigned to a DBWn process
Buffer Management
•
•
•
•
•
•
•
Cached buffers managed by doubly linked lists:
REPL
– Buffers containing block images being used
REPL-AUX
– Buffers ready to be used for I/O or CR build
WRITE and CKPT-Q
– Dirty Buffers requiring I/O. Two different queues are
maintained for write buffers as checkpoint queue ensure
buffers will be written in order or lowest RBA.
WRITE-AUX
– Dirty Buffers with I/O in progress
Touch count is used to decide the initial insertion location in the
REPL chain
Buffers are moved from main to auxiliary list by DBWn processes
to avoid unnecessary scans
Hash Buckets
• Hash value of each block calculated from
– Data Block Address (DBA)
– Block Class
• Number of hash buckets dependent on number of buffers in cache
e.g.
# buffers
# hash buckets
500
6000
64
1024
• Each hash bucket contains
– Cache Buffers Chains latch
– Pointer to array of double linked lists
Hash Buckets
# hash chains
BH
BH
cache
buffers
chain
latch
BH
BH
Buffer Headers
• Each buffer header describes contents of one buffer
• All buffers accessed via buffer header
• Buffer header contains pointers to
– Buffer
– Cache Buffers Chains latch
• Buffer header includes double linked lists for
– Cache Buffers Chain list
– Replacement list
– Users list
– Waiters list
Replacement List
•
•
•
•
•
•
•
In Oracle 8.1.5 and above a mid-point insertion algorithm is used
Buffer cache has a hot end and a cold end
Buffers are inserted at mid-point
Mid-point is head of cold end
Starts at hot end - moves down cache
Maximum mid-point determined by _db_percent_hot_default
Default value is 50%
Head of
Hot End
Hot End
Head of
Cold End
Replacement List
Cold End
Touch Count
• Each buffer header maintains
– touch count
– timestamp
• Touch count represents number of 3 second intervals in which
buffer has been accessed since
– buffer last read into cache
– touch count last reset
• Each time buffer is accessed
– if timestamp more than 3 seconds ago; increment touch
count
– set timestamp to current time
Touch Count
• When buffer reaches tail of cold end
– If touch count >= 2 then
– buffer is moved to hot end
– Otherwise used as next free buffer
• Hot criteria determined by
–
_db_aging_hot_criteria
–
default value is 2 touches
Single-Block Reads
Head of
Hot End
92
71
42
34
92
71
72
34
92
45
72
34
52
42
45
33
72
87
11
71
52
42
45
33
72
11
66
71
52
42
45
33
11
49
66
71
52
42
45
11
0
3
0
3
4
0
2
4
2
4
1
2
1
4
0
2
1
0
1
2
Read Block 87
11
42
33
34
Get
Insert
Update
Move
Set
Gettouch
Insert
first
first
block
buffer
buffer
touch
buffer
available
available
count
71
42
at
atcount
contents
to
head
head
head
buffer
buffer
of
of cold
cold
from
end
for
of
on
from
end
hot
block
block
cold
cold
end42
71
42
34
end
end
to zero
STOP
Block
Number
Head of
Cold End
42
33
87
11
1
Touch
Count
Buffer Cache Basics in PostgreSQL
Checkpoint Definition
Checkpoint Implementation in PostgreSQL
Advantages & Disadvantages of PostgreSQL Implementation
Buffer Cache Basics in Oracle
Checkpoint Implementation in Oracle
Advantages & Disadvantages of Oracle Implementation
Improvements in PostgreSQL
Checkpoint In Oracle
• The point of Oracle checkpoints is to synchronize all datafiles,
some datafiles or some objects to a point in time for consistency,
performance and recoverability purposes.
• The Database Writer Process
The Database Writer (DBWR) process is responsible for writing
data from the RAM buffer cache to the database datafiles on disk.
The default is to have one database writer process, but large
databases can have multiple DBWR processes.
• The Checkpoint Process
The checkpoint process is responsible for updating file headers in
the database datafiles. A checkpoint occurs when Oracle moves
new or updated blocks (called dirty blocks) from the RAM buffer
cache to the database datafiles.
Dbwriter (DBWn) Process
• If an Oracle user process searches the threshold limit of buffers
without finding a free buffer, the process stops searching the LRU
list and signals the DBW0 background process to write some of the
dirty buffers to disk.
• To allow buffer replacements to occur, it is necessary to write cold
dirty buffers.Such writes are referred to as aging writes and are
performed by DBWn processes.
• The DBWn processes work to write enough cold buffers to
guarantee a uniform supply of free buffers for replacement
purposes.
• Do forever
{
Scan lru list;
Scan checkpoint queue;
Accumulate batch of buffers to write;
Issue writes;
Wait for completion of writes;
}
Multiple Dbwriters
• A mechanism for
increasing write
throughput.
• Buffer cache is partitioned
between dbwriters by
working sets.
• Each DBWn process
scans its own assigned
working sets.
Set 0
Set 1
Set 2
DBW
0
Set 3
DBW
1
Writes to Datafiles
Why Multiple Dbwriters
• Reasons to consider multiple dbwriters include high wait times for:
- Free buffer waits (Cache not cleaned fast enough)
- Logfile switch: Checkpoint incomplete (checkpoint not
advanced out of previous log, blocking redo
generation)
• Also consider multiple dbwriters if a single dbwriter consumes 100%
of a CPU.
• It is useful when the transaction rates are high or when the buffer
cache size is so large that a single DBWn process cannot keep up
with the load.
• Free buffer waits due to slow writes to saturated disks will not go
away with multiple dbwriters.
• A good indication is time spent on “db file sequential read”
(foreground reads)
If this time is high then bottleneck is disk not the number of dbwriters.
Types Of Checkpoint
• Several types of checkpoint exist
• Full Checkpoint
• Thread Checkpoint
• File Checkpoint
• Object “Checkpoint”
• Parallel Query Checkpoint
• Incremental Checkpoint
• Log Switch Checkpoint
Incremental Checkpoint
• The objective was to make the algorithm scalable to very large
buffer caches, and to facilitate frequent checkpointing for fast crash
recovery.
• Writes the contents of “some” dirty buffers to the database from
CKPT-Q
• Checkpoint RBA updated in SGA
• Every 3 seconds CKPT calculates the checkpoint target RBA based
on:
▫ The most current RBA
▫ log_checkpoint_timeout
▫ log_checkpoint_interval
▫ fast_start_mttr_target
▫ fast_start_io_target
▫ 90% of the size of the smallest online redo log file
• LGWR flushes the redo buffer up to the target RBA
Incremental Checkpoint
• DBWR writes dirty buffers from CKPT-Q up to the target RBA
▫ Moves batches of buffers to WRITE-AUX
▫ Does the write
▫ Moves the buffers to REPL-AUX. They can be safely reused
• CKPT updates the control file with the checkpoint progress
• Since it is performed continuously, the value of the CKPT RBA will
be much closer to the tail of the log than the RBA of a “conventional”
checkpoint, thus limiting the amount of recovery needed.
Buffer Ckpt Queue
• Buffer Checkpoint Queue (BCQ)
– Ordered by RBA of first change (first-dirty or low RBA)
– Buffer is linked into BCQ when first dirtied
– Writes in BCQ order advance thread checkpoint
Buffers
b1
b2
b3
b4
Redo Thread
c0
(Thread Checkpoint)
c1
c2
c3
c4
Buffer Ckpt Queue
Each write in BCQ order advances the checkpoint
b2
b1
b3
b4
b2
b3
b4
(Write b1)
c0
c1
c2 c3
c4
c1
c2 c3
c4
(Write b2)
b4
b3
b4
(Write b3)
c3
c4
c2 c3
c4
Full Checkpoint
• Writes block images to the database for all dirty buffers from all
instances
• Caused by:
– Alter system checkpoint [global]
– Alter database begin backup
– Alter database close
– Shutdown
• Controlfile and Datafile headers are updated
– CHECKPOINT_CHANGE#
Buffer Cache Basics in PostgreSQL
Checkpoint Definition
Checkpoint Implementation in PostgreSQL
Advantages & Disadvantages of PostgreSQL Implementation
Buffer Cache Basics in Oracle
Checkpoint Implementation in Oracle
Advantages & Disadvantages of Oracle Implementation
Improvements in PostgreSQL
Advantages & Disadvantages of Oracle
Implementation
•
Advantages
1.
It uses multiple lists to avoid unnecessary scanning of free
buffers.
It can scale to having many buffers in memory
It has checksum mechanism for data blocks to protect from disk
damage issues.
Flexibility of multiple writters to avoid bottlenecks and maximize
I/O usage
Disadvantages
If the block gets corrupted due to partial write, it has to use
backup copy and incremental logs to recover database.
2.
3.
4.
5.
6.
Buffer Cache Basics in PostgreSQL
Checkpoint Definition
Checkpoint Implementation in PostgreSQL
Advantages & Disadvantages of PostgreSQL Implementation
Buffer Cache Basics in Oracle
Checkpoint Implementation in Oracle
Advantages & Disadvantages of Oracle Implementation
Improvements in PostgreSQL
Improvements in PostgreSQL
1.
2.
3.
4.
5.
6.
Can improve buffer management so that it can scale to large
number of buffers and reduce buffer scanning.
Double-write using temp file for page corruption during
checkpoint.
Checksum in pages to ensure if page is corrupted in disk due to
disk torn or any other similar reason.
Bgwriter to write based on buffer usage, not only at fixed intervals.
Have the mechanism to recover pg_control file from disk
corruption.
Can have Incremental checkpoint strategy to improve recovery
time and make database scale to large number of buffers.
Thank You
Download