Uploaded by walter-eidi

007-PostgreSQL.High.Performance.Cookbook.2017.3

advertisement
PostgreSQL High Performance
Cookbook
Mastering query optimization, database monitoring, and
performance-tuning for PostgreSQL
Chitij Chauhan
Dinesh Kumar
BIRMINGHAM - MUMBAI
PostgreSQL High Performance Cookbook
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or
transmitted in any form or by any means, without the prior written permission of the
publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the
information presented. However, the information contained in this book is sold without
warranty, either express or implied. Neither the authors, nor Packt Publishing, and its
dealers and distributors will be held liable for any damages caused or alleged to be caused
directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
First published: March 2017
Production reference: 1240317
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-78528-433-5
www.packtpub.com
Credits
Authors
Copy Editor
Chitij Chauhan
Dinesh Kumar
Safis Editing
Reviewers
Project Coordinator
Baji Shaik
Feng Tan
Nidhi Joshi
Commissioning Editor
Proofreader
Dipika Gaonkar
Safis Editing
Acquisition Editor
Indexer
Nitin Dasan
Aishwarya Gangawane
Content Development Editor
Production Coordinator
Mayur Pawanikar
Shraddha Falebhai
Technical Editor
Dinesh Pawar
About the Authors
Chitij Chauhan currently works as a senior database administrator at an IT-based MNC in
Chandigarh. He has over 10 years of work experience in the field of database and system
administration, with specialization in MySQL clustering, PostgreSQL, Greenplum, Informix
DB2, SQL Server 2008, Sybase, and Oracle. He is a leading expert in the area of database
security, with expertise in database security products such as IBM InfoSphere Guardium,
Oracle Database Vault, and Imperva.
Dinesh Kumar is an enthusiastic open source developer and has written several open
source tools for PostgreSQL. He recently announced pgBucket, a brand new job scheduler
for PostgreSQL. He is also a frequent blogger at manojadinesh.blogpsot.com, where he
talks more about PostgreSQL. He is currently working as a senior database engineer in
OpenSCG and building the PostgreSQL cloud operations. He has more than 6 years of
experience as an Oracle and PostgreSQL database administrator and developer, and is
currently focusing on PostgreSQL.
Thanks to my loving parents, Sreenivasulu and Vanamma, who raised me in a small
village called Viruvuru, which is in the Nellore district of India. Thanks to my loving wife,
Manoja, who enlightens my life with her wonderful support. Also, thanks to my friend Baji
Shaik and coordinators, Mayur Pawanikar and Nitin Dasan, for their excellent support.
Finally, thanks to every PostgreSQL contributor, author, and blogger.
About the Reviewers
Baji Shaik is a database administrator and developer. He is a co-author of PostgreSQL
Development Essentials and has tech-reviewed Troubleshooting PostgreSQL by Packt
Publishing. He is currently working as a database consultant at OpenSCG. He has an
engineering degree in telecommunications, and had started his career as a C# and Java
developer. Baji started working with databases in 2011, and over the years he has worked
with Oracle, PostgreSQL, and Greenplum. His background spans a length and breadth of
expertise and experience in SQL/NoSQL database technologies. He has good knowledge of
automation, orchestration, and DevOps in a cloud environment. He likes to watch movies,
read books, and write technical blogs. He also loves to spend time with family. Baji is a
certified PostgreSQL professional.
Feng Tan is from China. His nickname is Francs. He was a PostgreSQL DBA at SkyMobi
(NASDAQ: MOBI) for more than 5 years, where he was maintaining more than 100
PostgreSQL instances. He gave presentations at the China PostgreSQL conference on topics
such as Oracle VS PostgreSQL and PostgreSQL 9.4 new features.
Feng Tan likes to share PostgreSQL technology in his blog at http://francs3.blog.163.co
m/. He is also one of the translators of PostgreSQL 9 Administration Cookbook Chinese
Edition.
Currently, he serves as the open source database administrator at China Mobile Group
Zhejiang Co. Ltd.
www.PacktPub.com
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and
ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a
print book customer, you are entitled to a discount on the eBook copy. Get in touch with us
at service@packtpub.com for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a
range of free newsletters and receive exclusive discounts and offers on Packt books and
eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt
books and video courses, as well as industry-leading tools to help you plan your personal
development and advance your career.
Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Customer Feedback
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial
process. To help us improve, please leave us an honest review on this book's Amazon page
at https://www.amazon.com/dp/1785284339.
If you'd like to join our team of regular reviewers, you can e-mail us at
customerreviews@packtpub.com. We award our regular reviewers with free eBooks and
videos in exchange for their valuable feedback. Help us be relentless in improving our
products!
Table of Contents
Preface
Chapter 1: Database Benchmarking
Introduction
CPU benchmarking
Getting ready
How to do it…
Phoronix
sysbench
How it works…
Phoronix
sysbench
Memory benchmarking
Getting ready
How to do it…
Phoronix
tmpfs
Write test
Read test
How it works…
Disk benchmarking
Getting ready
How to do it…
Phoronix
bonnie++
How it works…
bonnie++
Performing a seek rate test
Getting ready
How to do it…
How it works…
Working with the fsync commit rate
Getting ready
How to do it…
How it works…
Checking IOPS
Getting ready
1
7
7
8
8
8
8
10
10
10
10
11
11
11
11
11
12
13
13
13
13
13
14
14
15
15
15
16
16
16
17
17
17
17
19
19
How to do it…
Sequential mixed read and write
Random mixed read and write
How it works…
Storage sizing
Getting ready
How to do it…
How it works…
Discussing RAID levels
Getting ready
How to do it…
How it works…
RAID 0
RAID 1
RAID 5
RAID 6
RAID 10
Configuring pgbench
Getting ready
How to do it…
How it works…
Running read/write pgbench test cases
Getting ready
How to do it…
Read-only
Write-only
How it works…
Chapter 2: Server Configuration and Control
Introduction
Starting the server manually
Getting ready
How to do it…
How it works…
Stopping the server quickly
Getting ready
How to do it…
How it works…
Stopping the server in an emergency
How to do it…
How it works…
[ ii ]
19
19
20
20
21
21
22
22
22
22
23
23
23
24
24
24
24
25
25
25
26
27
27
27
27
27
28
29
29
29
30
30
30
31
31
32
32
32
32
32
Reloading server configuration
Getting ready
How to do it…
How it works…
Restarting the database server quickly
How to do it…
How it works…
Tuning connection-related parameters
How to do it…
How it works…
Tuning query-related parameters
How to do it…
How it works…
Tuning logging-related parameters
How to do it…
How it works…
Chapter 3: Device Optimization
33
33
33
33
34
34
34
35
35
35
36
37
37
39
39
39
41
Introduction
Understanding memory units in PostgreSQL
Getting ready
How to do it…
shared_buffers
temp_buffers
work_mem
maintenance_work_mem
wal_buffers
max_stack_depth
effective_cache_size
How it works…
Handling Linux/Unix memory parameters
Getting ready
How to do it…
kernel.shmmax
kernel.shmall
kernel.shmmni
vm.swappiness
vm.overcommit_memory
vm.overcommit_ratio
vm.dirty_background_ratio
vm.dirty_ratio
How it works…
[ iii ]
41
41
41
42
42
42
42
42
43
43
43
43
43
43
44
44
44
44
45
45
45
45
46
46
CPU scheduling parameters
Getting ready
How to do it…
How it works…
Identifying checkpoint overhead
Getting ready
How to do it…
How it works…
Analyzing buffer cache contents
Getting ready
How to do it…
How it works…
46
46
47
47
47
47
47
48
48
48
48
49
49
49
49
50
50
50
50
51
51
51
52
53
Chapter 4: Monitoring Server Performance
54
kernel.sched_autogroup_enabled
kernel.sched_min_granularity_ns
kernel.sched_latency_ns
kernel.sched_wakeup_granularity_ns
kernel.sched_migration_cost_ns
How it works…
Disk tuning parameters
Getting ready
How to do it…
CFQ
noop
Deadline
Introduction
Monitoring CPU usage
Getting ready
How to do it…
How it works…
Monitoring paging and swapping
Getting ready
How to do it…
How it works…
Tracking CPU consuming processes
Getting ready
How to do it…
How it works…
Monitoring CPU load
How to do it…
[ iv ]
54
55
55
55
55
56
56
57
59
59
59
59
60
60
61
How it works…
Identifying CPU bottlenecks
How to do it…
How it works…
Identifying disk I/O bottlenecks
How to do it…
How it works…
Monitoring system load
How to do it…
How it works…
Tracking historical CPU usage
Getting ready
How to do it…
How it works…
There's more…
Tracking historical memory usage
Getting ready
How to do it…
How it works…
Monitoring disk space
How to do it…
How it works…
Monitoring network status
Getting ready
How to do it…
How it works…
Chapter 5: Connection Pooling and Database Partitioning
Introduction
Installing pgpool-II
Getting Ready
How to do it…
How it works…
Configuring pgpool and testing the setup
Getting ready
How to do it…
How it works…
There's more…
Installing PgBouncer
Getting ready
[v]
61
61
61
62
62
63
64
64
64
66
66
66
67
67
67
68
68
68
69
69
69
70
71
71
71
71
72
72
73
73
74
75
75
75
75
76
77
79
79
How to do it…
How it works…
Connection pooling using PgBouncer
Getting ready
How to do it…
How it works…
There's more…
Managing PgBouncer
Getting ready
How to do it…
How it works…
Implementing partitioning
Getting ready
How to do it…
How it works…
Managing partitions
Getting ready
How to do it…
How it works…
Installing PL/Proxy
How to do it…
How it works…
Partitioning with PL/Proxy
Getting ready
How to do it…
How it works…
Chapter 6: High Availability and Replication
Introduction
Setting up hot streaming replication
Getting ready
How to do it…
How it works…
Replication using Slony
Getting ready
How to do it…
How it works…
Replication using Londiste
Getting ready
How to do it…
[ vi ]
79
81
81
81
81
83
84
85
85
85
86
88
88
89
90
92
93
93
95
96
97
97
97
97
97
101
103
103
103
104
104
107
108
109
109
112
113
113
114
How it works…
Replication using Bucardo
Getting ready
How to do it…
How it works…
Replication using DRBD
Getting ready
How to do it…
How it works…
Setting up a Postgres-XL cluster
Getting ready
How to do it…
Chapter 7: Working with Third-Party Replication Management Utilities
Introduction
Setting up Barman
Getting ready
How to do it…
How it works…
Backup and recovery using Barman
Getting ready
How to do it…
How it works…
Setting up OmniPITR
Getting ready
How to do it…
How it works…
WAL management with OmniPITR
Getting ready
How to do it…
How it works…
Setting up repmgr
How to do it…
How it works…
Using repmgr to create replica
How to do it…
How it works…
Setting up walctl
How to do it…
How it works…
[ vii ]
120
122
122
123
125
126
126
127
135
136
136
136
146
146
147
147
147
149
150
150
151
152
153
153
153
154
154
154
155
156
157
157
159
160
160
161
162
162
164
Using walctl to create replica
How to do it…
How it works…
165
165
165
Chapter 8: Database Monitoring and Performance
Introduction
Checking active sessions
Getting ready
How to do it…
How it works…
There's more…
Finding out what the users are currently running
Getting ready
How to do it…
How it works…
Finding blocked sessions
Getting ready
How to do it…
How it works…
There's more…
Transactional locks
Table level locks
Prepared transaction locks
Usage of SKIP LOCKED
Dealing with deadlocks
Getting ready
How to do it…
Using FOR UPDATE
Advisory locks
Table access statistics
Getting ready
How to do it…
pg_stat_user_tables
pg_statio_user_tables
How it works…
Logging slow statements
Getting ready
How to do it…
How it works…
Determining disk usage
Getting ready
[ viii ]
166
166
167
167
167
168
169
169
170
170
171
171
171
172
173
173
173
174
176
178
179
179
179
180
181
181
182
182
182
184
184
185
185
185
186
186
187
How to do it…
Column size
Relation size
Database size
Tablespace size
How it works…
Preventing page corruption
Getting ready
How to do it…
How it works…
There's more…
Routine reindexing
Getting ready
How to do it…
How it works…
There's more…
Generating planner statistics
Getting ready
How to do it…
How it works…
null_frac
avg_width
n_distinct
most_common_vals
most_common_freqs
histogram_bounds
correlation
most_common_elems
most_common_elem_freqs
Tuning with background writer statistics
Getting ready
How to do it…
How it works…
checkpoints_timed
checkpoints_req
checkpoint_write_time
checkpoint_sync_time
buffers_checkpoint
buffers_clean
max_written_clean
buffers_backend
buffers_backend_fsync
buffers_alloc
stats_reset
[ ix ]
187
187
188
188
189
189
190
190
191
193
193
194
194
195
197
197
198
198
198
199
199
199
199
199
199
199
200
200
200
200
200
201
201
202
202
203
203
204
204
205
205
206
207
207
There's more…
207
Chapter 9: Vacuum Internals
208
Introduction
Dealing with bloating tables and indexes
Getting ready
How to do it…
How it works…
Vacuum and autovacuum
Getting ready
How to do it…
How it works…
xmin
xmax
Autovacuum
Freezing and transaction ID wraparound
Getting ready
How to do it…
How it works…
Fixing transaction ID wraparound
Preventing transaction ID wraparound
Monitoring vacuum progress
Getting ready
How to do it…
How it works…
Control bloat using transaction age
Getting ready
How to do it…
How it works…
Chapter 10: Data Migration from Other Databases to PostgreSQL and
Upgrading the PostgreSQL Cluster
Introduction
Using pg_dump to upgrade data
Getting ready
How to do it…
How it works…
Using the pg_upgrade utility for version upgrade
Getting ready
How to do it…
How it works…
[x]
208
209
209
209
211
211
211
211
215
215
216
216
217
217
220
220
222
223
223
223
223
224
224
224
225
226
227
227
228
228
228
229
230
230
230
232
Replicating data from other databases to PostgreSQL using
Goldengate
Getting ready
How to do it…
How it works…
There's more…
Chapter 11: Query Optimization
232
232
233
242
245
246
Introduction
Using sample data sets
Getting ready
How to do it…
How it works…
Timing overhead
Getting ready
How to do it…
How it works…
Studying hot and cold cache behavior
Getting ready
How to do it…
Cold cache
Hot cache
How it works…
Clearing the cache
Getting ready
How to do it…
How it works…
There's more…
Query plan node structure
Getting ready
How to do it…
How it works…
Generating an explain plan
Getting ready
How to do it…
How it works…
There's more…
Computing basic cost
Getting ready
How to do it…
[ xi ]
247
247
247
247
249
250
250
251
251
252
252
252
253
254
256
256
256
256
258
259
259
259
259
260
262
262
262
263
264
264
264
264
How it works…
Sequential scan
Index scan
Phase I
Phase II
Running sequential scans
Getting ready
How to do it…
How it works…
Running bitmap heap and index scan
Getting ready
How to do it…
How it works…
Aggregate and hash aggregate
Getting ready
How to do it…
Aggregate
Hash aggregate
How it works…
Aggregate
State transition functions
Hash aggregates
Running CTE scan
Getting ready
How to do it…
How it works…
CTE scan
Recursive union
Nesting loops
Getting ready
How to do it…
How it works…
Phase I
Phase II
Working with hash and merge join
Getting ready
How to do it…
Hash join
Merge join
How it works…
Hash join
Phase I
Phase II
[ xii ]
265
266
267
268
268
268
268
269
270
271
271
271
273
274
274
274
275
276
277
277
278
279
279
280
280
281
281
282
284
284
284
286
286
286
287
287
288
288
288
289
289
290
290
Merge join
Phase I
Phase II
Grouping
Getting ready
How to do it…
How it works…
Working with set operations
Getting ready
How to do it…
How it works…
INTERSECT
EXCEPT
UNION
Working on semi and anti joins
Getting ready
How to do it…
How it works…
Chapter 12: Database Indexing
291
291
291
292
292
293
294
294
294
295
296
296
299
300
300
301
301
302
304
Introduction
Measuring query and index block statistics
Getting ready
How to do it…
How it works…
idx_scan
idx_tup_read
idx_tup_fetch
idx_blks_read
idx_blks_hit
Index lookup
Getting ready
How to do it…
How it works…
Index scan
Bitmap heap scan
Index only scan
Comparing indexed scans and sequential scans
Getting ready
How to do it...
How it works…
Sequential scan
Index scan
[ xiii ]
304
305
305
305
306
307
308
308
309
309
309
309
310
312
312
312
312
313
313
313
315
315
315
Index only scan
Bitmap heap scan
There's more…
Clustering against an index
Getting ready
How to do it…
How it works…
There's more…
Concurrent indexes
Getting ready
How to do it…
How it works…
Combined indexes
Getting ready
How to do it…
How it works…
Partial indexes
Getting ready
How to do it…
How it works…
Finding unused indexes
Getting ready
How to do it…
How it works…
Forcing a query to use an index
Getting ready
How to do it…
How it works…
Detecting a missing index
Getting ready
How to do it…
How it works…
Index
315
315
316
316
316
317
317
317
317
317
318
319
319
319
319
320
320
320
320
321
321
322
322
322
323
323
323
324
324
324
325
325
326
[ xiv ]
Preface
PostgreSQL is one of the most powerful and easy-to-use database management systems. It
has strong support from the community and is being actively developed with a new release
every year. PostgreSQL supports the most advanced features included in SQL standards. It
also provides NoSQL capabilities and very rich data types and extensions. All of this makes
PostgreSQL a very attractive solution in software systems. If you run a database, you want
it to perform well and you want to be able to secure it. As the world's most advanced open
source database, PostgreSQL has unique built-in ways to achieve these goals. This book will
show you a multitude of ways to enhance your database's performance and give you
insights into measuring and optimizing a PostgreSQL database to achieve better
performance.
What this book covers
Chapter 1, Database Benchmarking, deals with a few major system component
benchmarkings such as CPU, disk, and IOPS, besides database benchmarking. In this
chapter, we will discuss a benchmarking frame called phoronix, along with disk RAID
levels.
Chapter 2, Server Configuration and Control, deals with how to control the PostgreSQL
instance behavior with the help of a few major configuration parameter settings.
Chapter 3, Device Optimization, discusses CPU, disk, and memory-related parameters.
Besides this, we will be discussing memory components of PostgreSQL along with how to
analyze the buffer cache contents.
Chapter 4, Monitoring Server Performance, discuss, various Unix/Linux related operating
system utilities that can help the DBA in performance analysis and troubleshooting issues.
Chapter 5, Connection Pooling and Database Partitioning, covers connection pooling methods
such as pgpool and pgbouncer. Also, we will be discussing a few partitioning techniques
that PostgreSQL offers.
Chapter 6, High Availability and Replication, is about various high availability and
replication solutions, including some popular third-party replication tools such as Slony,
Londiste, and Bucardo. Also, we will be discussing how to set up the PostgreSQL XL
cluster.
Preface
Chapter 7, Working with Third-Party Replication Management Utilities, discusses different
third-party replication management tools, such as repmgr. Also, this chapter covers a
backup management tool called Barman, along with WAL management tools, such
as walctl.
Chapter 8, Database Monitoring and Performance, shows different aspects of how and what to
monitor in the PostgreSQL instance to achieve the better performance. Also, in this chapter,
we will be discussing a few troubleshooting techniques that will help the DBA team.
Chapter 9, Vacuum Internals, is about MVCC and how to handle PostgreSQL's transaction
wraparound issues. Also, we will see how to control the bloat using snapshot threshold
settings.
Chapter 10, Data Migration from Other Databases to PostgreSQL and Upgrading PostgreSQL
Cluster, covers heterogeneous replication between Oracle and PostgreSQL using
Goldengate.
Chapter 11, Query Optimization, discusses the functionality of PostgreSQL query planner.
Besides this, we will be discussing several query processing algorithms with examples.
Chapter 12, Database Indexing, covers various index loop techniques PostgreSQL follows.
Besides, we will be discussing various index management methods, such as how to find
unused or missing indexes from the database.
What you need for this book
In general, a modern Unix-compatible platform should be able to run PostgreSQL. To make
the most out of this book, you also require CentOS 7. The minimum hardware required to
install and run PostgreSQL is as follows:
1 GHz processor
1 GB of RAM
512 MB of HDD
Who this book is for
If you are a developer or administrator with limited PostgreSQL knowledge and want to
develop your skills with this great open source database, then this book is ideal for you.
Learning how to enhance the database performance is always an exciting topic to everyone,
and this book will show you enough ways to enhance the database performance.
[2]
Preface
Sections
In this book, you will find several headings that appear frequently (Getting ready, How to
do it, How it works, There's more, and See also).
To give clear instructions on how to complete a recipe, we use these sections as follows.
Getting ready
This section tells you what to expect in the recipe, and describes how to set up any software
or any preliminary settings required for the recipe.
How to do it...
This section contains the steps required to follow the recipe.
How it works...
This section usually consists of a detailed explanation of what happened in the previous
section.
There's more...
This section consists of additional information about the recipe in order to make the reader
more knowledgeable about the recipe.
See also
This section provides helpful links to other useful information for the recipe.
Conventions
In this book, you will find a number of text styles that distinguish between different kinds
of information. Here are some examples of these styles and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions,
pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "In Linux,
tmpfs is a temporary filesystem, which uses the RAM rather than the disk storage."
[3]
Preface
Any command-line input or output is written as follows:
$ phoronix-test-suite benchmark pts/memory
Phoronix Test Suite v6.8.0
Installed: pts/ramspeed-1.4.0
To Install: pts/stream-1.3.1
To Install: pts/cachebench-1.0.0
New terms and important words are shown in bold. Words that you see on the screen, for
example, in menus or dialog boxes, appear in the text like this: "Select the product pack
Fusion Middleware and the Linux x86-64 platform. Click on the Go button."
Warnings or important notes appear in a box like this.
Tips and tricks appear like this.
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this
book-what you liked or disliked. Reader feedback is important for us as it helps us develop
titles that you will really get the most out of.
To send us general feedback, simply e-mail feedback@packtpub.com, and mention the
book's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or
contributing to a book, see our author guide at www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you
to get the most from your purchase.
[4]
Preface
Downloading the example code
You can download the example code files for this book from your account at http://www.p
acktpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.c
om/supportand register to have the files e-mailed directly to you.
You can download the code files by following these steps:
1.
2.
3.
4.
5.
6.
7.
Log in or register to our website using your e-mail address and password.
Hover the mouse pointer on the SUPPORT tab at the top.
Click on Code Downloads & Errata.
Enter the name of the book in the Search box.
Select the book for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this book from.
Click on Code Download.
You can also download the code files by clicking on the Code Files button on the book's
webpage at the Packt Publishing website. This page can be accessed by entering the book's
name in the Search box. Please note that you need to be logged in to your Packt account.
Once the file is downloaded, please make sure that you unzip or extract the folder using the
latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPubl
ishing/PostgreSQL-High-Performance-Cookbook. We also have other code bundles from
our rich catalog of books and videos available at https://github.com/PacktPublishing/.
Check them out!
[5]
Preface
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do
happen. If you find a mistake in one of our books-maybe a mistake in the text or the codewe would be grateful if you could report this to us. By doing so, you can save other readers
from frustration and help us improve subsequent versions of this book. If you find any
errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting
your book, clicking on the Errata Submission Form link, and entering the details of your
errata. Once your errata are verified, your submission will be accepted and the errata will
be uploaded to our website or added to any list of existing errata under the Errata section of
that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/conten
t/supportand enter the name of the book in the search field. The required information will
appear under the Errata section.
Piracy
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At
Packt, we take the protection of our copyright and licenses very seriously. If you come
across any illegal copies of our works in any form on the Internet, please provide us with
the location address or website name immediately so that we can pursue a remedy.
Please contact us at copyright@packtpub.com with a link to the suspected pirated
material.
We appreciate your help in protecting our authors and our ability to bring you valuable
content.
Questions
If you have a problem with any aspect of this book, you can contact us
at questions@packtpub.com, and we will do our best to address the problem.
[6]
1
Database Benchmarking
In this chapter, we will cover the following recipes:
CPU benchmarking
Memory benchmarking
Disk benchmarking
Performing a seek rate test
Working with the fysnc commit rate
Checking IOPS
Storage sizing
Discussing RAID levels
Configuring pgbench
Running read/write pgbench tests
Introduction
PostgreSQL is renowned in the database management system world. With every
PostgreSQL release, it's gaining in popularity due to its advanced features and performance.
This cookbook is especially designed to give more information about most of the major
features in PostgreSQL, and also how to achieve good performance with the help of proper
hardware/software benchmarking tools. This cookbook is also designed to discuss, all the
high availability options we can achieve with PostgreSQL, and also give some details about
how to migrate your database from other commercial databases.
To benchmark the database server, we need to benchmark several hardware/software
components. In this chapter, we will discuss major tools that are especially designed to
benchmark a certain component.
Database Benchmarking
I would like to say thanks to the Phoronix Test Suite team, for allowing me to discuss their
benchmarking tool. Phoronix is an open source benchmarking framework, which, by
default, provides test cases for several hardware/software components, thanks to its
extensible architecture, where we can write our own test suite with the set of benchmarking
test cases. Phoronix also supports to upload your benchmarking results to http://openben
chmarking.org/, which is a public/private benchmark results repository, where we can
compare our your benchmarking results with others.
Go to the following URL for installation instructions for Phoronix Test
Suite:
http://www.phoronix-test-suite.com/?k=downloads.
CPU benchmarking
In this recipe, let's discuss how to benchmark the CPU speed using various open source
benchmarking tools.
Getting ready
One of the ways to benchmark CPU power is by measuring the wall clock time for the
submitted task. The task can be like calculating the factorial of the given number, or
th
calculating the n Fibonacci number, or some other CPU-intensive task.
How to do it…
Let us discuss about how to configure phoronix and sysbench tools to benchmark the CPU:
Phoronix
Phoronix supports a set of CPU tests in a test suite called CPU. This test suite covers
multiple CPU-intensive tasks, which are mentioned at the following URL:
https://openbenchmarking.org/suite/pts/CPU.
[8]
Database Benchmarking
If you want to run this CPU test suite, then you need to execute the Phoronix Test Suite
benchmark CPU command as a root user. We can also run a specific test by mentioning its
test name. For example, let's run a sample CPU benchmarking test as follows:
$ phoronix-test-suite benchmark pts/himeno
Phoronix Test Suite v6.8.0
To Install: pts/himeno-1.2.0
...
1 Test To Install
pts/himeno-1.2.0:
Test Installation 1 of 1
1 File Needed
Downloading: himenobmtxpa.tar.bz2
Started Run 2 @ 05:53:40
Started Run 3 @ 05:54:35 [Std. Dev: 1.66%]
Test Results:
1503.636072
1512.166077
1550.985494
Average: 1522.26 MFLOPS
Phoronix also provides a way to observe the detailed test results via HTML file. Also, it
supports the offline generation of PDF, JSON, CSV, and text format outputs. To open these
test results in the browser, we need to execute the following command:
$ phoronix-test-suite show-result <Test Name>
The following is a sample screenshot of the results of the preceding command:
[9]
Database Benchmarking
sysbench
The sysbench tool provides a CPU task, which calculates the number of prime numbers
within a given range and provides the CPU-elapsed time. Let's execute the sysbench
command as shown in the following screenshot, to retrieve the CPU measurements:
[root@localhost ~]# sysbench --test=CPU --CPU-max-prime=10000 --numthreads=4 run
Doing CPU performance benchmark
Threads started!
Done.
Maximum prime number checked in CPU test: 10000
Test execution summary:
total time:
3.2531s
total number of events:
10000
total time taken by event execution: 13.0040
per-request statistics:
min:
1.10ms
avg:
1.30ms
max:
8.60ms
approx. 95 percentile:
1.43ms
Threads fairness:
events (avg/stddev):
2500.0000/8.46
execution time (avg/stddev):
3.2510/0.00
How it works…
The preceding results are collected from CentOS 7, which was running virtually on a
Windows 10 machine. The virtual machine has four processing units (CPU cores) of Intel
Core i7-4510U of CPU family six.
Phoronix
The URL http://openbenchmarking.org/ provides a detailed description of each test
detail along with its implementation, and would encourage you to read more information
about the himeno test case.
sysbench
From the previous results, the system takes 3.2531 seconds to compute the 10,000 prime
numbers, with the help of four background threads.
[ 10 ]
Database Benchmarking
Memory benchmarking
In this recipe, we will be discussing how to benchmark the memory speed using open
source tools.
Getting ready
As with the CPU test suite, phoronix supports one another memory test suite, which covers
RAM benchmarking. Otherwise, we can also use a dedicated memtest86 benchmarking tool,
which performs memory benchmarking during a server bootup phase. Another neat trick
would be to create a tmpfs mount point in the RAM and then create a tablespace on it in
PostgreSQL. Once we create the tablespace, we can then create in-memory tables, where we
can benchmark the table read/write operations. We can also use the dd command to
measure the memory read/write operations.
How to do it…
Let us discuss how to install phoronix and how to configure the tmpfs mount point in
Linux:
Phoronix
Let's execute the following phoronix command, which will install the memory test suit and
perform memory benchmarking. Once the benchmarking is completed, as aforementioned,
observe the HTML report:
$ phoronix-test-suite benchmark pts/memory
Phoronix Test Suite v6.8.0
Installed: pts/ramspeed-1.4.0
To Install: pts/stream-1.3.1
To Install: pts/cachebench-1.0.0
tmpfs
In Linux, tmpfs is a temporary filesystem, which uses the RAM rather than the disk
storage. Anything we store in tmpfs will be cleared once we restart the system:
[ 11 ]
Database Benchmarking
Refer to the URL for more information about tmpfs: https://en.wikiped
ia.org/wiki/Tmpfs and https://www.jamescoyle.net/knowledge
/1659-what-is-tmpfs.
Let's create a new mount point based on tmpfs using the following command:
# mkdir -p /memmount
# mount -t tmpfs -o size=1g tmpfs /memmount
# df -kh -t tmpfs
Filesystem
Size Used Avail Use% Mounted on
tmpfs
1.9G
96K 1.9G
1% /dev/shm
tmpfs
1.9G 8.9M 1.9G
1% /run
tmpfs
1.9G
0 1.9G
0% /sys/fs/cgroup
tmpfs
1.0G
0 1.0G
0% /memmount
Let's create a new folder in memmount and assign it to the tablespace.
# mkdir -p /memmount/memtabspace
# chown -R postgres:postgres /memmount/memtabspace/
postgres=# CREATE TABLESPACE memtbs LOCATION '/memmount/memtabspace';
CREATE TABLESPACE
postgres=# CREATE TABLE memtable(t INT) TABLESPACE memtbs;
CREATE TABLE
Write test
postgres=# INSERT INTO memtable VALUES(generate_series(1, 1000000));
INSERT 0 1000000
Time: 1372.763 ms
postgres=# SELECT
pg_size_pretty(pg_relation_size('memtable'::regclass));
pg_size_pretty ---------------35 MB (1 row)
From the preceding results, to insert 1 million records it took approximately 1 second with
a writing speed of 35 MB per second.
[ 12 ]
Database Benchmarking
Read test
postgres=# SELECT COUNT(*) FROM memtable;
count
--------1000000
(1 row)
Time: 87.333 ms
From the preceding results, to read the 1 million records it took approximately 90
milliseconds with a reading speed of 385 MB per second, which is pretty fast for the local
system configuration. The preceding read test was performed after clearing the system
cache and by restarting the PostgreQSL instance, which avoids the system buffers.
How it works…
In the preceding tmpfs example, we created an in-memory table, and all the system calls
PostgreQSL tries to perform to read/write the data will be directly affecting the memory
rather than the disk, which gives a major performance boost. Also, we need to consider to
drop these in-memory tablespace, tables after testing, since these objects will physically
vanish after system reboot.
Disk benchmarking
In this recipe, we will be discussing how to benchmark the disk speed using open source
tools.
Getting ready
The well-known command to perform disk I/O benchmarking is dd. We all use the dd
command to measure read/write operations by specifying the required block size, and we
also measure the direct I/O by skipping the system write buffers. Similarly, phoronix
supports a complete test suite for the disk as CPU and memory that perform different
storage-related tests. Another famous disk benchmarking tool is bonnie++, which provides
more flexibility in measuring the disk I/O.
[ 13 ]
Database Benchmarking
How to do it…
Let us discuss how to run the disk benchmarking using phoronix and using bonnie++
testing tools:
Phoronix
To run the complete disk test suite on the system, run the following command:
$ phoronix-test-suite benchmark pts/disk
Phoronix also supports a quick I/O test case, where you can perform an instant disk
performance test using the following command test, which is interactive and collects the
input, and then runs the test cases:
$ phoronix-test-suite benchmark pts/iozone
Phoronix Test Suite v6.8.0
Installed: pts/iozone-1.8.0
Disk Test Configuration
1: 4Kb
2: 64Kb
3: 1MB
4: Test All Options
Record Size: 1
1: 512MB
2: 2GB
3: 4GB
4: 8GB
5: Test All Options
File Size: 1
1: Write Performance
2: Read Performance
3: Test All Options
Disk Test: 3
bonnie++
bonnie++ is a filesystem and disk-level benchmarking tool and can perform the same test
multiple times. You can install this tool using either yum or apt-get install or installing it
via the source code. Let's run the bulk I/O test case using the following arguments, where it
tries to create 8 GB files:
$ /usr/local/sbin/bonnie++ -D -d /tmp/ -s 8G -b
Writing with putc()...done
[ 14 ]
Database Benchmarking
Writing intelligently...done
...
localhost.localdomain,8G,68996,106,14151,53,46772,15,95343,93,123633,16,201
.0,7,16,795,58,+++++,+++,733,46,757,57,+++++,+++,592,38
How it works…
Let us discuss how the bonnie++ performs the benchmarking, and what are all the tools
bonnie++ offers to understand the benchmarking results:
bonnie++
From the preceding test case, we provided the results the bonnie++ as to use only direct I/O
using the -D option. Also, we asked to create 8 GB random files in the /tmp/ location to
measure the disk speed. As the final output from bonnie++, we will get CSV values, which
we need to feed to the bon_csv2html command, which provides some detailed
information about the test results, as shown in the following screenshot:
$ echo
"localhost.localdomain,8G,68996,106,14151,53,46772,15,95343,93,123633,16,20
1.0,7,16,795,58,+++++,+++,733,46,757,57,+++++,+++,592,38"|bon_csv2html >
~/Desktop/bonresults.html
bonnie++ performs three different tests for disk benchmarking. They are read, write and
then seek speed. We will be discussing the seek rate in the further topics. The bonnie++ do
always recommend to have high number in /sec section in the preceding table, and lower
% CPU values for better disk performance. Also, ++++ shows that the test was not
performed accurately by bonnie++, as the test was incomplete with the provided arguments.
To get the complete results, we need to rerun the same test multiple times using the -n
option, where bonnie will get enough time/resources to complete the job.
Performing a seek rate test
In this recipe, we will be discussing how to benchmark the disk seek rate speed using open
source tools.
[ 15 ]
Database Benchmarking
Getting ready
A file can be read from the disk in two ways: sequentially and at random. Reading a file in
sequential order requires less effort than reading a file in random order. In PostgreSQL and
other database systems, a file needs to be scanned in random order as per the index scans.
During the index scans, as per the index lookups, the relation file needs to fetch the data
randomly, by moving its file pointer backward and forward, which needs an additional
mechanical overhead in spinning the disk in the normal HDD. In SSD, this overhead is
lower as it uses the flash memory. This is one of the reasons why we define
that random_page_cost as always higher than seq_page_cost in postgresql.conf. In
the previous bonnie++ example, we have random seeks, which were measured per second
as 201.0 and used 7% of the CPU.
How to do it…
We can use the same bonnie++ utility command to measure the random seek rate, or we can
also use another disk latency benchmarking tool called ioping:
# ioping -R /dev/sda3 -s 8k -w 30
--- /dev/sda3 (block device 65.8 GiB) ioping statistics --2.23 k requests completed in 29.2 s, 17.5 MiB read, 76 iops, 613.4
KiB/s
generated 2.24 k requests in 30.0 s, 17.5 MiB, 74 iops, 596.2 KiB/s
min/avg/max/mdev = 170.6 us / 13.0 ms / 73.5 ms / 5.76 ms
How it works…
Ioping is a disk latency benchmarking tool that produces an output similar to the network
utility command ping. This tool also provides no cache or with cache disk benchmarking as
bonnie++ and also includes synchronous and asynchronous I/O latency benchmarking. You
can install this tool using yum or apt-get in the respective Linux distributions. The
preceding results were generated based on PostgreSQL's default block size of 8 KB,
which ran for 30 seconds. Ioping provides another useful feature called ping-pong mode for
read/write. This mode displays the instant read/write speed of the disk as shown in the
following screenshot:
$
8
8
8
8
8
ioping -G /tmp/ -D
KiB >>> /tmp/ (xfs
KiB <<< /tmp/ (xfs
KiB >>> /tmp/ (xfs
KiB <<< /tmp/ (xfs
KiB >>> /tmp/ (xfs
-s 8k
/dev/sda3):
/dev/sda3):
/dev/sda3):
/dev/sda3):
/dev/sda3):
request=1
request=2
request=3
request=4
request=5
[ 16 ]
time=1.50
time=9.73
time=2.00
time=1.02
time=1.95
ms (warmup)
ms
ms
ms
ms
Database Benchmarking
In the preceding example, we ran ioping in ping-pong mode (-G) and used the direct I/O (D) with a block size of 8 KB. We can also run the same ping-pong mode in pure cache mode
using the (-C) option.
Working with the fsync commit rate
In this recipe, we will be discussing how to benchmark the fsync speed using open source
tools.
Getting ready
Fsync is a system call that flushes the data from system buffers into physical files. In
PostgreSQL, whenever a CHECKPOINT operation occurs, it internally initiates the fsync, to
flush all the modified system buffers into the respective files. The fsync benchmarking
defines the transfer ratio of data from memory to the disk.
How to do it…
To perform fsync benchmarking, we can use a dedicated benchmark test called fs-mark
from Phoronix. This fs-mark test was built based on a filesystem benchmarking tool called
fs_mark, or fio, which supports several fsync test cases. We can run this fs-mark test
case using the following command:
$ phoronix-test-suite benchmark fs-mark FS-Mark 3.3:
pts/fs-mark-1.0.1
Disk Test Configuration
1: 1000 Files, 1MB Size
2: 1000 Files, 1MB Size, No Sync/FSync
3: 5000 Files, 1MB Size, 4 Threads
4: 4000 Files, 32 Sub Dirs, 1MB Size
5: Test All Options
Test:
The preceding command failed to install while testing on the local
machine. Once I installed glibc-static via yum install, then the test
went smooth.
[ 17 ]
Database Benchmarking
How it works…
Phoronix installs all the binaries on the local machine when we start benchmarking the
corresponding test. In the preceding command, we are benchmarking the test fsmark, where it installs the tool at ~/.phoronix-test-suite/installedtests/pts/fs-mark-1.0.1/fs_mark-3.3. Let's go to the location, and let's see what
fsync tests it supports:
./fs_mark -help
Usage: fs_mark
-S Sync Method (
1:fsyncBeforeClose,
2:sync/1_fsync,
3:PostReverseFsync,
4:syncPostReverseFsync,
5:PostFsync,
6:syncPostFsync)
0:No Sync,
I would encourage you to read the readme file, which exists in the same location, for
detailed information about the sync methods. Let's run a simple fs_mark benchmarking by
choosing one sync method as shown in the following here:
./fs_mark -w 8096 -S 1 -s 102400 -d /tmp/ -L 3 -n 500
# ./fs_mark -w 8096 -S 1 -s 102400 -d /tmp/ -L 3 -n 500
#
Version 3.3, 1 thread(s) starting at Fri Dec 30 04:26:28 2016
#
Sync method: INBAND FSYNC: fsync() per file in write loop.
#
Directories: no subdirectories used
#
File names: 40 bytes long, (16 initial bytes of time stamp with 24
random bytes at end of name)
#
Files info: size 102400 bytes, written with an IO size of 8096
bytes per write
#
App overhead is time in microseconds spent in the test not doing
file writing related system calls.
FSUse%
Count
Size
Files/sec
App Overhead
39
500
102400
156.4
17903
39
1000
102400
78.9
22906
39
1500
102400
116.2
24269
We ran the preceding test with write files of size 102,400 and block size of 8,096. The
number of files it needs to create is 500 and it needs to repeat the test three times by
choosing sync method 1, which closes the file after writing the content to disk.
[ 18 ]
Database Benchmarking
Checking IOPS
In this recipe, we will be discussing how to benchmark the disk IOPS using open source
tools.
Getting ready
As mentioned previously, a disk can be read in either sequential or random orders. To
measure the disk accurately, we need to perform more random read/write operations,
which gives more stress to the disk. To calculate the IOPS (Input/Output Per Second) of a
disk, we can either use fio or bonnie++ tools, which do sequential/random operations over
the disk. In this chapter, let's use the fio (Flexible I/O) tool to calculate the IOPS for the
disk.
How to do it…
Let's download the latest version of the fio module from http://brick.kernel.dk/snaps
/, also download libaio-devel, which would be the ioengine we will be using for the
IOPS. This ioengine defines, how the fio module needs to submit the I/O requests to the
kernel. There are multiple ioengines you can specify for the I/O requests such as sync,
mmap, and so on. You can refer to the main page of fio for all the supported ioengines. After
downloading the fio module, let's follow the regular Linux source installation method as
configure, make, and make install.
Sequential mixed read and write
Let's run a sample sequential mixed read/write, as shown here:
$ ./fio --ioengine=libaio --direct=1 --name=test_seq_mix_rw -filename=test_seq --bs=8k --iodepth=32 --size=1G --readwrite=rw -rwmixread=50
test_seq_mix_rw: (g=0): rw=rw, bs=8K-8K/8K-8K/8K-8K, ioengine=libaio,
iodepth=32
...
...
test_seq_mix_rw: (groupid=0, jobs=1): err= 0: pid=43596: Fri Dec 30
23:31:11 2016
read : io=525088KB, bw=1948.1KB/s, iops=243 , runt=269430msec
...
bw (KB/s) : min=
15, max= 6183, per=100.00%, avg=2002.59,
[ 19 ]
Database Benchmarking
stdev=1253.68
write: io=523488KB, bw=1942.1KB/s, iops=242 , runt=269430msec
...
bw (KB/s) : min= 192, max= 5888, per=100.00%, avg=2001.74,
stdev=1246.19
...
Run status group 0 (all jobs):
READ: io=525088KB, aggrb=1948KB/s, minb=1948KB/s, maxb=1948KB/s,
mint=269430msec, maxt=269430msec
WRITE: io=523488KB, aggrb=1942KB/s, minb=1942KB/s, maxb=1942KB/s,
mint=269430msec, maxt=269430msec
Disk stats (read/write):
sda: ios=65608/65423, merge=0/5, ticks=869519/853644,
in_queue=1723445, util=99.85%
Random mixed read and write
Let's run a sample random mixed read/write, as shown here:
$ ./fio --ioengine=libaio --direct=1 --name=test_rand_mix_rw -filename=test_rand --bs=8k --iodepth=32 --size=1G --readwrite=randrw -rwmixread=50
test_rand_mix_rw: (g=0): rw=randrw, bs=8K-8K/8K-8K/8K-8K,
ioengine=libaio, iodepth=32
...
...
test_rand_mix_rw: (groupid=0, jobs=1): err= 0: pid=43893: Fri Dec 30
23:49:19 2016
read : io=525088KB, bw=1018.9KB/s, iops=127 , runt=515375msec
...
bw (KB/s) : min=
8, max= 6720, per=100.00%, avg=1124.47,
stdev=964.38
write: io=523488KB, bw=1015.8KB/s, iops=126 , runt=515375msec
...
bw (KB/s) : min=
8, max= 6904, per=100.00%, avg=1125.46,
stdev=975.04
...
Run status group 0 (all jobs):
READ: io=525088KB, aggrb=1018KB/s, minb=1018KB/s, maxb=1018KB/s,
mint=515375msec, maxt=515375msec
WRITE: io=523488KB, aggrb=1015KB/s, minb=1015KB/s, maxb=1015KB/s,
mint=515375msec, maxt=515375msec
Disk stats (read/write):
sda: ios=65609/65456, merge=0/4, ticks=7382037/5520238,
in_queue=12902772, util=100.00%
[ 20 ]
Database Benchmarking
How it works…
We ran the preceding test cases to work on 1 GB (--size) file without any cache (-direct), by doing 32 concurrent I/O requests (--iodepth), with a block size of 8 KB (--bs)
as 50% read and 50% write operations (--rwmixread). From the preceding sequential test
results, the bw (bandwidth), IOPS values are pretty high when compared with random test
results. That is, in sequential test cases, we gain approximately 50% more IOPS (read=243,
read=242) than with the random IOPS (read=127, write=126).
Fio also provides more information such, as I/O submission latency and complete latency,
along with CPU usage on the conducted test cases. I would encourage you to read more
useful information about fio's features from its man pages.
Storage sizing
In this recipe, we will be discussing how to estimate disk growth using the pgbench tool.
Getting ready
One of the best practices to predict the database disk storage capacity is by loading a set of
sample data into the application's database, and simulating production kind of actions
using pgbench over a long period. For a period of time (every 1 hour), let's collect the
database size using pg_database_size() or any native command, which returns the disk
usage information. Once we get the periodic intervals for at least 24 hours, then we can find
an average disk growth ratio by calculating the average of delta among each interval value.
[ 21 ]
Database Benchmarking
How to do it…
Prepare the SQL script as follows, which simulates the live application behavior in the
database:
Create connection;
--- Create/Use pool connection.
INSERT operation
--- Initial write operation.
SELECT pg_sleep(0.01);
--- Some application code runs here, and
waiting for the next query.
UPDATE operation
--- Update other tables for the newly
inserted records.
SELECT pg_sleep(0.1);
--- Updating other services which shows the
live graphs on the updated records.
DELETE operation
--- Delete or purge any unnecessary data.
SELECT pg_sleep(0.01);
--- Some application code overhead.
Let's run the following pgbench test case, with the preceding test file for 24 hours:
$ pgbench -T 86400 -f <script location> -c <number of concurrent
connections>
In parallel, let's schedule a job that collects the database size every hour using
the pg_database_size() function, also schedule another job to run for every 10 minutes,
which run the VACUUM on the database. This VACUUM job takes care of reclaiming the
dead tuples logically at database level. However, in production servers, we will not deploy
the VACUUM job to run for every 10 minutes, as the autovacuum process takes care of the
dead tuples. As this test is not for database performance benchmarking, we can also make
autovacuum more aggressive on the database side as well.
How it works…
Once we find the average disk growth per day, we can predict the database growth for the
next 1 or 2 years. However, the database write rate also increases with the business growth.
So, we need to deploy the database growth script or we need to analyze any disk storage
trends from the monitoring tool to make a better prediction of the storage size.
Discussing RAID levels
In this recipe, we will be discussing about various RAID levels and their unique usage.
[ 22 ]
Database Benchmarking
Getting ready
In this recipe, we will be discussing several RAID levels, which we configure for database
requirements. RAID (Redundant Array of Interdependent Disks) has a dedicated
hardware controller to deal with multiple disks, including a separate processor along with a
battery backup cache, where data can be flushed to disk properly when a power failure
occurs.
How to do it…
RAID levels can be differentiated as per their configurations. RAID supports configuration
techniques such as striping, mirroring, and parity to improve the disk storage performance,
or high availability. The most popular RAID levels are zero to six, and each level provides
its own kind of disk storage capacity, read/write performance and high availability. The
common RAID levels we configure for DBMS are 0, 1, 5, 6, or 10 (1 and 0).
How it works…
Let us discuss about how the mostly used RAID level works:
RAID 0
This configuration only focuses on read/write performance by striping the data across
multiple devices. With this configuration, we can allocate the complete disk storage for the
applications data. The major drawback in this configuration is no high availability. In the
case of any single disk failure, it will cause the remaining disks to be useless as they are
missing the chunks from the failed disk. This is a not recommended RAID configuration for
real-time database systems, but it is a recommended configuration for storing non-critical
business data such as historical application logs, database logs, and so on.
[ 23 ]
Database Benchmarking
RAID 1
This configuration is only to focus on high availability rather than on performance, by
broadcasting the data among two disk drives. That is, a single copy of the data will be kept
on two disks. If one disk is corrupted, then we can still use the other one for read/write
operations. This is also not a recommended configuration for real-time database systems, as
it is lacking the write performance. Also, in this configuration, we will be utilizing 50% of
the disk to store the actual data, and the rest to keep its duplicated information for high
availability. This is a recommended configuration where the durability of data matters
when compared with write performance.
RAID 5
This configuration provides more storage and high availability on the disk, by storing the
parity blocks across the disks. Unlike RAID 1, it offers more disk space to keep the actual
data, as parity blocks are spread among the disks. In any case, if one disk is corrupted, then
we can use the parity blocks from the other disk, to fetch the missing data. However, this is
also not a recommended configuration, since every read/write operation on the disk needs
to process the parity blocks, to get the actual data out of it.
RAID 6
This configuration provides more redundancy than RAID 5 by storing the two parity blocks
information for each write operation. That is, if both disks become corrupted, RAID 6 can
still get the data from the parity blocks, unlike RAID 5. This configuration is also not
recommended for the database systems, as write performance is less as compared than
previous RAID levels.
RAID 10
This configuration is the combination of RAID levels 0 and 1. That is, the data will be
striped to multiple disks and will be replicated to another disk storage. It is the most
recommended RAID level for real-time business applications, where we achieve a better
performance than with RAID 1, and higher availability than RAID 0.
For more information about RAID levels, refer to the following URLs:
http://www.slashroot.in/raid-levels-raid0-raid1-raid
10-raid5-raid6-complete-tutorial
https://en.wikipedia.org/wiki/Standard_RAID_levels
[ 24 ]
Database Benchmarking
Configuring pgbench
In this recipe, we will be discussing how to configure the pgbench to perform various test
cases.
Getting ready
By default, PostgreSQL provides a tool, pgbench, which performs a default test suite based
on TPC-B, which simulates the live database load on the servers. Using this tool, we can
estimate the tps (transactions per second) capacity of the server, by conducting a dedicated
read or read/write test cases. Before performing the pgbench test cases on the server, we
need to fine-tune the PostgreSQL parameters and make them ready to fully utilize the
server resources. Also, it's good practice to run pgbench from a remote machine, where the
network latency is trivial among the nodes.
How to do it…
As aforementioned, pgbench simulates a TPC-B-like workload on the servers, by executing
three update statements, followed by SELECT and INSERT statements into different predefined pgbench tables, and if we want to use those pre-defined tables, then we would need
to initiate pgbench using the -i or --initialize options. Otherwise, we can write a
customized SQL script.
To get effective results from pgbench, we need to fine-tune the PostgreSQL server with the
following parameters:
Parameter
Description
shared_buffers
This is the amount of memory for database
operations
huge_pages
This improves the OS-level memory management
work_mem
This is the amount of memory for each backend for
data operations such as sort, join, and so on
autovacuum_work_mem
This is the amount of memory for postgres internal
process autovacuum
max_worker_processes
This is the maximum number of worker processes
for the database
[ 25 ]
Database Benchmarking
max_parallel_workers_per_gather This is the number of worker processes to consider
for a gather node type
max_connections
This is the number of database connections
backend_flush_after
Do fsync, after this many bytes have been flushed
to disk by each user process
checkpoint_completion_target
This is used to set I/O usage ratio during the
checkpoint
archive_mode
This is used to determine whether the database
should run in the archive mode
log_lock_waits
This is used to log useful information about
concurrent database locks
log_temp_files
This is used to log information about the database's
temporary files
random_page_cost
This is used to set random page cost value for the
index scans
You can also find other parameters at the following URL, which are also
important before conducting any benchmarking on the database server: ht
tps://www.postgresql.org/docs/9.6/static/runtime-config.html.
Another good practice to get good performance is to keep the transaction logs (pg_xlog) in
another mount point, and also have unique tablespaces for tables and indexes. While
performing the pgbench testing with predefined tables, we can specify these unique
tablespaces using the --index-tablespace and --tablespace options.
How it works…
As we discussed earlier, pgbench is a TPC-B benchmarking tool for PostgreSQL, which
simulates the live transactions load on the database server by collecting the required metrics
such as tps, latency, and so on. Using pgbench, we can also increase the database size by
choosing the test scale factor while using predefined tables. If you wanted to test multiple
concurrent connections to the database and wanted to use the pooling mechanism, then it's
good practice to configure the pgbouner/pgpool on the local database node to reuse the
connections.
[ 26 ]
Database Benchmarking
For more features and options with the pgbench tool, visit https://www.p
ostgresql.org/docs/9.6/static/pgbench.html.
Running read/write pgbench test cases
In this recipe, we will be discussing how to perform various tests using the pgbench tool.
Getting ready
Using pgbench options, we can benchmark the database for read/write operations. Using
these measurements, we can estimate the disk read-write speed by including the system
buffers. To perform a read-write-only test, then either we can go with pgbench arguments,
or create a custom SQL script with the required SELECT, INSERT, UPDATE, or DELETE
statements, then execute them with the required number of concurrent connections.
How to do it…
Let us discuss about read-only and write-only in brief:
Read-only
To perform read-only benchmarking with pgbench predefined tables, we need to use the S option. Otherwise, as we discussed earlier, we need to prepare a SQL file with the
required SELECT statements.
Write-only
To perform write-only benchmarking with pgbench predefined tables, we need to use the N or -b simple-update options. Otherwise, as we discussed earlier, we have to prepare a
SQL file with the required UPDATE, DELETE, and INSERT statements.
[ 27 ]
Database Benchmarking
How it works…
While running read-only test cases, it's good practice to measure the database cache hit
ratio, which defines the reduction in I/O usage. You can get the database hit ratio using the
following SQL command:
postgres=# SELECT TRUNC(((blks_hit)/(blks_read+blks_hit)::numeric)*100, 2)
hit_ratio FROM pg_stat_database WHERE datname = 'postgres';
hit_ratio
----------99.69
(1 row)
Also, if we enable track_io_timing in postgresql.conf, it will provide some
information about disk blocks read/write operations by each backend process. We can get
these disk I/O timing values from the pg_stat_database catalog view.
Refer to the following URL, where pgbench supports various test suites,
such as disk, CPU, memory, and so on: https://wiki.postgresql.org/w
iki/Pgbenchtesting.
[ 28 ]
2
Server Configuration and
Control
In this chapter, we will cover the following:
Starting the database server manually
Stopping the server quickly
Stopping the server in an emergency
Reloading server configuration
Restarting the database server quickly
Tuning connection-related parameters
Tuning query-planning parameters
Tuning logging-related parameters
Introduction
This chapter basically deals with the server architecture and server control. Initially, we will
discuss the different options available to start and stop the PostgreSQL server. As an
administrator, one should be aware of the different methods available in order to use the
appropriate option in emergency situations or during maintenance operations. We then
move onto discussing different server parameters and the tuning aspects related to them.
Some important server parameters that help improve the overall server throughput and
performance are discussed here from an optimization perspective.
Server Configuration and Control
Starting the server manually
Normally, a PostgreSQL server will start automatically when the system boots up. If an
automatic start is not enabled for a server, we would need to start the server manually. This
may be required even for operational reasons.
Getting ready
Before we talk about how to start the database server, first we need to understand the
difference between server and service. The term server refers to the database server and its
processes, whereas the term service essentially indicates the operating system wrapper
through which the server gets called.
How to do it…
On a majority of platforms, that is, Linux and Unix distributions, we can start the server
using the pg_ctl command-line utility as shown here:
pg_ctl -D <location of data directory> start
Consider the following example:
pg_ctl -D /var/lib/pgsql/9.6/data start
The -D switch of the pg_ctl command indicates the data directory of the PostgreSQL
server. In the preceding command, the data directory is defined at the location
/var/lib/pgsql/9.6/data.
On Linux and Unix platforms such as Red Hat, the service can be started as mentioned here:
service <postgresql-version> start
For instance, to start the PostgreSQL server version 9.6 as a service, we can use the
following command:
service postgresql-9.6 start
How it works…
Using the start mode of the pg_ctl command, the PostgreSQL server is started in the
background and its background processes are initiated.
[ 30 ]
Server Configuration and Control
You can use the following command to check whether the PostgreSQL server background
processes have started:
ps aux | grep "postgres" | grep -v "grep"
postgres 1289 0.0 0.3 51676 7900 ?
S
11:31
0:01
/usr/lib/postgresql/9.6/bin/postgres -D /var/lib/postgresql/9.6/main -c
config_file=/etc/postgresql/9.6/main/postgresql.conf
postgres 1303 0.0 0.0 21828 1148 ?
Ss
11:31
0:00 postgres:
logger process
postgres 1305 0.0 0.0 51676 1568 ?
Ss
11:31
0:02 postgres:
writer process
postgres 1306 0.0 0.0 51676 1324 ?
Ss
11:31
0:02 postgres: wal
writer process
postgres 1307 0.0 0.1 52228 2452 ?
Ss
11:31
0:00 postgres:
autovacuum launcher process
postgres 1308 0.0 0.0 22016 1432 ?
Ss
11:31
0:00 postgres: stats
collector process
When we execute the pg_ctl command, it internally performs various validations such as
checking that the data directory that we mentioned is equal to the data_directory setting
in postgresql.conf. If the data_directory parameter points to some other directory
location, then PostgreSQL will instantiate to that directory location. After the validations
have been completed, the pg_ctl binary internally invokes another process called
postgres, which is the main process for the PostgreSQL cluster.
Stopping the server quickly
There are different modes available to stop the PostgreSQL server. Here we will talk about
the mode in which we can stop the server quickly.
Getting ready
The pg_ctl command is used in combination with the stop option in order to stop the
PostgreSQL server.
[ 31 ]
Server Configuration and Control
How to do it…
We can use the following command to stop the server quickly on Red Hat-based Linux
distributions and other Unix-based systems:
pg_ctl -D /var/lib/pgsql/9.6/data -m fast stop
Here, /var/lib/pgsql/9.6/data is the location of the data directory.
How it works…
The -m fast option must be used in order to shut down as quickly as possible. In case of a
normal shutdown, PostgreSQL will wait for all users to finish their transactions before
halting and on a busy system this can sometimes take a very long time.
While initiating a fast stop with the -m fast option of the pg_ctl command, all the users
will have their transactions aborted and all connections are disconnected. However, this is a
clean shutdown because all the active transactions are aborted followed, by a system
checkpoint before the server closes.
Also, it will verify various checks, such as whether PostgreSQL has started in a single user
mode or whether the cluster is running in backup mode while shutting down the database,
and will interact with the user by throwing the required notification messages.
Stopping the server in an emergency
In this recipe, we will show you the command that can be used to stop the server in an
emergency situation.
How to do it…
We can use the following command to stop the PostgreSQL server in an emergency:
pg_ctl -D /var/lib/pgsql/9.6/data stop -m immediate
Here, the data directory location is defined at /var/lib/pgsql/9.6/data.
[ 32 ]
Server Configuration and Control
How it works…
The moment the immediate stop mode is used with the pg_ctl command, all the users
have their transactions aborted and the existing connections are terminated. There is no
system checkpoint either and the database basically requires crash recovery at the time of
database restart.
In this shutdown mode, the PostgresSQL process will issue a direct SIGQUIT signal to each
of the child processes, by including the backend processes such as bgwriter, autovacuum,
and recovery processes. However, in smart shutdown mode, the PostgreSQL process will
wait until these processes are terminated and then shut down the postmaster process.
Reloading server configuration
In this recipe, we will talk about the command that can be used to reload the server
configuration and its parameters.
Getting ready
If any of the server parameters come into effect after reloading the server configuration files,
such as postgresql.conf, then we need to reload the server configuration using the
reload option of the pg_ctl command.
How to do it…
We can use the following command to reload the server configuration on Red Hat-based
Linux and Unix distributions:
pg_ctl -D /var/lib/pgsql/9.6/data reload
You can also reload the configuration using the pg_reload_conf function while still being
connected to PostgreSQL .The usage of the pg_reload_conf function is mentioned as
follows:
postgres=# select pg_reload_conf();
[ 33 ]
Server Configuration and Control
How it works…
The reload mode of the pg_ctl command is used to send the postgres process a SIGHUP
signal, which in turn causes it to reload its configuration files such as postgresql.conf
and pg_hba.conf. The benefit of this mode is that it allows the configuration changes to be
reloaded without requiring a complete restart of the PostgreSQL server to come into effect.
In this mode, PostgreSQL will send a SIGHUP signal to each child process, including the
utility processes such as autovacuum, bgwriter, stats collector, and so on, to get the new
settings from the latest configuration settings.
Restarting the database server quickly
Sometimes there are situations where you need a database bounce. This is most likely for
the database parameters that require a server restart to come into effect. This is different
from the reload option of the server configuration, which only reloads the configuration
files without requiring a server bounce.
How to do it…
We can use the following command to restart the database server:
pg_ctl -D /var/lib/pgsql/9.6/data restart
Here, /var/lig/pgsql/9.6/data is the location of the data directory of the PostgreSQL
server.
How it works…
Using the restart mode of the pg_ctl command first stops the running databases on the
PostgreSQL server and then starts the server. It is in effect a two-way process, where the
running server is first stopped and then started again. A database restart is needed in many
situations. It could be likely that some of the server parameters require a server restart to
enable the changes made to these parameters to come into effects or it can also be that some
of the server processes have hung and a restart is needed.
In this mode, by default, the PostgreSQL process will terminate using SIGINT, which is a
fast shutdown mode. However, we can also specify the restart mode using the -m
argument.
[ 34 ]
Server Configuration and Control
Tuning connection-related parameters
In this recipe, we will talk about tuning the connection-related server parameters.
How to do it…
The following are the connection-related parameters that usually require tuning:
listen_addresses
port
max_connections
superuser_reserved_connections
unix_socket_directory
unix_socket_permissions
These parameters can be set to appropriate values in the postgresql.conf file and these
parameter can be set at server start. The changes made to these parameters will only come
into effect once the PostgreSQL server is restarted, either using the restart mode of the
pg_ctl command or using the stop mode and then followed by the start mode of the
pg_ctl command. This can be done as follows:
pg_ctl -D $PGDATA restart
Here, $PGDATA refers to the environment variable that refers to the data directory location
of the PostgreSQL server.
How it works…
Here, we will talk about the connection-related parameters discussed in the preceding
section:
listen_addresses: The standard behavior of PostgreSQL is to respond to
connections from localhost. This default behavior usually needs alterations to
allow the PostgreSQL server to be accessed via the standard TCP/IP protocol
widely used in client-server communications. The de facto practice is to set this
value to * to allow the PostgreSQL server to listen to all incoming connections.
Otherwise, we need to mention a comma-separated list of IP addresses where
PostgreSQL needs to be responded. This gives an additional security control over
reaching the database instance directly from any IP address.
[ 35 ]
Server Configuration and Control
port: This parameter defines the port on which the PostgreSQL server will listen
for incoming client connection requests. The default value of this parameter is set
to 5432. If there are multiple PostgreSQL instances running on the same server,
the value of the port parameter for every PostgreSQL instance needs to be altered
to avoid port conflicts.
max_connections: This parameter governs the maximum number of concurrent
client connections that the server can support. This is a very important parameter
setting to other parameters such as work_mem that are dependent on this because
memory resources can be allocated on a per-client basis, so the maximum number
of clients in effect suggests the maximum possible memory use. The default value
of this parameter is 100. If the requirement is to support thousands of client
connections, connection pooling software should be used to reduce the
connection overhead. On a Linux server with 8 GB of RAM, a value of 140 is a
good start. Increasing the value of this parameter will cause the PostgreSQL
server to request more shared memory from the operating system.
superuser_reserved_connections: This parameter governs the number of
connection slots reserved for superusers. The default value of this parameter is
set to 3. This parameter can only be set at server restart.
unix_socket_directory: This parameter defines the directory to be used for
the Unix socket connections to the server. The default location is /tmp, but it can
be changed at build time. This parameter comes into effect during server start.
This parameter is irrelevant on Windows, which does not have socket
connections.
unix_socket_permissions: This parameter defines the access permissions of
the Unix domain socket. The default value is 0777, which means anyone can
connect using socket connections. This parameter value can be set only at server
start.
Tuning query-related parameters
In this recipe, we will talk about the query planning related parameters and the associated
tuning aspects.
[ 36 ]
Server Configuration and Control
How to do it…
The following are the query planning related parameters that usually require tuning:
random_page_cost
seq_page_cost
effective_cache_size
work_mem
constraint_exclusion
These parameters can be set in the postgresql.conf configuration file.
How it works…
random_page_cost: This parameter is basically used to estimate the cost of a random page
fetch in abstract cost units. The default value of this parameter is 4.0. Random page cost is
basically used to represent the coefficient between the cost of looking up one row via
sequential scans against the cost of looking up a single row individually using random
access, that is, disk seeks. This factor influences the query planner's decision to use indexes
instead of a table scan while executing queries. Reducing this value relative to the
seq_page_cost parameter will cause the system to prefer index scans. If, however, this
value is increased, it would make index scans appear expensive from a cost perspective.
seq_page_cost: This parameter is used to estimate the cost of a sequential page fetch in
abstract cost units. The default value of this parameter is 1.0 and this may need to be
lowered for caching effects. For situations where the database is cached entirely in RAM,
this value needs to be lowered. Always set random_page_cost to be greater than or equal
to seq_page_cost.
effective_cache_size: This parameter setting is used to provide an estimate of how
much memory is available for disk caching by the operating system and within the database
itself after considering and accounting for what is being used by the OS itself and other
applications. This parameter does not allocate any memory, rather it serves as a guideline as
to how much memory you expect to be available in the RAM, as well as in the PostgreSQL
shared buffer. This helps the PostgreSQL query planner to figure out whether the query
plans it is considering for query execution would fit in the system memory or not. If this is
set too low, indexes may not be used for executing queries the way you would expect.
[ 37 ]
Server Configuration and Control
Let us assume that there is around 1.5 GB of system RAM on your Linux machine, the value
of shared_buffers is set to 32 MB, and effective_cache_size is set to 850 MB. Now, if
a running query needs around 700 MB of a dataset, PostgreSQL would estimate that all the
data required should be available in memory and would hence opt for an aggressive plan
from the context of optimization, involving heavier index usage and merge joins. But if the
effective cache size is set to around 300 MB, the query planner will opt for the sequential
scan. The rule of thumb is that if shared_buffers is set to a quarter of the total RAM,
setting effective_cache_size to one half of the available RAM would be a conservative
setting and setting it to around 75% of the total RAM would be an aggressive setting, but
still reasonable.
work_mem: This parameter defines the amount of memory to be used by internal sort
operations and hash tables before switching to temporary disk files. When a running query
needs to sort data, an estimate is provided by the database as to how much data is involved
and it will compare it to the work_mem parameter. If it is larger, it will write out all the data
and use a disk-based sort mechanism instead of reverting to in-memory sort, which would
be much slower than the memory-based sort. If there are a lot of complex sorts involved,
and if you have good memory available, then increasing the value of the work_mem
parameter would allow PostgreSQL to do in-memory sorts, which will be faster than diskbased ones. The amount of memory as specified by the value of the work_mem parameter
gets applied for each sort operation done by each database session and complex queries can
use multiple working memory-based sort buffers. For instance, if the value of work_mem is
set to 100 MB, and if there are around 40 database sessions submitting queries, one would
end up using around 4 GB of real system memory. The catch is that one cannot predict the
number of sorts one client session may end up doing and thereby theoretically work_mem is
a per sort parameter instead of being a per client one effectively indicating that the memory
available from the use of work_mem is unbound, where there are a number of clients doing a
large number of sorts concurrently. From the point of view of optimization to arrive at
the optimum setting for the work_mem parameter, we must first consider how much free
system memory is available after the allocation of shared_buffers further divided by the
value of max_connections and then considering a fraction of that figure albeit half of that
would be an assertive value for the work_mem parameter. One needs to set this value
explicitly as the default value is too low for in-memory sorts.
[ 38 ]
Server Configuration and Control
constraint_exclusion: If in PostgreSQL you are using partitioned tables that utilize
constraints, then enabling the constraint_exclusion parameter allows the query planner to
ignore partitions that cannot have the data being searched for when that can be proven. In
earlier PostgreSQL versions before 8.4, this value was set to off, which meant that unless
toggled on partitioned tables, this will not work as expected. The new default value since
version 8.4 has been to set this value to partition, in which case the query planner will
examine constraints for inheritance child subtables and UNION ALL queries. If this value is
turned on, it will examine constraints on all tables and this will impose extra planning
overhead even for simple queries and thus there will be a performance overhead.
Tuning logging-related parameters
In this recipe, we will talk about tuning logging-related parameters.
How to do it…
The following are the logging-related parameters that usually require tuning:
log_line_prefix
log_statement
log_min_duration_statement
How it works…
log_line_prefix: Usually the default value related to this parameter is empty and it is
not desirable. A good way to put this in context would be to use the following format:
log_line_prefix='%t:%r:%u@%d:[%p]: '
Once this format is used, every line in the log will follow this format. The following is a
description of the entries:
%t: This is the timestamp
%u: This is the database username
%r: This is where the remote host connection is from
%d: This is where the database connection is to
%p: This is the process ID of the connection
[ 39 ]
Server Configuration and Control
Not all of these values will be required initially; however, at a later stage as you drill down
further, they will be useful. For instance, the significance of processID becomes apparent
when you are troubleshooting performance issues.
log_statement: This parameter defines which statements are logged. Statement logging is
a powerful technique to find performance issues. The options for the log_statement
values are as follows:
None: At this setting, no statement-level logging is captured.
DDL: When this setting is enabled, DDL statements such as CREATE, ALTER, and
DROP statements are captured. This is a handy setting to find out whether there
were any major changes introduced accidentally by developers or administrators
alike.
MOD: When this setting is enabled, all the statements that modify a value will be
logged. From the point of view of auditing changes, this is a handy feature. This
setting is not required if your workload includes running SELECT statements or
for reporting purposes.
ALL: When this setting is enabled, all the statements are logged. This should
generally be avoided to counter the logging overhead on the server when all
statements are being logged.
log_min_duration_statement: This parameter setting causes the duration of each
statement to be logged if a given statement ran for at least a certain number of milliseconds.
It is basically like setting a threshold and if any statement exceeds the defined thresholds,
the duration of that statement is logged.
[ 40 ]
3
Device Optimization
In this chapter, we will cover the following recipes:
Understanding memory units in PostgreSQL
Handling Linux/Unix memory parameters
CPU scheduling parameters
Device tuning parameters
Identifying checkpoint overhead
Analyzing buffer cache contents
Introduction
Tuning only PostgreSQL-related parameters will not be sufficient to achieve good
performance from any hardware. It is also required to tune specific device-related
parameters, which will greatly improve the performance of any database server. In this
chapter, we will be discussing various device configurations such as CPU, memory, and
disk. Most of the content in this topic will be discussing Linux-related parameters, which
you may need to refer for alternative configuration section in different operating systems.
Understanding memory units in PostgreSQL
In this recipe, we will be discussing the memory components of PostgreSQL instances.
Device Optimization
Getting ready
PostgreSQL uses several memory components for each unique usage. That is, it uses
dedicated memory areas for transactions, sort/join operations, maintenance operations, and
so on. If the configured memory component doesn't fit the usage of the live application,
then we may hit a performance issue, where PostgreSQL tries for more I/O.
How to do it…
Let us discuss about, how to tune the major PostgreSQL memory components:
shared_buffers
This is the major memory area that PostgreSQL uses for executing the transactions. On most
Linux operating systems, it is recommended to allocate at least 25% of RAM as shared
buffers, by leaving 75% of RAM to OS, OS cache, and for other PostgreSQL memory
components. PostgreSQL provide multiple ways to create these shared buffers. The
supported shared memory creation techniques are POSIX shared memory, System V shared
memory, memory mapped files, and windows, in which we specify one method as an
argument to the default_shared_memory_type.
temp_buffers
PostgreSQL utilizes this memory area for holding the temporary tables of each session,
which will be cleared when the connection is closed.
work_mem
This is the memory area that PostgreSQL tries to allocate for each session when the query
requires any joining, sorting, or hashing operations. We need to be a bit cautious while
tuning this parameter, as this memory will be allocated for each connection whenever it is
needed.
maintenance_work_mem
PostgreSQL utilizes this memory area for performing maintenance operations such as
VACUUM, ALTER TABLE, CREATE INDEX, and so on. Autovacuum worker processes also
utilize this memory area, if the autovacuum_work_mem parameter is set to -1.
[ 42 ]
Device Optimization
wal_buffers
PostgreSQL utilizes this memory area for holding incoming transactions, which will be
flushed immediately to disk (pg_xlog files) on every commit operation. By default, this
parameter is configured to utilize 3% of the shared_buffers memory to hold incoming
transactions. This setting may not be sufficient for busy databases, where multiple
concurrent commits happen frequently.
max_stack_depth
PostgreSQL utilizes this memory area for function call/expression execution stacks. By
default, it's configured to use 2 MB, which we can increase up to the kernel's configured
stack size (ulimit -s).
effective_cache_size
This is a logical setting for PostgreSQL instances, which gives a hint about how much cache
(shared_buffers and OS cache) is available at this moment. Based on this cache setting,
the optimizer will generate a better plan for the SQL. It is recommended to set the 75% of
RAM as a value for this parameter.
How it works…
In the preceding memory components, not all will get instantiated during PostgreSQL
startup. Only shared_buffers and wal_buffers will be initialized, and the remaining
memory components will be initialized whenever their presence is needed.
For more information about these parameters, refer to the following
URL: https://www.postgresql.org/docs/9.6/static/runtime-config
-resource.html.
Handling Linux/Unix memory parameters
In this topic, let's discuss various kernel memory settings, and see the recommended
configuration values.
[ 43 ]
Device Optimization
Getting ready
The kernel provides various memory settings based on its distribution. Using these
parameters, we can control the kernel behavior, which provides the necessary resources to
the applications. PostgreSQL is a database software application, which needs to
communicate to kernel with its system calls, to get its required resources.
How to do it…
Let us discuss about, how to tune few major kernel memory components in Linux:
kernel.shmmax
This setting defines the maximum size limit of a shared memory segment, which limits the
processes required segment size. We need to specify the maximum allowed segment size in
bytes. It is recommended to set this value as half of the RAM size in the form of bytes.
kernel.shmall
This setting defines the limit on the available shared memory in the system. In general, we
need to set this parameter in the form of pages. While the PostgreSQL server is starting, it
will calculate the amount of shared memory that is required based on postgresql.conf
settings such as max_connections, autovacuum_max_workers, and so on. It is generally
recommended to set this value as half of the RAM size in the form of pages.
kernel.shmmni
This setting defines the limit of the number of segments, which the kernel needs to allow in
the system. The default value of 4096 is sufficient for a standalone PostgreSQL server. We
can also get the number of shared memory segments in the system, by using the ipcs Linux
utility command.
[ 44 ]
Device Optimization
vm.swappiness
This setting defines the aggressiveness of kernel, in clearing the system cache. If we define
this value as 100, then kernel will be in busy paging out the application into the swap file,
and it also drops the cached data. Once the kernel drops the cache then it's a major
performance hit to the database server, and also a major performance problem for the
applications, where it needs to load its content from swap into the memory. The
recommended value for this parameter is any value that is less than 10. In general, the
kswapd Linux OS daemon process will start dumping all the pages into swap, if we have
less free memory available.
vm.overcommit_memory
This setting defines the overcommit behavior, which will try to control the system stability
by preventing OOM (Out Of Memory) issues. The suggested values for this parameter are
either 0 or 2 (recommended). The setting value 2 will control the memory allocation, and
will try to allocate the memory for the processes, by using swap size, and a percentage of
the overcommit_ratio from the RAM size. The setting value 0 follows a heuristic
approach, which allocates more address space (virtual memory) than the physical memory,
which does not guarantee system stability.
vm.overcommit_ratio
This setting defines how much physical memory kernel needs to allocate for the processes.
The suggested value for this parameter is 80. That is, 80% of the physical memory will be
considered to allocate the memory for the requested processes. This parameter plays a
major role in utilizing the memory to the full extent.
vm.dirty_background_ratio
This setting defines what percentage of the system memory can be dirtied before flushing it
to disk. The suggested value for this parameter is less than 5, based on RAM size. Once the
system memory fills with dirty data, then kernel will push all the modified content into the
disk storage, by blocking the incoming I/O requests to system memory.
[ 45 ]
Device Optimization
vm.dirty_ratio
This setting defines what percentage of the system memory can be dirtied before flushing it
to disk. The suggested value for this parameter is 10 to 15 percent based on RAM size, and
the configured RAID levels. When the system memory fills with dirty data, then the process
that generated the dirty data is responsible for flushing the contents to disk, by blocking
incoming I/O requests to the user memory space.
How it works…
These kernel parameters play a vital role for all the Linux applications, including the
database server. Based on these configuration settings, we can indirectly control the
PostgreSQL behavior, by controlling the operating system kernel.
For more information about kernel parameters, check out the following
URL: https://www.kernel.org/doc/Documentation/sysctl/.
CPU scheduling parameters
In this topic, let's discuss a few CPU scheduling parameters, which will fine tune the
scheduling policy.
Getting ready
In the previous recipe, we discussed memory-related parameters, which are not sufficient
for better performance. Much like memory, kernel also supports various CPU-related
tuning parameters, which will drive the processes scheduling.
[ 46 ]
Device Optimization
How to do it…
Let us discuss about, few major CPU kernel scheduling parameters in Linux:
kernel.sched_autogroup_enabled
This parameter groups all the processes that belong to the same kernel session ID, which are
further processed as a single process scheduling entity rather than multiple entities. That is,
the CPU will indirectly spend a contiguous amount of time on a group of processes that
belong to the same session rather than switching to multiple processes. Using the ps xa o, sid, pid, and comm command, where we can get the session ID of the processes. It is
always recommended to set this value to 0 for the PostgreSQL standalone server, as each
process belongs to a unique session ID.
kernel.sched_min_granularity_ns
This parameter defines the minimum time slice of the process that it needs to spend in the
allocated CPU before its preemption. The recommended value for this parameter is
2,000,000 ns, and it can be tuned further by doing proper benchmarking with multiple
settings.
kernel.sched_latency_ns
This parameter defines the kind of maximum time slice for each process that runs in CPU
core. However, this process time slice will be calculated based on the load at that time in the
server; it also includes the process's nice values along with the
sched_min_granularity_ns parameter. The recommended value for this parameter is
10,000,000 ns, which gives better performance for the moderated number of process. If the
database server is dealing with more processes, this setting may lead to more processes
preemption and may reduce the performance of the CPU-bound tasks.
kernel.sched_wakeup_granularity_ns
This parameter defines whether the current running process needs to be preempted from
the CPU, when it compares its vruntime (virtual run time of the process, that is, the actual
time the process spends in CPU) with the woken process vruntime. If the delta of vruntime
between these two processes is greater than this granularity setting, then it will cause the
current process to preempt from the CPU. The recommended value for this parameter
should always be less than half of the sched_latency_ns.
[ 47 ]
Device Optimization
kernel.sched_migration_cost_ns
This parameter defines the amount of time in which the process is ready for the CPU
migration. That is, after spending this much amount of time, a process is eligible to migrate
to another CPU core, which is less busy. By reducing this value, the scheduler makes more
task migrations among the CPU, which will effectively utilize all the cores. But migrating a
task from one CPU to another requires additional efforts, which may reduce the
performance of the process. The recommended value for this parameter is 500,000.
How it works…
In the Linux kernel, each process is called a kernel scheduling entity (KSE), which will be
processed based on different scheduling policies. Using the preceding parameters, we can
control the process scheduling and migration of the process among CPUs, which will
improve the execution speed of a process. We can also process a set of processes as a single
KSE by using the sched_autogroup_enabled option, where a set of processes will get
more CPU resources than the others.
Disk tuning parameters
In this recipe, we will be discussing a few disk I/O related schedulers that are designed for a
specific need.
Getting ready
The Linux kernel provides various device-specific algorithms, which give more flexibility in
fine tuning the hardware devices. The Linux kernel by default provides several I/O
scheduling algorithms, which have their own unique usages. Kernel also provides a way to
change different I/O scheduling policies for different disk devices. The major disk I/O
schedulers are CFQ (Completely Fair Queuing), noop, and deadline.
[ 48 ]
Device Optimization
How to do it…
Let us discuss about, the Linux disk scheduling algorithms:
CFQ
This is the default I/O scheduler that we get for the disk devices. This scheduler provides
I/O time slices, like the CPU scheduler. This algorithm also considers the I/O priorities by
reading the process's ionice values, and also allots scheduling classes such as real time,
best effort, and idle classes. An advantage of this scheduler is it tries to analyze the
historical data of the I/O requests and will predict the process near future I/O requirements.
If it is expected that a process will issue more I/O than it waits for it, rather processing the
other waiting I/O process. Sometimes this I/O prediction may go bad, which leads to a bad
disk throughput. However, CFQ provides a tunable parameter, slice_idle, which
controls the amount of waiting time for the predicted I/O requests. It is always
recommended to set this value to 0 for the RAID disk devices, and any positive integer
setting for non-RAID devices.
noop
This I/O scheduler is pretty neat, as it doesn't include any restrictions such as time slices in
CFQ and sends the request to the disk in FIFO order. This I/O scheduler works much more
and gives more performance when the underlying disk devices are of a RAID or SSD type.
This is because devices such as RAID or SSD follow their own unique techniques while
reordering I/O requests.
Deadline
This I/O scheduler is widely used for all database servers, since it enforces the latency limits
over all I/O requests. By using this scheduler, we can achieve good performance for the
database server, which has more read operations than write operations. This algorithm
dispatches the I/O tasks as batches. A single batch contains a set of read/write operations
along with their expire timings. We can tune the batch size and expire timings of each read,
write using the following parameters:
Parameter
Description
fifo_batch
This defines the batch size of the read/write requests. By increasing this
parameter, we can get better disk throughput as we are increasing the I/O
latency.
[ 49 ]
Device Optimization
The deadline for the read request to get serviced. Default is 500ms.
read_expire
write_expire
The deadline for the write request to get serviced. Default is 5,000 ms.
This defines how many read batches need to be dispatched before
dispatching the single write batch.
writes_starved
How it works…
As we discussed previously, each I/O request will be processed based on a unique I/O
scheduler, which will try to rearrange the requests in a physical order to reduce the
mechanical overhead while reading the data from the disk.
Identifying checkpoint overhead
In this recipe, we will be discussing the checkpoint process, which may cause more I/O
usage while flushing the dirty buffers into the disk.
Getting ready
In PostgreSQL, the checkpoint is an activity that flushes all the dirty buffers from the shared
buffers into the persistent storage, which is followed by updating the consistent transaction
log sequence number in the data, WAL files. That means this checkpoint guarantees that all
the information up to this checkpoint number is stored in a persistent disk storage.
PostgreSQL internally issues the checkpoint as per the checkpoint_timeout and
max_wal_size settings, which keep the data in a consistent storage as configured. In case
the configured shared_buffers are huge in size, then the probability of holding dirty
buffers will also be greater in size, which leads to more I/O usage for the next checkpoint
process. It is also not recommended to tune the checkpoint_timeout and max_wal_size
settings to be aggressive, as it will lead to frequent I/O load spikes on the server and may
cause system instability.
[ 50 ]
Device Optimization
How to do it…
To identify the overhead of the checkpoint process, we need to monitor the disk usage by
using some native tools such as iostat, iotop, and so on, while the checkpoint process is
in progress. To identify this checkpoint process status, we have to either query the
pg_stat_activity view as follows, or otherwise we have to consider enabling the
log_checkpoints parameter, which will log an implicit checkpoint begin status as
checkpoint starting: xlog and for the explicit requests as checkpoint starting: immediate
force wait xlog:
SELECT * FROM pg_stat_activity WHERE wait_event = 'CheckpointLock';
How it works…
Once we find that the I/O usage is high enough when the checkpoint process is running,
then we have to consider tuning the checkpoint_completion_target parameter, which
will limit the transfer rate between the memory and disk. This parameter accepts real
numbers between 0 and 1, which will smooth the I/O transfer rate if the parameter value
gets close to 1. PostgreSQL version 9.6 introduced a new parameter called
checkpoint_flush_after, which will guarantee flushing the buffers from the kernel
page cache at certain operating systems.
For more information about the checkpoint process and its behavior, refer
to the following URL: https://www.postgresql.org/docs/9.6/static/w
al-configuration.html.
Analyzing buffer cache contents
In this recipe, we will be discussing how to analyze the content that resides in PostgreSQL
shared buffers.
Getting ready
To analyze the shared buffer contents, PostgreSQL provides an extension called
pg_buffercache, and we also use the CREATE EXTENSION command. This extension
reports the buffer details such as which relation holds the number of buffers in the memory,
what buffers are dirty at this moment, and what is a specific buffer's usage count.
[ 51 ]
Device Optimization
In PostgreSQL, shared buffers are not specific to any database as they are managed at
cluster level. Hence, while querying the pg_buffercache we may get relation details that
are from other databases too. Also, it is not recommended to query the pg_buffercache
too frequently as it needs to handle buffer management locks, to get the buffer's current
status.
How to do it…
Let's install the pg_buffercache extension and then query the view as follows:
postgres=# CREATE EXTENSION pg_buffercache;
CREATE EXTENSION
postgres=# SELECT * FROM pg_buffercache LIMIT 1;
-[ RECORD 1 ]----+----bufferid
| 1
relfilenode
| 1262
reltablespace
| 1664
reldatabase
| 0
relforknumber
| 0
relblocknumber
| 0
isdirty
| f
usagecount
| 5
pinning_backends | 0
From the preceding output, the reldatabase is 0, which indicates that the bufferid 1
belongs to a table/view of the shared system catalogs, such as pg_database. Also, it
displays the usagecount of that buffer as 5, which will be helpful during the buffer cache
eviction policy.
[ 52 ]
Device Optimization
How it works…
PostgreSQL has a built-in buffer manager that maintains the shared buffers for the cluster.
Whenever a backend requests to read a block from the disk, then the buffer manager
allocates a block for it from the available shared buffer memory. If the shared buffers don't
have any available buffers, then the buffer manager needs to choose a certain number of
blocks to evict from the shared buffers for the newly requested blocks.
PostgreSQL's buffer manager follows an approach called clock sweep for the buffer eviction
policy. When a block is loaded from disk into the shared buffers, then its usage count will
begin with 1. Whenever this buffer hits from the other backend processes, its usage count
will be increased and its value will stay as 5 when its usage count exceeds 5. This usage
count value is reduced by 1 during the eviction policy. When the buffer manager needs to
evict the buffers for the incoming requests, then it scans all the buffers including dirty
buffers, and will reduce all the buffer usage counts by 1. After the end of the scan, if it finds
the buffers whose usage count is 0, then it treats those buffers as if they should be evicted
from the shared buffers. If the chosen buffer is a dirty buffer, then the backend process
request will be in waiting state, until that dirty buffer is flushed to the respective physical
file storage. From the preceding output, the usage count of the bufferid is 5, which defines
that it is not much closer to the eviction policy.
[ 53 ]
4
Monitoring Server Performance
In this chapter, we will cover the following:
Monitoring CPU usage
Monitoring paging and swapping
Tracking CPU consuming processes
Monitoring CPU load
Identifying CPU bottlenecks
Identifying disk I/O bottlenecks
Monitoring system load
Tracking historical CPU usage
Tracking historical memory usage
Monitoring disk space
Monitoring network status
Introduction
In order to be able to solve performance problems we should be able to effectively use
operating system utilities. We should be able to use the right operating system tools and
commands in order to identify performance problems that may be due to CPU, memory, or
disk I/O issues. Many times a DBA's duties often overlap with certain system
administration-related functions and it is important for a DBA to be effective in using the
related operating system utilities in order to correctly identify where the underlying issue
on the server could be. In this chapter, we are going to discuss various Unix/Linux related
operating system utilities that can help the DBA in performance analysis and
troubleshooting issues.
Monitoring Server Performance
Monitoring CPU usage
In this recipe, we are going to use the sar command to monitor CPU usage on the system.
Getting ready
The commands used in this recipe have been performed on a Sun Solaris machine. Hence
the command output may vary in different Unix and Linux-related systems.
How to do it…
We can use the sar command with the -u switch to monitor CPU utilization:
bash-3.2$ sar -u 10 8
SunOS usstlz-pinfsi09 5.10 Generic_150400-04 sun4u
23:32:17
23:32:27
23:32:37
23:32:47
23:32:57
23:33:07
23:33:17
23:33:27
23:33:37
Average
%usr
80
70
72
76
73
71
67
69
%sys
14
14
13
14
10
8
9
10
%wio
3
12
21
6
13
17
20
17
73
11
13
08/06/2014
%idle
3
4
4
3
4
4
4
4
4
In the preceding command with the -u switch, two values are passed as input. The first
value used here, that is 10, displays the number of seconds between sar readings and the
second value used here, that is 8, indicates the number of times you want sar to run.
How it works…
The sar command provides a quick snapshot of how much the CPU is bogged down or
utilized. The sar output reports the following four columns:
%usr: This indicates percentage of CPU running in user mode
%sys: This indicates percentage of CPU running in system mode
[ 55 ]
Monitoring Server Performance
%wio: This indicates percentage of CPU running idle with a process waiting for
block I/O
%idle: This indicates percentage of CPU that is idle
Often a low percentage of idle time points to a CPU intensive job or an underpowered CPU.
You could use the ps command or prstat command in Solaris to find a CPU intensive job.
The following are general indicators of performance problems:
If you see an abnormally high value in the %usr column, this would mean that
applications are not being tuned properly or are over utilizing the CPU.
If you see a high value in the %sys column, it probably indicates a bottleneck that
could be due to swapping or paging and that needs to be investigated further.
Also, this is the amount of CPU usage by the system calls, when applications
makes several calls in background.
Monitoring paging and swapping
In this recipe, we are going to use the sar and vmstat commands with options to monitor
paging and swapping operations.
Getting ready
It is necessary to monitor the amount of paging and swapping happening on the operating
system. Paging occurs when a part of the operating system process gets transferred from
physical memory to disk, or is read back from physical memory to disk. Swapping occurs
when an entire process gets transferred to disk, from physical memory to disk, or is read
back into physical memory from disk. Depending on the system, either paging or swapping
could be an issue. If paging is occurring normally, and if you see a trend of heavy
swapping, then the issue could be related to insufficient memory or sometimes the issue
could be related to disk as well. If the system is heavily paging and not swapping, the issue
could be related to either CPU or memory.
[ 56 ]
Monitoring Server Performance
How to do it…
We could use the vmstat and sar command with the following to monitor the paging and
swapping operations:
1. The vmstat command can be used with the -S switch to monitor swapping and
paging operations:
bash$ vmstat -S
kthr
memory
faults
cpu
r b w
swap
free
si
in
sy
cs us sy
6 14 0 453537392 170151696 0
77696 687853
72596
page
disk
so
pi po fr de sr s0 s2 s3 s4
id
0
2444 186 183 0 0 1 1 1 1
13 4 83
In the preceding snippet, the columns si and so represent swap in and swap
out operations.
Similarly, pi and po represent page in and page out operations, respectively.
However, when the sar command is used with the required options, it
provides more in depth analysis of paging and swapping operations.
2. We can also use the sar command with the -p switch to report paging
operations:
bash-3.2$
sar -p 5 4
SunOS usmtnz-sinfsi17 5.10 Generic_150400-04 sun4u
05:45:18
05:45:23
05:45:28
05:45:33
05:45:38
atch/s
4391.18
2172.26
2765.60
2194.80
Average
2879.85
08/08/2014
pgin/s ppgin/s pflt/s vflt/s slock/s
0.80
2.20 12019.44 30956.92
0.60
1.80
2.40 5417.76 15499.80
0.20
0.20
0.20 9893.20 20556.60
0.00
2.00
2.00 7494.80 19018.60
0.00
1.20
1.70 8703.00 21500.25
0.20
The preceding output reports the following columns:
atch/s: Page faults per second that are satisfied by reclaiming a page
currently in memory per second.
pgin/s: The number of times per second that the filesystem receives
page in requests.
[ 57 ]
Monitoring Server Performance
ppgin/s: Pages paged in per second.
pflt/s: The number of page faults from protection errors.
vflt/s: Address translation page faults per second. This happens when
a valid page is not found in memory.
slock/s: Faults per second caused by software lock requests requiring
physical I/O.
Similarly, for swapping we can use the sar command with the -w
switch to indicate whether there are any performance problems with
swap.
3. Similarly, we can use the sar command with the -w switch to report swapping
activities and identify whether there are any swap-related issues:
bash-3.2$ sar -w 5 4
SunOS usmtnz-sinfsi17 5.10 Generic_150400-04 sun4u
08/08/2014
06:20:55 swpin/s bswin/s swpot/s bswot/s pswch/s
06:21:00
0.00
0.0
0.00
0.0
53143
06:21:05
0.00
0.0
0.00
0.0
60949
06:21:10
0.00
0.0
0.00
0.0
55149
06:21:15
0.00
0.0
0.00
0.0
64075
Average
0.00
0.0
0.00
0.0
58349
The preceding output reports the following columns:
swpin/s: This indicates the number of LWP transfers in memory per
second.
bswin/s: This indicates the number of blocks transferred for swap-ins
per second.
swpot/s: This reports the average number of processes that are
swapped out of memory per second.
bswot/s: This reports the number of blocks that are transferred for
swap-outs per second.
pswch/s: This indicates the number of kernel thread switches per
second.
[ 58 ]
Monitoring Server Performance
How it works…
If the si and so columns of the vmstat -S output have non-zero values, then it serves as a
good indicator of a possible performance issue related to swapping. This must be further
investigated using the more detailed analysis provided by the sar command with the -p
and -w switch, respectively.
For paging the key is to look for an inordinate amount of page faults of any kind. This
would indicate a high degree of paging. The concern is not with paging, but with swapping
because as paging increases it would be followed by swapping. We can look at the values in
the atch/s, pflt/s, vflt/s, and slock/s columns of the sar -p command output to
review the amount of page faults of any type and see the paging statistics to observe
whether the paging activity remains steady or increases during a specific timeframe.
For the output of the sar -w command, the key column to observe is the swpot/s column.
This column indicates the average number of processes that are swapped out of memory
per second. If the value in this column is greater than 1, it is an indicator of memory
deficiency and to correct this you would have to increase memory.
Tracking CPU consuming processes
In this recipe, we are going to use the top command to find the processes that are using a lot
of CPU resources.
Getting ready
The top command is a Linux-based utility and it does not work in Unix-based systems. If
you are using Solaris then use the prstat command instead to find CPU intensive
processes.
How to do it…
The usage of the top command is shown in the following snippet:
bash-3.2$top
Cpu states: 0.0% idle, 82.0% user, 18.7% kernel, 0.8% wait, 0.5% swap
Memory: 795M real, 12M free, 318M swap, 1586M free swap
PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND
[ 59 ]
Monitoring Server Performance
23624 postgres -25 2 208M 4980K cpu 1:20 22.47% 94.43% postgres
15811 root -15 4 2372K 716K sleep 22:19 0.61% 3.81% java
20435 admin 33 0 207M 2340K sleep 2:47 0.23% 1.14% postgres
20440 admin 33 0 93M 2300K sleep 2:28 0.23% 1.14% postgres
23698 root 33 0 2052K 1584K cpu 0:00 0.23% 0.95% top
23621 admin 27 2 5080K 3420K sleep 0:17 1.59% 0.38% postgres
23544 root 27 2 2288K 1500K sleep 0:01 0.06% 0.38% br2.1.adm
15855 root 21 4 6160K 2416K sleep 2:05 0.04% 0.19% proctool
897 root 34 0 8140K 1620K sleep 55:46 0.00% 0.00% Xsun
20855 admin -9 2 7856K 2748K sleep 7:14 0.67% 0.00% PSRUN
208534 admin -8 2 208M 4664K sleep 4:21 0.52% 0.00% postgres
755 admin 23 0 3844K 1756K sleep 2:56 0.00% 0.00% postgres
2788 root 28 0 1512K 736K sleep 1:03 0.00% 0.00% lpNet
18598 root 14 10 2232K 1136K sleep 0:56 0.00% 0.00% xlock
1 root 33 0 412K 100K sleep 0:55 0.00% 0.00% ini
The starting lines in the preceding output give general system information, whereas the rest
of the display is arranged in order of decreasing current CPU usage.
How it works…
The top command provides statistics on CPU activity. It displays a list of CPU intensive
tasks on the system and also provides an interface for manipulating processes.
In the preceding output, we can see that the top user is postgres with a process ID of 23624.
We can see the CPU consumption of this user to be at 94.43%, which is way too high and it
needs to be investigated or the corresponding operating system process needs to be killed if
it is causing performance issues on the system.
Monitoring CPU load
In this recipe, we are going to use the uptime command to monitor overall CPU load.
[ 60 ]
Monitoring Server Performance
How to do it…
The uptime command tells us the following information:
Current system time
How long the system has been running
Number of currently logged on users in the system
System load average for the past 1, 5, and 15 minutes
The uptime command can be used as follows:
bash-3.2$ uptime
11:44pm up 20 day(s), 20 hr(s),
33.77
10 users,
load average: 27.80, 30.46,
Here in the preceding output we can see the current system time, 11.44pm GMT, and the
system has been up and running for the last 20 days and 20 hrs without requiring a reboot.
The output also tells us there are around 10 concurrently logged on users in the system.
Finally, we also get the load average during the past 1, 5, and 15 minutes as 27.80, 30.46, and
33.77 respectively.
How it works…
The basic purpose of running the uptime command is to have a quick look at the current
CPU load in the system. This provides a sneak peek of current system performance. System
load can be defined as the average number of processes that are either in a runnable or
uninterruptable state. A process is known to be in a running state when it is using the CPU
resources or else is waiting for its turn to acquire CPU resources. A process is known to be
in an uninterruptable state when it is waiting for some I/O operation, such as waiting on
disk. Load average is then taken over the three time intervals, that is, 1, 5, and 15 minute
periods. Load averages are not normalized for the number of CPUs in the system, that is,
for a system with a single CPU, a load average of 1 would mean that the CPU was heavily
loaded all the time with no idle time period, whereas for a system with five CPUs, a load
average of 1 would indicate an idle time of 80% and a busy time period of only 20%.
Identifying CPU bottlenecks
In this recipe, we are going to use the mpstat command to identify CPU bottlenecks.
[ 61 ]
Monitoring Server Performance
How to do it…
The mpstat command is used to report per processor statistics in a tabular format.
We are now going to show the usage of the mpstat command:
bash-3.2$ mpstat 1 1
CPU minf mjf xcal
idl
0 672
0 2457
73
1
90
0 1551
79
2
68
0 1026
86
3
50
0 568
92
4
27
0 907
88
5
75
0 1777
74
intr ithr
csw icsw migr smtx
srw syscl
usr sys
wt
681
12
539
17
57
119
0
4303
18
10
0
368
22
344
6
37
104
0
3775
17
4
0
274
14
217
4
24
83
0
2393
11
3
0
218
9
128
3
17
56
0
1319
7
2
0
340
12
233
3
22
72
0
2034
9
2
0
426
25
370
5
33
111
0
4820
22
4
0
How it works…
In the preceding output of the mpstat command, each row of the table represents the
activity of one processor. The first table shows the summary of activity since boot time. The
important column that is relevant from a DBA's perspective is the value in the smtx
column. The smtx measurement indicates the number of times CPU failed to obtain the
mutual exclusion lock or mutex. Mutex stalls waste CPU time and degrade multiprocessor
scaling.
A general rule of thumb is if the values in the smtx column are greater than 200, then it is a
symptom and indication of CPU bottleneck issues that need to be investigated.
Identifying disk I/O bottlenecks
In this recipe, we are going to use the iostat command to identify disk-related bottlenecks.
[ 62 ]
Monitoring Server Performance
How to do it…
There are various options, that is, switches, available with the iostat command. The
following are the most important switches used with iostat:
1. -d: This switch reports the number of kilobytes transferred per second for
specific disks, the number of transfers per second, and the average service time in
milliseconds. The following is the usage of the iostat -d command:
bash-3.2$iostat -d 5
sd0
Kps
1
140
8
11
0
tps serv
0
53
14
16
1
15
1
82
0
0
5
sd2
Kps tps serv
57 5 145
0
0
0
0
0
0
0
0
26
1
0
22
sd3
Kps
19
785
814
818
856
tps
1
31
36
36
37
sd4
serv Kps tps serv
89
0
0 14
21
0
0 0
18
0
0 0
19
0
0 0
20
0
0 0
2. -D: This switch lists the reads per second, writes per second, and percentage of
disk utilization:
bash-3.2$ iostat -D 5 5
sd0
sd2
rps wps util rps wps
0
0 0.3
4
0
0
0 0.0
0 35
0
0 0.0
0 34
0
0 0.0
0 34
0
2 4.4
0 37
util
6.2
90.6
84.7
88.3
91.3
sd3
sd4
rps wps util rps wps util
1
1
1.8
0
0 0.0
237 0
97.8 0
0 0.0
218 0
98.2 0
0 0.0
230 0
98.2 0
0 0.0
225 0
97.7 0
0 0.0
3. -x: This switch will report extended disk statistics for all disks:
bash-3.2$ iostat -x
extended device statistics
device
r/s
w/s
kr/s
kw/s wait actv
fd0
0.0
0.0
0.0
0.0 0.0 0.0
sd0
0.0
0.0
0.4
0.4 0.0 0.0
sd2
0.0
0.0
0.0
0.0 0.0 0.0
sd3
0.0
4.6
0.0 257.6 0.0 0.1
sd4
69.3
3.6 996.9 180.5 0.0 7.6
nfs10
0.0
0.0
0.4
0.0 0.0 0.0
nfs14
0.0
0.0
0.0
0.0 0.0 0.0
nfs16
0.0
0.0
0.0
0.0 0.0 0.0
[ 63 ]
svc_t
0.0
49.5
0.0
26.4
102.4
24.5
6.3
4.9
%w
0
0
0
0
0
0
0
0
%b
0
0
0
12
100
0
0
0
Monitoring Server Performance
How it works…
The iostat command reports statistics about disk input and output operations and to
produce measurements of throughput, utilization, queue lengths, transaction rates, and
service time. The first line of the iostat output is for everything since booting the system,
whereas each subsequent line shows only the prior interval specified.
If we observe the preceding output of the iostat -d command in the How to do it section,
we can clearly see that disk drive sd3 is heavily overloaded. The values for the kps column
(kilobytes transferred per second), tps (number of transfers per second), and serv (average
service time in milliseconds) for disk drive sd3 is consistently high over the specified
interval. This leads us to the conclusion that moving information from sd3 to any other
drive might be a good idea if this information is representative of disk I/O on a consistent
basis. This would reduce the load on disk drive sd3.
Also, if you observe the output of the iostat -D command in the How to do it section, we
can conclude that disk drive sd3 has high read activity as indicated by the high values in
the rps column (reads per second). Similarly, we can see that disk drive sd2 has a high
write activity, as indicated by the high values in the wps column (writes per second). Both
the disk drives sd2 and sd3 are at a peak level of utilization, as can be seen from the high
percentage in the util column (utilization). The high values in the util column are an
indication of I/0 problems and this should be investigated by the system administrator.
Similarly, if we have a look at the preceding output of the iostat -x command in the How
to do it section, we can easily come to the conclusion that disk drive sd4 is experiencing I/O
problems, as can be seen from the %b column, which indicates the percentage of time that
the disk is busy. For disk drive sd4, the disk utilization is at 100%, which in this case would
need a system administrator's immediate attention.
Monitoring system load
Many a times there are situations when the application users start complaining about the
database performance being slow and as a DBA you need to determine whether there are
system resource bottlenecks on the PostgreSQL server. Running the vmstat command can
help us to quickly locate and identify the bottlenecks on the server.
[ 64 ]
Monitoring Server Performance
How to do it…
The vmstat command is used to report real-time performance statistics about processes,
memory, paging, disk I/O, and CPU consumption. The following is the usage of the vmstat
command:
$ vmstat
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu--r b swpd free buff cache si so bi bo in cs us sy id wa
14 0 52340 25272 3068 1662704 0 0 63 76 9 31 15 1 84 0
In the preceding output, the first line divides the columns on the second line into six
different categories, which are discussed in the following list:
The first category is the process (procs) and it contains the following columns:
r: This column indicates the total number of processes waiting for
runtime
b: This column reports the total number of processes in
uninterruptible sleep
The second category is memory and it contains the following columns:
swpd: This column indicates the total amount of virtual memory in
use
free: This column reports the amount of idle memory available for
use
buff: This column indicates the amount of memory used for
buffers
cache: This column indicates the amount of memory used as page
cache
The third category is swap and it contains the following columns:
si: This column reports the amount of memory swapped in from
disk
so: This column indicates the amount of memory swapped out
from disk
The fourth category is I/O (io) and it contains the following columns:
bi: This column indicates the blocks that are read in from a block
device
bo: This column reports the blocks that are written out to a block
device
[ 65 ]
Monitoring Server Performance
The fifth category is system and this category contains the following columns:
in: This column reports the number of interrupts per second
cs: This column reports the number of context switches per second
The final category is the CPU and it contains the following columns:
us: This column reports the percentage of the time CPU ran user
level code
sy: This column reports the percentage of the time the CPU ran
system level code
id: This column reports the percentage of time the CPU was idle
wa: This column reports the amount of time spent waiting for I/O to
complete
How it works…
The following are the general rules of thumb that are used while interpreting the vmstat
command output:
If the value in the wa column is high, it is an indication that the storage system is
probably overloaded and necessary action needs to be taken to address that issue
If the value in the b column is greater than zero consistently, then it is an
indication that the system does not have enough processing power to service the
currently running and scheduled jobs
If the values in the so and the si column are greater than zero when monitored
for a period of time, then it is an indication and a symptom of a memory
bottleneck
Tracking historical CPU usage
In this recipe, we are going to show how to use the sar command in combination with
various switches to analyze historical CPU load during some time in the past.
Getting ready
The commands used in this recipe have been performed on a CentOS Linux machine. The
command output may vary in other Linux and Unix-based operating systems.
[ 66 ]
Monitoring Server Performance
How to do it…
The sar command when used with the -u switch is used to display CPU statistics. When
used this way, the sar command will report on the current day's activities.
If we are looking to analyze the CPU statistic over some time period in the past we would
need to use the -f switch in conjunction with the -u switch of the sar command. The -f
option is followed by the files that sar uses to report on statistics for different days of the
month, and these files are usually located in the /var/log/sa directory and usually have a
naming convention of sadd, where dd represents the numeric day of the month whose
values are in a range from 01 to 31.
The following snippet shows the usage of the sar command to view CPU statistics for the
eighth day of the month:
$ sar -u
03:50:10
04:00:10
04:10:10
04:20:10
04:30:10
04:40:10
Average:
-f
PM
PM
PM
PM
PM
PM
/var/log/sa/sa08
CPU %user %nice %system
all 0.42
0.00
0.24
all 0.22
0.00
1.96
all 0.22
0.00
1.22
all 0.22
0.00
0.24
all 0.24
0.00
0.23
all 0.19
0.00
0.19
%iowait
0.00
0.00
0.01
2.11
0.00
0.07
%idle
96.41
95.53
99.55
99.54
92.54
99.55
How it works…
Generally, the rule of thumb is that if the %idle is low, it serves as an indication that either
the CPUs are underpowered or the application load is high. Similarly, if we see non-zero
values in the iowait time column, it serves as a reminder that the I/O subsystem could be a
potential bottleneck.
If we observe the preceding output of the sar command, we can clearly see that the %idle
time is high, which clearly indicates that CPU is probably not overburdened and we don't
see many non-zero values in the %iowait column, which tells us that there is not much
contention for disk I/O either.
There's more…
When the sysstat package is installed, a few cron jobs are scheduled to create files used by
the sar utility to report on historical server statistics. We can observe these cron jobs by
taking a look at the /etc/cron.d/sysstat file.
[ 67 ]
Monitoring Server Performance
Tracking historical memory usage
In this recipe, we are going to see how to analyze the memory load for a previous day in the
week.
Getting ready
The commands used in this recipe have been performed on a CentOS Linux machine. The
command output may vary in other Linux and Unix-based operating systems.
How to do it…
When it comes to analyzing memory statistics, we need to check for both paging statistics as
well as swapping statistics.
We can use the sar command in conjunction with the -B switch to report on paging
statistics, along with the -f switch to report on statistics for different days of the month. As
mentioned in the previous recipe, the files that the sar command uses to report statistics for
different days of the month are located in the /var/log/sa directory, where the files have
a naming convention of sadd, where dd represent the numeric date of the month, and
where these values are in the range from 01 to 31.
For instance, to report on the paging statistics for the fifth of the month, we can use the sar
command as follows:
$ sar -B -f /var/log/sa/sa05
06:10:05
06:20:05
06:30:05
06:40:05
06:50:06
07:00:06
AM
AM
AM
AM
AM
PM
pgpgin/s pgpgout/s fault/s
0.02
18.17
19.37
4.49
26.68
76.15
4512.43
419.24
380.14
4850.03 1055.79
4364.73
4172.68 1096.96
6650.51
majflt/s
0.00
0.05
0.65
0.51
0.16
Similarly, to report on the swapping statistics we can use the sar command in conjunction
with the -W switch to report on swapping statistics, along with the -f switch to report on
statistics for different days of the month.
[ 68 ]
Monitoring Server Performance
For instance, to report on the swapping statistics for the fifth day of the month we can use
the sar command as follows:
$ sar -W -f /var/log/sa/sa05
06:10:05
06:20:05
06:30:05
06:40:05
06:50:06
07:00:06
AM
AM
AM
AM
AM
PM
pswpin/s pswpout/s
0.00
0.00
0.02
0.00
1.15
1.45
0.94
2.99
0.67
6.95
How it works…
In the preceding output of the sar -B -f /var/log/sa/sa05 command, we can clearly
see that around 6.40 am there was a substantial increase in paging from disk (pgpgin/s),
pages paged out to disk (pgpgout/s), and page faults per second (fault/s).
Similarly, when swapping statistics are being reported with the sar -W -f
/var/log/sa/sa05 command, we can clearly see that the swapping started around 06.40
am as can be seen from the values in the pswpin/s column and pswpout/s column. If we
see high values in the pswpin/s column (pages swapped into memory per second) and in
the pgpgout/s column (pages swapped out per second), it is an indication that current
memory is inadequate and it needs to be either increased or for certain application
components the existing memory needs to be optimally resized.
Monitoring disk space
In this recipe, we are going to show the commands that are used to monitor disk space.
How to do it…
We can use the df command with various switches to monitor disk space. To make the
output more understandable, we often use the -h switch with the df command as follows:
bash-3.2$ df -h
Filesystem
capacity Mounted on
size
132G
62%
/
[ 69 ]
used
avail
80G
50G
Monitoring Server Performance
/devices
0K
0K
0K
0%
/devices
ctfs
0K
0K
0K
0%
/system/contract
proc
0K
0K
0K
0%
/proc
mnttab
0K
0K
0K
0%
/etc/mnttab
swap
418G
488K
418G
1%
/etc/svc/volatile
swap
418G
38M
418G
1%
/tmp
swap
418G
152K
418G
1%
/var/run
/dev/dsk/c20t60000970000192602156533030374242d0s0
236G
240M
234G
1%
/peterdata/cm_new
/dev/dsk/c20t60000970000192602156533032353441d0s0
30G
30M
29G
1%
/peterdata/native
/dev/dsk/c20t60000970000192602156533033313441d0s0
236G
60G
174G
26%
/peterdata/db_new
/dev/dsk/c20t60000970000195701036533032454646d0s0
30G
6.9G
22G
24%
/peterdata/native
/dev/dsk/c20t60000970000195701036533032444137d0s0
236G
224G
12G
95%
/peterdata/db
/dev/dsk/c20t60000970000192602156533032333232d0s2
709G
316G
386G
45%
/peterdata/cm
usmtnnas4106-epmnfs.emrsn.org:/peterb1ap_2156
276G
239G
36G
87%
/peterb1ap
usmtnnas4106-epmnfs.emrsn.org:/peterdata_data_2156
98G
53G
45G
54%
/peterdata/data
usmtnnas4106-epmnfs.emrsn.org:/peterdata_uc4appmgr
9.8G
3.6G
6.3G
37%
/peterdata/uc4/
How it works…
If we observe the preceding output, we can see that the mountpoint /peterdata/db is
nearing its full capacity as the mountpoint has reached a capacity of 95% and only another
12 GB of free disk space is available on the device. This is an indication that either the
administrator needs to clean up some old files on the existing mountpoint to release more
free space, or allocate more additional space to the given mount point before it reaches its
full capacity.
[ 70 ]
Monitoring Server Performance
Monitoring network status
In this recipe, we are going to show how to monitor the status of network interfaces.
Getting ready
The commands used in this recipe have been performed on a CentOS Linux machine. The
command output may vary in other Linux and Unix-based operating systems.
How to do it…
We are going to use the netstat command with the -i switch to display the status of
network interfaces that are configured on the system. The following is the usage of the
netstat command:
How it works…
In the preceding output of the netstat -i command, we can determine the number of
packets a system transmits and receives on each network interface. The Ipkts column
determines the input packet count and the Okpts column determines the output packet
count. If the input packet count (value in the Ipkts column) remains steady over a period
of time, it means that the machine is not receiving network packets at all and the outcome
suggests that it could possibly be a hardware failure on the network interface. If the output
packet count (value in the Okpts column) remains steady over a period of time, then it
could possibly point towards problems that may be caused due to an incorrect address
entry in the hosts or the ethers database.
[ 71 ]
5
Connection Pooling and
Database Partitioning
In this chapter, we will cover the following:
Installing pgpool-II
Configuring pgpool-II and testing the setup
Installing PgBouncer
Connection pooling using PgBouncer
Managing Pgbouncer
Implementing partitioning
Managing artitions
Installing PL/Proxy
Partitioning with PL/Proxy
Introduction
pgpool-II is basically a middleware solution that works as an interface between a
PostgreSQL server and a PostgreSQL client application. pgpool-II talks about the
PostgreSQL's backend and frontend protocol and relays a connection between the two.
pgpool-II caches incoming connections to PostgreSQL servers and reuses them whenever a
new connection with the same properties comes in. This way it reduces connection
overhead and improves overall throughput.
Connection Pooling and Database Partitioning
pgpool-II in fact offers a lot more features than connection pooling. pgpool-II offers the
load balancing and replication modes along with the parallel query feature. However, since
this chapter is dedicated to connection pooling, we are going to focus only on the
connection pooling feature.
Pgbouncer is also a lightweight connection pooler for PostgreSQL. Applications connect to
the Pgbouncer port just like the way it would connect to a Postgresql database on the
database port. By using pgbouncer we can lower the connection overload impact on
PostgreSQL servers. Pgbouncer provides for connection pooling by reusing existing
connections.
The difference between PgBouncer and pgpool-II is that pgbouncer is lightweight and is
only dedicated for connection pooling purposes, whereas pgpool-II offers a lot more
features such as replication, load balancing, and parallel query features in addition to
connection pooling.
Partitioning is defined as splitting up a large table into smaller chunks. PostgreSQL
supports basic table partitioning. Partitioning is entirely transparent to applications.
Partitioning has a lot of benefits, which are discussed in the following list:
Query performance can be improved significantly for certain types of queries.
Partitioning can lead to an improved update performance. Whenever queries
update a big chunk of a single partition, performance can be improved by
performing a sequential scan for that partition instead of using random reads and
writes, which are dispersed across the entire table.
Bulk deletes can be accomplished by simply removing one of the partitions.
Infrequently used data can be shipped to cheaper and slower media.
For this chapter, we will be referring to pgpool-II as pgpool for the purpose of simplicity.
Installing pgpool-II
In this recipe, we are going to install pgpool.
Getting Ready
Installing pgpool from source requires GCC 2.9 or higher and GNU Make. Since pgpool
links with the libpq library, the libpq library and its development headers must be
installed prior to installing pgpool. Also the OpenSSL library and its development headers
must be present in order to enable OpenSSL support in pgpool.
[ 73 ]
Connection Pooling and Database Partitioning
How to do it…
To install pgpool in a Debian or Ubuntu-based distribution we can execute the following
command:
apt-get install pgpool2
On Red Hat, Fedora, CentOS, or any other RHEL-based Linux distributions, use the
following command:
yum install pgpool*
If you are building from source, then follow these steps:
1. Download the latest tarball of pgpool from the following website: http://www.pg
pool.net/mediawiki/index.php/Downloads.
2. The next step is to extract the pgpool tarball and enter the source directory:
tar -xzf pgpool-II-3.4.0.tar.gz
cd pgpool-II-3.4.0
3. The next step is to build, compile, and install pgpool software:
./configure --prefix=/usr --sysconfdir=/etc/pgpool/
make
make install
4. The next step is to create the location where pgpool can maintain activity logs:
mkdir /var/log/pgpool
chown postgres:postgres /var/log/pgpool
5. The next step is to create a directory where pgpool can store its service lock files:
mkdir /var/run/pgpool
chown postgres:postgres
/var/run/pgpool
[ 74 ]
Connection Pooling and Database Partitioning
How it works…
If you are using an operating system specific package manager to install pgpool then the
respective configuration files and log files required by pgpool are automatically created.
However, if you are proceeding with a full source-based pgpool installation then there are
some additional steps required. The first step is required to run the configure script and
then build and compile pgpool. After pgpool is installed you will be required to create the
directories where pgpool can maintain activity logs and service lock files. All these steps
need to be performed manually, as can be seen in steps 3, 4, and 5 respectively in the
previous section.
Configuring pgpool and testing the setup
In this recipe, we are going to configure pgpool and show how to make connections using
pgpool.
Getting ready
Before running pgpool if you are downloading from the tarball source then the pgpool
software needs to be built and compiled. These steps are shown in the first recipe of this
chapter.
How to do it…
We are going to follow these steps to configure pgpool and run the setup:
1. After pgpool is installed as shown in the first recipe of this chapter, the next step
is to copy the configuration files from the sample directory with some default
settings, which will be later edited as per our requirements:
cd /etc/pgpool/
cp pgpool.conf.sample-stream pgpool.conf
cp pcp.conf.sample etc/pcp.conf
[ 75 ]
Connection Pooling and Database Partitioning
2. The next step is to define a username and password in the pcp.conf file, which
is an authentication file for pgpool. Basically, to use PCP commands user
authentication is required. This mechanism is different from PostgreSQL 's user
authentication. Passwords are encrypted in the md5 hash format. To obtain the
md5 hash for a user we have to use the pg_md5 utility, as shown in the following
snippet. Once the md5 hash is generated it can then be used to store the md5
password in the pcp.conf file:
/usr/bin/pg_md5 postgres
e8a48653851e28c69d0506508fb27fc5
vi /etc/pcp.conf
postgres:e8a48653851e28c69d0506508fb27fc5
3. The next step is to configure the pgpool.conf configuration file for pgpool
settings to take effect:
listen_addresses = '*'
port = 9999
backend_hostname0 = 'localhost'
backend_port0 = 5432
backend_weight0 = 1
backend_data_directory0 = '/var/lib/pgsql/9.6/data'
connection_cache = on
max_pool = 10
num_init_children = 20
4. Once the preceding parameters have been configured and saved in the
pgpool.conf file, the next step is to launch pgpool and start accepting
connections to the PostgreSQL cluster using pgpool:
pgpool -f
/etc/pgpool.conf -F /etc/pcp.conf
psql -p 9999 postgres postgres
How it works…
Let's now discuss some of the parameters that were configured in the earlier section:
listen_addresses: We configure listen_addresses to * because we want to
listen on all TCP connections and not specifically to a particular IP address.
port: This defines the pgpool port on which the system will listen when
accepting database connections.
[ 76 ]
Connection Pooling and Database Partitioning
backend_hostname0: This refers to the hostname of the first database in our
setup.
backend_port0: The TCP port of the system, that is, the system identified by the
value backend_hostname0 on which the database is hosted.
backend_weight0: The weight assigned to the node identified by the hostname
obtained from the backend_hostname0. Basically, in pgpool weights are
assigned to individual nodes. More requests will be dispatched to the node with a
higher weight value.
backend_data_directory0: This represents the data directory, that is,
PGDATA for the host identified by the backend_hostname0 value.
connection_cache: To enable the connection pool mode we need to turn on
connection_cache.
max_pool: This value determines the maximum size of the pool per child. The
number of connections from pgpool to the backend may reach a limit at
num_init_children*max_pool.
In our case we configured max_pool to 10 and num_init_children to 20. So in sum, the
total number of connections from pgpool to the backend may reach a limit of 200.
Remember that the max_pool * num_init_children value should always be less than
the max_connections parameter value.
Finally, once the parameters are configured it is time to launch pgpool and make
connections on the pgpool port 9999.
Please refer to the following web links for more details:
http://www.pgpool.net/mediawiki/index.php/Relationship_between
_max_pool,_num_init_children,_and_max_connections and http://ww
w.pgpool.net/docs/latest/en/html/.
There's more…
Here we are going to show the commands that can be used to start/stop pgpool.
pgpool can be started in two ways:
1. By starting the pgpool service at the command line as the root user:
service pgpool start
[ 77 ]
Connection Pooling and Database Partitioning
2. By firing the pgpool command on the terminal:
pgpool
Similarly, pgpool can be stopped in two ways:
1. By stopping the pgpool service at the command line as the root user:
service pgpool stop
2. By firing the pgpool command with the stop option:
pgpool stop
Starting and stopping pgpool is relatively simple as was seen in the preceding section.
However, pgpool comes with a lot of options and the following is the partial and most used
syntax for starting pgpool:
pgpool [-f config_file][-a hba_file][-F pcp_config_file]
These options are discussed in the following list:
-c: When the -c switch is used it is used to clear the query cache
-f config_file: This option specifies the pgpool.conf configuration file and
pgpool obtains its configuration from this file while starting itself
-a hba_file: This option specifies the authentication file that is used when
starting pgpool
-F pcp_config_file: This option specifies the password file pcp.conf to be
used when starting pgpool
For full syntax of pgpool, refer to the following link: http://www.pgpool.
net/docs/latest/en/html/pgpool.html.
For stopping pgpool the same options that were used earlier while starting pgpool can be
used. However, along with these switches we can also specify the mode that needs to be
used while stopping pgpool.
[ 78 ]
Connection Pooling and Database Partitioning
There are two modes in which pgpool can be stopped:
Smart Mode: This option is specified by using the -m smart option. In this mode,
we first wait for the clients to disconnect and then shutdown pgpool.
Fast Mode: This mode can be set by specifying the -m fast option. In this mode,
pgpool does not wait for clients to disconnect and it shuts down pgpool
immediately.
The complete syntax of the pgpool stop command is as follows:
pgpool [-f config_file][-F pcp_config_file] [-m {s[mart]|f[ast]|]}] stop
The pgpool tool is also helpful in doing the automatic failover and
failback, which will reduce the database availability downtime. For more
information about this, please check out the following URL: http://manoj
adinesh.blogspot.in/2013/12/pgpool-configuration-failback.html.
Installing PgBouncer
In this recipe, we are going to show the steps that are required to install PgBouncer.
Getting ready
We can either do a full source-based installation or use the operating system-specific
package manager to install PgBouncer.
How to do it…
On an Ubuntu/Debian based system, we need to execute the following command to install
PgBouncer:
apt-get install pgbouncer
On CentOS, Fedora, or Red Hat-based Linux distributions, we can execute the following
command:
yum install pgbouncer
[ 79 ]
Connection Pooling and Database Partitioning
If you are doing a full source based installation, then the sequence of commands are
mentioned here:
1. Download the archive installation file from the given link:
http://pgfoundry.org/projects/pgbouncer
2. Extract the downloaded archive and enter the source directory:
tar -xzf pgbouncer-1.5.4.tar.gz
cd pgbouncer-1.5.4
3. The next step is to build and proceed with the software installation:
./configure --prefix=/usr
make & make install
4. After PgBouncer is installed, the next step is to create the directory where
PgBouncer can maintain activity logs:
mkdir /var/log/pgbouncer
cd /var/log/pgbouncer
chown postgres:postgres pgbouncer
5. The next step is to create a directory where PgBouncer can keep its service lock
files:
mkdir /var/run/pgbouncer
chown postgres:postgres /var/run/pgbouncer
6. The final step is to create a configuration directory to hold a sample PgBouncer
configuration file that can be used later on to make parameter changes:
mkdir /etc/pgbouncer
cp etc/pgbouncer.ini /etc/pgbouncer
chown -R postgres:psgtes /etc/pgbouncer
[ 80 ]
Connection Pooling and Database Partitioning
How it works…
If you are using an operating system specific package manager to install PgBouncer then the
respective configuration files and log files required by PgBouncer are automatically created.
However, if you are proceeding with a full source based PgBouncer installation then there
are some additional steps required. You will be required to create the directories where
PgBouncer can maintain activity logs and service lock files. You will also be required to
create the configuration directory where the configuration file for PgBouncer will be stored.
All these steps need to be performed manually as seen in steps 4, 5, and 6 respectively in the
preceding section.
Connection pooling using PgBouncer
Here in this recipe we are going to implement PgBouncer and benchmark the results for
database connections made to the database via PgBouncer against normal database
connections.
Getting ready
Before we configure and implement connection pooling, the PgBouncer utility must be
installed prior to performing the steps mentioned in this recipe. Installing PgBouncer is
covered in the previous recipe.
How to do it…
1. First we are going to tweak some of the configuration settings in the
pgbouncer.ini configuration file, as shown in the following snippet. The first
two entries are for the databases that will be passed through PgBouncer. Next we
configure the listen_addr parameter to *, which means that it is going to listen
on all IP addresses. Finally, we set the last two parameters, which are
auth_file, that is, the location of the authentication file, and auth_type, which
indicates the type of authentication used. We use plain as the authentication type,
which indicates that we are using the password-based mechanism here for
authentication:
vi
/etc/pgbouncer/pgbouncer.ini
postgres = host=localhost dbname=postgres
[ 81 ]
Connection Pooling and Database Partitioning
pgtest = host=localhost dbname=pgtest
listen_addr = *
auth_file = /etc/pgbouncer/userlist.txt
auth_type = md5
2. The next step is to create a user list that contains a list of users who would be
allowed to access the databases through PgBouncer. Format of the entries in the
user list would be supplied as username followed by the user's password, as
shown in the following snippet where the first entry is for username whose value
is author and the second entry is for password whose value is password. Since
we are using the authentication type as md5 we have to use the md5 password
entry in the user list. Had we been using the authentication type as plain then the
actual password would have been supplied in the user list:
postgres=# CREATE role author LOGIN PASSWORD 'author' SUPERUSER;
CREATE ROLE
postgres=# select rolname ,rolpassword from pg_authid where
rolname='author';
rolname |
rolpassword
---------+------------------------------------author | md5d50afb6ec7b2501164b80a0480596ded
(1 row)
The md5 password obtained can then be defined in the userlist file for the
corresponding user:
vi /etc/pgbouncer/userlist.txt
"author" "md5d50afb6ec7b2501164b80a0480596ded"
3. Once we have configured the pgbouncer.ini configuration file and created the
user list the next step is to start the pgbouncer service by using one of the
following approaches:
service pgbouncer start
Alternatively:
pgbouncer -d /etc/pgbouncer/pgbouncer.ini
4. Once the pgbouncer service is up and running, the next step is to make
connections to it. By default, the pgbouncer service runs on port 6432 so any
connections made to the pgbouncer service need to be made on port 6432:
psql -h localhost -p 6432 -d postgres -U author
[ 82 ]
-W
Connection Pooling and Database Partitioning
5. Now that we have made connections using PgBouncer, the next logical step is to
find out if there are any performance improvements using PgBouncer. For this
purpose we are going to create a temporary database, the one that was initially
defined in the pgbouncer.ini file, and then we are going to insert records into it
and do benchmarking for connections made against this database:
createdb pgtest
pgbench -i -s 10 pgtest
6. The next step is to benchmark the results against the pgtest database:
bash-3.2$ pgbench -t 1000 -c 20 -S pgtest
starting vacuum...end.
transaction type: SELECT only
scaling factor: 10
number of clients: 20
number of transactions per client: 1000
number of transactions actually processed: 20000/20000
tps = 415.134776 (including connections establishing)
tps = 416.464707 (excluding connections establishing)
7. The final step is to benchmark the results against the pgbench database on
PgBouncer port 6432:
bash-3.2$ pgbench -t 1000 -c 20 -S
-p 6432 -U author -P author
pgtest
starting vacuum...end.
transaction type: SELECT only
scaling factor: 10
number of clients: 20
number of transactions per client: 1000
number of transactions actually processed: 20000/20000
tps = 2256.161662 (including connections establishing)
tps = 410.006262 (excluding connections establishing)
How it works…
Here we can see a couple of things. Initially, we have to configure the pgbouncer.ini
configuration file along with the user list that will be used for accessing databases via
PgBouncer. Effectiveness of PgBouncer can be seen in steps 6 and 7 in the preceding section.
We can see that the throughput increases to 2,256.15 transactions per second while using
PgBouncer, whereas when PgBouncer is not being used the throughput decreases to 415.13
transactions per second. In effect, when using PgBouncer the throughput increases by five
times approximately.
[ 83 ]
Connection Pooling and Database Partitioning
PgBouncer offers connection pooling in different modes such as transaction mode,
statement mode, and session mode. The session mode of PgBouncer is the mostly preferred
pooling mode, where a connection will be pooled back to the resource queue whenever the
application closes the connection. The same session will be reused later for other
transactions.
There's more…
In the How It Works section we configured a couple of parameters in the pgbouncer.ini
configuration file. However, there are a lot more parameters that can be configured and if
they are not configured they will take the default settings. Please refer to the following link
for more details on PgBouncer parameters https://pgbouncer.github.io/config.html.
These days it is more common to include PgBouncer modules in application business logic
to achieve better high availability. Refer to the following diagram, which shows how we
need to manage PgBouncer to do failover to the slave node when the master is not
available:
To achieve the preceding kind of setup in your business logic, we have to use a dedicated
tool called pgHA, which is implemented by the OpenSCG team.
[ 84 ]
Connection Pooling and Database Partitioning
Please find more information about pgHA at the following URL:
https://www.openscg.com/products/postgres-high-availability/.
Managing PgBouncer
PgBouncer provides an administrative console to view pool status and client connections. In
this recipe we are going to see information regarding PgBouncer connections, client
connections, view pool status, and obtaining connection pooling statistics.
Getting ready
Before we issue any commands we first need to connect to the PgBouncer's administrative
console. For this purpose we need to set the admin_users parameter in the
pgbouncer.ini configuration file:
vi /etc/pgbouncer/pgbouncer.ini
admin_users = author
Once the preceding changes are saved in the pgbouncer.ini configuration file the
PgBouncer service needs to be restarted in order to ensure that the parameter changes come
into effect:
service pgbouncer restart
Once this is done we can then make connections to the PgBouncer administration console
with the following command:
psql -p 6432 -U author pgbouncer
How to do it…
With the help of the PgBouncer administration console we can get information regarding
the clients, servers, and the pool health:
1. To get information regarding the clients, issue the show clients command on
the PgBouncer admin interface:
[ 85 ]
Connection Pooling and Database Partitioning
2. To get information regarding server connections, issue the show
servers command on the PgBouncer administrative console:
3. Similarly, to get information regarding pool health you can issue the commands
show pools and show stats to get information regarding pool status and pool
statistics:
How it works…
As we can see in the previous section, we can get information regarding the clients, servers,
and pool health and statistics. We will first begin with discussing clients, then servers, and
then finally pool health and statistics.
[ 86 ]
Connection Pooling and Database Partitioning
When you issue the show clients command on the PgBouncer administrative console,
PgBouncer provides us with a list of clients that have been either using or waiting for a
Postgresql connection. Some of the important columns that are displayed in the output of
the show clients command are discussed in the following list:
User: The value in this column displays the user that is connected to the
database.
Database: The value in this column displays the database name to which the
client may be connected.
State: The value in this column displays the session state of the currently
connected user. The client connection can either be in an active, used, waiting, or
idle state.
Connect_Time: The value in this column indicates the time at which PgBouncer
initiated the client connection to Postgresql.
Request Time: The value in this column shows the timestamp of the latest client
request.
Port: The value in this column indicates the port to which the client is connected
to.
We have the show servers command, which is used to display information about every
connection that is being used to fulfill client requests. The show servers output contains
similar columns that were discussed for show clients. The only difference is for the type
column, if the value for the type column is S, it means it is a server entry. If the value for the
type column is C, it means it is a client entry. Some of the other important columns for the
show servers output are displayed in the following list:
User: The value in this column displays the user that is connected to the
database
Database: The value in this column displays the name of the database to which
the connection is attached to
State: The value in this column displays the state of the PgBouncer server
connection. The server state could be active, used, or idle
Connect_Time: The value in this column indicates the time at which the
connection was made
Request Time: The value in this column shows the timestamp when the last
request was issued
Port: The value in this column indicates the port number of the Postgresql server
[ 87 ]
Connection Pooling and Database Partitioning
The show pools command displays a row for every database for which PgBouncer acts as
a proxy. Some of the important columns in the show pools output are given in the
following list:
cl_active: The value in this column displays the number of clients that are
currently active and that are assigned server connections.
cl_waiting: The value in this column displays the number of clients waiting for
a server connection.
sl_active: The value in this column displays the number of server connections
that are assigned to PgBouncer clients.
sl_idle: The value in this column displays the number of idle server
connections or the ones that are not in use.
sl_used: The value in this column displays the number of used server
connections. In effect these connections are actually idle, but they have not been
yet marked by PgBouncer for reuse.
The show stats command displays the relevant connection pool statistics related to
PgBouncer and for the databases for which PgBouncer was acting as a proxy. Some of the
important columns in the show stats output are discussed here:
total_requests: The value in this column displays the total number of SQL
requests pooled by PgBouncer
total_received: The value in this column displays the total volume of network
traffic measured in bytes that has been received by PgBouncer
total_sent: The value in this column displays the total volume of network
traffic measured in bytes that has been sent by PgBouncer
total_query_time: The value in this column displays the amount of time in
microseconds that PgBouncer spent communicating with a client in this pool
Implementing partitioning
In this recipe we are going to cover table partitioning and show the steps that are needed to
partition a table.
Getting ready
Exposure to database design and normalization is needed.
[ 88 ]
Connection Pooling and Database Partitioning
How to do it…
There are a series of steps that need to be carried out to set up table partitioning. Here are
the steps:
1. The first step is to create a master table with all fields. A master table is the table
that will be used as a base to partition data into other tables, that is, partitions. An
index is optional here for a master table; however, since there are performance
benefits of using an index, we are creating an index from a performance
perspective:
CREATE TABLE country_log (
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
country_code char(2),
content text
);
CREATE INDEX country_code_idx ON country_log USING btree
(country_code);
2. The next step is to create child tables that are going to inherit from the master
table:
CREATE TABLE country_log_ru ( CHECK ( country_code = 'ru') )
INHERITS (country_log);
CREATE TABLE country_log_sa ( CHECK ( country_code = 'sa' ) )
INHERITS (country_log);
3. The next step is to create an index for each child table:
CREATE INDEX country_code_ru_idx ON country_log_ru
USING btree (country_code);
CREATE INDEX country_code_sa_idx ON country_log_sa
USING btree (country_code);
4. The next step is to create a trigger function with the help of which data will be
redirected to the appropriate partition table:
CREATE OR REPLACE FUNCTION country_insert_trig() RETURNS TRIGGER AS
$$
BEGIN
IF ( NEW.country_code = 'ru' ) THEN
INSERT INTO country_log_ru VALUES (NEW.*);
ELSIF ( NEW.country_code = 'sa' ) THEN
INSERT INTO country_log_sa VALUES (NEW.*);
ELSE
[ 89 ]
Connection Pooling and Database Partitioning
RAISE EXCEPTION 'Country unknown';
END IF;
RETURN NULL;
END;
$$ LANGUAGE plpgsql;
5. The next step is to create a trigger and attach the trigger function to the master
table:
CREATE TRIGGER country_insert
BEFORE INSERT ON country_log
FOR EACH ROW EXECUTE PROCEDURE country_insert_trig();
6. The next step is to insert data into the master table:
postgres=# INSERT INTO country_log (country_code, content)
VALUES ('ru', 'content-ru');
postgres=# INSERT INTO country_log (country_code, content)
VALUES ('sa', 'content-sa');
7. The final step is to select the data from both the master table and the child table to
confirm partitioning of data in child tables:
postgres=# SELECT * from country_log;
created_at
| country_code | content
-------------------------------+--------------+-----------2014-11-30 12:10:06.123189-08 | ru
| content-ru
2014-11-30 12:10:14.22666-08 | sa
| content-sa
(2 rows)
postgres=# select * from country_log_ru;
created_at
| country_code | content
-------------------------------+--------------+-----------2014-11-30 12:10:06.123189-08 | ru
| content-ru
(1 row)
postgres=# select * from country_log_sa;
created_at
| country_code | content
------------------------------+--------------+-----------2014-11-30 12:10:14.22666-08 | sa
| content-sa
(1 row)
[ 90 ]
Connection Pooling and Database Partitioning
How it works…
The following is the detailed explanation for the steps carried out in the previous section:
1. PostgreSQL basically supports partitioning via table inheritance. Hence
partitioning is set up in such a way that every child table inherits from the parent
table. It is for this purpose that we created two child tables, that is,
country_log_ru and country_log_sa in step 2 of the previous section. These
child tables inherit from the parent or the master table country_log by using the
keyword INHERITS against the master table country_log for the CREATE
TABLE DDL statement for both the child tables. This was the initial setup.
2. The next step in our scenario was to build partitioning in such a way that logs by
country are stored in a country-specific table. The case scenario that we used in
the previous section was to ensure that all logs for Russia go in the
country_log_ru table and all logs for South Africa go in the country_log_sa
table. To achieve this objective we defined a trigger function
country_insert_trig, which helps partition the data into a country-specific
table whenever an INSERT statement is triggered on the country_log master
table. The moment the INSERT statement gets triggered on the country_log
master table, the country_log trigger gets fired upon, which it calls the
country_insert_trig() trigger function. The country_insert_trig()
trigger function checks the inserted records: if it finds records for Russia (checked
by the NEW.country_code = 'ru' condition ) in the country_log table then it
inserts the said record in the country_log_ru child table; and if the inserted
record in the country_log master table is for South Africa (NEW.country_code
= 'sa') then it logs the same record in the country_log_sa child table. This
way the trigger function partitions the data. The following is the section of code
in the country_insert_trig() trigger function that uses the logic defined in
the IF condition to partition the data into the child tables:
IF ( NEW.country_code = 'ru' ) THEN
INSERT INTO country_log_ru VALUES (NEW.*);
ELSIF ( NEW.country_code = 'sa' ) THEN
INSERT INTO country_log_sa VALUES (NEW.*);
ELSE
RAISE EXCEPTION 'Country unknown';
END IF;
[ 91 ]
Connection Pooling and Database Partitioning
3. Once the data has been partitioned into the child tables, the final step was to
verify the data by comparing the records from the child tables and the master
table, as shown in step 7 of the previous section.
There were initially two records inserted in the country_log master table. This can be
confirmed by running the SELECT query against the country_log table. In the following
snippet, we can see two log records in the country_log table, one for Russia identified by
the country code ru, and one for South Africa identified by the country code sa:
postgres=# SELECT * from country_log;
created_at
| country_code | content
-------------------------------+--------------+-----------2014-11-30 12:10:06.123189-08 | ru
| content-ru
2014-11-30 12:10:14.22666-08 | sa
| content-sa
(2 rows)
The next step is to run the SELECT queries against the respective child tables,
country_log_ru and country_log_sa:
postgres=# select * from country_log_ru;
created_at
| country_code | content
-------------------------------+--------------+-----------2014-11-30 12:10:06.123189-08 | ru
| content-ru
(1 row)
postgres=# select * from country_log_sa;
created_at
| country_code | content
------------------------------+--------------+-----------2014-11-30 12:10:14.22666-08 | sa
| content-sa
(1 row)
From the preceding output, we can see that there is only a record in each country_log_ru
and country_log_sa child table. In effect the country_insert_trig() trigger function
has partitioned the log data in country-specific tables. Entries for the country_code ru,
that is, for Russia go into the country_log_ru table, and entries for the country_code
sa, that is, South Africa go into the country_log_sa child table.
You may refer to the following links for more detailed explanations on
implementing partitioning:
https://blog.engineyard.com/2013/scaling-postgresql-performanc
e-table-partitioningand http://www.postgresql.org/docs/9.6/sta
tic/ddl-partitioning.html.
[ 92 ]
Connection Pooling and Database Partitioning
Managing partitions
In this recipe, we are going to show how the partitioning scheme remains intact when an
existing partition is dropped or a new partition is added.
Getting ready
Refer to the first recipe, Implementing partitioning, before reading the steps outlined in this
recipe.
How to do it…
There are two scenarios here. One is what happens when an existing partition is deleted
and the other is what happens when a new partition is added. Let's discuss both cases:
1. In the first scenario we will drop an existing partition table. Here it is the
country_code_sa table that will be dropped:
Before dropping the country_log_sa child table, first see the records in the
country_log master table:
postgres=# SELECT * from country_log;
created_at
| country_code | content
-------------------------------+--------------+-----------2014-11-30 12:10:06.123189-08 | ru
| content-ru
2014-11-30 12:10:14.22666-08 | sa
| content-sa
(2 rows)
The next step would be to drop the country_log_sa child table:
postgres=# drop table country_log_sa;
DROP TABLE
As a final step we would again check the data in the country_log master
table once the country_log_sa child table is dropped:
postgres=# select * from country_log;
created_at
| country_code | content
-------------------------------+--------------+-----------2014-11-30 14:41:40.742878-08 | ru
| content-ru
[ 93 ]
Connection Pooling and Database Partitioning
2. Here in the second scenario we are going to add a partition. Let's add a new
partition, country_log_default. The idea of creating this partition is that if
there are tables for which the trigger function does not define any country codes
those records should go into a default table partition:
Before we create the child table, let's see the existing records in the
country_log master table:
postgres=# select * from country_log;
created_at
| country_code | content
-------------------------------+--------------+-----------2014-11-30 14:41:40.742878-08 | ru
| content-ru
The next step would be to create a country_log_sa child table and create an
index on the child table:
postgres=# CREATE TABLE country_log_sa ( ) INHERITS
(country_log);
CREATE TABLE
postgres=# CREATE INDEX country_code_sa_idx ON country_log_sa
USING btree
(country_code);
CREATE INDEX;
Also, let's create another partition table called country_log_default where
the rest of the transaction goes to that table. Let's modify our existing trigger
function to define a condition to insert log records for those countries whose
country codes are not explicitly defined to go into a country code-specific log
table:
CREATE OR REPLACE FUNCTION country_insert_trig() RETURNS
TRIGGER AS $$
BEGIN
IF ( NEW.country_code = 'ru' ) THEN
INSERT INTO country_log_ru VALUES (NEW.*);
ELSIF ( NEW.country_code = 'sa' ) THEN
INSERT INTO country_log_sa VALUES (NEW.*);
ELSE
INSERT INTO country_log_default VALUES (NEW.*);
END IF;
RETURN NULL;
END;
$$ LANGUAGE plpgsql;
[ 94 ]
Connection Pooling and Database Partitioning
Now let's insert records into the master table:
postgres=# INSERT INTO country_log (country_code, content)
VALUES ('dk',
'content-dk');
INSERT 0 0
postgres=# INSERT INTO country_log (country_code, content)
VALUES ('us',
'content-us');
INSERT 0 0
Let's now check the newly created records in the country_log master table
and see if these records have been partitioned into the
country_log_default child table:
postgres=# select * from country_log;
created_at
| country_code | content
-------------------------------+--------------+-----------2014-11-30 14:41:40.742878-08 | ru
| content-ru
2014-11-30 15:10:28.921124-08 | dk
| content-dk
2014-11-30 15:10:42.97714-08 | us
| content-us
postgres=# select * from country_log_default;
created_at
| country_code | content
-------------------------------+--------------+-----------2014-11-30 15:10:28.921124-08 | dk
| content-dk
2014-11-30 15:10:42.97714-08 | us
| content-us
(2 rows)
How it works…
Let's first discuss the first scenario, where we are dropping the child partition table,
country_log_sa. Here is the code snippet that was shown in the previous section:
postgres=# SELECT * from country_log;
created_at
| country_code | content
-------------------------------+--------------+-----------2014-11-30 12:10:06.123189-08 | ru
| content-ru
2014-11-30 12:10:14.22666-08 | sa
| content-sa
(2 rows)
postgres=# drop table country_log_sa;
DROP TABLE
postgres=# select * from country_log;
created_at
| country_code | content
-------------------------------+--------------+-----------2014-11-30 14:41:40.742878-08 | ru
| content-ru
[ 95 ]
Connection Pooling and Database Partitioning
If you refer to the sequence of events in the preceding output, we can clearly see that once
the country_log_sa child table got dropped its corresponding entry from the
country_log master table also got removed. This technique really helps if there are large
numbers of records to be pruned from the master table once the corresponding child table is
dropped. This procedure is automatic and does not require DBA intervention. This way the
partition structure and data can be easily managed and handled if any existing partition is
dropped.
Similarly, data can be easily managed when a new partition is added. If you refer to step 2
in the How to do it section we are creating a new child table, country_log_default,
which inherits from the country_log master table. Once the existing trigger function,
country_insert_trig(), is modified to include the conditional-based insert for
partitioning the data into the newly created partition, country_log_default, and once an
INSERT statement is triggered on the country_log master table, and if the prevalent
condition for inserting records into the country_log_default child table is fulfilled, then
the records are inserted into the country_log_default child table. This can be seen from
in step 2 of the How to do it section.
To manage partitions automatically, PostgreSQL offers an extension called pg_partman,
which will create the required partition tables on the parent table automatically. Refer to the
following URL for more information about this extension https://github.com/keithf4/p
g_partman.
Installing PL/Proxy
PL/Proxy is a database partitioning system that is implemented as a PL language. PL/Proxy
makes it straightforward to split large independent tables among multiple nodes in a way
that almost allows unbounded scalability. PL/Proxy scaling works on both read and write
workloads. The main idea is that the proxy function will be set up with the same signature
as the remote function to be called, so only the destination information needs be specified
inside the proxy function body.
In this recipe, we are going to show the steps required to install PL/Proxy.
[ 96 ]
Connection Pooling and Database Partitioning
How to do it…
The following are the steps to install PL/Proxy:
1. Go to the following website and download the latest tarball of PL/Proxy:
http://pgfoundry.org/projects/plproxy/.
2. Once the latest version of PL/Proxy is downloaded, then the next step is to
unpack the TAR archive:
tar xvfz plproxy-2.5.tar.gz
3. Once the TAR archive has been unpacked, the next step is to enter the newly
created directory and start the compilation process:
cd
plproxy-2.5
make && make install
How it works…
Installing PL/Proxy is an easy task. Here we are downloading the source code from the
website link given in step 1 of the preceding section. The latest version of PL/Proxy at this
stage is 2.5 and we need to download the tarball containing version 2.5 of PL/Proxy. Once
downloaded we need to compile and build it. This completes the installation of PL/Proxy.
You may also install plproxy from binary packages if prebuilt packages are available for
your operating system.
Partitioning with PL/Proxy
In this recipe, we are going to cover horizontal partitioning with PL/Proxy.
Getting ready
PL/Proxy needs to be installed on the host machine. Refer to the previous recipe for more
details on installing PL/Proxy.
[ 97 ]
Connection Pooling and Database Partitioning
How to do it…
The following are the steps that need to be carried out for horizontal partitioning using
PL/Proxy:
1. Create three new databases: one proxy database named nodes, and two
partitioned databases, nodes_0000 and nodes_0001, respectively:
postgres=# create database nodes;
postgres=# create database nodes_0000;
postgres=# create database nodes_0001;
2. Once you have created these databases the next step is to create a PL/Proxy
extension:
psql
-d
nodes
nodes=# create extension plproxy;
3. The next step is to create the plproxy schema in the proxy database nodes:
nodes=# create schema plproxy;
4. The next step is to execute the following mentioned file, plproxy--2.5.0.sql,
on the proxy database nodes:
cd
/usr/pgsql-9.6/share/extension
psql -f plproxy--2.5.0.sql
CREATE
CREATE
CREATE
CREATE
nodes
FUNCTION
LANGUAGE
FUNCTION
FOREIGN DATA WRAPPER
5. The next step is to configure PL/Proxy using the configuration functions on the
proxy database nodes:
psql -d
nodes
CREATE OR REPLACE FUNCTION
plproxy.get_cluster_version(cluster_name text)
RETURNS int AS $$
[ 98 ]
Connection Pooling and Database Partitioning
BEGIN
IF cluster_name = 'nodes' THEN
RETURN 1;
END IF;
END;
$$ LANGUAGE plpgsql;
CREATE OR REPLACE FUNCTION
plproxy.get_cluster_partitions(cluster_name
text) RETURNS SETOF text AS $$
BEGIN
IF cluster_name = 'nodes' THEN
RETURN NEXT 'host=127.0.0.1 dbname=nodes_0000';
RETURN NEXT 'host=127.0.0.1 dbname=nodes_0001';
RETURN;
END IF;
RAISE EXCEPTION 'no such cluster: %', cluster_name;
END;
$$ LANGUAGE plpgsql SECURITY DEFINER;
CREATE OR REPLACE FUNCTION plproxy.get_cluster_config
(cluster_name text,
out key text, out val text)
RETURNS SETOF record AS $$
BEGIN
RETURN;
END;
$$ LANGUAGE plpgsql;
6. The next step is to log in to the partitioned databases and create the users table in
both of them:
psql -d nodes_0000
nodes_0000=# CREATE TABLE users (username text PRIMARY KEY);
psql -d
nodes_0001
nodes_0001=# CREATE TABLE users (username text PRIMARY KEY);
[ 99 ]
Connection Pooling and Database Partitioning
7. The next step is to create the following mentioned function, insert_user(),
which will be used to insert usernames in the users table:
psql -d
nodes_0000
CREATE OR REPLACE FUNCTION insert_user(i_username text) RETURNS
text AS $$
BEGIN
PERFORM 1 FROM users WHERE username = i_username;
IF NOT FOUND THEN
INSERT INTO users (username) VALUES (i_username);
RETURN 'user created';
ELSE
RETURN 'user already exists';
END IF;
END;
$$ LANGUAGE plpgsql SECURITY DEFINER;
psql
-d
nodes_0001
CREATE OR REPLACE FUNCTION insert_user(i_username text) RETURNS
text AS $$
BEGIN
PERFORM 1 FROM users WHERE username = i_username;
IF NOT FOUND THEN
INSERT INTO users (username) VALUES (i_username);
RETURN 'user created';
ELSE
RETURN 'user already exists';
END IF;
END;
$$ LANGUAGE plpgsql SECURITY DEFINER;
8. The next step is to create a proxy function called insert_user() on the proxy
database nodes:
psql -d nodes
CREATE OR REPLACE FUNCTION insert_user(i_username text) RETURNS
TEXT AS $$
CLUSTER 'nodes'; RUN ON hashtext(i_username);
$$ LANGUAGE plproxy;
9. Check the pg_hba.conf file and you will need to set the authentication to trust
and then restart the postgresql service.
[ 100 ]
Connection Pooling and Database Partitioning
10. The next step is to fill the partitions by executing the following query on the
proxy database nodes:
nodes=#SELECT insert_user('user_number_'||generate_series::text)
FROM generate_series(1,10000);
11. Once the data has been inserted we are going to verify the corresponding records
in the partitioned databases, nodes_0000 and nodes_0001:
nodes_0000=# select count(*) from users;
count
------5106
(1 row)
nodes_0001=# select count(*) from users;
count
------4894
(1 row)
How it works…
Explanation for the preceding code output is given here:
1. Initially we created three databases, one as a proxy database named nodes and
two other databases named nodes_0000 and nodes_0001 across which the data
will be partitioned.
2. The next step was to create the plproxy extension.
3. As can be seen in step 5 in the preceding section, we are configuring PL/Proxy
using the configuration functions on the proxy database nodes. The function
plproxy.get_cluster_partitions() is invoked when a query needs to be
forwarded to a remote database and is used by PL/Proxy to get the connection
string to use for each partition. We also use the
plproxy.get_cluster_version() function, which is called upon each request
and is used to determine if the output from a cached result from
plproxy.get_cluster_partitions can be reused. We also use the
plproxy.get_cluster_config() function, which enables us to configure the
behavior of PL/Proxy.
[ 101 ]
Connection Pooling and Database Partitioning
4. Once we are done with defining the configuration functions on the proxy
database nodes, the next step was to create the table users in both the partitioned
databases across which the data will be partitioned.
5. In the next step, we created an insert_user() function that will be used to
insert usernames into the users table. The insert_user() function will be
defined on both the partitioned databases nodes_0000 and nodes_0001. This is
shown in step 7 of the preceding section.
6. In the next step, we created an insert_user() proxy function inside the proxy
database nodes. The proxy function will be used to send the INSERT to the
appropriate partition. This is shown in step 8 of the preceding section.
7. Finally, we filled the partitions with random data by executing the
insert_user() proxy function in the proxy database named nodes. This is seen
in step 10 in the preceding section.
For more detailed information on PL/Proxy, refer to the following web
link:
http://plproxy.projects.pgfoundry.org/doc/tutorial.html.
[ 102 ]
6
High Availability and Replication
In this chapter, we will cover the following:
Setting up hot streaming replication
Replication using Slony
Replication using Londiste
Replication using Bucardo
Replication using DRBD
Setting up a Postgres-XL Cluster
Introduction
Some of the most important components for any production database are there to achieve
fault tolerance, availability 24 x 7, and redundancy. It is for this purpose we have different
high availability and replication solutions available for PostgreSQL.
From a business perspective, it is important to ensure data availability 24 x 7 in the event of
a disaster situation or in the event of a database crash due to disk or hardware failure. In
such situations, it becomes critical to ensure that a duplicate copy of data is available on a
different server or a different database so that the seamless failover could be achieved even
when the primary server/database is unavailable.
In this chapter, we will talk about various high availability and replication solutions
including some popular third-party replication tools such as Slony, Londiste, and Bucardo.
We will also discuss block level replication using DRBD, and finally we will set up a
PostgreSQL Highly Extensible Cluster, that is, Postgres-XC.
High Availability and Replication
Setting up hot streaming replication
In this recipe, we are going to set up a master/slave streaming replication.
Getting ready
For this exercise, you need two Linux machines each with the latest version of PostgreSQL
9.6 installed. We will be using the following IP addresses for master and slave servers:
Master IP address: 192.168.0.4
Slave IP address: 192.168.0.5
How to do it…
The following steps show you how to set up master/slave streaming replication:
1. Set up password-less authentication between master and slave for the Postgres
user.
2. First, we are going to create a user ID on the master, which will be used by the
slave server to connect to the PostgreSQL database on the master server:
psql -c "CREATE USER repuser REPLICATION LOGIN ENCRYPTED
PASSWORD 'charlie';"
3. The next step is to allow the replication user that was created in the previous step
to allow access to the master Postgresql server.
This is done by making the necessary changes as in the pg_hba.conf file:
Vi pg_hba.conf
host
replication
repuser
192.168.0.5/32
md5
4. In the next step, we are going to configure parameters in the postgresql.conf
file. These parameters are required to be set in order to get the streaming
replication working:
Vi /var/lib/pgsql/9.6/data/postgresql.conf
listen_addresses = '*'
wal_level = hot_standbymax_wal_senders = 3wal_keep_segments = 8
archive_mode = on
archive_command = 'cp %p /var/lib/pgsql/archive/%f && scp %p
postgres@192.168.0.5:/var/lib/pgsql/archive/%f'
[ 104 ]
High Availability and Replication
5. Once the parameter changes have been made in the postgresql.conf file in the
previous step, the next step is to restart the PostgreSQL server on the master
server in order to get the changes made in the previous step to come into effect:
pg_ctl -D /var/lib/pgsql/9.6/data restart
6. Before the slave can replicate the master, we need to give it the initial database to
build off. For this purpose, we will make a base backup by copying the primary
server's data directory to the standby:
psql -U postgres -h 192.168.0.4 -c "SELECT
pg_start_backup('label', true)"
rsync -a /var/lib/pgsql/9.6/data/
192.168.0.5:/var/lib/pgsql/9.6/data/
--exclude postmaster.pid
psql -U postgres -h 192.168.0.4 -c "SELECT pg_stop_backup()"
7. Once the data directory in the previous step is populated, the next step is to
configure the following parameters in the postgresql.conf file on the slave
server:
hot_standby = on
8. The next step is to copy the recovery.conf.sample in the $PGDATA location on
the slave server and then configure the following parameters:
cp /usr/pgsql-9.6/share/recovery.conf.sample
/var/lib/pgsql/9.6/data/recovery.conf
standby_mode = on
primary_conninfo = 'host=192.168.0.4 port=5432 user=repuser
password=charlie'
trigger_file = '/tmp/trigger.replication′
restore_command = 'cp /var/lib/pgsql/archive/%f "%p"'
9. The next step is to start the slave server:
service postgresql-9.6 start
[ 105 ]
High Availability and Replication
10. Now that the previously replication steps are set up, we will test for replication.
On the master server, log in and issue the following SQL commands:
psql -h 192.168.0.4
-d postgres -U postgres -W
postgres=# create database test;
postgres=# \c test;
test=# create table testtable ( testint int, testchar
varchar(40) );
CREATE TABLE
test=# insert into testtable values ( 1, 'What A Sight.' );
INSERT 0 1
11. On the slave server, we will now check if the newly created database and the
corresponding table in the previous step are replicated:
psql -h 192.168.0.5 -d test -U postgres -W
test=# select * from testtable;
testint | testchar
---------+--------------------------1 | What A Sight.
(1 row)
12. The wal_keep_segments parameter ensures how many WAL files should be
retained in the master pg_xlog in case of network delays. However, if you do not
want to assume a value for this, you can create a replication slot, which makes
sure the master does not remove the WAL files in pg_xlog until they have been
received by standbys.
For more information, visit the following link: https://www.postgresql.
org/docs/9.6/static/warm-standby.html#STREAMING-REPLICATION-SL
OTS.
[ 106 ]
High Availability and Replication
How it works…
The following is the explanation given for the steps done in the preceding section:
In the initial step of the preceding section, we created a user called repuser, which
will be used by the slave server to make a connection to the primary server. In the
second step of the preceding section, we made the necessary changes in the
pg_hba.conf file to allow the master server to be accessed by the slave server
using the userid repuser that was created in step 2.
We then made the necessary parameter changes on the master in step 4 of the
preceding section for configuring streaming replication. The following is a
description for these parameters:
Listen_Addresses: This parameter is used to provide the IP
address that you want to have PostgreSQL listen to. A value of *
indicates all available IP addresses.
wal_level: This parameter determines the level of wal logging
done. Specify hot_standby for streaming replication.
wal_keep_segments: This parameter specifies the number of 16
MB WAL files to retain in the pg_xlog directory. The rule of
thumb is that more such files may be required to handle a large
checkpoint.
archive_mode: Setting this parameter enables completed wal
segments to be sent to archive storage.
archive_command: This parameter is basically a shell command
that is executed whenever a WAL segment is completed. In our
case, we are basically copying the file to the local machine and then
we are using the secure copy command to send it across to the
slave.
max_wal_senders: This parameter specifies the total number of
concurrent connections allowed from the slave servers.
Once the necessary configuration changes have been made on the master server,
we then restart the PostgreSQL server on the master in order to get the new
configuration changes to come into effect. This is done in step 5 of the preceding
section.
In step 6 of the preceding section, we were basically building the slave by
copying the primary's data directory to the slave.
[ 107 ]
High Availability and Replication
Now with the data directory available on the slave, the next step is to configure it.
We will now make the necessary parameter replication related parameter
changes on the slave in the postgresql.conf directory on the slave server. We
set the following parameter on the slave:
hot_standby: This parameter determines if we can connect and
run queries during times when the server is in the archive recovery
or standby mode.
In the next step, we are configuring the recovery.conf file. This is required to
be set up so that the slave can start receiving logs from the master. The following
parameters are configured in the recovery.conf file on the slave:
standby_mode: This parameter when enabled causes Postgresql to
work as a standby in a replication configuration.
primary_conninfo: This parameter specifies the connection
information used by the slave to connect to the master. For our
scenario, our master server is set as 192.168.0.4 on port 5432
and we are using the userid repuser with password charlie to
make a connection to the master. Remember that the repuser was
the userid that was created in the initial step of the preceding
section for this purpose, that is, connecting to the master from the
slave.
trigger_file: When the slave is configured as a standby it will
continue to restore the XLOG records from the master. The
trigger_file parameter specifies what is used to trigger a slave
to switch over its duties from standby and take over as master or to
be the primary server.
At this stage, the slave has now been fully configured and we then start the slave
server and then the replication process begins.
In step 10 and 11 of the preceding section, we are simply testing our replication.
We first begin by creating a database test and log into the test database and create
a table by the name test table and then begin inserting some records into the test
table. Now our purpose is to see whether these changes are replicated across the
slave. To test this, we then log in into the slave on the test database and then
query the records from the test table as seen in step 10 of the preceding section.
The final result that we see is that all the records that are changed/inserted on the
primary are visible on the slave. This completes our streaming replication set up
and configuration.
[ 108 ]
High Availability and Replication
Replication using Slony
In this recipe, we are going to set up a replication using Slony, which is a widely used
replication engine. It replicates a desired set of table data from one database to another. This
replication approach is based on a few event triggers that will be created on the source set
of tables, which will log the DML and DDL statements into Slony catalog tables. By using
Slony, we can also set up the cascading replication among multiple nodes.
Getting ready
The steps followed in this recipe are carried out on a CentOS version 6 machine.
We first need to install Slony.
The following are the steps needed to install Slony:
1. First, go to the following web link and download the given software: http://slo
ny.info/downloads/2.2/source/
2. Once you have downloaded the software, the next step is to unzip the tarball and
then go to the newly created directory:
tar xvfj slony1-2.2.3.tar.bz2
cd slony1-2.2.3
3. In the next step, we are going to configure, compile, and build the software:
./configure --with-pgconfigdir=/usr/pgsql-9.6/bin/
make
make install
How to do it…
The following are the sequence of steps required to replicate data between two tables using
Slony replication:
1. First, start the PostgreSQL server if it is not already started:
pg_ctl -D $PGDATA start
[ 109 ]
High Availability and Replication
2. In the next step, we will be creating two databases, test1, and test2, which will
be used as source and target databases:
createdb test1
createdb
test2
3. In the next step, we will create the table t_test on the source database, test1,
and will insert some records into it:
psql -d test1
test1=# create table t_test (id numeric primary key, name
varchar);
test1=# insert into t_test values(1,'A'),(2,'B'), (3,'C');
4. We will now set up the target database by copying the table definitions from the
source database, test1:
pg_dump -s -p 5432 -h localhost test1 | psql -h localhost -p
5432 test2
5. We will now connect to the target database, test2, and verify that there is no
data in the tables of the test2 database:
psql -d
test2
test2=# select * from t_test;
6. We will now set up a slonik script for master/slave, that is, the source/target
setup:
vi init_master.slonik
#! /bin/slonik
cluster name = mycluster;
node 1 admin conninfo = 'dbname=test1 host=localhost
port=5432 user=postgres password=postgres';
node 2 admin conninfo = 'dbname=test2 host=localhost
port=5432 user=postgres password=postgres';
init cluster ( id=1);
create set (id=1, origin=1);
set add table(set id=1, origin=1, id=1, fully qualified
name = 'public.t_test');
store node (id=2, event node = 1);
[ 110 ]
High Availability and Replication
store path (server=1, client=2, conninfo='dbname=test1
host=localhost port=5432 user=postgres password=postgres');
store path (server=2, client=1, conninfo='dbname=test2
host=localhost port=5432 user=postgres password=postgres');
store listen (origin=1, provider = 1, receiver = 2);
store listen (origin=2, provider = 2, receiver = 1);
7. We will now create a slonik script for subscription on the slave, that is, the target:
vi init_slave.slonik
#! /bin/slonik
cluster name = mycluster;
node 1 admin conninfo = 'dbname=test1 host=localhost
port=5432 user=postgres password=postgres';
node 2 admin conninfo = 'dbname=test2 host=localhost
port=5432 user=postgres password=postgres';
subscribe set ( id = 1, provider = 1, receiver = 2, forward
= no);
8. We will now run the init_master.slonik script created in step 6 and will run
this on the master:
cd /usr/pgsql-9.6/bin
slonik init_master.slonik
9. We will now run the init_slave.slonik script created in step 7 and will run
this on the slave, that is, the target:
cd /usr/pgsql-9.6/bin
slonik init_slave.slonik
10. In the next step, we will start the master slon daemon:
nohup slon mycluster "dbname=test1 host=localhost port=5432
user=postgres
password=postgres" &
11. In the next step, we will start the slave slon daemon:
nohup slon mycluster "dbname=test2 host=localhost port=5432
user=postgres
password=postgres" &
12. In the next step, we will connect to the master, that is, the source
database, test1, and insert some records in the t_test table:
psql -d test1
test1=# insert into t_test values (5,'E');
[ 111 ]
High Availability and Replication
13. We will now test for replication by logging to the slave, that is, the target
database, test2, and see if the inserted records in the t_test table in the
previous step are visible:
psql -d test2
test2=# select * from t_test;
id | name
----+-----1 | A
2 | B
3 | C
5 | E
(4 rows)
How it works…
We will now discuss the steps followed in the preceding section:
In step 1, we started the Postgresql server if it was not already started.
In step 2, we created two databases, namely test1 and test2, which will serve
as our source (master) and target (slave) databases.
In step 3, of the preceding section, we logged into the source database, test1,
and created a table t_test and inserted some records into the table.
In step 4, of the preceding section, we set up the target database, test2, by
copying the table definitions present in the source database and loading them
into the target database, test2, by using the pg_dump utility.
In step 5, of the preceding section, we logged into the target database, test2, and
verified that there were no records present in the table t_test because in step 5
we only extracted the table definitions into the test2 database from the test1
database.
In step 6, we set up a slonik script for the master/slave replication setup. In the
init_master.slonik file, we first defined the cluster name as mycluster. We
then defined the nodes in the cluster. Each node will have a number associated to
a connection string, which contains database connection information. The node
entry is defined both for the source and target databases. The store_path
commands are necessary so that each node knows how to communicate with the
other.
[ 112 ]
High Availability and Replication
In step 7, we set up a slonik script for subscription of the slave, that is, the target
database, test2. Once again the script contains information such as cluster name
and node entries, which are designated a unique number related to the connect
string information. It also contains a subscriber set.
In step 8, of the preceding section, we ran the init_master.slonik on the
master. Similarly in step 9, we ran the init_slave.slonik on the slave.
In step 10, of the preceding section, we started the master slon daemon. In step
11 of the preceding section we started the slave slon daemon.
The subsequent sections from step the 12-13 of the preceding section are used to
test for replication. For this purpose in step 12 of the preceding section we first
login into the source database test 1 and insert some records into the t_test
table. To check if the newly inserted records have been replicated to the target
database, test2, we log into the test2 database in step 13 and then the result set
obtained by the output of the query confirms that the changed/inserted records
on the t_test table in the test1 database are successfully replicated across the
target database, test2.
You may refer to the following link for more information regarding Slony
replication:
http://slony.info/documentation/tutorial.html.
Replication using Londiste
In this recipe, we are going to show how to replicate data using Londiste, which is based on
a queue that is populated by a set of producers, and consumed by a set of consumers. It is
also similar to Slony replication, which is needed to create a set of event triggers on source
tables.
Getting ready
For this setup, we are using the same host CentOS Linux machine to replicate data between
two databases. This can also be set up using two separate Linux machines running on
Vmware or VirtualBox or any other virtualization software. It is assumed that Postgresql
version 9.6 is installed. We have used CentOs version 6 as the Linux operating system for
this exercise.
[ 113 ]
High Availability and Replication
To set up Londiste replication on the Linux machine, follow these steps:
1. Go to the following link and download the latest version of skytools 3.2, that is,
tarball skytools-3.2.tar.gz: http://pgfoundry.org/projects/skytools/
2. Extract the tar ball as follows:
tar -xvf skytools-3.2.tar.gz
3. Go to the new location and build and compile the software:
cd skytools-3.2
./configure --prefix=/var/lib/pgsql/9.6/Sky -with
pgconfig=/usr/pgsql9.6/bin/pg_config
make
make install
4. Also set the environment variable, PYTHONPATH, as follows. Alternatively, you
can also set it in the .bash_profile script as well:
export PYTHONPATH=/opt/PostgreSQL/9.6/Sky/lib64/python2.6/sitepackages/
How to do it…
We are going to follow these steps for setting up replication between two different
databases using Londiste:
1. First create the two databases between which replication is to occur:
createdb node1
createdb node2
2. Populate the node1 database with data using the pgbench utility:
pgbench -i -s 2 -F 80 node1
[ 114 ]
High Availability and Replication
3. Add any primary key and foreign keys on the tables in the node1 database that
are needed for replication. Create the following SQL file and add the following
lines to it:
Vi /tmp/prepare_pgbenchdb_for_londiste.sql
- add primary key to history table
ALTER TABLE pgbench_history ADD COLUMN hid SERIAL PRIMARY KEY;
-- add foreign keys
ALTER TABLE pgbench_tellers ADD CONSTRAINT
pgbench_tellers_branches_fk
FOREIGN KEY(bid)
REFERENCES pgbench_branches;
ALTER TABLE pgbench_accounts ADD CONSTRAINT
pgbench_accounts_branches_fk
FOREIGN KEY(bid) REFERENCES pgbench_branches;
ALTER TABLE pgbench_history ADD CONSTRAINT
pgbench_history_branches_fk
FOREIGN KEY(bid) REFERENCES pgbench_branches;
ALTER TABLE pgbench_history ADD CONSTRAINT
pgbench_history_tellers_fk
FOREIGN KEY(tid) REFERENCES pgbench_tellers;
ALTER TABLE pgbench_history ADD CONSTRAINT
pgbench_history_accounts_fk
FOREIGN KEY(aid) REFERENCES pgbench_accounts;
4. We will now load the SQL file created in the previous step and then load it into
the database:
psql node1 -f /tmp/prepare_pgbenchdb_for_londiste.sql
5. We will now populate the node2 database with table definitions from the tables
in the node1 database:
pg_dump -s -t 'pgbench*' node1 > /tmp/tables.sql
psql -f /tmp/tables.sql node2
6. Now start the process of replication. We will first create the configuration
file, londiste.ini, with the following parameters for setting up the root node
for source database, node1:
Vi londiste.ini
[londiste3]
job_name = first_table
db = dbname=node1
queue_name = replication_queue
logfile = /home/postgres/log/londiste.log
pidfile = /home/postgres/pid/londiste.pid
[ 115 ]
High Availability and Replication
7. In the next step, we are going to use the configuration file, londiste.ini,
created in the previous step to set up the root node for, database, node1, as
follows:
[postgres@localhost bin]$ ./londiste3 londiste3.ini create-root
node1
dbname=node1
2016-12-09 18:54:34,723 2335 WARNING No host= in public connect
string,
bad idea
2016-12-09 18:54:35,210 2335 INFO plpgsql is installed
2016-12-09 18:54:35,217 2335 INFO pgq is installed
2016-12-09 18:54:35,225 2335 INFO pgq.get_batch_cursor is
installed
2016-12-09 18:54:35,227 2335 INFO pgq_ext is installed
2016-12-09 18:54:35,228 2335 INFO pgq_node is installed
2016-12-09 18:54:35,230 2335 INFO londiste is installed
2016-12-09 18:54:35,232 2335 INFO londiste.global_add_table is
installed
2016-12-09 18:54:35,281 2335 INFO Initializing node
2016-12-09 18:54:35,285 2335 INFO Location registered
2016-12-09 18:54:35,447 2335 INFO Node "node1" initialized for
queue
"replication_queue" with type "root"
2016-12-09 18:54:35,465 2335 INFO Don
8. We will now run the worker daemon for the root node:
[postgres@localhost bin]$ ./londiste3 londiste3.ini worker
2016-12-09 18:55:17,008 2342 INFO Consumer uptodate = 1
9. In the next step, we will now create a configuration file, slave.ini, to create a
leaf node for the target database, node2:
vi slave.ini
[londiste3]
job_name = first_table_slave
db = dbname=node2
queue_name = replication_queue
logfile = /home/postgres/log/londiste_slave.log
pidfile = /home/postgres/pid/londiste_slave.pid
[ 116 ]
High Availability and Replication
10. We will now initialize the node in the target database:
./londiste3 slave.ini create-leaf node2 dbname=node2
-provider=dbname=node1
2016-12-09 18:57:22,769 2408 WARNING No host= in public connect
string, bad idea
2016-12-09 18:57:22,778 2408 INFO plpgsql is installed
2016-12-09 18:57:22,778 2408 INFO Installing pgq
2016-12-09 18:57:22,778 2408 INFO
Reading from
/var/lib/pgsql/9.6/Sky/share/skytools3/pgq.sql
2016-12-09 18:57:23,211 2408 INFO pgq.get_batch_cursor is
installed
2016-12-09 18:57:23,212 2408 INFO Installing pgq_ext
2016-12-09 18:57:23,213 2408 INFO
Reading from
/var/lib/pgsql/9.3/Sky/share/skytools3/pgq_ext.sql
2016-12-09 18:57:23,454 2408 INFO Installing pgq_node
2016-12-09 18:57:23,455 2408 INFO
Reading from
/var/lib/pgsql/9.3/Sky/share/skytools3/pgq_node.sql
2016-12-09 18:57:23,729 2408 INFO Installing londiste
2016-12-09 18:57:23,730 2408 INFO
Reading from
/var/lib/pgsql/9.3/Sky/share/skytools3/londiste.sql
2016-12-09 18:57:24,391 2408 INFO londiste.global_add_table is
installed
2016-12-09 18:57:24,575 2408 INFO Initializing node
2016-12-09 18:57:24,705 2408 INFO Location registered
2016-12-09 18:57:24,715 2408 INFO Location registered
2016-12-09 18:57:24,744 2408 INFO Subscriber registered: node2
2016-12-09 18:57:24,748 2408 INFO Location registered
2016-12-09 18:57:24,750 2408 INFO Location registered
2016-12-09 18:57:24,757 2408 INFO Node "node2" initialized for
queue
"replication_queue" with type "leaf"
2016-12-09 18:57:24,761 2408 INFO Done
11. We will now launch the worker daemon for the target database, that is, node2:
[postgres@localhost bin]$ ./londiste3 slave.ini worker
2016-12-09 18:58:53,411 2423 INFO Consumer uptodate = 1
12. We will now create the configuration file, that is, pgqd.ini, for the ticker
daemon:
vi pgqd.ini
[pgqd]
logfile = /home/postgres/log/pgqd.log
pidfile = /home/postgres/pid/pgqd.pid
[ 117 ]
High Availability and Replication
13. Using the configuration file created in the previous step, we will now launch the
ticker daemon:
[postgres@localhost bin]$ ./pgqd
2016-12-09 19:05:56.843 2542 LOG
2016-12-09 19:05:56.844 2542 LOG
2016-12-09 19:05:57.257 2542 LOG
2016-12-09 19:05:58.130 2542 LOG
pgqd.ini
Starting pgqd 3.2
auto-detecting dbs ...
node1: pgq version ok: 3.2
node2: pgq version ok: 3.2
14. We will now add all the tables to replication on the root node:
[postgres@localhost bin]$ ./londiste3 londiste3.ini add-table
-all
2016-12-09 19:07:26,064 2614 INFO Table added:
public.pgbench_accounts
2016-12-09 19:07:26,161 2614 INFO Table added:
public.pgbench_branches
2016-12-09 19:07:26,238 2614 INFO Table added:
public.pgbench_history
2016-12-09 19:07:26,287 2614 INFO Table added:
public.pgbench_tellers
15. Similarly, we will add all the tables to replication on the leaf node:
[postgres@localhost bin]$ ./londiste3 slave.ini add-table -all
16. We will now generate some traffic on the source database, node1:
pgbench -T 10 -c 5 node1
17. We will now use the compare utility available with the londiste3 command to
check that the tables in both the nodes, that is, both the source database, node1,
and destination database, node2, have the same amount of data:
[postgres@localhost bin]$ ./londiste3 slave.ini compare
2016-12-09 19:26:16,421 2982
for copy
2016-12-09 19:26:16,424 2982
using it
2016-12-09 19:26:16,425 2982
Using node
node1 as provider
2016-12-09 19:26:16,441 2982
2016-12-09 19:26:16,446 2982
public.pgbench_accounts
2016-12-09 19:26:16,447 2982
INFO Checking if node1 can be used
INFO Node node1 seems good source,
INFO public.pgbench_accounts:
INFO Provider: node1 (root)
INFO Locking
INFO Syncing
[ 118 ]
High Availability and Replication
public.pgbench_accounts
2016-12-09 19:26:18,975 2982
public.pgbench_accounts
2016-12-09 19:26:19,401 2982
checksum=167607238449
2016-12-09 19:26:19,706 2982
checksum=167607238449
2016-12-09 19:26:19,715 2982
for copy
2016-12-09 19:26:19,716 2982
using it
2016-12-09 19:26:19,716 2982
Using node
node1 as provider
2016-12-09 19:26:19,730 2982
2016-12-09 19:26:19,734 2982
public.pgbench_branches
2016-12-09 19:26:19,734 2982
public.pgbench_branches
2016-12-09 19:26:22,772 2982
public.pgbench_branches
2016-12-09 19:26:22,804 2982
checksum=-3078609798
2016-12-09 19:26:22,812 2982
checksum=-3078609798
2016-12-09 19:26:22,866 2982
for copy
2016-12-09 19:26:22,877 2982
using it
2016-12-09 19:26:22,878 2982
node
node1 as provider
2016-12-09 19:26:22,919 2982
2016-12-09 19:26:22,931 2982
public.pgbench_history
2016-12-09 19:26:22,932 2982
public.pgbench_history
2016-12-09 19:26:25,963 2982
public.pgbench_history
2016-12-09 19:26:26,008 2982
checksum=9467587272
2016-12-09 19:26:26,020 2982
checksum=9467587272
2016-12-09 19:26:26,056 2982
for copy
2016-12-09 19:26:26,063 2982
using it
2016-12-09 19:26:26,064 2982
INFO Counting
INFO srcdb: 200000 rows,
INFO dstdb: 200000 rows,
INFO Checking if node1 can be used
INFO Node node1 seems good source,
INFO public.pgbench_branches:
INFO Provider: node1 (root)
INFO Locking
INFO Syncing
INFO Counting
INFO srcdb: 2 rows,
INFO dstdb: 2 rows,
INFO Checking if node1 can be used
INFO Node node1 seems good source,
INFO public.pgbench_history: Using
INFO Provider: node1 (root)
INFO Locking
INFO Syncing
INFO Counting
INFO srcdb: 715 rows,
INFO dstdb: 715 rows,
INFO Checking if node1 can be used
INFO Node node1 seems good source,
INFO public.pgbench_tellers: Using
[ 119 ]
High Availability and Replication
node
node1 as provider
2016-12-09 19:26:26,100 2982
2016-12-09 19:26:26,108 2982
public.pgbench_tellers
2016-12-09 19:26:26,109 2982
public.pgbench_tellers
2016-12-09 19:26:29,144 2982
public.pgbench_tellers
2016-12-09 19:26:29,176 2982
checksum=4814381032
2016-12-09 19:26:29,182 2982
checksum=4814381032
INFO Provider: node1 (root)
INFO Locking
INFO Syncing
INFO Counting
INFO srcdb: 20 rows,
INFO dstdb: 20 rows,
How it works…
The following is an explanation regarding the steps done in the preceding section:
Initially in step 1 of the preceding section we created two databases, that is,
node1 and node2, which are to be used as source and target databases
respectively from a replication perspective.
In step 2 of the preceding section, we populated the node1 database using the
pgbench utility. In step 3 of the preceding section, we added and defined the
respective primary key and foreign key relationships on different tables and put
these DDL commands in an SQL file. In step 4 of the preceding section, we
executed these DDL commands stated in step 3 on the node1 database and thus
this way we force, the primary key and foreign key definitions on the tables in the
pgbench schema in the node1 database.
In step 5 of the preceding section, we extracted the table definitions from the
tables in the pgbench schema in the node1 database and loaded these definitions
in the node2 database.
We will now discuss steps 6-8 of the preceding section. In step 6, we created the
configuration file, which is then used in step 7 to create the root node for the
source database, node1. In step 8, we launched the worker daemon for the root
node. With regards to the entries in the configuration file in step 6, we first
defined a job that must have a name so that distinguished processes can be easily
identified. Then we defined a connect string with information to connect to the
source database, that is, node1, and then we defined the name of the replication
queue involved. Finally, we defined the location of the log and PID file.
[ 120 ]
High Availability and Replication
We will now discuss steps 9-11 of the preceding section. In step 9, we defined the
configuration file, which is then used in step 10 to create the leaf node for the
target database, that is, node2. In step 11, we then launched the worker daemon
for the leaf node. The entries in the configuration file in step 9 contain the
jobname, connect string to connect to the target database, that is, node2, the name
of the replication queue involved, and the location of the log and PID involved.
The key part here in step 11 relates to the slave, that is, the target database, and
where to find the master or provider, that is, the source database, node1.
We will now talk about steps 12-13 of the preceding section. In step 12, we
defined the ticker configuration with the help of which we launched the ticker
process in step 13. Once the ticker has started successfully, we have all the
components and processes set up and needed for the replication; however, we
have not yet defined what the system needs to replicate.
In steps 14 and 15 of the preceding section, we defined the tables to replication set
on both the source and target database, that is, node1, and node2, respectively.
Finally, we will talk about steps 16 -17 of the preceding section. Here, at this
stage, we are testing for the replicate that was set up the between source
database, node1, and the target database, node2. In step 16, we generated some
traffic on the source database, node1, by running pgbench with five parallel
database connections and generating traffic for 10 seconds. In step 17, we checked
that the tables on both the source and target databases have the same data. For
this purpose, we use the compare command on the provider and subscriber
nodes and then count and checksum the rows on both sides. A partial output
from the preceding section tells us that the data has been successfully replicated
between all the tables that are part of the replication set up between the source
database, node1, and the destination database, node2, as the count and
checksum of rows for all the tables on source and target destination databases are
matching:
2016-12-09 19:26:18,975 2982 INFO Counting
public.pgbench_accounts
2016-12-09 19:26:19,401 2982 INFO srcdb: 200000 rows,
checksum=167607238449
2016-12-09 19:26:19,706 2982 INFO dstdb: 200000 rows,
checksum=167607238449
2016-12-09 19:26:22,772 2982 INFO Counting
public.pgbench_branches
2016-12-09 19:26:22,804 2982 INFO srcdb: 2 rows,
checksum=-3078609798
2016-12-09 19:26:22,812 2982 INFO dstdb: 2 rows,
checksum=-3078609798
[ 121 ]
High Availability and Replication
2016-12-09 19:26:25,963 2982 INFO Counting
public.pgbench_history
2016-12-09 19:26:26,008 2982 INFO srcdb: 715 rows,
checksum=9467587272
2016-12-09 19:26:26,020 2982 INFO dstdb: 715 rows,
checksum=9467587272
2016-12-09 19:26:29,144 2982 INFO Counting
public.pgbench_tellers
2016-12-09 19:26:29,176 2982 INFO srcdb: 20 rows,
checksum=4814381032
2016-12-09 19:26:29,182 2982 INFO dstdb: 20 rows,
checksum=4814381032
You may refer to the following links for more information regarding
Londiste
replication: https://wiki.postgresql.org/wiki/Londiste_Tutorial_(Sk
ytools_2) and http://manojadinesh.blogspot.in/2012/11/skytoolslondiste-replication.html.
Replication using Bucardo
In this recipe, we are going to show the replication between two databases using Bucardo,
which is another replication tool that replicates data among multiple master nodes. As with
the other two tools, this tool is also dependent on triggers at the source tables.
Getting ready
This exercise is carried out in a Red Hat Linux machine.
Install the EPEL package for your Red Hat platform from the following URL: https://fed
oraproject.org/wiki/EPEL.
Then install these RPMs with the following yum command:
yum install perl-DBI perl-DBD-Pg perl-DBIx-Safe
If it is not already installed, download the Postgresql repository from the following web
link: http://yum.pgrpms.org/repopackages.php
After this, install the following package. This is required because Bucardo is written in Perl:
yum install postgresql96-plperl
[ 122 ]
High Availability and Replication
To install Bucardo, download it from the following web link: http://bucardo.org/wiki/B
ucardo
Extract the tarball and go to the newly downloaded location and compile and build the
software:
tar xvfz Bucardo-5.2.0.tar.gz
cd Bucardo-5.2.0
perl Makefile.PL
make
make install
How to do it…
The following are the steps to configure replication between two databases using Bucardo:
1. The first step is to install Bucardo, that is, create the main Bucardo database
containing the information that the Bucardo daemon will need:
[postgres@localhost ~]$ bucardo install --batch --quiet
Creating superuser 'bucardo'
2. In the next step, we create source and target databases, that is, gamma1 and
gamma2 respectively, between which the replication needs to be set up:
[postgres@localhost ~]$ psql -qc 'create database gamma1'
psql -d test1 -qc 'create table t1 (id serial primary key,
email text)'
[postgres@localhost ~]$ psql -qc 'create database gamma2
template gamma1'
3. In the next step, we inform Bucardo about the databases that will be involved in
the replication:
postgres@localhost ~]$ bucardo add db db1 dbname=gamma1
Added database "db1"
[postgres@localhost ~]$ bucardo add db db2 dbname=gamma2
Added database "db2"
[ 123 ]
High Availability and Replication
4. In the next step, we create a herd and include those tables from the source
database that will be part of the replication setup:
[postgres@localhost ~]$ bucardo add herd myherd t1
Created relgroup "myherd"
Added the following tables or sequences:
public.t1 (DB: db1)
The following tables or sequences are now part of the relgroup
"myherd":
public.t1
5. In the next step, we create a source sync:
[postgres@localhost ~]$ bucardo add sync beta herd=myherd
dbs=db1:source
Added sync "beta"
Created a new dbgroup named "beta"
6. In the next step, we create a target sync:
[postgres@localhost ~]$ bucardo add sync charlie herd=myherd
dbs=db1:source,db2:target
Added sync "charlie"
Created a new dbgroup named "charlie"
7. At this stage, we have the replication procedure set up, so the next step is to start
the Bucardo service:
[postgres@localhost ~]$ bucardo start
Checking for existing processes
Removing file "pid/fullstopbucardo"
Starting Bucardo
8. The next step is to test our replication setup. For this purpose, we are going to
insert some records in the t1 table on the source database, gamma1:
psql
-d
gamma1=#
INSERT 0
gamma1=#
INSERT 0
gamma1
insert into t1 values (1,'wallsingh@gmail.com');
1
insert into t1 values (2,'neha.verma@gmail.com');
1
[ 124 ]
High Availability and Replication
9. Now that we have inserted some records in the source database in the previous
step, we need to check if these changes have been replicated to the target
database, gamma2:
psql
-d
gamma2
gamma2=# select * from t1;
id |
email
----+---------------------1 | wallsingh@gmail.com
2 | neha.verma@gmail.com
(2 rows)
How it works…
The following is the description of the steps from the preceding section:
In step 1, of the preceding section, we first created the bucardo database that will
contain information about the Bucardo daemon and will also create a superuser
by the name of bucardo.
In step 2, of the preceding section, we created our source and target databases for
replication, that is, gamma1, and gamma2, respectively. We also created the table t1
on the gamma1 database, which will be used for replication.
In step 3, of the preceding section, we tell Bucardo regarding the source and
target databases, that is, gamma1, and gamma2 respectively, which will be
involved in the replication.
In step 4, of the preceding section, we created a herd by the name of myherd and
included the table t1 from the source database, gamma1, which will be a part of
the replication setup. Any changes made on this table should be replicated from
the source to the target database.
In steps 5 and 6, of the preceding section, we basically created a source and a
target sync that will replicate the table t1 in the herd, myherd, and replicate it
from the source database db1, that is, gamma1 to the target database, db2, that is,
gamma2.
With the replication set up configured, we started the Bucardo service in step 7 of
the preceding section.
We tested the replication setup in steps 8 and 9, of the preceding section.
[ 125 ]
High Availability and Replication
In step 8, we inserted some records in the t1 table on the gamma1 database and in
step 9, we logged in to the gamma2 database and checked if the newly inserted
records in the t1 table in the gamma database are replicated across the gamma2
database. The result set of the SELECT query from the t1 table in the gamma2
database confirms that the inserted records in the gamma1 database have been
successfully replicated to the gamma2 database.
You may refer to the following links for more information regarding
Bucardo replication: http://blog.pscs.co.uk/postgresql-replication
-and-bucardo/ and http://blog.endpoint.com/2014/06/bucardo-5-mu
ltimaster-postgres-released.html.
Replication using DRBD
In this recipe, we are going to cover block level replication using DRBD for Postgresql.
Getting ready
A working Linux machine is required for this setup. This setup requires network interfaces
and the cLusterIP. These steps are carried out in a CentOS version 6 machine. Having
covered the PostgreSQL setup in previous chapters, it is assumed that the necessary
packages and prerequisites are already installed.
We will be using the following setup in our hierarchy:
Node1.author.org uses the LAN IP address 10.0.0.181 and uses
172.16.0.1 for crossover
Node2.author.org uses the LAN IP address 10.0.0.182 and uses the IP
address 172.16.0.2 for crossover
dbip.author.org uses the cluster IP address 10.0.0.180
[ 126 ]
High Availability and Replication
How to do it…
The following are the steps for block level replication using DRBD:
1. First, temporarily disable SELINUX and set SELINUX to disabled and then save
the file:
vi /etc/selinux/config
SELINUX=disabled
2. In the next step, change the hostname and gateway for both the nodes, that is,
the network interfaces:
vi /etc/sysconfig/network
For node 1:
NETWORKING=yes
NETWORKING_IPV6=no
HOSTNAME=node1.author.org
GATEWAY=10.0.0.2
For node 2:
NETWORKING=yes
NETWORKING_IPV6=no
HOSTNAME=node2.author.org
GATEWAY=10.0.0.2
3. In the next step, we need to configure the network interfaces for the first node,
that is, node1.
We first configure the first node1:
vi /etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE=eth0
BOOTPROTO=static
IPADDR=10.0.0.181
NETMASK=255.255.255.0
ONBOOT=yes
HWADDR=a2:4e:7f:64:61:24
We then configure the crossover/DRBD interface for node1:
vi /etc/sysconfig/network-scripts/ifcfg-eth1
DEVICE=eth1
[ 127 ]
High Availability and Replication
BOOTPROTO=static
IPADDR=172.16.0.1
NETMASK=255.255.255.0
ONBOOT=yes
HWADDR=ee:df:ff:4a:5f:68
4. In step 4, we then configure the network interfaces for the second node, that is,
node2:
vi /etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE=eth0
BOOTPROTO=static
IPADDR=10.0.0.182
NETMASK=255.255.255.0
ONBOOT=yes
HWADDR=22:42:b1:5a:42:6f
We then configure the crossover/DRBD interface for node2:
vi /etc/sysconfig/network-scripts/ifcfg-eth1
DEVICE=eth1
BOOTPROTO=static
IPADDR=172.16.0.2
NETMASK=255.255.255.0
ONBOOT=yes
HWADDR=6a:48:d2:70:26:5e
5. The next step is to configure DNS:
vi /etc/resolv.conf
search author.org
nameserver 10.0.0.2
Also configure the basic hostname resolution:
vi /etc/hosts
127.0.0.1
10.0.0.181
10.0.0.182
10.0.0.180
localhost.localdomain localhost
node1.author.org
node1
node2.author.org
node2
dbip.author.org
node2
[ 128 ]
High Availability and Replication
6. In the next step, we will check the network connectivity between the nodes. First
we will ping node 2 from node 1 first through the LAN interface and then
through the crossover IP:
root@node1 ~]# ping -c 2 node2
PING node2 (10.0.0.182) 56(84) bytes of data.
64 bytes from node2 (10.0.0.182): icmp_seq=1 ttl=64 time=0.089 ms
64 bytes from node2 (10.0.0.182): icmp_seq=2 ttl=64 time=0.082 ms
--- node2 ping statistics --2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.082/0.085/0.089/0.009 ms
[root@node1 ~]# ping -c 2 172.16.0.2
PING 172.16.0.2 (172.16.0.2) 56(84) bytes of data.
64 bytes from 172.16.0.2: icmp_seq=1 ttl=64 time=0.083 ms
64 bytes from 172.16.0.2: icmp_seq=2 ttl=64 time=0.083 ms
--- 172.16.0.2 ping statistics --2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.083/0.083/0.083/0.000 ms
Now we will ping node1 from node2 first via the LAN interface and then
through the crossover IP:
[root@node2 ~]# ping -c 2 node1
PING node1 (10.0.0.181) 56(84) bytes of data.
64 bytes from node1 (10.0.0.181): icmp_seq=1 ttl=64 time=0.068 ms
64 bytes from node1 (10.0.0.181): icmp_seq=2 ttl=64 time=0.063 ms
--- node1 ping statistics --2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.063/0.065/0.068/0.008 ms
Now we will ping node1 through the crossover interface:
[root@node2 ~]# ping -c 2 172.16.0.1
PING 172.16.0.1 (172.16.0.1) 56(84) bytes of data.
64 bytes from 172.16.0.1: icmp_seq=1 ttl=64 time=1.36 ms
64 bytes from 172.16.0.1: icmp_seq=2 ttl=64 time=0.075 ms
--- 172.16.0.1 ping statistics --2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.075/0.722/1.369/0.647 ms
[ 129 ]
High Availability and Replication
7. After configuring network connectivity, we next set the initialization options and
set the run level to 3:
vi /etc/inittab
id:3:initdefault:
8. Install the necessary packages:
yum install -y
drbd83 kmod-drbd83
9. In the next step, configure DRBD on both the nodes:
vi /etc/drbd.conf
global {
usage-count no;
}
common {
syncer { rate 100M; }
protocol
C;
}
resource postgres {
startup {
wfc-timeout 0;
degr-wfc-timeout
120;
}
disk { on-io-error detach; }
on node1.author.org {
device
/dev/drbd0;
disk
/dev/sdb;
address
172.16.0.1:7791;
meta-disk
internal;
}
on node2.author.org {
device
/dev/drbd0;
disk
/dev/sdb;
address
172.16.0.2:7791;
meta-disk
internal;
}
}
[ 130 ]
High Availability and Replication
10. Once the drbd.conf file is set up for both the nodes, we then metadata on the
Postgres resource. Execute the following step on both the nodes:
[root@node1 ~]# drbdadm create-md postgres
Writing meta data...
initializing activity log
NOT initialized bitmap
New drbd meta data block successfully created.
root@node2 ~]# drbdadm create-md postgres
Writing meta data...
initializing activity log
NOT initialized bitmap
New drbd meta data block successfully created.
11. In the next step, we will bring up the resource. Execute the following step on both
the nodes:
drbdadm up postgres
12. In the next step, we can make the initial sync between the nodes. This step can be
done on the primary node and we set node 1 as the primary node:
drbdadm -- --overwrite-data-of-peer primary postgres
13. To monitor the progress of the sync and the status of the DRBD resource, look at
the /proc/drbd file:
[root@node1 ~]# cat /proc/drbd
version: 8.3.8 (api:88/proto:86-94)
GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by
mockbuild@builder10.centos.org, 2016-10-04 14:04:09
0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r--ns:48128 nr:0 dw:0 dr:48128 al:0 bm:2 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b
oos:8340188
[>....................] sync'ed: 0.6% (8144/8188)M delay_probe: 7
finish: 0:11:29 speed: 12,032 (12,032) K/sec
14. Once the sync process completes, we can take a look at both the status of the
Postgres resource on both the nodes:
[root@node1 ~]# cat /proc/drbd
version: 8.3.8 (api:88/proto:86-94)
GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by
mockbuild@builder10.centos.org, 2016-10-04 14:04:09
[ 131 ]
High Availability and Replication
0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r---ns:8388316 nr:0 dw:0 dr:8388316 al:0 bm:512 lo:0 pe:0 ua:0 ap:0
ep:1 wo:b oos:0
[root@node2 ~]# cat /proc/drbd
version: 8.3.8 (api:88/proto:86-94)
GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by
mockbuild@builder10.centos.org, 2016-2016-10-04 14:04:09
0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r---ns:0 nr:8388316 dw:8388316 dr:0 al:0 bm:512 lo:0 pe:0 ua:0 ap:0
ep:1 wo:b oos:0
15. In the next step, we are going to initiate DRBD services on both the nodes. Issue
the following command:
/etc/init.d/drbd start
16. In order to initialize the data directory and to set up using DRBD, we would need
to format and mount the DRBD device. And then we initialize the data directory.
Issue the following commands on node1:
mkfs.ext3 /dev/drbd0
mount -t ext3 /dev/drbd0 /var/lib/pgsql/9.6
chown postgres.postgres /var/lib/pgsql/9.6
Next we log in as the Postgres user on node1 and initialize the database:
su - postgres
initdb /var/lib/pgsql/9.6/data
exit
17. In the next step, we enable trusted authentication and we will configure the
necessary parameters for setting up PostgreSQL in the postgresql.conf file.
On node1, execute the following:
echo "host all
all
10.0.0.181/32
/var/lib/pgsql/9.6/data/pg_hba.conf
echo "host all
all
10.0.0.182/32
/var/lib/pgsql/9.6/data/pg_hba.conf
echo "host all
all
10.0.0.180/32
/var/lib/pgsql/9.6/data/pg_hba.conf
[ 132 ]
trust" >>
trust" >>
trust" >>
High Availability and Replication
Then we configure the necessary parameters in the postgresql.conf file:
vi /var/lib/pgsql/9.3/data/postgresql.conf
listen_addresses = '*'
18. Once the previously parameters have been changed in the postgresql.conf
file, the next step would be to start PostgreSQL. Execute the following command
on node1:
service postgresql-9.6 start
19. We will then create an admin user to manage PostgreSQL. On node 1, execute the
following command:
su - postgres
createuser --superuser admin
--pwprompt
20. In the next step, we create a database and populate it with data. On node1
execute the following steps and then access the database:
su - postgres
createdb test
pgbench -i test
psql -U admin -d test
test=# select * from pgbench_tellers;
tid | bid | tbalance | filler
-----+-----+----------+-------1 |
1 |
0 |
2 |
1 |
0 |
3 |
1 |
0 |
4 |
1 |
0 |
5 |
1 |
0 |
6 |
1 |
0 |
7 |
1 |
0 |
8 |
1 |
0 |
9 |
1 |
0 |
10 |
1 |
0 |
(10 registros)
[ 133 ]
High Availability and Replication
21. In the next step, we will test the block-level replication and see if PostgreSQL will
work on node2.
On node 1, execute the following commands:
1. We will first stop postgresql on node1:
service postgresql-9.6 stop
2. Then we will unmount the DRBD device on node1:
umount /dev/drbd0
3. Now we will set up node1 as the secondary node:
drbdadm secondary postgres
4. Now we will configure node2 as the primary node:
drbdadm primary postgres
5. In the next step, mount the DRBD device:
mount -t ext3 /dev/drbd0 /var/lib/pgsql/9.6
6. Then we start the postgresql service on node2:
service postgresql-9.6 start
7. Now we will see if we are able to access the test database on node2:
psql -u admin -d test
test=# select * from pgbench_tellers;
tid | bid | tbalance | filler
-----+-----+----------+-------1 |
1 |
0 |
2 |
1 |
0 |
3 |
1 |
0 |
4 |
1 |
0 |
5 |
1 |
0 |
6 |
1 |
0 |
7 |
1 |
0 |
8 |
1 |
0 |
9 |
1 |
0 |
10 |
1 |
0 |
[ 134 ]
High Availability and Replication
How it works…
In the initial steps from steps 1 to 6, we configured the nodes, that is, node1 and node2, we
set up network connectivity, and also configured DNS. In step 6, we did the network
connectivity test between node1 and node2 on the LAN interface as well as on the
crossover interface. We received successful echo response messages after doing the ping
request tests. This shows that the network connectivity is successfully configured.
In step 9, we set up the drbd.conf file on both the nodes. Here is an extract from the
drbd.conf file:
global {
usage-count no;
}
common {
syncer { rate 100M; }
protocol
C;
}
resource postgres {
startup {
wfc-timeout 0;
degr-wfc-timeout
120;
}
disk { on-io-error detach; }
on node1.author.org {
device
/dev/drbd0;
disk
/dev/sdb;
address
172.16.0.1:7791;
meta-disk
internal;
}
on node2.author.org {
device
/dev/drbd0;
disk
/dev/sdb;
address
172.16.0.2:7791;
meta-disk
internal;
}
}
[ 135 ]
High Availability and Replication
Basically, resource from the previously configuration, we are setting up a Postgres and we
are configuring a DRBD interface /dev/drbd0, which is set up on two nodes, node1 and
node2. This is basically what causes the block level replication to be successful. We can see
in step 12 of the preceding section that we have initially set up node1 as the primary node
and node2 serves as the secondary node at that stage. Then we set up Postgresql on node1
from step 16 onwards. From step 21 onwards, we do failover testing. We first reset the node
1 as the secondary node, unmount the filesystem, and then set up node2 as the primary
node, mount the filesystem, and bring up the Postgresql server. After that, we are testing
for record visibility in node2. The database test that was created in step 20 of the preceding
section is accessible in node2 and so are the tables in the pgbench schema in step 21. Thus
DRBD provides for block level replication and in case one of the nodes is not available, we
can then configure and continue to run Postgresql on the secondary node where it is going
to take the role of the primary.
Setting up a Postgres-XL cluster
Here in this recipe, we are going to set up a Postgres XL cluster, which is a scalable
PostgreSQL cluster, to solve the big data processing challenges.
Getting ready
Postgres-XL is a horizontally scalable open source cluster, which has a set of individual
dedicated components to scale PostgreSQL to multiple nodes. Postgre-XL (extensible lattice)
gives us more convenience in distributing the data across the nodes or replicating the data
across the nodes.
How to do it…
Note that:
node1
node2
node3
node4
-----
ec2-54-164-174-117.compute-1.amazonaws.com -- 172-30-1-154
ec2-54-152-235-194.compute-1.amazonaws.com -- 172-30-1-155
ec2-54-165-7-151.compute-1.amazonaws.com -- 172-30-1-156
ec2-54-152-210-130.compute-1.amazonaws.com -- 172-30-1-157
1. Create the PostgresXL user account and group:
sudo groupadd pgxl
sudo useradd -g pgxl pgxl
[ 136 ]
High Availability and Replication
2. Install RPMS on all four servers. Copy the following RPMS to the server and
make sure you are in the same location where the packages are before you run
the following yum command:
yum install postgres-xl92-libs-9.2-34.1.x86_64.rpm \
postgres-xl92-9.2-34.1.x86_64.rpm \
postgres-xl92-server-9.2-34.1.x86_64.rpm \
postgres-xl92-contrib-9.2-34.1.x86_64.rpm \
postgres-xl92-gtm-9.2-34.1.x86_64.rpm
3. Verify installation. Check if it is installed properly on each server:
[centos@ip-172-30-1-157
[centos@ip-172-30-1-157
total 12K
drwxr-xr-x. 2 pgxl pgxl
drwxr-xr-x. 7 pgxl pgxl
drwxr-xr-x. 2 root root
~]$ cd /usr/postgres-xl-9.2/
postgres-xl-9.2]$ ls -ltrh
4.0K Dec 3 05:32 lib
4.0K Dec 3 05:32 share
4.0K Dec 3 05:32 bin
4. Set up the environment variables in bash. Make sure you add the following lines
in bash_profile of the pgxl user on each server:
export PATH=/usr/postgres-xl-9.2/bin:$PATH
export LD_LIBRARY_PATH=/usr/postgres-xl
9.2/lib:$LD_LIBRARY_PATH
5. GTM and GTM-Proxy initialization (login as a pgxl user). Decide where to run
the GTM server. Let's choose the
ec2-54-164-174-117.compute-1.amazonaws.com server as the GTM.
Steps to start GTM:
mkdir data_gtm
initgtm -Z gtm -D /home/pgxl/data_gtm/
--- Modify the config:
[pgxl@ip-172-30-1-154 data_gtm]$ grep -v '^#' gtm.conf
nodename = 'gtm_node' # Specifies the node name.
# (changes requires restart)
listen_addresses = '*' # Listen addresses of this GTM.
# (changes requires restart)
port = 6666 # Port number of this GTM.
# (changes requires restart)
Start using:
[pgxl@ip-172-30-1-154 data_gtm]$ gtm -D /home/pgxl/data_gtm/ &
[1] 5938
[ 137 ]
High Availability and Replication
[pgxl@ip-172-30-1-154 data_gtm]$
[pgxl@ip-172-30-1-154 data_gtm]$
[pgxl@ip-172-30-1-154 data_gtm]$ ps -ef|grep gtm
pgxl 5938 5909 0 06:23 pts/2 00:00:00 gtm -D
/home/pgxl/data_gtm/
pgxl 5940 5909 0 06:23 pts/2 00:00:00 grep gtm
[pgxl@ip-172-30-1-154 data_gtm]$
[pgxl@ip-172-30-1-154 data_gtm]$ netstat -apn |grep 6666
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0 0.0.0.0:6666 0.0.0.0:* LISTEN 5938/gtm
tcp 0 0 :::6666 :::* LISTEN 5938/gtm
[pgxl@ip-172-30-1-154 data_gtm]$
[pgxl@ip-172-30-1-154 data_gtm]$
Steps to start gtm-proxy:
mkdir data_gtm_proxy
intigtm -Z gtm_proxy /home/pgxl/data_gtm_proxy
--- Modify config:
pgxl@ip-172-30-1-154 ~]$ grep -v "^#"
data_gtm_proxy/gtm_proxy.conf
nodename = 'gtm_proxy_node' # Specifies the node name.
# (changes requires restart)
listen_addresses = '*' # Listen addresses of this GTM.
# (changes requires restart)
port = 6667 # Port number of this GTM.
gtm_host =
# (changes
gtm_port =
# (changes
'localhost' # Listen address of the active GTM.
requires restart)
6666 # Port number of the active GTM.
requires restart)
Start the gtm_proxy:
gtm_proxy -D /home/pgxl/data_gtm_proxy/ &
[pgxl@ip-172-30-1-154 ~]$ ps -ef|grep gtm
pgxl 7026 6952 0 16:58 pts/0 00:00:00 gtm -D
/home/pgxl/data_gtm/
pgxl 7052 6952 0 17:05 pts/0 00:00:00 gtm_proxy -D
/home/pgxl/data_gtm_proxy/
pgxl 7059 6952 0 17:05 pts/0 00:00:00 grep gtm
[ 138 ]
High Availability and Replication
6. Create and configure a data node on each of the servers.
On node1 (which is GTM):
mkdir data_node1
initdb -D /home/pgxl/data_node1/ --nodename datanode1
In the datanode1/postgresql.conf file:
listen_addresses='*'
port = 5432
gtm_host = '172.30.1.154'
gtm_port=6666
On node2:
mkdir data_node2
initdb -D /home/pgxl/data_node2/ --nodename datanode2
In the datanode2/postgresql.conf file:
listen_addresses='*'
port = 5432
gtm_host = '172.30.1.154'
gtm_port=6666
On node3:
mkdir data_node3
initdb -D /home/pgxl/data_node3/ --nodename datanode3
In the datanode3/postgresql.conf file:
listen_addresses='*'
port = 5432
gtm_host = '172.30.1.154'
gtm_port=6666
On node4:
mkdir data_node4
initdb -D /home/pgxl/data_node4/ --nodename datanode4
[ 139 ]
High Availability and Replication
In the datanode4/postgresql.conf file:
listen_addresses='*'
port = 5432
gtm_host = '172.30.1.154'
gtm_port=6666
7. Create and configure coordinators on two nodes: in this case,
ec2-54-152-235-194.compute-1.amazonaws.com (node2) and
ec2-54-165-7-151.compute-1.amazonaws.com (node3).
On node2:
mkdir coord1
initdb -D /home/pgxl/coord1/ --nodename coord1
In coord1/postgresql.conf file:
listen_addresses='*'
port = 5433
gtm_host = '172.30.1.154'
gtm_port=6666
On node3:
mkdir coord2
initdb -D /home/pgxl/coord2/ --nodename coord2
In the coord2/postgresql.conf file:
listen_addresses='*'
port = 5433
gtm_host = '172.30.1.154'
gtm_port=6666
8. Configure hba.conf. Add the following entries in hba.conf of all the data
nodes and coordinators:
host
host
host
host
all
all
all
all
all
all
all
all
172.30.1.154/32
172.30.1.155/32
172.30.1.156/32
172.30.1.157/32
trust
trust
trust
trust
[ 140 ]
High Availability and Replication
9. Start the data nodes and coordinators.
On node1:
postgres --datanode -D /home/pgxl/data_node1 &
On node2:
postgres --datanode -D /home/pgxl/data_node2 &
postgres --coordinator -D /home/pgxl/coord1 &
On node3:
postgres --datanode -D /home/pgxl/data_node3 &
postgres --coordinator -D /home/pgxl/coord2 &
On node4:
postgres --datanode -D /home/pgxl/data_node4 &
Example of processes, on the GTM and GTM proxy:
[pgxl@ip-172-30-1-154 ~]$ ps
pgxl 7026 6952 0 16:58 pts/0
pgxl 7052 6952 0 17:05 pts/0
/home/pgxl/data_gtm_proxy/
pgxl 7152 6952 0 17:36 pts/0
-ef|grep gtm
00:00:00 gtm -D /home/pgxl/data_gtm/
00:00:00 gtm_proxy -D
00:00:00 grep gtm
On a data node ONLY node:
[pgxl@ip-172-30-1-154 ~]$ ps -ef|grep postgres
pgxl 7128 6952 0 17:26 pts/0 00:00:00 postgres --datanode -D
/home/pgxl/data_node1
pgxl 7129 7128 0 17:26 ? 00:00:00 postgres: logger process
pgxl 7131 7128 0 17:26 ? 00:00:00 postgres: pooler process
pgxl 7132 7128 0 17:26 ? 00:00:00 postgres: checkpointer process
pgxl 7133 7128 0 17:26 ? 00:00:00 postgres: writer process
pgxl 7134 7128 0 17:26 ? 00:00:00 postgres: wal writer process
pgxl 7135 7128 0 17:26 ? 00:00:00 postgres: autovacuum launcher
process
pgxl 7136 7128 0 17:26 ? 00:00:00 postgres: stats collector process
pgxl 7155 6952 0 17:37 pts/0 00:00:00 grep postgres
On a datanode and coordinator node:
[pgxl@ip-172-30-1-156 ~]$ ps -ef|grep postgres
pgxl 7151 6922 0 17:35 pts/0 00:00:00 postgres --datanode -D
/home/pgxl/data_node3
pgxl 7152 7151 0 17:35 ? 00:00:00 postgres: logger process
[ 141 ]
High Availability and Replication
pgxl 7154 7151 0 17:35
pgxl 7155 7151 0 17:35
process
pgxl 7156 7151 0 17:35
pgxl 7157 7151 0 17:35
pgxl 7158 7151 0 17:35
process
pgxl 7159 7151 0 17:35
process
pgxl 7161 6922 0 17:35
/home/pgxl/coord2
pgxl 7162 7161 0 17:35
pgxl 7164 7161 0 17:35
pgxl 7165 7161 0 17:35
process
pgxl 7166 7161 0 17:35
pgxl 7167 7161 0 17:35
pgxl 7168 7161 0 17:35
process
pgxl 7169 7161 0 17:35
process
pgxl 7171 6922 0 17:35
? 00:00:00 postgres: pooler process
? 00:00:00 postgres: checkpointer
? 00:00:00 postgres: writer process
? 00:00:00 postgres: wal writer process
? 00:00:00 postgres: autovacuum launcher
? 00:00:00 postgres: stats collector
pts/0 00:00:00 postgres --coordinator -D
? 00:00:00 postgres: logger process
? 00:00:00 postgres: pooler process
? 00:00:00 postgres: checkpointer
? 00:00:00 postgres: writer process
? 00:00:00 postgres: wal writer process
? 00:00:00 postgres: autovacuum launcher
? 00:00:00 postgres: stats collector
pts/0 00:00:00 grep postgres
10. Create the cluster node information on each datanode and coordinator:
CREATE NODE coord1
WITH (TYPE = 'coordinator', HOST = '172.30.1.155',
PORT = 5433);
CREATE NODE coord2
WITH (TYPE = 'coordinator', HOST = '172.30.1.156',
PORT = 5433);
CREATE NODE datanode1
WITH (TYPE = 'datanode', HOST = '172.30.1.154',
PORT = 5432);
CREATE NODE datanode2
WITH (TYPE = 'datanode', HOST = '172.30.1.155',
PORT = 5432, PRIMARY, PREFERRED);
CREATE NODE datanode3
WITH (TYPE = 'datanode', HOST = '172.30.1.156',
PORT = 5432);
CREATE NODE datanode4
WITH (TYPE = 'datanode', HOST = '172.30.1.157',
PORT = 5432);
[ 142 ]
High Availability and Replication
11. Update the coordinators' information in pgxc_node.
On node2:
psql -d postgres -p 5433
postgres=# begin;
BEGIN
postgres=# update pgxc_node set node_port=5433 where
node_name='coord1';
-- check if it's updated or not
before you commit.
UPDATE 1
postgres=# commit;
COMMIT
On node3:
psql -d postgres -p 5433
postgres=# begin;
BEGIN
postgres=# update pgxc_node set node_host='172.30.1.156' where
node_name='coord2'; check if it's updated or not before you
commit.
UPDATE 1
postgres=# commit;
12. Alter the data nodes information in pgxc_node:
On node1:
psql -d postgres -p 5432
postgres=# begin;
BEGIN
postgres=# alter node datanode1 with (TYPE = 'datanode',
HOST = '172.30.1.154', PORT = 5432);
ALTER NODE
postgres=# select * from pg_catalog.pgxc_node ;
node_name | node_type | node_port | node_host | nodeis_primary|
nodeis_preferred | node_id
-----------+-----------+-----------+--------------+-------------+------------------+------------coord1 | C | 5433 | 172.30.1.155 | f | f | 1885696643
coord2 | C | 5433 | 172.30.1.156 | f | f | -1197102633
datanode2 | D | 5432 | 172.30.1.155 | t | t | -905831925
datanode3 | D | 5432 | 172.30.1.156 | f | f | -1894792127
datanode4 | D | 5432 | 172.30.1.157 | f | f | -1307323892
datanode1 | D | 5432 | 172.30.1.154 | f | f | 888802358
(6 rows)
[ 143 ]
High Availability and Replication
postgres=# commit;
COMMIT
On node2:
psql -d postgres -p 5432
postgres=# begin;
BEGIN
postgres=# alter node datanode2 with (TYPE = 'datanode',
HOST = '172.30.1.155', PORT = 5432, PRIMARY, PREFERRED);
postgres=# select * from pgxc_node;
node_name | node_type | node_port | node_host | nodeis_primary|
nodeis_preferred | node_id
-----------+-----------+-----------+--------------+-------------+------------------+------------coord1 | C | 5433 | 172.30.1.155 | f | f | 1885696643
coord2 | C | 5433 | 172.30.1.156 | f | f | -1197102633
datanode1 | D | 5432 | 172.30.1.154 | f | f | 888802358
datanode3 | D | 5432 | 172.30.1.156 | f | f | -1894792127
datanode4 | D | 5432 | 172.30.1.157 | f | f | -1307323892
datanode2 | D | 5432 | 172.30.1.155 | t | t | -905831925
(6 rows)
postgres=# commit;
COMMIT
On node3:
psql -d postgres -p 5432
postgres=# begin;
BEGIN
postgres=# alter node datanode3 with (TYPE = 'datanode',
HOST = '172.30.1.156', PORT = 5432);
postgres=# select * from pgxc_node;
node_name | node_type | node_port | node_host | nodeis_primary|
nodeis_preferred | node_id
-----------+-----------+-----------+--------------+--------------+------------------+------------coord1 | C | 5433 | 172.30.1.155 | f | f | 1885696643
coord2 | C | 5433 | 172.30.1.156 | f | f | -1197102633
datanode1 | D | 5432 | 172.30.1.154 | f | f | 888802358
datanode2 | D | 5432 | 172.30.1.155 | t | t | -905831925
datanode4 | D | 5432 | 172.30.1.157 | f | f | -1307323892
datanode3 | D | 5432 | 172.30.1.156 | f | f | -1894792127
(6 rows)
postgres=# commit;
COMMIT
[ 144 ]
High Availability and Replication
On node4:
psql -d postgres -p 5432
postgres=# begin;
BEGIN
postgres=#
postgres=# alter node datanode4 with (TYPE = 'datanode',
HOST = '172.30.1.157', PORT = 5432);
postgres=# select * from pgxc_node;
node_name | node_type | node_port | node_host | nodeis_primary|
nodeis_preferred | node_id
-----------+-----------+-----------+--------------+-----------]
---+------------------+------------coord1 | C | 5433 | 172.30.1.155 | f | f | 1885696643
coord2 | C | 5433 | 172.30.1.156 | f | f | -1197102633
datanode1 | D | 5432 | 172.30.1.154 | f | f | 888802358
datanode2 | D | 5432 | 172.30.1.155 | t | t | -905831925
datanode3 | D | 5432 | 172.30.1.156 | f | f | -1894792127
datanode4 | D | 5432 | 172.30.1.157 | f | f | -1307323892
(6 rows)
postgres=# commit;
COMMIT
13. For verification, log in to any of the coordinators on nodes 2 and 3. Distribute by
replication on the coordinator:
create table dist_repl(id int) distribute by replication to
node
(datanode1, datanode2, datanode3, datanode4);
insert into dist_repl values(generate_series(1,8));
On each datanode:
select * from dist_tab;
Distribute by hash on the coordinator:
create table dist_hash(id int) distribute by hash(id) to
node(datanode1, datanode2, datanode3, datanode4);
insert into dist_hash values(generate_series(1,8));
On the datanode:
select * from dist_hash;
[ 145 ]
7
Working with Third-Party
Replication Management
Utilities
In this chapter, we will cover the following:
Setting up Barman
Backup and recovery using Barman
Setting up OmniPITR
WAL management with OmniPITR
Setting up Repmgr
Using Repmgr to create Replica
Setting up Walctl
Using Walctl to create Replica
Introduction
Indexes are in essence a database structure that speeds up fetching the data access
operations. Indexes are basically used to boost database performance by rapidly locating
data without having to search every row in the database table every time it is accessed. Data
access via indexes is known as an indexed scan and it is used to provide random lookups
and organized access of ordered records.
Working with Third-Party Replication Management Utilities
In this chapter, we are going to talk about indexes, performing maintenance operations on
indexes, and comparing indexed and sequential scans:
One of the major responsibilities of a DB admin is to use some third-party tools to
manage the database as not all can be done through scripting. Every tool is
designed for a specific purpose and can be utilized as per our requirements.
In this chapter, we are going to cover tools such as Barman and OmniPITR, which
are related to backup/recovery and WAL management, repmgr for auto-failover
and setting up replicas in an easy way, and Walctl to create replicas and manage
WALs.
Setting up Barman
Barman, which stands for Backup And Recovery Manager, is a third-party tool developed
nd
by a company known as 2 Quadrant and it serves as a complete backup management
system for PostgreSQL.
Barman can be used to perform the following functions:
It can be used to manage PostgreSQL server backups
It can be used as a tool to perform and restore backup images for PostgreSQL
databases
It can be used to perform incremental and configure backup retention policies
Getting ready
The following are some of the prerequisites that are required before installing Barman:
The system on which Barman needs to be installed must be either a Linux or a
Unix system.
The Python language version needs to either start with 2.6 or higher.
The rsync command version needs to be higher than 3.0.4. On the latest
operating systems this should be taken care of automatically.
For our requirement we will be using two servers. One will serve as a backup server and
the other one will serve as a primary server. The backup server is named pg-backup and
the primary server is named pg-primary.
[ 147 ]
Working with Third-Party Replication Management Utilities
How to do it…
The following are the series of steps that are required to be carried out to install and set up
Barman:
1. If you are running a Red Hat-based distribution you should use the following
command to install the Barman tool:
yum install barman
On Ubuntu-based systems you need to use the following command:
apt-get install barman
The next two steps talk about setting up SSH connectivity between the
backup and the primary server.
2. In this step we log into the backup server as the barman user and we will execute
the following commands to set up direct ssh access to the primary server:
ssh-keygen -t rsa -N ' '
ssh-copy-id postgres@pg-primary
3. Quite similar to the previous server, but in a vice versa manner we will now log
into the primary server and set up ssh access to the backup server:
ssh-keygen -t rsa -N ' '
ssh-copy-id barman@pg-backup
4. We will now configure the primary server's firewall file to allow connectivity
access to the backup server:
host all postgres pg-backup md5
5. In the next step, we will enable the following PostgreSQL configuration
parameters on the primary server:
archive_mode = on
archive_command = 'rsync -aq %p \ barman@pgbackup:primary/incoming/%f'
Make sure you setup password less authentication between backup server
and primary server.
[ 148 ]
Working with Third-Party Replication Management Utilities
6. In the next step on the backup server, we will enter the following line in the
.pgpass file for the barman user:
*:*:*:postgres:postgres-password
7. In the next step, we will restart the PostgreSQL server on the primary for changes
made to the server parameters in the previous step to come into effect:
pg_ctl -D /db/pgdata restart
8. In the next step, we are going to add the following steps at the end of the
barman.conf configuration file, which resides in the /etc/ or /etc/barman
folder:
[primary]
description = "Primary PostgreSQL Server"
conninfo = "host=pg-primary user=postgres"
ssh_command = "ssh postgres@pg-primary"
9. In the next step, we are going to issue the following command from the backup
server to see the primary server's barman configuration:
barman check primary
How it works…
The following is the explanation for steps carried out in the preceding section:
In the first step of the preceding section, we installed the Barman toolkit and
depending on whether the platform is a Red Hat distribution or an
Ubuntu/Debian Linux distro the appropriate commands are given for the
installation of the Barman tool for each platform.
As mentioned in steps 2 and 3, we basically configured ssh access to and from the
primary and the backup server. The barman user needs to fetch the PostgreSQL
files from the primary server. In addition to this the Postgres user needs to be
able to send files to the backup server using rsync. For this reason we set up
passwordless SSH connectivity between the two servers for seamless interaction.
In step 4, we basically modified the access control file pg_hba.conf of the
primary server to allow the Postgres user to connect to primary from the backup
server.
[ 149 ]
Working with Third-Party Replication Management Utilities
In step 5, we enabled archiving by setting archive_mode=on and then we set
archive_command in order to send the archived WAL files to the backup server
so that Barman can find these WAL files in an appropriate directory.
In step 6, we basically entering password credentials of the Postgres user in the
.pgpass file.
In step 7, we restart the PostgreSQL services on the primary server so that
changes made in step 5 could come into effect straightaway.
In step 8, we entered configuration information for Barman. The first parameter
description is more like setting a named label so that Barman will know it is the
primary server. The second line refers to the conn_info parameter, which
specifies PostgreSQL server connection settings on the primary server that may
be used internally by Barman 's Python libraries. The third line refers to the
ssh_command parameter, which is more like stating to Barman the way to access
files on the primary server as a Postgres user.
Step 9 is more like checking Barman configuration that has been set up in all the
previous steps of the preceding section. In essence, this step goes beyond just
checking the Barman configuration and server status. It in fact creates different
directories and tracking files that can be used to maintain PostgreSQL database
server backups.
Backup and recovery using Barman
Now that we have installed and configured Barman it is now time to use Barman for its
actual purpose, that is, taking backups with Barman and recovering PostgreSQL using the
backups taken with Barman. This section is divided into two areas, first taking the backup
and then testing its recovery.
Getting ready
All the steps in this recipe require Barman to be installed on the backup server. For
restoring data we are going to use a new server named pg-clone.
[ 150 ]
Working with Third-Party Replication Management Utilities
How to do it…
The first part is to take backups, which will be done here. The following steps are all carried
out on the backup server:
1. First, we create a Barman backup with the following command:
barman backup primary
2. In the next step, we examine the list of backups with the following command:
barman list-backup primary
3. In the third step, we check the metadata of the most recent backup:
barman show-backup primary latest
4. In the next step, we see all files that were backed up with the most recent Barman
backup:
barman list-files primary latest
In the next series of steps we are restoring data with Barman. Here we are going to
demonstrate how to restore data to the target server pg-clone:
1. The first step is to log into the backup server and enable SSH connectivity for the
target pg-clone server:
ssh-copy-id postgres@pg-clone
2. In the next step, we need to ensure that the data in the data directory of the target
server is deleted/empty because using Barman we are going to restore the data to
this directory:
rm -Rf /db/pgdata
3. In the next step, we are going to send the data to the target server from the
backup server and then we are going to recover the data:
barman recover --remote-ssh-command "ssh postgres@pg-clone"
primary latest /db/pgdata
4. In the final step, we start the PostgreSQL server on the target server having
recovered the data:
pg_ctl -D /db/pgdata start
[ 151 ]
Working with Third-Party Replication Management Utilities
How it works…
Here we are going to explain the steps carried out in the preceding section.
First we will talk about steps 1-4 of the preceding section:
In step 1, we basically tell Barman to enter the backup mode with the backup
command and the label primary is used to indicate to contact the primary server
pg-primary and then contact with via ssh connectivity and take the backup of it
and retrieve all the database files over SSH and save them in its repository.
In step 2, we listed and examine the content of the backup made and that was
stored in our barman repository.
In step 3, we examine the metadata of the backup that was made. The metadata
part includes the series of WAL log files produced during the backup and among
other things the start and stop time of the backup.
In step 4, we examine all the files that constituted as part of the barman backup
command.
In step 5, we configure SSH connectivity to the target server pg-clone. We
basically do this by sending the SSH key to the target server using the ssh-copyid command.
In step 6, we remove any data existing on the data directory of the target server.
This is done because we are going to restore the data directory from the backup
made using Barman.
In step 7, we basically use the barman-recover command to restore data.
However, we are using the -remote-ssh-command option with this command.
This is done because usually the recovery is done on the local server and with the
usage o -remote-ssh- command we are basically initiating the recovery on the
target server. We then specify primary as the label and we specify the latest ID of
the backup to restore, that is, in short we are looking to restore the latest backup
of the primary server onto the target. All these commands are fired from the
terminal console of the backup server. Using the latest backup the restoration is
eventually done to the data directory of the target server.
In step 8, we now start the PostgreSQL server on the target having done full
recovery of the primary on the target.
[ 152 ]
Working with Third-Party Replication Management Utilities
Setting up OmniPITR
OmniPITR is a third-party tool that consists of a set of scripts that are used to maintain and
manage PostgreSQL database transaction log files, that is, WAL logs.
OmniPITR can also be used for initiating hot backups on master and slave PostgreSQL
servers.
In this recipe, we are going to set up and install OmniPITR.
Getting ready
In this recipe, we are going to show how to install OmniPITR on a machine. However, this
needs to be done on both the master and slave servers. For our exercise, the pg-primary
serves as the master server and the pg-clone server serves as the slave server.
How to do it…
The following are the series of steps that are required to configure and install OmniPITR:
1. On both the master and slave machines we can install the OmniPITR as follows:
$ git clone git://github.com/omniti-labs/omnipitr.git
2. In the next step, we run a sanity check to see if the omniPITR installation goes
fine on both the machines:
$ /opt/omnipitr/bin/sanity-check.sh
Checking:
- /opt/omnipitr/bin
- /opt/omnipitr/lib
5 programs, 9 libraries.
All checked, and looks ok.
[ 153 ]
Working with Third-Party Replication Management Utilities
How it works…
In step 1 of the preceding section, we used the git clone command to copy an existing
GIT repository on to the local machines, that is, the master and the slave server. Once that is
done, in the next step we run a shell script named sanity-check.sh, which examines
various sources and produces a report. If it is successful, the sanity-check.sh script gives
an OK message, as seen in step 2 of the preceding section.
WAL management with OmniPITR
Transaction log files, that is WAL logs, play a significant role in different situations be it for
backups, point in time recovery, replication, or even crash recovery.
Here we are going to utilize the OmniPITR tool to effectively manage WAL log files for
archival and recovery purpose.
Getting ready
The prerequisites are that OmniPITR needs to be installed on all the participating servers.
Here we are assuming three servers, one is the primary server, the next one is the slave
server, and the other one is the backup server.
We also assume the following configuration on these servers.
On the master server, we should have or create the following directories that should be
writeable by the postgres user:
/home/postgres/omnipitr: Location of the data files stored by omnipitr
/home/postgres/omnipitr/log: Location of the omnipitr log
/home/postgres/omnipitr/state: Location of the omnipitr state files
/var/tmp/omnipitr: Location of temporary files created by omniPITR
[ 154 ]
Working with Third-Party Replication Management Utilities
On the slave server we assume the following configuration:
/home/postgres/wal_archive: Location where the master sends xlog
segments for replication
/home/postgres/omnipitr: Location of the data files stored by omnipitr
/home/postgres/omnipitr/log: Location of the omnipitr log
/home/postgres/omnipitr/state: Location of the omnipitr state files
/var/tmp/omnipitr: Location of temporary files created by omniPITR
On the backup server we assume the following configuration:
/var/backup/database: Directory for database backups
/var/backup/database/hot_backup: Directory for hot backup files
/var/backup/database/xlog: Directory for xlog segments
We are going to use direct rsync for transfers over servers for OmniPITR transfers.
We assume the following direct rsync paths:
rsync://slave/wal_archive: This refers to /home/postgres/wal_archive
on the slave server with write access to the master without a password
rsync://backup/database: This refers to /var/backups/database on the
backup server with write access to both the master and slave without a password
How to do it…
The following are the series of steps that need to be carried out on the master server for
archival of xlogs from the master:
1. In the first step, we are going to configure archiving as follows by setting the
following server parameter in the postgresql.conf file:
archive_mode = on
2. In the second step, we will set the archive_command parameter as follows:
archive_command = '/opt/omnipitr/bin/omnipitr-archive -l
/home/postgres/omnipitr/log/omnipitr-^Y^m^d.log -s
/home/postgres/omnipitr/state -dr
gzip=rsync://slave/wal_archive/ -dr
gzip=rsync://backup/database/xlog/ -db
[ 155 ]
Working with Third-Party Replication Management Utilities
/var/tmp/omnipitr/dstbackup -t
/var/tmp/omnipitr/ -v "%p"'
3. In the third step, we are going to set the archive_timeout parameter:
archive_timeout = 60
4. In the final step, we are going to restart the master:
pg_ctl -D /home/postgres/data restart
How it works…
Here we are going to discuss the steps mentioned in the preceding section:
In step 1, we enabled WAL archiving by setting archive_mode = on.
Once archiving is enabled in step 1, of the preceding section, we could then set
the archive_command parameter, which invokes the omnipitr-archive
command that is used to archive WAL logs and send them to the slave server
from the master server. In step 3, we set the archive_timeout parameter to 60
seconds to force the PostgreSQL server to switch to a new WAL log after every 60
seconds. Finally, in step 4, we restarted the PostgreSQL server on the master so
that the values set for the configuration parameters come into effect.
We will now discuss the options used in the archive_command:
-l /home/postgres/omnipitr/log/omnipitr-^Y^m^d.log: This option
refers to the log file path, which be rotated automatically on every date change
-s /home/postgres/omnipitr/state: This option refers to the directory in
which omnipitr keeps state information
-dr gzip=rsync://slave/wal_archive/: This option sends the WAL
segments in a compressed gzip format to the slave
-dr gzip=rsync://backup/database/xlog/: This option send the
compressed WAL segments to the backup server
-db /var/tmp/omnipitr/dstbackup: Ideally this patch should not exist, but if
it does it will be used as an additional location for the purpose of the omnipitrbackup-master program
-t /var/tmp/omnipitr/: This option specifies where to keep large temporary
files created by omniPITR
[ 156 ]
Working with Third-Party Replication Management Utilities
Setting up repmgr
repmgr is an open source tool developed by 2ndQuadrant for the purpose of managing
replication and failover in a cluster of PostgreSQL servers. It builds on and enhances
PostgreSQL server's existing capabilities to create new clones, that is, replicas, and add
them to an existing cluster of PostgreSQL servers. In addition to this it is used to monitor
replication setups and perform administrative tasks such as failover or manual switchover
operations
In this recipe, we are going to set up the remgr tool.
How to do it…
There are two ways to install the repmgr tool:
1. Through operating system package manager tools such as yum or apt-get
install.
2. Through the source installation by downloading the tar.gz from https://gith
ub.com/2ndQuadrant/repmgr/releases. Or by downloading the GitHub
repository via the git clone command from the repmgr GitHub page.
In this recipe, we will be using the standard package manager tools to install repmgr.
For this recipe we will need at least two servers. The primary PostgreSQL server will be
named pg-primary and the replica server is named pg-clone. /data is our default data
directory as defined by the $PGDATA environment variable:
1. On CentOS/Red Hat-based systems use the following command:
sudo yum install repmgr
2. On Debian/Ubuntu based distributions you should use the following command:
sudo apt-get install repmgr postgresql-9.4-repmgr
3. In the next step, we are going to copy the repmgr script from the /init directory
to /etc/init.d on each server.
[ 157 ]
Working with Third-Party Replication Management Utilities
4. If the supplied init script was copied, execute these commands as a root-capable
user:
sudo rm -f /etc/init.d/repmgrd
sudo chmod 755 /etc/init.d/repmgr
In the next series of steps we are going to configure the primary server.
5. Log in as the Postgres user and generate an RSA key pair and export the public
key to the replica:
ssh-keygen -t rsa -N ' '
ssh-copy-id postgres@pg-clone
6. In the next step, we need to modify postgresql.conf and set the following
server parameters:
wal_level = hot_standy
archive_mode = on
archive_command = 'exit 0'
wal_keep_segments = 5000
hot_standby = on
7. In the next step, we will modify the pg_hba.conf file and add the following
lines:
host
host
all
postgres
192.168.56.0/24
trust
replication
postgres
192.168.56.0/24
trust
8. In the next step, we will modify the pg_hba.conf file and add the following
lines:
pg_ctl -D /data restart
9. Execute this command to find the binary path to PostgreSQL tools:
pg_config -bindir
10. We will now create the configuration file for repmgr, named repmgr.conf, with
the following parameters:
cluster=pgnet
node=1
node_name=parent
conninfo='host=pg-primary dbname=postgres'
pg_bindir=[value from step 5]
[ 158 ]
Working with Third-Party Replication Management Utilities
11. We will now register the master node with the following command:
repmgr -f /etc/repmgr.conf master register
12. Finally, we will start the repmgr daemon:
sudo service repmgr start
How it works…
The following is a description for the preceding steps:
In step 1 and step 2, we were installing repmgr using yum on CentOS. If you are
using Ubuntu systems you may follow step 2 to install repmgr and you can skip
step 1.
In step 3 and 4 we were just daemonizing the service.
The rest of the steps were meant only for the primary server.
In step 5 we were setting some PostgreSQL server parameters such as wal_level
to hot_standby, enabling archive_mode by setting it to on, and by setting
archiving_command to exit 0 we are essentially stating that the system exits
with a status of 0, which means that the system is working correctly. This will
make PostgreSQL believe that the system is working correctly.
We were also setting the value of wal_keep_segments to 5000, which means
that we are preserving a size of almost 80 GB of wal files on the primary server.
This is essentially a requirement for repmgr because of the need to configure
multiple replicas we need this much amount of WAL logs on the primary server.
In the next step, we were setting our PostgreSQL authentication and accessing
configuration file pg_hba.conf to allow the postgres user to connect to any
database. We allow these connections to originate from anywhere within the
192.168.56.0 subnet.
In the next step, we created a single file named repmgr.conf in the /etc
directory. We named the repmgr cluster pgnet, noted that this is our first node,
and named our node parent as it is easy to remember. The connection
information needs to match our entry in pg_hba.conf; thus, we use the repmgr
user that we added to the database earlier. Finally, we set pg_bindir so that
repmgr always knows where to find certain PostgreSQL binaries. This setting is
supposed to be optional, but we ran into several problems when we tried to omit
this entry; just keep it for now.
[ 159 ]
Working with Third-Party Replication Management Utilities
As a final step we registered the primary node and completed the installation
process by creating various database objects. These steps are all performed by the
repmgr command, provided we specify the configuration file with -f and use
the master register parameter.
The repmgr system comes with a daemon that manages communication and controls
behavior between other repmgr nodes. Whenever we start repmgr it will run in the
background and await the arrival of new replicas/clones.
Using repmgr to create replica
As mentioned earlier in this chapter, repmgr is a management suite to manage replication
setups and also to create replicas and add them to the existing PostgreSQL clusters. In this
recipe, we are going to create a new clone of the primary server.
This recipe builds on from the previous recipe. Here we are going to create a replica of the
primary server pg-primary and as we said we need two machines participating in this
configuration. Our replica server will be titled pg-clone and the /data location will be our
data directory.
How to do it…
All the following commands in this recipe are carried out in the replica server pg-clone.
Here are the complete steps:
1. First log in as the Postgres user and generate a RSA key-pair and send the
credentials to the primary server pg-primary:
ssh-keygen -t rsa -N ' '
ssh-copy-id postgres@pg-primary
2. In the next step, we are going to clone the data on the pg-primary server with the
following command:
repmgr -D /data standby clone pg-primary
3. We are going to start the cloned server with the following command:
pg_ctl -D /data start
[ 160 ]
Working with Third-Party Replication Management Utilities
4. Execute this command to find the binary path to PostgreSQL tools:
pg_config --bindir
5. We will now create the configuration file for repmgr named repmgr.conf with
the following contents:
cluster=pgnet
node=2
node_name=child1
conninfo='host=pg-clone dbname=postgres'
pg_bindir=[value from step 4]
6. In the next step, we are going to register the pg-clone with the primary server:
repmgr -f /etc/repmgr.conf standby register
7. In the next step, we are going to start the repgmr service:
sudo service repmgr start
8. In the final step, we connect to the Postgres database and view the standby node's
replication lag with the following query:
SELECT standby_node, standby_name, replication_lag FROM
repmgr_pgnet.repl_status;
How it works…
Much of the preliminary work was done earlier in the previous recipe. Here in the
preceding section all the steps are carried out on the replica server pg-clone.
In the first step, we were creating an SSH key and exporting it to the primary server.
Now that the SSH keys are established we are now cloning the primary server on the
replica with the repmgr command.
Once the cloning is completed we are ready to launch the replica server. When you start the
replica server it immediately connects to the primary server pg-primary and begins
replication. It can do this because repmgr knows all of the connection information necessary
to establish a streaming replication connection with pg-primary. During the cloning
process, it automatically created a recovery.conf file suitable to start directly in replication
mode.
[ 161 ]
Working with Third-Party Replication Management Utilities
In the next step, we configured repgmr to recognize the new replica. When we create
/etc/repmgr.conf, we need to use the same cluster name as we used on pg-primary. We
also tell repmgr that this is node 2, and it should be named child1. The conninfo value
should always reflect the connection string necessary for repmgr to connect to PostgreSQL
on the named node. As we did earlier, we set pg_bindir to avoid encountering possible
repmgr bugs.
With the configuration file in place, we can register the new clone in a similar way to how
we registered the primary. By calling the repmgr command with -f and the full path to the
configuration file, there are several operations we can invoke. For now, we will settle with
the standby register to tell repmgr that it should track pg-clone as a part of the pgnet
cluster.
Once we start the repmgrd daemon, all nodes are aware of each other and the current status
of each.
Setting up walctl
walctl is basically a WAL management system that either pushes or fetches WAL files from
a remote central server. It is a substitute for archive_command or restore_command in
handling WAL archival or recovery.
walctl also includes a utility to clone a primary server and create a replica.
How to do it…
For this recipe we are going to use three servers. The remote server that will handle archival
is called pg-arc. The primary server will be named pg-primary and the standby server
will be named pg-clone. Our assumption is that the data directory will be located at /data
location and the same can be defined in the $PGDATA environment variable.
Here are the steps for this recipe:
1. On the primary server and standby run the following commands:
git clone https://github.com/OptionsHouse/walctl
cd walctl
sudo make install
[ 162 ]
Working with Third-Party Replication Management Utilities
2. On the archival server pg-arc create the wal storage directory:
sudo mkdir -m 0600 /db/wal_archive sudo chown postgres:postgres
/db/wal_archive
3. On the primary server, we are going to create an RSA key pair and export it to the
archival and the standby server:
ssh-keygen -t rsa -N ' '
ssh-copy-id pg-arc
ssh-copy-id pg-clone
4. Similarly, on the standby server we are going to create an SSH key and export it
to the primary and archival server:
ssh-keygen -t rsa -N ' ' ssh-copy-id pg-arc ssh-copy-id pgprimary
5. On the primary server, we are then going to create a user for walctl usage:
CREATE USER walctl
REPLICATION;
WITH PASSWORD 'superb' SUPERUSER
6. In the next step, we are going to modify the pg_hba.conf file on primary to
allow the connection from archival and standby servers:
host
host
replication
replication
walctl
walctl
pg-clone
pg-arc
md5
md5
7. On the pg-clone and pg-primary servers we are going to configure the
password authentication file with the following entries:
*:*:*:walctl:superb
8. In this step, we are going to create the configuration file for walctl, titled
walctl.conf, on the primary and standby server:
PGDATA=/data
ARC_HOST=pg-arc
ARC_PATH=/db/wal_archive
9. On the primary server, we need to execute the following command:
walctl_setup master
[ 163 ]
Working with Third-Party Replication Management Utilities
10. In the next step, we need to restart the database server on the primary server:
pg_ctl -D /data restart
How it works…
The following is the explanation of steps for the preceding section:
In step 1, we were cloning the walctl from the GitHub repository and setting up
the install process on the standby and the primary servers.
In step 2, we were setting up the WAL storage directory that will be holding up
the archived files on the archival server pg-arc.
In step 3, we were creating the SSH keys on the primary server and exporting
them to the archival and standby servers.
Similar to step 3, we were creating the SSH keys on the standby and exporting
them to the primary and archival servers.
In step 5, we were creating a database user on the primary server by the name of
walctl. The standby server will use this user to connect to the primary.
In step 6, we modified the access regulation file pg_hba.conf on the primary to
allow the standby server to connect to the primary with the help of the walctl
user defined in the previous step.
In step 7, we were creating a password file, pgpass, and storing the username
and password credentials for the walctl user on both primary and standby
servers.
In step 8, we were creating the configuration file for walctl where we defined the
data directory and the name of the archive host server and WAL archive path on
the archive server pg-arc.
In this step we were calling the watctl_setup command on the primary server.
The purpose of this command is to prepare PostgreSQL for WAL integration.
When it is called with the master parameter as we've done here, it modifies
postgresql.conf so that WAL files are compatible with archival, and
streaming replicas can connect. In addition, it enables archive mode and sets
archive_command to invoke a walctl utility named walctl_push, which sends
WAL files to the archive server.
[ 164 ]
Working with Third-Party Replication Management Utilities
Using walctl to create replica
In this recipe, we are going to use the walctl_clone utility to create a clone/replica
How to do it…
For this recipe only two servers are needed. One is the primary server and the other is the
replica server. Our primary server will be named pg-primary and the replica will be
named pg-clone server.
We need to execute the following command on the replica server:
walctl_clone pg-primary walctl
How it works…
walctl_clone has two parameters. The first parameter is supposed to be the hostname of
the primary server and the second parameter should be the name of the database superuser
with which we are going to create a backup.
Here are the series of actions that are performed on behalf of walctl_clone:
It puts the primary server in the backup mode.
It retrieves all files from the database. It will copy only the changed files if all the
data files are residing in the data directory.
Once it has copied the respective files it will end the backup mode on the primary
server.
It will automatically create a recovery.conf file that will retrieve files from the
archive server pg-arc and connect as a streaming standby to the primary server
pg-primary.
Finally, it starts the PostgreSQL server on the replica.
[ 165 ]
8
Database Monitoring and
Performance
In this chapter, we will cover the following recipes:
Checking active sessions
Finding out what the users are currently running
Finding blocked sessions
Dealing with deadlocks
Studying table access statistics
Routine reindexing
Logging slow statements
Determining disk usage
Preventing page corruption
Generating planner statistics
Tuning with background writer statistics
Introduction
Monitoring plays a crucial role in defining the server capacity or defining the health of a
database server. Monitoring a database server with a certain interval will give us some
useful trend information about the server functionality as per the running business. As
business grow continuously, the collected trend information will define the future database
server capacity.
Database Monitoring and Performance
Server capacity may not be directly proportional to database performance. Since even if we
have enough resources, a bug in a database engine's optimizer may drain out all the
resources. Also, database performance may not be directly proportional to server capacity,
since a kernel bug might cause all the resources to drain out in no time. To identify the
underlying problems, monitoring trends information will be more useful. However, in this
chapter, we are only covering what to monitor in PostgreSQL to check its behavior in a
database server.
PostgreSQL provides a good number of catalog tables, views, and extensions to monitor the
instance accurately. We are going to use these catalog views to monitor the server's live
metrics.
Checking active sessions
Using this recipe, we will get all the active sessions from a PostgreSQL instance, which are
utilizing the server resources.
Getting ready
Like other database management systems, PostgreSQL is also providing a set of catalog
tables and views, which will help us to retrieve the current instance status. In this recipe, we
will be discussing a catalog view, which every DBA/developer queries very frequently.
How to do it…
1. Connect to your database using the psql client as a superuser and check the
track_activities parameter status:
postgres=# SHOW track_activities;
track_activities
-----------------on
(1 row)
2. Initiate the pgbench test as follows:
$ pgbench -i; pgbench -c 4 -T 100
creating tables...
100000 of 100000 tuples (100%) done (elapsed 0.17
[ 167 ]
Database Monitoring and Performance
s, remaining
0.00 s)
vacuum...
set primary keys...
done.
starting vacuum...end.
3. Execute the following SQL query to get the list of active sessions along with the
session's attributes using the psql connection:
postgres=# SELECT
datname,
usename,
application_name,
now()-backend_start AS "Session duration",
pid
FROM
pg_stat_activity
WHERE
state='active';
datname | usename | application_name | Session duration | pid
---------+----------+------------------+------------------+---postgres | postgres | psql.bin
| 00:06:41.74497
|2751
postgres | postgres | pgbench
| 00:00:46.884915 |2796
postgres | postgres | pgbench
| 00:00:46.883277 |2797
postgres | postgres | pgbench
| 00:00:46.881618 |2798
(4 rows)
How it works…
In PostgreSQL, we have an independent statistics collection process called statistics
collector, which will populate the dynamic information about the sessions. This process
behavior is dependent on a set of track parameters, which guides the stats collector about
which metrics it needs to collect from the running instance. By enabling the required
tracking at instance level, it will reduce the overhead on query execution. To collect the
preceding metrics from the processes, we need to enable a parameter called
track_activities, which is on by default. If we turn off this parameter, the statistics
collection process will not give us all the useful information such as session status,
command and query duration, and so on.
[ 168 ]
Database Monitoring and Performance
It is not recommended to turn this parameter off, as it gives more useful
monitoring information to the end users. In case you want to not allow
any users to observe what queries are running on a database, then we
have to use the following SQL statement to disable them:
ALTER DATABASE <database name> SET track_activities TO
false;
Once we have disabled this parameter, we need to depend on PostgreSQL
log messages to track useful information about query duration. Besides, it
will reduce the query execution overhead.
When an application makes a connection to PostgreSQL, the postmaster process will fork a
new process for that connection, and keeps its properties information in memory. Using a
dynamic view called pg_stat_activity, we can read the connection properties
information directly from memory.
The pg_stat_activity is a dynamic statistics view, which was built based on a
PostgreSQL internal function, pg_stat_get_activity(), which collects the information
from memory.
Whenever we execute the preceding SQL query:
It will execute the underlying pg_stat_get_activity(NULL) function, which
will get all the sessions information
On the result, the actual query filters the data based on the predicate in the WHERE
clause, which will only give us the active sessions information
There's more…
The pg_stat_activity is a dynamic statistics view, which will show all kinds of session
information such as active, idle, idle in transaction, and so on. Besides, it will also give all
connection properties such as client IP address, the application name that is being occupied,
the database connections, and so on.
PostgreSQL also provides a set of per backend connection statistics, which will also produce
the similar information like pg_stat_activity.
You can find more information about this view at this URL, https://www.
postgresql.org/docs/9.6/static/monitoring-stats.html#PG-STAT-A
CTIVITY-VIEW.
[ 169 ]
Database Monitoring and Performance
Finding out what the users are currently
running
Using this recipe, we will get all the current running SQL commands from the PostgreSQL
instance.
Getting ready
In the previous recipe, we discussed how to get all the active sessions information, and now
we are going to get all the SQL statements that the active sessions are executing.
How to do it…
1. Initiate pgbench as aforementioned.
2. Connect to the PostgreSQL database as either a superuser or database owner and
execute the following SQL statement:
$ psql -h localhost -U postgres
postgres=# SELECT
datname,
usename,
application_name,
now()-backend_start AS "Session duration",
pid,
query
FROM
pg_stat_activity
WHERE
state='active';
-[ RECORD 1 ]----+------------------------------datname
| postgres
usename
| postgres
application_name | pgbench
Session duration | 00:00:08.470704
pid
| 29323
query
| UPDATE pgbench_tellers SET
tbalance = tbalance + 2331 WHERE tid = 9;
-[ RECORD 2 ]----+------------------------------datname
| postgres
usename
| postgres
application_name | pgbench
[ 170 ]
Database Monitoring and Performance
Session duration | 00:00:08.466858
pid
| 29324
query
| UPDATE pgbench_tellers SET
tbalance = tbalance + -1438 WHERE tid = 7;
-[ RECORD 3 ]----+------------------------------datname
| postgres
usename
| postgres
application_name | psql
Session duration | 00:00:05.234634
pid
| 29580
query
| SELECT
|
datname,
|
usename,
|
application_name,
|
now()-backend_start AS
"Session duration",
| pid,
| query
| FROM
| pg_stat_activity
| WHERE
| state='active';
How it works…
The aforementioned SQL command is similar to the SQL, which collects the active session
information, except a new field called query from the pg_stat_activity dynamic view.
As the database postgres has the track_activities parameter enabled, the statistics
collector also tracks the present running SQL commands.
From the preceding results, we also got the executed SQL statement in the
result, as it is running at that moment. If you want to exclude that SQL
query, which is running to get all the current sessions information, then
we have to add another predicate in the WHERE clause as AND pid !=
pg_backend_pid().
Finding blocked sessions
In this recipe, we will be discussing how to verify the queries that are currently blocking the
SQL queries.
[ 171 ]
Database Monitoring and Performance
Getting ready
In any database, sessions will be blocked due to multiple technical reasons, to avoid
concurrent access on the same resource. In PostgreSQL, a session can be blocked due to an
idle in transaction or due to concurrent access on the same resource or due to some
prepared transactions. In this recipe, we will be discussing how to get all the
blocked/waiting sessions from the currently running instance.
How to do it…
1. Initiate pgbench as follows:
$ pgbench -i; pgbench -c 4 -T 100
2. Connect to your database using the psql client as a superuser and execute the
following SQL command:
$ psql -h localhost -U postgres
postgres=#
SELECT
datname,
usename,
application_name,
now()-backend_start AS "Session duration",
pid,
query
FROM
pg_stat_activity
WHERE
state='active'
AND wait_event IS NOT NULL;
-[ RECORD 1 ]----+------------------------------datname
| postgres
usename
| postgres
application_name | pgbench
Session duration | 00:01:37.019791
pid
| 17853
query
| UPDATE pgbench_tellers SET
tbalance = tbalance + 3297 WHERE tid = 5;
-[ RECORD 2 ]----+------------------------------datname
| postgres
usename
| postgres
application_name | pgbench
Session duration | 00:01:37.016008
[ 172 ]
Database Monitoring and Performance
pid
| 17854
query
| UPDATE pgbench_branches SET
bbalance = bbalance + 2742 WHERE bid = 1;
How it works…
The previously mentioned SQL command is similar to the aforementioned SQL queries,
except for an additional predicate in the WHERE clause as wait_event IS NOT NULL. The
pg_stat_activity view has the wait_event column, which will differentiate the waiting
and non-waiting sessions. If the wait_event field is set to NOT NULL, then that particular
session is blocked due to some concurrent resource waiting. We can also get the lock type it
held on a resource using the wait_event_type column from the pg_stat_activity
view.
There's more…
Once we find that a session is blocked, then we need to identify why the session got blocked
and on what resources it's been waiting for. Once we identify the underlying session or
transaction that is blocking the particular session, then we can take further actions such as
kill the blocking session, cancel the blocking transaction, or wait until the blocking session is
complete.
Transactional locks
Transactional locks are the implicit locks that get acquired when any DML statement gets
executed. To identify the blocking session information that holds the transactional locks, we
have to use the following query:
postgres=#
SELECT bl.pid
AS blocked_pid,
a.usename
AS blocked_user,
ka.query
AS
current_or_recent_statement_in_blocking_process,
ka.state
AS
state_of_blocking_process,
now() - ka.query_start AS blocking_duration,
kl.pid
AS blocking_pid,
ka.usename
AS blocking_user,
a.query
AS blocked_statement,
now() - a.query_start AS blocked_duration
FROM pg_catalog.pg_locks
bl
[ 173 ]
Database Monitoring and Performance
JOIN pg_catalog.pg_stat_activity a ON a.pid =
bl.pid
JOIN pg_catalog.pg_locks
kl ON
kl.transactionid = bl.transactionid AND kl.pid
!= bl.pid
JOIN pg_catalog.pg_stat_activity ka ON ka.pid =
kl.pid
WHERE NOT bl.GRANTED;
-[ RECORD 1 ]--------------------------+-------blocked_pid
| 8724
blocked_user
|postgres
current_or_recent_statement_in_blocking_process
|UPDATE
pgbench_branches SET bbalance = bbalance +
-1632 WHERE bid = 1;
state_of_blocking_process
| idle in
transaction
blocking_duration
|
00:00:00.003981
blocking_pid
| 8726
blocking_user
|postgres
blocked_statement
| UPDATE
pgbench_tellers SET tbalance = tbalance + 4600
WHERE tid = 3;
blocked_duration
|
00:00:00.00222
From the preceding query results, we can confirm that the session (pid 8726) is blocking
the session (pid 8724) due to an idle in transaction. And the recent query that got executed
in session 8726 was UPDATE pgbench_branches SET bbalance = bbalance + -1632
WHERE bid = 1.
The previously mentioned SQL query is compatible with PostgreSQL
version greater than 9.1. Refer to the following link if you need an SQL
that is compatible with versions less than 9.2: https://wiki.postgresql.
org/wiki/Lock_Monitoring.
Whenever we find the preceding kind of idle in transactions that are blocking the other
transactions, then check the blocking duration of that statement using the preceding SQL
statement. If you observe that the blocking duration is much higher, then it's recommended
you kill that process using the pg_terminate_backend() function. However, it's expected
to get these kinds of locks in databases with minimal blocking_duration time.
[ 174 ]
Database Monitoring and Performance
Table level locks
In other cases, a session can also be blocked due to table level locks, which are granted
exclusively using the LOCK statement. To find this table level exclusive locks information,
we have to use the following SQL query.
For example,
In session 1:
postgres=# BEGIN;
BEGIN
postgres=# LOCK TABLE test IN EXCLUSIVE MODE;
LOCK TABLE
In session 2:
postgres=# INSERT INTO test VALUES (10);
Session 2 will be in a waiting state until we close the transaction that we opened in session
1. In Session 3 execute the following SQL to identify which session is blocked and which
one is blocking:
postgres=#
SELECT
act1.query as blocking_query,
act2.query as blocked_query,
l1.pid AS blocked_pid,
l2.pid AS blocking_pid,
l1.relation::regclass
FROM
pg_locks l1,
pg_locks l2,
pg_stat_activity act1,
pg_stat_activity act2
WHERE
l1.granted=true AND
l2.granted=false AND
l1.pid=act1.pid AND
l2.pid=act2.pid AND
l1.relation=l2.relation;
-[ RECORD 1 ]--+-----------------------------blocking_query | LOCK TABLE test IN EXCLUSIVE
MODE;
blocked_query | INSERT INTO test VALUES(10);
blocked_pid
| 10417
blocking_pid
| 8913
[ 175 ]
Database Monitoring and Performance
relation
| test
In general, developers will not be using the preceding LOCK statements explicitly in their
application code, until there is a specific requirement. Whenever you observe the preceding
kinds of locks on a database, it would be good to consult the DBA before terminating the
session. Since there might be chances that DBA is doing some kind of maintenance on the
table, by holding an exclusive lock.
Prepared transaction locks
In some other cases, a session can also be blocked due to 2 PC's (two phase commit)
prepared transactions.
For example, let's create a prepared transaction on a table as follows.
In session 1:
postgres=# BEGIN;
BEGIN
postgres=# INSERT INTO test VALUES (10);
INSERT 0 1
postgres=# PREPARE TRANSACTION 'testing';
PREPARE TRANSACTION
postgres=# END;
WARNING: there is no transaction in progress
COMMIT
postgres=# SELECT * FROM test;
t
--(0 rows)
In session 2:
postgres=# ALTER TABLE test ADD col2 INT;
The preceding DDL operation in session 2 will be in waiting state, until we either commit or
rollback the previous prepared transaction. If we run the previous SQL query, which shows
the waiting sessions information, then the ALTER statement will be in waiting state, as
shown here:
-[ RECORD 1 ]----+------------------------------datname
| postgres
usename
| postgres
application_name | psql
Session duration | 00:02:01.030429
pid
| 23223
[ 176 ]
Database Monitoring and Performance
query
wait_event
| ALTER TABLE test ADD col2 INT;
| relation
If we run the preceding SQL, which finds the blocking queries, we will get zero rows. This
is because 2 PC won't store the process ID of a session that initiates the prepared
transaction. To identify these blocking prepared transactions, we have to use a separate SQL
query as follows:
postgres=# SELECT locktype, relation::regclass, virtualtransaction FROM
pg_locks WHERE locktype ='relation';
locktype | relation | virtualtransaction
----------+----------+-------------------relation | pg_locks | 4/34
relation | test
| -1/452455
(2 rows)
A prepared transaction holds a relation lock on all the tables, which we used in that
particular prepared transaction. To identify the tables information, we have to query the
pg_locks, as shown previously.
From the preceding results, we see that the table test has a relation lock, which
associates to the transaction ID, 452455. Now, let's query the pg_prepared_xacts view
using the preceding transaction ID:
postgres=# SELECT * FROM pg_prepared_xacts WHERE
transaction = '452455';
transaction |
gid
|
prepared
| owner
|
database
------------+---------+--------------------+---------+--------452455
| testing | 2016-08-20 19:06:28.534232+05:30| postgres|
postgres
(1 row)
From the preceding results, we can confirm that the transaction 452455 is a prepared
transaction, and its gid is testing. Once we find that the transaction is a prepared
transaction, then we have to either commit or rollback the transaction, which unlocks the
blocked sessions. To commit/rollback the previous prepared transaction we have to execute
the [COMMIT|ROLLBACK]PREPARED'testing' statement.
[ 177 ]
Database Monitoring and Performance
Usage of SKIP LOCKED
As of now, we have been trying to collect all the blocked and blocking queries information,
which will be useful to troubleshoot the database performance issues. Now, let's consider a
real use case such as parallel workers could process a single dataset. In this situation, all the
parallel workers should be smart enough to skip the data set that is already processing by
the other worker process. To achieve this kind of requirement, PostgreSQL has a new
feature called SKIP LOCKED, which skips the rows that are already locked by the other
processes. That is, if we run a query such as SELECT jobid FROM dataset FOR UPDATE
SKIP LOCKED LIMIT 1 in parallel from multiple workers, then each worker process gets a
distinct jobid to process. Here SKIP LOCKED will skip the rows that have been locked by
other sessions and will return the rows that are not locked. Let's see the demonstration of
this feature, as shown here:
In session 1:
postgres=# BEGIN;
BEGIN
postgres=# SELECT jobid FROM dataset FOR UPDATE SKIP LOCKED LIMIT 1;
jobid
------1
(1 row)
In session 2:
postgres=# BEGIN;
BEGIN
postgres=# SELECT jobid FROM dataset FOR UPDATE SKIP LOCKED LIMIT 1;
jobid
------2
(1 row)
From the preceding examples, each session got a unique jobid without any waiting at
session 2, by skipping the locked tuple jobid 1 in session 1.
Read more information about this feature at this URL, http://blog.2ndqu
adrant.com/what-is-select-skip-locked-for-in-postgresql-9-5.
[ 178 ]
Database Monitoring and Performance
Dealing with deadlocks
Using this recipe, we will be troubleshooting the deadlocks in PostgreSQL.
Getting ready
In any database management systems, deadlocks can occur due to concurrent resource
locking. It is the database engines responsibility to detect the deadlocks, and its applications
responsibility to prevent the deadlocks. The PostgreSQL engine has the ability to detect the
deadlocks; it also provides a few features to the developers to prevent the deadlocks in their
application code.
How to do it…
Let's produce a simple deadlock situation and we will see all the options that PostgreSQL
provides to troubleshoot them:
Have two different database sessions and execute the SQL statements as follows:
Session 1
Session 2
BEGIN;
BEGIN;
UPDATE test SET t=1 WHERE
t=1;
UPDATE test SET t=2 WHERE
t=2;
UPDATE test SET t=2 WHERE t=2; --Waiting UPDATE test SET t=1 WHERE t=1; -for the record 2 which is locked in
Waiting for the record 1
which is
session 2
locked in session 1
ERROR: deadlock detected
END;
DETAIL: Process 10417 waits for
ShareLock on transaction
452459;
blocked by process 8913.
Process 8913 waits for ShareLock on
transaction 452458;
blocked by process
10417.
ROLLBACK;
[ 179 ]
Database Monitoring and Performance
From the preceding example, in session 1 we got the deadlock error due to mutual locking
between session 1 and 2. From the preceding DETAIL message, it clearly says that the
process 8913 waits for the transaction that holds by the 10417 process, and the 10417
process is waiting for the transaction that holds by the 8913 process.
As you see from the preceding example, deadlocks will not cause any data loss and it will
only cause a transaction failure. To avoid these deadlock situations, we have to use the prelocking techniques, as mentioned in the following section.
Using FOR UPDATE
This is an approach that tries to avoid the deadlock issues, by pre-locking all the required
records that it is going to update in that session. To pre-lock all the required tuples, we have
to use the FOR UPDATE clause in the SELECT statement. Let's see how we are going to fix the
preceding problem using this approach:
Session 1
Session 2
BEGIN;
BEGIN;
SELECT * FROM test WHERE t
IN(1, 2) FOR UPDATE;
SELECT * FROM test WHERE t IN(1, 2) FOR
UPDATE; --Waiting
for the session to release
the lock on records 1, 2
UPDATE test SET t=1 WHERE t=1;
UPDATE test SET t=2 WHERE t=2;
END;
UPDATE test SET t=2 WHERE t=2;
SET t=1 WHERE t=1;
END;
UPDATE test
From the preceding example, session 2 transaction will be in waiting until session 1
transaction is complete. Here, we just made the two transactions to run in serializable
fashion. That means that the transactions will not conflict with each other. This approach
cannot be implemented for the SQL queries, which deal with SET operations. That is, usage
of FOR UPDATE is restricted in SQL queries that do UNION/INTERSECT/EXCEPT operations.
[ 180 ]
Database Monitoring and Performance
Advisory locks
PostgreSQL provides advisory locks, which is an external locking mechanism that we can
enforce from the application level, to achieve the concurrent data access. Let's see how we
will be avoiding deadlocks using advisory locks:
Session 1
Session 2
BEGIN
SELECT pg_advisory_lock(t) FROM
test WHERE t IN (1,2);
BEGIN
SELECT pg_advisory_lock(t) FROM test
WHERE t IN (1,2); --Waiting
for the
session1 to release lock
UPDATE test SET t=1 WHERE t=1;
UPDATE test SET t=2 WHERE t=2;
SELECT pg_advisory_unlock(t) FROM
test WHERE t IN(1,2);
END;
UPDATE test SET t=2 WHERE t=2;
UPDATE
test SET t=1 WHERE t=1;
SELECT pg_advisory_unlock(t) FROM test
WHERE t IN(1,2);
END;
In the preceding example, we are making the transactions as serialize, by using the advisory
locks. The only disadvantage of using advisory locks is, we need an application to enforce
the lock and unlock behavior. Also, the session level advisory locks will not release the lock,
even if the transaction failed or rollbacked. However, when a session is closed, then all the
associated advisory locks within that session will be released automatically.
Refer to the following URL for more information about advisory locks: htt
ps://www.postgresql.org/docs/9.6/static/explicit-locking.html.
Table access statistics
In this recipe, we will be discussing how to study the table access statistics by using
PostgreSQL catalog views.
[ 181 ]
Database Monitoring and Performance
Getting ready
PostgreSQL's statistics collector process keeps track of all kinds of operations that are
performing on any table or an index. These statistics will be helpful in determining the
importance of an object in the business, and also defines whether the required maintenance
operations are running on it when it is required.
How to do it…
Let's create a test table, which we will perform a few operations on:
postgres=# CREATE TABLE test (t INT);
CREATE TABLE
PostgreSQL provides a few catalog views, which we are going to query to analyze the
statistics, which statistics collector process is collected in the background.
pg_stat_user_tables
This catalog view provides most of the transactional metrics as well as maintenance
operations metrics. Now, let's go and perform a few transactions and see how they reflect in
this view:
postgres=# INSERT INTO test VALUES(1);
INSERT 0 1
postgres=# UPDATE test SET t=10;
UPDATE 1
postgres=# DELETE FROM test;
DELETE 1
postgres=#SELECT relname, n_tup_ins "Total Inserts", n_tup_upd "Total
Updates", n_tup_del "Total deletes" FROM pg_stat_user_tables WHERE
relname='test';
-[ RECORD 1 ]-+----relname
| test
Total Inserts | 1
Total Updates | 1
Total deletes | 1
[ 182 ]
Database Monitoring and Performance
From the preceding results, it seems like our sample transactions are reflected properly in
pg_stat_user_tables. Now, let's query on this table and see the relation between
seq_scan and idx_scan:
postgres=# SELECT relname, seq_scan, seq_tup_read, idx_scan, idx_tup_fetch
FROM pg_stat_user_tables WHERE relname='test';
-[ RECORD 1 ]-+------relname
| test
seq_scan
| 12
seq_tup_read | 100004
idx_scan
| 4
idx_tup_fetch | 50003
From the preceding results, the table has gone through more sequential scans rather than
index scans. Also, the sequential scans read more tuples from the disk rather than index
scans. Now, let's perform a few updates on the test table and see the maintenance operation
metrics:
postgres=# SELECT relname, last_autovacuum, last_autoanalyze,
autovacuum_count, autoanalyze_count FROM pg_stat_user_tables WHERE
relname='test';
-[ RECORD 1 ]-----+--------------------------------relname
| test
last_autovacuum
| 2016-09-08 13:31:45.313314+05:30
last_autoanalyze | 2016-09-08 13:31:45.451255+05:30
autovacuum_count | 2
autoanalyze_count | 2
From the preceding results, it seems that the autovacuum process is invoked two times after
we perform few update operations on it. In PostgreSQL 9.4, we have a new monitoring
metric called n_mod_since_analyze, which shows the number of modifications that
happened from the last analyze operation on a table. Now, let's go and update the whole
table, and see how many tuples were changed after the recent table analyze operation:
postgres=# UPDATE test SET t=1000;
UPDATE 100000
postgres=# SELECT relname, n_mod_since_analyze FROM pg_stat_user_tables
WHERE relname='test';
-[ RECORD 1 ]-------+--------------------------------relname
| test
n_mod_since_analyze | 100000
[ 183 ]
Database Monitoring and Performance
From the preceding results, the table had 100000 tuple modifications, which are not
analyzed by either manual analyze or the auto vacuum process. This
n_mod_since_analyze metric value also gives an idea about whether the table has healthy
statistics for the optimizer or not. Once an autovacuum or a manualanalyze is complete on
this table, the n_mod_since_analyze will be reset to 0. This view also provides manual
vacuum/analyze metrics such as last_vacuum, last_analyze, and vacuum_cost, which
are similar to autovacuum metrics. Using this view, we can also get the metrics of live
tuples and dead tuples or unused storage, by using n_live_tup and n_dead_tup fields.
pg_statio_user_tables
This catalog view stores all the I/O related metrics, which were performed while scanning
or updating the tables. That is, how many blocks a session read from the PostgreSQL's
cache and how many blocks it read from the disk:
postgres=# SELECT relname, heap_blks_read "Disk blks", heap_blks_hit "Cache
blks" FROM pg_statio_user_tables WHERE relname = 'test';
-[ RECORD 1 ]------relname
| test
Disk blks | 890
Cache blks | 1130524
From the preceding results, for all the queries that are executed on the table test took, 890
disk reads, and 1130524 cache block hits. These metrics will be helpful in defining the
cache hit and miss ratios of any specific table. This view also provides the cache hit ratio of
table's indexes and its toast tables too.
How it works…
By default, the PostgreSQL instance enables the track_counts setting, which is a hint to
the statistics collector process. When this parameter is enabled, all the table and index
access related statistics will be collected and will store into the pg_stat_tmp location in the
form of db_oid.stat or globals.stat. Whenever we query the catalog views, it will
fetch the information from these statistics files, which are frequently updated by the
statistics collector process, with the latest statistical information. This statistics collector
process also depends on a few other parameters such as track_functions (which collects
the metrics about the function calls) and track_io_timing (which collects the
information about I/O operations).
[ 184 ]
Database Monitoring and Performance
Please refer to the following PostgreSQL official documentation URL,
which gives much more information about the statistics collector
process: https://www.postgresql.org/docs/9.6/static/monitoring-s
tats.html.
Logging slow statements
In this recipe, we will be discussing how to track the slow running SQL statements.
Getting ready
In any database management system, checking the database logs is a regular action by the
developers/DBA to find the root cause, for any database related issues. PostgreSQL
provides various log settings, which controls its logger process. By using proper logging
settings, we can control the amount of log messages, and the format of the log content. In
some cases, having proper logging settings will help in debugging the situations as
mentioned here.
Sometimes, it is not an easy job to figure out why a query is running very slowly, as there
could be many reasons behind the slowness of the execution. The reasons could be:
Concurrent locks on the same table
Volume of the data
Bad execution plan for the query and so on
As we discussed previously, if you have proper logging settings, PostgreSQL will log some
useful information such as query duration, process ID, client host details, number of temp
files it generated, and so on, which will help us in debugging the preceding scenarios.
How to do it…
Login to the database as a superuser and then execute the following SQL statement:
postgres=# ALTER SYSTEM SET log_min_duration_statement to '5s';
ALTER SYSTEM
postgres=# SELECT pg_reload_conf();
pg_reload_conf
---------------t
(1 row)
[ 185 ]
Database Monitoring and Performance
The preceding ALTER command changes the instance configuration to log all the SQL
queries where the execution duration was greater than or equal to 5 seconds.
It is always recommended to enable the following settings at instance
level, which logs additional query behavior such as query waiting
duration, and the amount of temporary files it generated during the query
execution:
ALTER SYSTEM SET log_lock_waits to 'on'; ALTER SYSTEM SET
log_temp_files to '0;.
We can also track the slow running queries from the pg_stat_activity dynamic view by
querying it as follows. This view will only give us the live information:
postgres=# SELECT now()-query_start Duration, query, pid FROM
pg_stat_activity WHERE pid!=pg_backend_pid();
duration
|
query
| pid
----------------+------------------------+-----00:00:25.19148 | SELECT pg_sleep(1000); | 2187
(1 row)
How it works…
As we discussed previously, logger process is an independent PostgreSQL child
process, and its job is to receive the log content from each PostgreSQL backend and flush
them into the log destinations such as stderr, syslog, csvlog, and eventlog.
When we execute the preceding ALTER statements, PostgreSQL writes all these settings into
a new file called postgresql.auto.conf. When we do the instance reload or when we
perform the instance restart, these parameters will override the postgresql.conf
parameters. To remove the entries from postgresql.auto.conf, we have to specify
RESET ALL in the ALTER SYSTEM command.
Refer to the following URL for more information about various logging
parameters: https://www.postgresql.org/docs/9.6/static/runtime-c
onfig-logging.html.
Determining disk usage
In this recipe, we will be discussing the PostgreSQL functionality that measures object size.
[ 186 ]
Database Monitoring and Performance
Getting ready
PostgreSQL follows a specific file storage layout, which stores data into multiple folders.
The common directories that we see in the PostgreSQL data directory are:
base: This folder contains the database subfolders and the subfolder will contain
actual relational files
global: This folder contains all the cluster level catalog information
pg_clog: This folder contains the committed transactions information, which
will be helpful during crash recovery
pg_xlog: This folder contains the ongoing transactions information, which will
be helpful during crash recovery
pg_tblsp: This folder contains the symbolic link to tablespaces
pg_log: This folder contains the database log files
PostgreSQL also has a few more additional directories, which also keep additional
information about the running instance.
Refer to this URL for more information about additional directories: https
://www.postgresql.org/docs/9.6/static/storage-file-layout.html.
To determine the whole PostgreSQL cluster size, including all the directories and
subdirectories, we have to use the native OS commands, such as du in Linux. And to
determine any specific object size, we have to use PostgreSQL's predefined SQL functions,
as mentioned in the following section.
How to do it…
Column size
PostgreSQL provides a utility function called pg_column_size(), which determines the
compressed disk usage of a single column value. The usage of this function is demonstrated
as follows:
postgres=# CREATE TABLE bigtable(c char(10240000));
CREATE TABLE
postgres=# INSERT INTO bigtable VALUES('a');
INSERT 0 1
[ 187 ]
Database Monitoring and Performance
postgres=# SELECT pg_column_size(c) FROM bigtable;
pg_column_size
---------------117225
(1 row)
From the preceding result, the value a has acquired 117225 bytes to store its value.
Relation size
PostgreSQL provides a utility function called pg_total_relation_size(), which
calculates the whole table size including its indexes, toast, and also along with its free space
map files:
postgres=# SELECT pg_size_pretty(pg_total_relation_size('test'));
pg_size_pretty
---------------15 MB
(1 row)
In the preceding query, we used the pg_size_pretty() function, which converts the
number of bytes into a human readable format. PostgreSQL also provides another utility
function called pg_relation_size(), which gives more flexibility in measuring the
relation size. This function takes table/index/material view names as the first argument, and
its fork attribute as another argument. The following are the relation fork files in
PostgreSQL:
main: It determines the size of actual table data
fsm: It determines the size of free space map file of a relation
vm: It determines the size of visibility map file of a relation
init: It determines the size of the initialization file of an unlogged relation
If you are using the psql client to connect to PostgreSQL, then we can also determine the
size of an object using the following meta commands:
\dt+ <Table name>
\di+ <Index name>
\dms+ <Materialized view name>
[ 188 ]
Database Monitoring and Performance
Database size
To determine the size of a database, we have to use the catalog function
pg_database_size() as a superuser or as a database owner as follows:
postgres=# SELECT pg_size_pretty(pg_database_size('postgres'));
pg_size_pretty
---------------2860 MB
(1 row)
The psql client provides a meta command \l+ <Database name>, which also determines
the database size.
Tablespace size
To determine the size of a tablespace, we have to use the catalog function
pg_tablespace_size() as a superuser as follows:
postgres=# SELECT pg_size_pretty(pg_tablespace_size('test'));
pg_size_pretty
---------------3568 kB
(1 row)
The psql client provides a meta command \db+ <Tablespace size>, which also
determines the tablespace size.
How it works…
If we execute the preceding function, it reaches to the underlying disk and calculates the
disk usage of each object. For example, let's consider pg_database_size('postgres'),
which returns the disk size as 2860 MB. Let's calculate the same database disk usage using
native commands:
postgres=# SELECT oid FROM pg_database WHERE datname = 'postgres';
oid
------12641
(1 row)
$ cd ${PGDATA DIRECTORY}/base/12641/
$ du -h .
2.8G
[ 189 ]
Database Monitoring and Performance
From the preceding output, we got the same disk usage from the native and SQL
commands.
Preventing page corruption
Database corruption is a major concern, since in a worst case scenario we may lose valuable
customer related data. It is also a painful job, to recover the corrupted disk or restore the
data from previous backups. Database corruption can happen due to the following reasons:
Bad hardware
Bug in database software
Bug in a kernel
Bad database configuration
As like other database management systems, PostgreSQL also provides few configuration
settings, which will try to avoid the page corruption. If page corruption happens at disk
level, then PostgreSQL has the ability to bypass the corrupted data, and return the
uncorrupted data.
Getting ready
In this recipe, we will be discussing how to enable the checksums for each data block, and
how to retain the uncorrupted data when any corruption happens.
To enable the checksums for each data block, we have to initialize the PostgreSQL cluster
with the --data-checksums option using the initdb command. Adding this checksum
option will take a few computational resources, while doing any operations on data blocks.
Once we enable or disable (by default) the checksums options, then it is not possible to
change it later on. So, we have to choose wisely about this feature by doing the proper
benchmarking with and without enabling the checksum option. However, we always
recommend enabling this checksum, which will identify the serious data corruption in the
early stages.
[ 190 ]
Database Monitoring and Performance
How to do it…
To enable the checksums, we have to use the following syntax while initializing the cluster:
$ initdb -D pgdata --data-checksums
The files belonging to this database system will be owned by user
"postgres".
This user must also own the server process.
The database cluster will be initialized with locale "C".
The default database encoding has accordingly been set to "SQL_ASCII".
The default text search configuration will be set to "english".
Data page checksums are enabled.
creating directory pgdata ... ok
creating subdirectories ... ok
...
...
From the preceding initdb command output, it is showing that checksums are enabled at
cluster level. We can also get this confirmation from pg_controldata, from an existing
cluster as follows:
$ pg_controldata pgdata|grep checksum
Data page checksum version:
1
From the preceding output, the checksum version is 1, which defines whether the cluster's
checksum is enabled. If the value is 0, then the cluster was not initialized with the checksum
option.
Now let's see how to handle the corrupted blocks. Once any table block is corrupted, then
by default PostgreSQL will not allow us to query on the table by throwing the following
error message:
postgres=# SELECT * FROM test;
ERROR: invalid page in block 0 of relation base/215681/429942
The preceding error message is saying that the block 0 has an invalid page and that is the
reason why the query has stopped to proceed with scanning the remaining blocks.
PostgreSQL has a session level setting called zero_damaged_pages, which gives a hint to
query process as, to continue its scanning operation on all uncorrupted blocks by skipping
all corrupted blocks. For example, let's consider the following case:
postgres=# SET client_min_messages TO WARNING;
SET
postgres=# SET zero_damaged_pages to on;
SET
postgres=# SELECT COUNT(*) FROM test;
[ 191 ]
Database Monitoring and Performance
WARNING:
out page
count
------776
(1 row)
invalid page in block 0 of relation base/215681/429942; zeroing
From the preceding example, we configured client_min_messages as to receive warning
messages from the database server. Also, we enabled the zero_damaged_pages setting,
which skips all the damaged pages and will return the undamaged pages. A data block can
also be corrupted, even checksums at cluster level enabled. Checksums will not prevent the
corruption, and it will only prompt about the corruption during its first phase. Let's see
how the checksums react when any corruption occurs:
postgres=# SELECT COUNT(*) FROM test;
WARNING: page verification failed, calculated checksum 47298 but expected
18241
ERROR: invalid page in block 0 of relation base/12411/16384
From the preceding warning message, it is saying that the checksum has failed because it
expected the checksum value to be 18241 and got the bad checksum as 47298. To ignore
these checksum messages, we have to enable the ignore_checksum_failure parameter,
as shown here at session level:
postgres=# SET ignore_checksum_failure TO true;
SET
postgres=# SET zero_damaged_pages TO on;
SET
postgres=# SELECT COUNT(*) FROM test;
WARNING: page verification failed, calculated checksum 47298 but expected
18241
WARNING: invalid page in block 24 of relation base/12411/16384; zeroing
out page
WARNING: page verification failed, calculated checksum 44553 but expected
56495
WARNING: page verification failed, calculated checksum 30647 but expected
42827
WARNING: invalid page in block 24 of relation base/12411/16384; zeroing
out page
count
------9481
(1 row)
[ 192 ]
Database Monitoring and Performance
Once we initialize the instance with checksums, then the checksums will
be calculated for the table's indexes and PostgreSQL catalog tables.
How it works…
Once we initialize the cluster with the checksums option, then PostgreSQL writes the
checksum value in the page header, after calculating its page checksum value. As we
discussed in the beginning, on each operation a page takes the checksum validation, and it
returns the page if and only if the validation is successful. Now, let's evaluate the checksum
value of a page using the pageinspect extension as follows:
postgres=# CREATE EXTENSION pageinspect;
CREATE EXTENSION
postgres=# SELECT * FROM page_header(get_raw_page('test', 0));
lsn
| checksum | flags | lower | upper | special | pagesize |
version | prune_xid
-----------+----------+-------+-------+-------+---------+----------+--------+----------0/1ACE940 |
-9900 |
0 |
28 | 8160 |
8192 |
8192 |
4 |
0
(1 row)
From the preceding results, we can see that the checksum of a block 0 is -9900. This
checksum value will always be 0 for the clusters that were not initialized with the
checksums option.
If the cluster is not initialized with the checksum options, then it is
recommended to run cluster level pg_dumpall with a certain interval,
which identifies the page corruption in early stages.
There's more…
As of now, we have only discussed the user data corruption. However, corruption can also
occur at PostgreSQL's catalog level as well. Dealing with catalog corruption is a different
story, and it is not an easy to job to recover from catalog corruption. To deal with catalog
corruption, we have a utility tool called pg_catcheck, which will identify the catalog
corruption in the database.
[ 193 ]
Database Monitoring and Performance
Refer to the following URL for more information about the pg_catcheck
extension: https://github.com/EnterpriseDB/pg_catcheck/blob/mast
er/README.md.
As, like the zero_damaged_pages setting, PostgreSQL has the ignore_system_indexes
setting, which ignores the corrupted catalog indexes while we are dealing with corrupted
catalogs.
Routine reindexing
As like tables, an index also needs a special maintenance activity called reindex or rebuild.
The job of reindex is to build a fresh index by replacing the existing index pages.
Getting ready
When we do the DELETE/UPDATE operation on a table, PostgreSQL will remove or update
the corresponding entries from the table's indexes. Once it removes the entries from the
index pages, there might be chances of increasing the leaf page fragmentation in btree
indexes, which leads to more I/O while performing the index scan operations. This is
because in the btree index all the index and row entries will be at the leaf node, where root
and branch nodes will be helping the index scan to reach its index entries. In general, if a
leaf page fragmentation of an index is more than 30% then it is recommended to do a
REINDEX operation on it. For non btree indexes, it is not possible to identify the leaf page
fragmentation as it maintains its own implementation.
Like a table, an index can also get bloated due to multiple DELETE/UPDATE operations. To
reclaim these dead spaces at index level, we have to execute the VACUUM command on the
corresponding table; otherwise an autovacuum process takes care of it. If the index is too
bloated, then it is also recommended to do a REINDEX operation on it. To find the leaf
fragmentation and bloat on an index, we have to use an extension called pgstattuple.
[ 194 ]
Database Monitoring and Performance
How to do it…
Let's create the pgstattuple extension and examine a sample index as follows:
postgres=# CREATE EXTENSION pgstattuple ;
CREATE EXTENSION
postgres=# SELECT * FROM pgstatindex('test_idx');
-[ RECORD 1 ]------+-------version
| 2
tree_level
| 2
index_size
| 22478848
root_block_no
| 412
internal_pages
| 10
leaf_pages
| 2733
empty_pages
| 0
deleted_pages
| 0
avg_leaf_density
| 90.06
leaf_fragmentation | 0
From the preceding results, the test_idx index is healthy, as its leaf fragmentation is 0,
and also it does not have any deleted pages in it. Now, let's see the performance of the
index scan by executing a sample query:
postgres=# EXPLAIN ANALYZE SELECT * FROM test WHERE t BETWEEN 999 AND 9999;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------Index Only Scan using test_idx on test (cost=0.29..331.17 rows=9144
width=4) (actual time=0.063..3.765 rows=9001 loops=1)
Index Cond: ((t >= 999) AND (t <= 9999))
Heap Fetches: 9001
Planning time: 0.585 ms
Execution time: 4.124 ms
(5 rows)
Time: 5.804 ms
Now, let's perform some update operation on the test table, and observe the index
fragmentation ration:
postgres=# UPDATE test SET t=1000 WHERE t%3=0;
UPDATE 33333
postgres=# EXPLAIN ANALYZE SELECT * FROM test WHERE t BETWEEN 999 AND 9999;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------Index Only Scan using test_idx on test (cost=0.29..448.85 rows=12178
[ 195 ]
Database Monitoring and Performance
width=4) (actual time=0.095..21.316 rows=39333 loops=1)
Index Cond: ((t >= 999) AND (t <= 9999))
Heap Fetches: 42334
Planning time: 0.084 ms
Execution time: 22.682 ms
(5 rows)
Time: 30.249 ms
postgres=# SELECT * FROM pgstatindex('test_idx');
-[ RECORD 1 ]------+-------version
| 2
tree_level
| 1
index_size
| 3219456
root_block_no
| 3
internal_pages
| 0
leaf_pages
| 392
empty_pages
| 0
deleted_pages
| 0
avg_leaf_density
| 62.88
leaf_fragmentation | 26.79
From the preceding stats, the same query took 30ms since we increased the number of rows
in the table, and the value is between 999 and 9999. And also, the corresponding index is
fragmented up to 26.79.
Now, let's go and REINDEX the index and observe its fragmentation ratio along with the
query result time:
postgres=# REINDEX INDEX test_idx;
REINDEX
postgres=# SELECT * FROM pgstatindex('test_idx');
-[ RECORD 1 ]------+-------version
| 2
tree_level
| 1
index_size
| 2252800
root_block_no
| 3
internal_pages
| 0
leaf_pages
| 274
empty_pages
| 0
deleted_pages
| 0
avg_leaf_density
| 89.33
leaf_fragmentation | 0
postgres=# EXPLAIN ANALYZE SELECT * FROM test WHERE t BETWEEN 999 AND 9999;
QUERY PLAN
---------------------------------------------------------------------------
[ 196 ]
Database Monitoring and Performance
---------------------------------------------------Index Only Scan using test_idx on test (cost=0.29..1222.95 rows=39333
width=4) (actual time=0.035..8.646 rows=39333 loops=1)
Index Cond: ((t >= 999) AND (t <= 9999))
Heap Fetches: 0
Planning time: 0.210 ms
Execution time: 10.024 ms
(5 rows)
Time: 10.792 ms
From the preceding results, after performing the REINDEX operation on the index, the query
response time decreased to 10ms from 30ms, which is a sign of good query performance.
The pgstatindex() function only supports btree index types.
How it works…
When we initiate REINDEX, it acquires the access exclusive lock on the table, and will
recreate a brand new index. Until this index operation is complete, all the incoming queries
on the same table will be blocked. If the index size is huge in size, then the amount of time
for the index creation will be increased. This will cause a serious application outage, as all
the queries will be in a waiting state. As an alternative to this REINDEX operation, we have
to create a new index concurrently on the same columns, then we have to drop the
fragmented index concurrently and then use the ALTER command to rename the new index
name as an old index name. However, we will be discussing more about this in further
chapters.
There's more…
We can also perform the REINDEX operation on system/database/schema/table levels as
well. If we do REINDEX on the system level, it will recreate the system catalog indexes. And
REINDEX on the database regenerate all the indexes of all tables in it. And, REINDEX on a
schema level regenerates all the indexes of all the tables in the given schema. And REINDEX
on a table level regenerate all the indexes of the specified table.
[ 197 ]
Database Monitoring and Performance
Refer to the following URL for more information about the index
maintenance operations: https://wiki.postgresql.org/wiki/Index_Ma
intenance.
Generating planner statistics
In this recipe, we will be discussing how we can use PostgreSQL to generate statistics.
Getting ready
Database statistical information plays a crucial role in deciding the proper execution plan
for the given SQL statement. PostgreSQL provides a utility command called ANALYZE,
which collects statistics from tables and makes them available to the planner. PostgreSQL
also provides another utility background process called autovacuum, which does a similar
job to analyze. All these collected statistics will be stored into the PostgreSQL catalog tables.
How to do it…
Now, for demonstration let's create a test table and populate a few entries in it:
postgres=# CREATE TABLE test(t INT);
CREATE TABLE
postgres=# SELECT COUNT(*) FROM pg_stats WHERE tablename = 'test';
count
------0
(1 row)
postgres=# INSERT INTO test VALUES(generate_series(1, 1000));
INSERT 0 1000
postgres=# ANALYZE test;
ANALYZE
postgres=# SELECT COUNT(*) FROM pg_stats WHERE tablename = 'test';
count
------1
(1 row)
[ 198 ]
Database Monitoring and Performance
After inserting a few records and executing the ANALYZE command, we are able to see that
pg_stats is reflected with test table information. If we do not execute the ANALYZE
command on the test table, autoanalyze will update pg_stats with the test table
information, when an autovacuum process runs on that table.
How it works…
The ANALYZE command scans the relation, and will populate each column's metrics into
the pg_statistic catalog table using the following:
null_frac
The percentage of null values in that column. If the column does not have any null values,
then this value will be set to 0.
avg_width
This is the average width of bytes of the whole column.
n_distinct
This value defines the number of distinct values that exist in the column. If this value is -1,
then the column does not have any duplicate values in it. This value will be always set to -1
for the primary key constrained columns.
most_common_vals
This is an array that stores the most common or repeated values in the column.
most_common_freqs
This is an array that stores the most common or repeated values frequency.
histogram_bounds
This array stores special values, which can divide the table into equally sized datasets.
[ 199 ]
Database Monitoring and Performance
correlation
This defines whether the rows are stored in the table's column order or not.
most_common_elems
This defines the most repeated elements of non-scalar data types, such as arrays and
vectors.
most_common_elem_freqs
This defines the repeated elements frequency of non-scalar data types.
When ANALYZE or autoanalyze run on a table, then it collects the preceding metrics and
will keep them in the pg_statistics table. To limit the number of statistics to collect and
store into these columns, we have to use the default_statistics_target parameter or
we can also use the ALTER TABLE ... ALTER COLUMN ... SET STATISTICS syntax.
Refer to the following URL for more information about statistical
information: https://www.postgresql.org/docs/9.5/static/view-pgstats.html.
Tuning with background writer statistics
In this recipe, we will be discussing how PostgreSQL's background writer (bgwriter)
process plays an important role in fine tuning PostgreSQL memory and checkpoint related
information.
Getting ready
The bgwriter process is a mandatory background process in a PostgreSQL instance. Its
main responsibility is to flush the buffers from memory to disk. Besides, it will also keep the
shared buffers ready by flushing the least used dirty buffers from memory to disk, based on
bgwriter_lru_maxpages, bgwriter_lru_multiplier parameter settings. In
PostgreSQL, checkpoint activity is a heavy process, as it is going to flush many dirty buffers
into the physical disk, which will increase a great utilization in I/O.
[ 200 ]
Database Monitoring and Performance
When an explicit/implicit checkpoint is invoked, the bgwriter process will do most of the
work. The pg_stat_bgwriter catalog view will show the cumulative statistics of the
bgwriter and checkpoint statistics. If we can store this statistical information into a
separate table using any scheduled process, then we can define that the configured memory
and checkpoint parameters are fine tuned. Otherwise, we have to reset the stats before
analyzing, and let the instance run with required load and then evaluate them as shown
next.
How to do it…
Let's reset the pg_stat_bgwriter statistics and run a normal pgbench test as follows:
postgres=# SELECT pg_stat_reset_shared('bgwriter');
pg_stat_reset_shared
---------------------(1 row)
$ cat /tmp/test.sql
INSERT INTO test VALUES(generate_series(1, 10000));
SELECT * FROM test ORDER BY t DESC LIMIT 1000;
DELETE FROM test WHERE t%7=0;
UPDATE test SET t=100;
$ bin/pgbench -c 90 -T 1000 -f /tmp/test.sql
While the preceding pgbench test is running, we may get many deadlocks due to
concurrent row update/delete operations. So, ignore those errors and let the pgbench
operation complete. Once the pgbench test completes, let's evaluate the
pg_stat_bgwriter and view results as follows:
postgres=# SELECT * FROM pg_stat_bgwriter;
-[ RECORD 1 ]---------+--------------------------------checkpoints_timed
| 19
checkpoints_req
| 445
checkpoint_write_time | 551012
checkpoint_sync_time | 48577
buffers_checkpoint
| 1974899
buffers_clean
| 150281
maxwritten_clean
| 1470
buffers_backend
| 1883063
buffers_backend_fsync | 0
buffers_alloc
| 1880980
stats_reset
| 2016-09-20 23:58:12.183039+05:30
[ 201 ]
Database Monitoring and Performance
How it works…
Now, let's evaluate each field from the preceding results.
checkpoints_timed
This defines the number of checkpoints occurs due to checkpoint_timeout. From the
preceding statistics, an implicit checkpoint occurred 19 times. Now, let's check what the
current chekpoint_timeout setting is:
postgres=# SHOW checkpoint_timeout;
checkpoint_timeout
-------------------1min
(1 row)
It is not recommended to set this value as very low. This is because if a checkpoint invokes
for every 1 minute, then it will use most of the I/O to sync the dirty buffers to disk, which
significantly decreases the running queries response time. To tune this parameter, we have
to consider the current shared_buffers setting and need to decide the proper interval
value for the timed checkpoints. In case we have the current instance shared_buffers as
10 GB, and the amount of dirty buffers generated in 1 minute is 5 GB, then every 1 minute
the checkpoint process will be flushing 5 GB to disk, which takes more I/O resources. If we
configure this parameter properly, then we can expect the checkpoint_req value as 0 or
as a minimal value.
checkpoints_req
This defines the number of checkpoints that occur due to the max_wal_size limit. That is,
if the latest generated WAL files size is greater than the max_wal_size, then PostgreSQL
executes an internal checkpoint. This is similar to the checkpoint_segments parameter,
which we have in previous versions of PostgreSQL.
From the preceding statistics, implicit checkpoints occurred 445 times. Now, let's see the
max_wal_size setting:
postgres=# SHOW max_wal_size;
max_wal_size
-------------192MB
(1 row)
[ 202 ]
Database Monitoring and Performance
As per the configuration setting, an implicit checkpoint will be executed after the 192 MB
WAL files got generated. In some cases, even if we configured this value with high data
interval like 10 GB, and if we do a bulk INSERT/UPDATE/DELETE operations that generate
several 10 GB in less time than the checkpoint_timeout interval, then it will execute
implicit checkpoint operations multiple times, which automatically brings down the server
performance. If we could avoid these kinds of checkpoints during the bulk operations, then
the server will be in stable mode. If we can tune this max_wal_size parameter properly,
then we can get the checkpoints_req value as either 0 or minimal value.
checkpoint_write_time
This defines the amount of milliseconds it took to write the dirty buffers into the disk, by
using the write() system call. From the preceding statistics, this value is 551012. That is,
all the timed out and required checkpoints took 5,51,012 milliseconds to write 1974899
(buffers_checkpoints) number of blocks into the disk. Now, let's verify the block_size
setting value:
postgres=# SHOW block_size;
block_size
-----------8192
(1 row)
So, to write 1974899*8192 = 16178372608 bytes into the disk, checkpoint operation is
needed 551 seconds. That is, on an average the write speed is 28 MB per second from shared
buffers into disk. The amount of time required to write the buffers into disk is directly
proportional to the number of dirty blocks and the I/O wait.
checkpoint_sync_time
This defines the amount of milliseconds it took to sync OS cached data to disk, by using the
fsync() system call. From the preceding statistics, this value is 48577. That is, all
checkpoints took 48 seconds to flush the cached data into the disk. Once the checkpoint
completes its write() system call, then it initiates the fsync() call to flush the cache to
disk.
[ 203 ]
Database Monitoring and Performance
buffers_checkpoint
This defines the number of buffers that the checkpoint processes flushes to disk. From the
preceding statistics, this value is 1974899. That is, 16178372608 ~ 15 GB. Out of this 15 GB,
let's calculate how much data is flushed due to checkpoints_timed(19) and how much
withcheckpoints_req(445). That is, 5% of 15 GB ~ 0.75 GB is flushed to disk due to
checkpoints_timed, whereas 95% of 15 GB ~ 14.25 GB is flushed to disk due to
checkpoints_req.
As we discussed previously, we have to maintain the cluster to do its checkpoints based on
its timeout rather the limit of max_wal_size. From the preceding calculations, we can
confirm that most of the data it is flushing is due to max_wal_size rather than
checkpoint_timeout. So we have to consider fine tuning these two parameters.
buffers_clean
This defines the number of buffers that got cleared by the bgwriter process by flushing the
least recently used dirty buffers. To limit the number of blocks to flush to disk, it depends
on the bgwriter_lru_maxpages, bgwriter_lru_multiplier settings. The
bgwriter_lru_maxpages is the maximum page limit for the bgwriter process, whereas
bgwriter_lru_multiplier will be helpful in calculating the number of pages to flush,
based on the average recent needed buffers allocated to load data from disk to pages.
From the preceding statistics, this value is 150281. That is, the bgwriter process flushed
150281*8192 ~ 1231101952 bytes into disk whenever it invoked. Now, let's see the
bgwriter_delay setting and evaluate this value:
postgres=# SHOW bgwriter_delay ;
bgwriter_delay
---------------10s
(1 row)
As we configured the delay as 10 seconds, the bgwriter process invokes in this interval
and flushed 1231101952 ~ 1 GB dirty buffers from memory to disk.
[ 204 ]
Database Monitoring and Performance
But the amount of standalone checkpoint process is flushed 15 GB of data, whereas in
background as a parallel task, the bgwriter process also flushed 1 GB dirty buffers to disk.
So, to distribute the amount of dirty buffers between the checkpoint and bgwriter
processes, we have to make the bgwriter related settings aggressive. That is, we have to
reduce the bgwriter_delay settings, besides we have to increase the limit for the
maximum number of pages to flush using the bgwriter_lru_maxpages and
bgwriter_lru_multiplier parameters. Now, let's check these current parameter
settings:
postgres=# SHOW bgwriter_lru_maxpages ;
bgwriter_lru_maxpages
----------------------100
(1 row)
postgres=# SHOW bgwriter_lru_multiplier;
bgwriter_lru_multiplier
------------------------2
(1 row)
The reason to distribute the dirty buffers among the checkpoint, bgwriter process is to
avoid the user's backend connection to flush the dirty buffers. When a backend connection
tries to access multiple dirty buffers, then it is needed to flush those buffers into the disk in
the first place, before doing any operation on it. In this case, we may get an inefficient query
response time. To avoid this kind of situation, we need to use bgwriter to flush all the
possible dirty buffers into disk, before any user's connection tries to access it.
max_written_clean
This defines the number of times the bgwriter processes reached the
bgwriter_lru_maxpages limit, while flushing the dirty buffers into disk. From the
preceding statistics, this value is 1470. That is, the bgwriter process is stopped 1470
times, since it reached its max page limit. This value will be helpful in defining the work
load of the bgwriter process. If we see that this count is increasing rapidly, then we have
to consider fine tuning the background writer's lru related parameters. And if we see that
this value is constant, then we have to check whether bgwriter is causing any I/O load
spikes during its interval. If it is, then we have to reduce the values for the background
writer's lru related parameters.
[ 205 ]
Database Monitoring and Performance
buffers_backend
This defines the number of buffers that got flushed into disk by the individual backend
connection. That is, when any connection tries to hit a dirty buffer, then that process will
sync the dirty buffer into disk, before doing any operations on it. This syncing operation
may cause a delay while processing the given query. From the preceding statistics, this
value is 1883063. That is, 1883063*8192 ~ 15426052096 bytes, which is approximately 14 GB.
As per the pgbench test case against checkpoint settings, and the bgwriter delay and its
lru parameter settings, the pgbench user connection processes flushed 14 GB of data into
disk. Now, let's verify the shared_buffers setting what we have at instance:
postgres=# SHOW shared_buffers;
shared_buffers
---------------128MB
(1 row)
It is an inefficient test case that we ran with this shared_buffers setting. This is because
all the connections that forked from pgbench were busy in acquiring the limited
shared_buffers, besides flushing the dirty buffers to disk. If we ran another pgbench test
case, which had a low number of connections and a low number of DML operations, then
we might get this buffers_backend value as either zero or minimal.
In general, it is recommended to set the shared_buffers value as 25% of the RAM, to hold
all the connection's transactions. As per the shared_buffers setting value, we also need to
tune the checkpoints and bgwriter related parameters, to reduce the number of
buffers_backend, which significantly improve the query response time.
buffers_backend_fsync
This defines the number of times the fsync() happened, when buffers flushed to disk by
the backend connection process. In general, when a user connection flushes the buffers, then
it will push an fsync() request into the bgwriter parameter's queue. When this queue is
already filled, then it is the user connections responsibility to execute fsync() on updated
files.
[ 206 ]
Database Monitoring and Performance
buffers_alloc
This defines the number of buffers allocated in shared buffers to load the data blocks from
disk, when a cache miss happens. From the preceding statistics, this value is 1880980,
which is approximately equal to buffers_backend. As we discussed previously, with the
lack of enough shared_buffers, the backend connections were competing for the
resources by flushing the dirty buffers into the disk.
stats_reset
This defines the beginning timestamp for all of these cumulative statistics.
There's more…
The preceding pgbench test case ran by resetting the bgwriter statistics. It would be more
useful if we can store this statistics snapshot information into another table, based on an
interval. Once we have this snapshot information, then we can identify the I/O usage and
the behavior of bgwriter and checkpoint by comparing the delta between these
snapshots, as discussed previously.
PostgreSQL provides a logging parameter called log_checkpoints, which will log
detailed information about the checkpoint into the log files as follows:
LOG: checkpoint starting: immediate force wait
LOG: checkpoint complete: wrote 14168 buffers (86.5%); 0 transaction log
file(s) added, 0 removed, 12 recycled; write=1.669 s, sync=0.221 s,
total=34.881 s; sync files=6, longest=0.116 s, average=0.036 s;
distance=200615 kB, estimate=200615 kB
From the preceding log message, immediate force wait defines that it is not a timed out
checkpoint. Also, the log message contains some helpful information, such as total time it
took to complete this checkpoint, how many buffers it flushed to disk, and so on, which will
help us to understand the I/O utilization when a checkpoint occurs.
[ 207 ]
9
Vacuum Internals
In this chapter, we will cover the following recipes:
Dealing with bloating tables and indexes
Vacuuming and autovacuuming
Freezing and transaction ID wraparound
Monitoring vacuum progress
Controlling bloat using transaction age
Introduction
PostgreSQL provides MVCC (Multi Version Concurrency Control) to achieve one of the
ACID properties called isolation. Transactional isolation is nothing but achieving the
maximum possible concurrency among concurrent transactions. PostgreSQL has
implemented MVCC to achieve this concurrency by creating versions of each tuple when
the tuple receives any modifications. For example, let's say that a tuple received n
concurrent modifications. PostgreSQL will keep n versions of the same tuple and it will
make the last committed modified tuple as visible to further transactions. Having both
visible and non-visible tuples leads to more disk space utilization. However, PostgreSQL
provides a few ways to reutilize the non-visible tuples for further write operations.
In this chapter, we will be discussing what PostgreSQL offers to clean the nonvisible tuples,
and how they internally work.
Vacuum Internals
Dealing with bloating tables and indexes
In this recipe, we will be discussing how to deal with bloats using PostgreSQL's garbage
collector processes.
Getting ready
We all know that PostgreSQL's storage implementation is based on MVCC. As a result of
MVCC, PostgreSQL needs to reclaim the dead space/bloats from the physical storage, using
its garbage collector processes called vacuum or autovacuum. If we do not reclaim these
dead rows, then the table or index will keep growing until the disk space gets full. In a
worst case scenario, a single live row in a table can cause the disk space outage. We will
discuss these garbage collector processes in more detail in the next recipe, but for now let's
find out which tables or indexes have more dead space.
How to do it…
To identify the bloat of an object, we have to use the pgstattuple extension, otherwise we
have to follow the approach that is mentioned
at: http://www.databasesoup.com/2014/10/new-table-bloat-query.html:
postgres=# CREATE EXTENSION pgstattuple;
CREATE EXTENSION
postgres=# CREATE TABLE test(t INT);
CREATE TABLE
postgres=# INSERT INTO test VALUES(generate_series(1, 100000));
INSERT 0 100000
postgres=# SELECT * FROM pgstattuple('test');
-[ RECORD 1 ]------+-------table_len
| 3629056
tuple_count
| 100000
tuple_len
| 2800000
tuple_percent
| 77.16
dead_tuple_count | 0
dead_tuple_len
| 0
dead_tuple_percent | 0
free_space
| 16652
free_percent
| 0.46
[ 209 ]
Vacuum Internals
From the preceding output from pgstattuple(), we do not have any dead tuples in the
test table, and we also do not have a huge amount of free space in it. Now, let's delete a few
rows from the table test, and see the result of pgstattuple():
postgres=# DELETE FROM test WHERE t%7=0;
DELETE 14285
postgres=# SELECT * FROM pgstattuple('test');
-[ RECORD 1 ]------+-------table_len
| 3629056
tuple_count
| 85715
tuple_len
| 2400020
tuple_percent
| 66.13
dead_tuple_count | 14285
dead_tuple_len
| 399980
dead_tuple_percent | 11.02
free_space
| 16652
free_percent
| 0.46
From the preceding results, pgstattuple() returns the number of dead tuples and the
percentage of dead tuples in the table test. To identify the bloats of an index, we have to use
another function called pgstatindex(), as follows:
postgres=# SELECT * FROM pgstatindex('test_idx');
-[ RECORD 1 ]------+---------version
| 2
tree_level
| 2
index_size
| 105332736
root_block_no
| 412
internal_pages
| 40
leaf_pages
| 12804
empty_pages
| 0
deleted_pages
| 13
avg_leaf_density
| 9.84
leaf_fragmentation | 21.42
From the preceding results, the index has a deleted pages count of 13, which is a bloat of an
index. Once we identify the set of tables or indexes that need to be vacuumed, we have to
initiate a manual VACUUM operation on those tables. We could also take advantage of the
autovacuum process, which automatically runs VACUUM on the required tables.
[ 210 ]
Vacuum Internals
How it works…
The pgstattuple/pgstatindex function requires a complete table/index scan to calculate
the accurate values. If the table or index is huge in size, then the execution of the function
takes a long time and increases the significant I/O, so try to avoid executing this function on
any tables during production hours. There is no harm in using the other approach, which
calculates the bloat based on pg_catalog views.
In PostgreSQL 9.5, the pgstattuple extension introduced a new function called
pgstattuple_approx(), which is an alternative to pgstattuple() and returns the
approximated bloat, along with the estimated live tuples information. This function will not
do the full table scan like pgstattuple(), so it returns the results faster than the
pgstattuple() function.
For more information about this function, refer to: https://www.postgres
ql.org/docs/9.6/static/pgstattuple.html.
Vacuum and autovacuum
In this recipe, we will be discussing the importance of vacuum and autovacuum in
achieving good PostgreSQL performance.
Getting ready
As aforementioned, PostgreSQL is based on MVCC. As a net result, we will have all the
non-visible tuples beside visible tuples, which occupy the underlying disk storage. As of
now, these non-visible tuples have no use, and if we could reclaim or reuse the non-visible
tuple's disk storage, that would make the disk utilization more effective.
How to do it…
Let's experiment with the usage of VACUUM by creating a sample table and executing a few
SQL statements that generate non-visible tuples or dead tuples.
[ 211 ]
Vacuum Internals
Connect to your database using psql as a super user and then execute the following
command:
$ psql -h localhost -U postgres
postgres=# CREATE EXTENTION pg_freespacemap;
CREATE
Now create a test table as follows:
postgres=# CREATE TABLE test(t INT);
CREATE
For demonstration of the VACUUM process, let's turn off autovacuum on this table as follows:
postgres=# ALTER TABLE test SET(autovacuum_enabled=off);
ALTER TABLE
Now let's insert 1,000 rows into the table, and observe how many pages it occupies in the
underlying disk storage:
postgres=# INSERT INTO test VALUES(generate_series(1, 1000));
INSERT 0 1000
postgres=# ANALYZE test;
ANALYZE
postgres=# SELECT relpages FROM pg_class WHERE relname='test';
relpages
---------5
(1 row)
From the preceding SQL query, PostgreSQL used five relpages to store 1000 integer
records. Now, let's verify the free space available in the relation test using the
pg_freespace() function:
postgres=# SELECT sum(avail) freespace FROM pg_freespace('test');
freespace
----------0
(1 row)
Now, let's update all the 1,000 records and collect the relpages and freespace map
information again:
postgres=# UPDATE test SET t=1000;
UPDATE 1000
postgres=# ANALYZE test;
ANALYZE
postgres=# SELECT relpages FROM pg_class WHERE relname='test';
[ 212 ]
Vacuum Internals
relpages
---------9
(1 row)
postgres=# SELECT sum(avail) freespace FROM pg_freespace('test');
freespace
----------0
(1 row)
From the preceding SQL statement result, we got four additional pages by executing the
update on 1,000 tuples. That is, the table size is expanded by up to 80% in the disk storage
as we updated the whole table. We can also collect the number of non-visible or dead tuples
of the table from the pg_stat_user_tables view:
postgres=# SELECT n_dead_tup FROM pg_stat_user_tables WHERE relname='test';
n_dead_tup
-----------1000
(1 row)
Now let's execute VACUUM and then check the amount of free space available or the relation
'test':
postgres=# VACUUM test;
VACUUM
postgres=# SELECT sum(avail) freespace FROM pg_freespace('test');
freespace
----------33248
(1 row)
From the preceding results, once we execute the VACUUM command, it gives us some free
space that is available by logically cleaning the dead tuples. Now, let's insert an additional
1,000 records into the table, and then look at the relpages count and free space information:
postgres=# INSERT INTO test VALUES(generate_series(1, 1000));
INSERT 0 1000
postgres=# ANALYZE test;
ANALYZE
postgres=# SELECT relpages FROM pg_class WHERE relname='test';
relpages
---------9
(1 row)
postgres=# SELECT sum(avail) freespace FROM pg_freespace('test');
freespace
[ 213 ]
Vacuum Internals
----------4320
(1 row)
From the preceding SQL statement result, the relpages count is not increased even after
inserting an additional 1,000 records into the table, since we ran the VACUUM command
before an insert operation. Also, we can clearly see that the free space that was logically
removed by the previous VACUUM command has been utilized properly.
Now, let's repeat the same actions without using the VACUUM command before doing insert
operations, and look at the disk usage:
postgres=# TRUNCATE test;
TRUNCATE TABLE
postgres=# ANALYZE test;
ANALYZE
postgres=# INSERT INTO test VALUES(generate_series(1, 1000));
INSERT 0 1000
postgres=# UPDATE test SET t=100;
UPDATE 1000
postgres=# INSERT INTO test VALUES(generate_series(1, 1000));
INSERT 0 1000
postgres=# ANALYZE test;
ANALYZE
postgres=# SELECT relpages FROM pg_class WHERE relname='test';
relpages
---------14
(1 row)
postgres=# SELECT sum(avail) freespace FROM pg_freespace('test');
freespace
----------0
(1 row)
From the preceding results, we can see that an additional insert operation after updating the
table increased the table size by 50%. This is because the table has no free space map
available, and it extended the table storage. In the previous example, after the insert
operation is complete, the total table size takes nine pages, and in this case it takes 14 pages.
[ 214 ]
Vacuum Internals
From all the preceding experiments, we can confirm that using the VACUUM operation on a
table makes it reuse the table's dead tuple storage for any further incoming transactions.
However, to run the VACUUM command explicitly on every insert/update/delete operation is
not advisable. This is because VACUUM is a fairly heavy process that scans all the tuples in a
table that gradually increases the I/O utilization. Also, it is not advisable to completely stop
VACUUM on any table, since it will increase more dead tuples in your disk, which leads to
ineffective disk utilization. Instead, we have to run the VACUUM process periodically, such as
once in a day or once in a week, by scheduling the command in either crontab or using
pgBucket.
You can find more information about pgBucket at: https://bitbucket.or
g/dineshopenscg/pgbucket.
From PostgreSQL 8.1 onwards, autovacuum background workers are turned on by default,
which automatically run the VACUUM behavior in the background, and cleans the dead tuple
storage from the database. We can consider that the autovacuum process is a lightweight
vacuum process that keeps on running in the background with the interval of the
autovacuum_nap_time setting.
How it works…
When we execute the VACUUM command on any table, it will request the postgres instance to
send all the current running SQL queries at that moment. Once the vacuum process receives
that list, it will go to identify the dead tuples that are not visible to the running SQL queries.
Then it will use the configured maintenance_work_mem memory area, to load the relation
into memory for identifying dead tuples. Once the vacuum process identifies all the dead
tuples information, it will generate an FSM (Free Space Map) for that relation. This FSM file
will be referred to reuse the dead tuple's disk storage for further transactions.
Another responsibility of the vacuum process is to update the table's each row transaction
status by validating pg_clog and pg_subtrans. In PostgreSQL, every row header contains
xmin and xmax values. The xmin and xmax values associate with the transaction statuses,
such as In progress, Committed, or Aborted by referring to the status bits from
pg_clog.
[ 215 ]
Vacuum Internals
xmin
It keeps the transaction ID of an inserting transaction that is inserted due to insert/update
operations.
xmax
It keeps the transaction ID of a deleting transaction, or a new transaction ID, which is
inserted due to an update operation. By default, the value of xmax is 0.
VACUUM not only has the responsibility of cleaning the dead tuples and updating the status
flags, it also has the responsibility of cleaning the tuples that are created due to an aborted
transaction.
To initiate the vacuum operation at database level, we just have to execute a plain VACUUM
command without any further arguments. By default, the VACUUM command will run on the
entire database level by vacuuming tables one by one. PostgreSQL also provides a utility
command called vacuumdb, which also executes the same VACUUM command by connecting
to the given data source. In PostgreSQL 9.5, the vacuumdb utility command has parallel
vacuum jobs, which execute vacuum operations concurrently on multiple relations, rather
than doing them in a sequential manner.
For example, if we want to initiate four vacuum process concurrently, then we have to
execute the vacuumdb command as follows:
$ vacuumdb -j 4
vacuumdb: vacuuming database "postgres"
Autovacuum
Autovacuum is a utility background worker that invokes vacuum automatically for every
autovacuum_nap_time per database. We can also configure how many background
worker processes we want by using the autovacuum_max_workers setting. Once an
autovacuum worker process invokes, it will try to allocate the maintenance_work_mem
memory area to do its vacuum operation. So it is advisable to revalidate the
maintenanc_work_mem and autovacuum_max_workers parameters, while fine-tuning the
PostgreSQL settings. We can also turn off this autovacuum job by making
autovacuum=off in the postgresql.conf file, which needs an instance reload.
[ 216 ]
Vacuum Internals
Even though, we turned off this autovacuum process at cluster level, autovacuum will
invoke in a special case, namely when it tries to eliminate transaction ID wrap around an
issue at database level. We will be discussing about the transaction ID wrap around in the
further recipe, and for now will see how the autovacuum process works.
When an autovacuum invokes per a database, it will scan all the tables and will generate
FSM for each table as like vacuum process. An additional responsibility of an autovacuum
process is, it will also update the PostgreSQL catalog tables with the latest information. That
is, internally autovacuum is running an ANALYZE operation on each table, and will update
the statistics as well. By default, autovacuum process is not configured as aggressive in
reclaiming the dead tuples. That means, autovacuum will only scan the all tables, when the
n% of the table has dead tuples. PostgreSQL also provides table-level, database-level and
cluster-level autovacuum settings, which helps us to drive autovacuum as needed.
In general, for every autovacuum job iteration it will collect the list of tables which needs to
be vacuumed from the pg_class catalog table, by using the configured autovacuum
settings. If any table's number of dead tuples crosses the threshold outlined in the following
calculation, then autovacuum will trigger on that table, and will update its activity
information in the pg_stat_user_tables catalog:
autovacuum_vacuum_scale_factor * total number of tuples +
autovacuum_vacuum_threshold
We can get the total number of tuples of any relation by using the reltuples field of
pg_class, and we can get the dead tuples of any relation by using the n_dead_tup field of
pg_stat_user_table.
Freezing and transaction ID wraparound
In this recipe, we will be discussing a few other aspects of the VACUUM process.
Getting ready
As we discussed in the previous recipe, each row in PostgreSQL contains xmin and xmax
values in its header, which define the transaction status. For each implicit/explicit
transaction, PostgreSQL allots a number to that transaction as a transaction ID. As
transaction ID is a number, which should have its boundaries like the maximum and
minimum values it should allow, since we cannot generate an infinite amount of numbers.
[ 217 ]
Vacuum Internals
As any number should be a definite value, PostgreSQL chooses a 4-byte integer as a definite
number for these transaction IDs. That is, the maximum transaction ID we can generate
with 4 bytes is 2^32 ~ 4294967296, which is 4 billion transaction IDs. However, PostgreSQL
is capable enough to handle the unlimited number of transactions with the 4-byte integer,
by rotating the transaction IDs from 1 to 2^31 ~ 2147483648. That is, if PostgreSQL reaches
to the transaction ID 2147483648, then it will allot transaction ID from 1 to 2^31 for further
incoming transactions. In PostgreSQL terminology, this kind of rotating the transaction ID
is called as transaction ID wraparound.
In PostgreSQL, a tuple visibility is designed as follow.
A transaction can only see an old transaction's committed tuples as follows.
Session 1
postgres=# BEGIN;
BEGIN
postgres=# SELECT txid_current();
txid_current
-------------452516
(1 row)
postgres=# INSERT INTO test
VALUES(1);
INSERT 0 1
postgres=# END;
COMMIT
Session 2
postgres=# BEGIN;
BEGIN
postgres=# SELECT txid_current();
txid_current
-------------452517
(1 row)
postgres=# SELECT xmin, xmax, t FROM
test;
xmin | xmax | t
--------+------+--452516 |
0 | 1
(1 row)
postgres=# END;
COMMIT
From the preceding example, Session 2's transaction ID of 452517 is greater than the Session
1's transaction ID 452516, which is also committed. That is the reason we are able to see the
tuple in Session 2 along with the xmin and xmax of the inserted record.
[ 218 ]
Vacuum Internals
In some special cases, PostgreSQL will also allow the transaction to see its future
transaction's committed tuples by referring the commit log files from pg_clog, as follows:
Session 1
Session 2
postgres=# BEGIN;
BEGIN
postgres=# SELECT txid_current();
txid_current
-------------452519
(1 row)
postgres=# BEGIN;
BEGIN
postgres=# SELECT txid_current();
txid_current
-------------452520
(1 row)
postgres=# INSERT INTO test
VALUES(1);
INSERT 0 1
postgres=# END;
COMMIT
postgres=# SELECT xmin, xmax, t FROM
test;
xmin | xmax | t
--------+------+--452520 |
0 | 1
(1 row)
postgres=# END;
COMMIT
pg_clog stores the currently running transactions status, which will be helpful during
PostgreSQL instance crash recovery. Also, pg_clog will only keep the minimum essential
transactions information rather than storing all transactions from 1 to 2^31.
Now, let us consider a special case that will occur when a transaction ID is a wraparound.
Let us assume the current transaction ID is 1, which we got it as a next transaction ID, after
we reach the transaction limit 2^31. As demonstrated earlier, the current transaction ID can
only see its previous transaction ID's committed information. As we are at transaction ID 1,
it is only able to see its previous transaction information and not the future transactions.
[ 219 ]
Vacuum Internals
Since, all the transaction IDs from 2 to 2^31 are future ids for the current the transaction,
and also pg_clog will not have any transaction status about those transactions, the current
transaction ID 1 is not capable to see any rows in the database, which leads to a database
unavailability. In PostgreSQL, we call this special case as transaction ID wraparound
problem, and to prevent this issue, PostgreSQL provides an approach called freeze
transaction ID.
How to do it…
Vacuum provides an another utility operation called freeze. When we execute
vacuum along with the freeze option, it will remove dead tuples and freeze the transaction
ID. This freeze will be helpful in preventing transaction ID wraparound.
The following command needs to be executed to freeze the transaction IDs:
postgres=# VACUUM FREEZE test;
VACUUM
The autovacuum process is also smart enough to do this freeze operation, even though we
turned off this process at instance level.
How it works…
Freezing a transaction ID means converting the value of a transaction ID as less than the
transaction ID 1. As we discussed previously, when we reached transaction ID 1 as a result
of transaction wraparound, we cannot see all the further transactions that are committed
from transaction ID 2 to 2^31. Since, all these transaction IDs are greater than the ID 1. So, if
we could make all these IDs less than ID 1, then PostgreSQL will able to display all those
committed transaction ID's information. This is exactly what a FREEZE operation perform
against the table. It will convert all the previous transaction ID values as logically less than
the transaction ID 1.
[ 220 ]
Vacuum Internals
Once we execute the preceding VACUUM FREEZE command, PostgreSQL sets the
vacuum_freeze_table_age to 0 and then it will proceed with the actual freezing
operation. To do this freezing operation, VACUUM needs to visit each row in each page, and
will freeze the transaction IDs. If a transaction ID has bulk operations like
insert/update/delete, then vacuum freeze needs to convert all these row entries, which will
increase the I/O load. If we do not run this VACUUM FREEZE, autovacuum will do the same
freezing option to avoid the wraparound problem, when a database transaction ID age
reaches to autovacuum_freeze_max_age. The age of a transaction ID is nothing but, how
many transactions we performed in a table or database without any FREEZE being applied
or after the FREEZE operation. Whenever the database transaction age reaches to
autovacuum_freeze_max_age, then PostgreSQL will launch the autovacuum process
instantly to perform the freeze operation on the whole database.
As the VACUUM FREEZE process is a bit expensive, we have to fine-tune this operation using
the following parameters:
[ 221 ]
Vacuum Internals
From the preceding pictorial representation contains the following parameters:
vacuum_freeze_min_age: When an age of transaction reaches 50 million, then
only perform the FREEZE operation on it. Once a FREEZE operation is done on a
transaction ID, then, after 50 million transactions, it is eligible to freeze.
vacuum_freeze_table_age: When the age of transaction IDs of a table reaches 150
million, then do a full table FREEZE operation, which will scan each and every
row in a table. When we run VACUUM FREEZE, it will scan each and every row in
a table to freeze the transaction IDs, and also with reclaim the dead tuples. In
general, a plain VACUUM operation visits only the blocks, that have dead rows in
them. so as to avoid the high I/O usage during the vacuum.
autovacuum_freeze_max_age: When an age of database transaction ID reaches
200 million, then an autovacuum process will start forcefully and will perform
FREEZE operation on all tables of a database. That is, autovacuum will be
performing FREEZE operation approximately 10 times before we reach the
transaction ID limit of 2^31.
Fixing transaction ID wraparound
Transaction ID wraparound is a serious concern, since PostgreSQL needs to be shut down,
which leads to a database outage. PostgreSQL will also throw the following kind of
warning messages when transaction wraparound is about to happen:
WARNING: database "<database name>" must be vacuumed within XXX
transactions.
HINT: To avoid a database shutdown, execute a database wide VACUUM in that
database.
If we see the preceding kind of warning messages in a log file, as per the hint line we need
to perform the VACUUM in the database. If we didn't notice the preceding warning messages
from log file, we may hit the transaction wraparound limit and get the following error
message in the log file:
ERROR: database is not accepting commands to avoid wraparound data loss in
database "<database name>"
HINT: Stop the postmaster and use a standalone backend to vacuum that
database.
[ 222 ]
Vacuum Internals
The preceding hint line suggests that we have to use a standalone backend–such as the one
shown here–to perform the vacuum.
$ bin/postgres --single -D <Data Directory Location> <database name>
PostgreSQL stand-alone backend 9.5.4
backend> VACUUM;
Once the vacuum operation is completed, use Ctrl + D to close this standalone connection,
and then start the instance as usual.
Preventing transaction ID wraparound
To prevent a transaction id wraparound problem, it is always advisable to turn on the
autovacuum process at cluster level. The autovacuum process plays a vital role in Postgres,
which will update the statistics and perform VACUUM and do freezing operations online
when it is required. Also, a periodic VACUUM ANALYZE operation in the required timings
will also help the instance to be healthier.
Monitoring vacuum progress
In this recipe, we will be discussing the usage of the pg_stat_progress_vacuum catalog
view, which gives some metrics about the ongoing vacuum process.
Getting ready
PostgreSQL 9.6 introduces a new catalog view, which provides some metrics about the
ongoing vacuum processes. This view provides some metrics, along with different action
phases, where vacuum/autovacuum performs internally on the table. This view currently
does not track the metrics of VACUUM FULL operations, which might be available in future
versions.
How to do it…
Let use VACUUM on any sample big table:
benchmarksql=# VACUUM bigtable;
[ 223 ]
Vacuum Internals
In another terminal, let us query the view and put in the \watch mode, as shown here:
benchmarksql=# SELECT * FROM pg_stat_progress_vacuum;
(0 rows)
benchmarksql=# \watch 1 -[ RECORD 1 ]------+------------------ pid
| 4785
datid
| 16405
datname
| benchmarksql
relid
| 16762
phase
| scanning heap
heap_blks_total
| 412830
heap_blks_scanned | 393532
heap_blks_vacuumed | 0
index_vacuum_count | 0
max_dead_tuples
| 11184810
num_dead_tuples
| 1264
How it works…
The catalog view pg_stat_progress_vacuum is created based on an internal
function, pg_stat_get_progress_info, which only takes VACUUM as an argument. This
view provides statistical information about each vacuum/autovacuum backend process. At
this moment, this function only tracks the information of VACUUM, and might be improved
with more commands in future versions. From the preceding output, we can identify on
what phase the current vacuum at this moment. There are a series phases a vacuum
performs on a table in a sequential order, which helps us to identify which phase is taking
more time during the vacuum.
You can find more details about vacuum phases, and its counter values
at: https://www.postgresql.org/docs/9.6/static/progress-reportin
g.html.
Control bloat using transaction age
In this recipe, we will be discussing how to control the generation of dead tuples using the
snapshot threshold setting.
[ 224 ]
Vacuum Internals
Getting ready
In earlier versions of PostgreSQL, the old version of a tuple could be visible to the snapshot
until the transaction was completed. Once the tuple was not visible to any of the active
transactions, then it would be removed logically by the autovacuum process. Also, we
cannot limit the age of a transaction snapshot as we can in other database management
systems. If we can restrict the age of a transaction, then we can prevent generating multiple
versions of the tuples by throwing the snapshot too old error. This means that, if a
transaction holds a set of tuples that were modified some time ago, then that transaction
should not progress further. In PostgreSQL 9.6, we can achieve this by configuring
the old_snapshot_threshold parameter.
How to do it…
To demonstration this feature, let us execute the following query and then restart the
PostgreSQL instance:
postgres=# ALTER SYSTEM SET old_snapshot_threshold TO '1min';
ALTER SYSTEM
Let us execute transactions in two different sessions, as shown here:
Session 1
Session 2
postgres=# BEGIN WORK;
BEGIN
postgres=# SET TRANSACTION ISOLATION LEVEL
SERIALIZABLE;
SET
postgres=# SELECT * FROM test;
t
---10
(1 row)
postgres=# UPDATE test SET t=10;
UPDATE 1
postgres=# SELECT pg_sleep(100);
pg_sleep
---------(1 row)
postgres=# UPDATE test SET t=10;
ERROR: snapshot too old
[ 225 ]
Vacuum Internals
How it works…
In PostgreSQL 9.6, whenever a transaction is started, it holds the transaction begin time as a
snapshot time. For better understanding, let us call the snapshot time strx. Every statement
that executes in the transaction, continues to update the recent transaction time. Let us
call this transaction time ltrx. Whenever the statement in the transaction needs to visit the
tuples, which were already modified by the other transactions, then it will compare the ltrx
with the strx. If the delta of ltrx and strx is beyond the old_snapshot_threshold, then it
will throw an error message as the snapshot is too old.
At the moment, this new setting doesn't affect the tables that have hash
indexes, and catalog tables.
[ 226 ]
10
Data Migration from Other
Databases to PostgreSQL and
Upgrading the PostgreSQL
Cluster
In this chapter, we will cover the following recipes:
Using the pg_dump to upgrade data
Using the pg_upgrade utility for version upgrade
Replicating data from other databases to PostgreSQL using Goldengate
Introduction
During the career of a database administrator, he/she will often be required to do major
version upgrades of the PostgreSQL server. Over a period of time, new terminologies and
features get added to PostgreSQL, and this results in a major version release. To implement
the new features of the new version, the existing PostgreSQL setup needs to be upgraded to
the new version. Database upgrades require proper planning, careful execution, and
planned downtime. PostgreSQL offers two major ways to do a version upgrade:
With the help of the pg_dump utility
With the help of the pg_upgrade script
Data Migration from Other Databases to PostgreSQL and Upgrading the PostgreSQL Cluster
We will also cover the Oracle Goldengate tool in this chapter. Goldengate is a piece of
heterogeneous replication software that can be used to replicate data between different
databases. In this chapter, we are going to cover heterogeneous replication between Oracle
and PostgreSQL.
Using pg_dump to upgrade data
In this recipe, we are going to upgrade the PostgreSQL cluster from version 9.5 to 9.6. We
will utilize the pg_dump utility for this purpose.
Getting ready
The only prerequisites here are that an existing PostgreSQL cluster must be set up and
running. The required version here is PostgreSQL Version 9.6. These steps are carried out
on a 64 bit CentOS machine.
How to do it…
Here are the series of steps that need to be carried out for upgrading PostgreSQL from
version 9.5 to 9.6 using pg_dump:
1. Back up your database using pg_dumpall:
pg_dumpall >
db.backup
2. The next step would be to shut down the current PostgreSQL server:
pg_ctl -D /var/lib/pgsql/9.5/data
stop
3. Rename the old PostgreSQL installation directory:
mv /var/lib/pgsql
/var/lib/pgsql.old
4. Install the new version of PostgreSQL, which is PostgreSQL Version 9.6, by using
either the yum command or the PostgreSQL binaries from: https://www.bigsql.
org/postgresql/installers.jsp.
5. The next step would be to initialize the PostgreSQL Version 9.6 server:
/usr/pgsql-9.6/bin/initdb -D
/var/lib/pgsql/9.6/data
[ 228 ]
Data Migration from Other Databases to PostgreSQL and Upgrading the PostgreSQL Cluster
6. Once the database cluster is initialized, the next step is to restore the
configuration settings, such as pg_hba.conf and postgresql.conf, from 9.5
into 9.6.
7. Then start the PostgreSQL 9.6 database server:
pg_ctl -D /var/lib/pgsql/9.6/data
start
8. Finally, restore your data from the backup that was created in step 1:
/usr/pgsql-9.6/bin/psql
-d postgres
-f
db.backup
9. We can either remove the old version data directory or continue working
alongside both the server versions.
10. If we choose to remove the old version as mentioned in step 9, we can then
remove the respective old version packages by using the yum remove command,
or the uninstaller file that BigSQL distribution creates in the local system.
How it works…
Here, initially we take a dump of all the databases in the existing PostgreSQL 9.6 version
cluster. We then initiate a clean shutdown of the current PostgreSQL server and rename the
existing PostgreSQL installation directory to avoid any conflicts with the new version of
PostgreSQL, that is, the version 9.6 being installed. Once the respective packages of the new
version are installed, we proceed with initializing a database directory for PostgreSQL 9.6
server. To ensure that the desired configuration settings come into effect, we will need to
copy the configuration files from the old version's data directory to the new version's data
directory, and then start the new version PostgreSQL server service using the configuration
settings that were defined for the existing environment in the old server. Once the
PostgreSQL server version 9.6 has been started, we can make connections to the databases
on this server. Eventually, we restore all the tables and databases from the old PostgreSQL
server 9.5 to the new PostgreSQL server version 9.6 using the backup that was made in step
1 of the previous section.
You may refer to the link given here for a more detailed explanation on
upgrading a PostgreSQL cluster using pg_dump: http://www.postgresql
.org/docs/9.6/static/upgrading.html.
[ 229 ]
Data Migration from Other Databases to PostgreSQL and Upgrading the PostgreSQL Cluster
Using the pg_upgrade utility for version
upgrade
Here in this recipe, we are going to talk about upgrading a PostgreSQL cluster using
pg_upgrade. We will cover the upgrading of the PostgreSQL version from 9.5 to 9.6.
Getting ready
The only prerequisite here is that an existing PostgreSQL cluster must be up and running.
The required version here is PostgreSQL Version 9.6. These steps are carried out on a 64 bit
CentOS machine.
How to do it…
Here are the steps to upgrade a PostgreSQL machine from version 9.5 to version 9.6 using
the pg_upgrade utility:
1. Take a full backup of the data directory using a filesystem dump, or use
pg_dumpall to back up the data:
cd
/opt/pgsql/9.5/
tar -cvf data.tar data
2. The next step would to be install the new version of PostgreSQL 9.6, as
mentioned previously.
3. Now that the new version of PostgreSQL is installed, we make the following
configuration changes and then initialize the data directory for the new
PostgreSQL Version 9.6 database:
View /etc/init.d/postgresql-9.6
PGDATA=/opt/pgsql/9.6/data
PGLOG=/opt/pgsql/9.6/pgstartup.log
PGPORT= 5433
/etc/init.d/postgresql-9.6 initdb
[ 230 ]
Data Migration from Other Databases to PostgreSQL and Upgrading the PostgreSQL Cluster
4. Once the data directory has been initialized for the new PostgreSQL Version 9.6,
the next step is to stop the existing PostgreSQL Version 9.5 server and then run
the pg_upgrade utility to check whether the clusters are compatible or not:
Service postgresql-9.5
cd
stop
/usr/pgsql-9.6/bin
./pg_upgrade -v -b /usr/pgsql-9.5/bin/ -B /usr/pgsql-9.6/bin/ d /opt/pgsql/9.5/data/ -D /opt/pgsql/9.6/data/ --check
5. Once the preceding command returns the compatible check as good, we need to
run the same command without the --check option, which actually performs the
upgrade operation.
6. Once the upgrade completes, it will generate two files,
analyze_new_cluster.sh and delete_old_cluster.sh. These files are
basically used to generate optimizer statistics and delete the old PostgreSQL
cluster version's data files.
7. After this upgrade step, we would need to copy the configuration setup present
in the configuration files in the old version to the new setup.
8. Next, we start the PostgreSQL server version 9.6 service:
service postgresql-9.6 start
9. Run the analyze_new_cluster.sh shell script that was generated at the end of
step 5. This script is used to collect minimal optimizer statistics in order to get a
working and usable PostgreSQL system:
./opt/pgsql/analyze_new_cluster.sh
10. Now remove the old PostgreSQL directory by running this script:
./delete_old_cluster.sh
[ 231 ]
Data Migration from Other Databases to PostgreSQL and Upgrading the PostgreSQL Cluster
How it works…
After installing the packages for PostgreSQL server version 9.6, what we are doing here is
changing the location of the data directory in the startup script, as shown in step 4. After
this is done, we initialize the data directory for the new PostgreSQL server. The difference
between the steps here and the previous recipe is that here we are changing the location of
the data directory, the log path, and port number of the new PostgreSQL server version;
whereas in the earlier recipe, we renamed the existing PostgreSQL server version data
directory. Once the data directory is initialized, we stop the current PostgreSQL server and
then launch the pg_upgrade script to upgrade the existing setup to the new version. The
pg_upgrade script requires specifying the path of old and new data directories and
binaries. Once the upgrade completes, it generates two shell
scripts, analyze_new_cluster.sh and delete_old_cluster.sh to generate statistics
and delete the old version PostgreSQL directory. To preserve the existing configuration, we
would need to move the pg_hba.conf and postgresql.conf files from the old version's
data directory to the new version's data directory, as shown in step 6 in the preceding
section. Then we can start the upgraded PostgreSQL server. Once the server is upgraded,
we can then proceed to generate statistics via the analyze_new_cluster.sh, script and
remove the old version directory, via the delete_old_cluster.sh script as shown in step
9 and 10, respectively.
For more inputs, refer to http://www.postgresql.org/docs/9.
6/static/pgupgrade.html.
Replicating data from other databases to
PostgreSQL using Goldengate
In this recipe, we are going to cover heterogeneous replication using the Oracle Goldengate
software. We are going to migrate table data from Oracle to PostgreSQL.
Getting ready
Since this recipe talks about replicating data from Oracle to PostgreSQL, it is important to
cover the Oracle installation. Also, since Goldengate is the primarily tool that is used, we
will cover the Goldengate installation for both Oracle and PostgreSQL.
[ 232 ]
Data Migration from Other Databases to PostgreSQL and Upgrading the PostgreSQL Cluster
To install the Oracle 11g software on the Linux platform, you may refer to either of the
following web links: http://oracle-base.com/articles/11g/oracle-db-11gr2-install
ation-on-oracle-linux-5.php and http://dbaora.com/install-oracle-11g-release2-11-2-on-centos-linux-7/.
To install Goldengate for Oracle database, refer to: http://docs.oracle.com/cd/E35209_
01/doc.1121/e35957.pdf.
Here are the high-level installation steps, given for the ease of the reader. Please refer to the
aforementioned web link for more detailed information:
1. Log in to edelivery.oracle.com.
2. Select the product pack Fusion Middleware and the Linux x86-64 platform. Click
on the Go button.
3. Choose the option with the description that says Oracle GoldenGate on
Oracle v11.2.1 Media Pack for Linux x86-64 and it will open a web
link. Then download the file with the name Oracle GoldenGate V11.2.1.0.3
for Oracle 11g on Linux x86-64. Download the respective zip file.
4. As a next step, extract the downloaded file and change the directory to the new
location. Then launch the Goldengate command-line interface using the ggsci
command. Before launching the Goldengate command-line interface, set the
Goldengate installation directory and library path in the PATH and
LD_LIBRARY_PATH environment variables, respectively.
How to do it…
Given here is the complete sequence of steps required to migrate table data/changes from
Oracle to PostgreSQL using Goldengate:
1. Connect as the superuser system by using the operating system authentication on
the machine that hosts the Oracle database, using the sqlplus utility:
sqlplus / as sysdba
2. Once logged in to the Oracle database, make the following parameter changes:
SQL> alter system set
log_archive_dest_1='LOCATION=/home/abcd/oracle/oradat
/arch';
[ 233 ]
Data Migration from Other Databases to PostgreSQL and Upgrading the PostgreSQL Cluster
3. To make the aforementioned parameter changes come into effect, shut down and
restart Oracle database:
SQL> shutdown immediate
SQL> startup mount
4. Configure archiving on the Oracle database to ensure that changes made by
transactions are captured and logged in the archive log files:
SQL> alter database archivelog;
SQL> alter database open;
5. Enable minimum supplemental logging:
SQL> alter database add supplemental log data;
SQL> alter database force logging;
SQL> SELECT force_logging, supplemental_log_data_min FROM
v$database;
FOR SUPPLEME
--- -------YES
YES
6. Add the Goldengate directory path to the PATH and LD_LIBRARY_PATH
environment variables:
export
PATH=$ORACLE_HOME/bin:$ORACLE_HOME/OPatch:$HOME/ggs:$PATH
export LD_LIBRARY_PATH=$ORACLE_HOME/lib:$HOME/ggs/lib
7. Launch the Goldengate command-line interface for Oracle:
./ggsci
8. The next step would to be create various subdirectories for Goldengate, such as
directories for report files, database definition, and so on:
GGSCI> create subdirs
Creating subdirectories under current directory
/home/abcd/oracle/ggs
Parameter files
/home/abcd/oracle/ggs/dirprm:
already exists
[ 234 ]
Data Migration from Other Databases to PostgreSQL and Upgrading the PostgreSQL Cluster
Report files
Checkpoint files
Process status files
SQL script files
Database definitions files
Extract data files
Temporary files
Stdout files
/home/abcd/oracle/ggs/dirrpt:
created
/home/abcd/oracle/ggs/dirchk:
created
/home/abcd/oracle/ggs/dirpcs:
created
/home/abcd/oracle/ggs/dirsql:
created
/home/abcd/oracle/ggs/dirdef:
created
/home/abcd/oracle/ggs/dirdat:
created
/home/abcd/oracle/ggs/dirtmp:
created
/home/abcd/oracle/ggs/dirout:
created
9. Create for the manager a parameter file that contains a port number for the
manager. Here we enter port 7809 as the port number:
GGSCI > edit param mgr
GGSCI > view param mgr
PORT 7809
10. The next step would be to exit from the manager, start the manager, and then
verify that it is running:
GGSCI > start mgr
GGSCI > info all
Program
Chkpt
Status
MANAGER
RUNNING
Group
GGSCI > info mgr
Manager is running (IP port 7809).
[ 235 ]
Lag at Chkpt
Time Since
Data Migration from Other Databases to PostgreSQL and Upgrading the PostgreSQL Cluster
11. Log in to the server hosting the PostgreSQL server and write the Goldengate
configuration steps there. First, add the Goldengate directory to the
LD_LIBRARY_PATH and PATH environment variables:
export LD_LIBRARY_PATH=/usr/pgsql/lib:/usr/pgsql/ggs/lib
export PATH=/usr/pgsql/bin:/usr/pgsql/ggs:$PATH
12. Goldengate uses an ODBC connection to connect to the Postgres database. The
next step is to create the ODBC file. The ODBC driver is shipped along with the
installation on Linux/Unix; you only have to create just the configuration file. If
the ODBC driver is not available, you may download the respective PostgreSQL
driver from http://www.uptimemadeeasy.com/linux/install-postgresql-odb
c-driver-on-linux/:
view odbc.ini
[ODBC Data Sources]
GG_Postgres=DataDirect 6.1 PostgreSQL Wire Protocol
[ODBC]
IANAAppCodePage=106
InstallDir=/usr/pgsql/ggs
[GG_Postgres]
Driver=/usr/pgsql/ggs/lib/GGpsql25.so
Description=DataDirect 6.1 PostgreSQL Wire Protocol
Database=test
HostName=dbtest
PortNumber=5432
LogonID=nkumar
Password=nkumar
13. Export the ODBC environment variable that is ODBCINI which should point to
odbc.ini file that we created in the previous step:
export ODBCINI=/usr/pgsql/ggs/odbc.ini
14. Now that we have the ODBC setup completed, we start with the Goldengate
setup for PostgreSQL. We will first launch the Goldengate command-line
interpreter for PostgreSQL:
./ggsci
[ 236 ]
Data Migration from Other Databases to PostgreSQL and Upgrading the PostgreSQL Cluster
15. We will now create various subdirectories for Goldengate report, definition files,
and so on:
GGSCI > create subdirs
Creating subdirectories under current directory /usr/pgsql/ggs
Parameter files
Report files
Checkpoint files
Process status files
SQL script files
Database definitions files
Extract data files
Temporary files
Stdout files
/usr/pgsql/ggs/dirprm:
exists
/usr/pgsql/ggs/dirrpt:
/usr/pgsql/ggs/dirchk:
/usr/pgsql/ggs/dirpcs:
/usr/pgsql/ggs/dirsql:
/usr/pgsql/ggs/dirdef:
/usr/pgsql/ggs/dirdat:
/usr/pgsql/ggs/dirtmp:
/usr/pgsql/ggs/dirout:
already
created
created
created
created
created
created
created
created
16. The next step would be to create the manager parameter file with port number
and then start the manager:
GGSCI > edit param mgr
GGSCI > view param mgr
PORT 7809
17. Once we've created the parameter file, we can start the manager and check its
status:
GGSCI > start mgr
Manager started.
GGSCI > info all
Program
Chkpt
Status
MANAGER
RUNNING
Group
GGSCI > info mgr
Manager is running (IP port 7809).
[ 237 ]
Lag at Chkpt
Time Since
Data Migration from Other Databases to PostgreSQL and Upgrading the PostgreSQL Cluster
18. We will now create a table both in Oracle and PostgreSQL and replicate the data
between the two. Log in to the Oracle database and create the table:
sqlplus nkumar
SQL> create table abcd(col1 number,col2 varchar2(50));
Table created.
SQL> alter table abcd add primary key(col1);
Table altered.
19. Log in to the PostgreSQL database and create a similar table:
psql -U nkumar -d test -h dbtest
test=> create table "public"."abcd" ( "col1" integer NOT NULL,
"col2" varchar(20),CONSTRAINT "PK_Col111" PRIMARY KEY
("col1"));
20. Log in to the Oracle database using the Goldengate command-line interface, list
the tables, and capture and check their data types:
GGSCI > dblogin userid nkumar, password nkumar
Successfully logged into database.
GGSCI > list tables *
NKUMAR.ABCD
Found 1 tables matching list criteria:
GGSCI > capture tabledef nkumar.abcd
Table definitions for NKUMAR.ABCD:
COL1
NUMBER NOT NULL PK
COL2
VARCHAR (50)
21. In the next step, we will check our ODBC connection to the PostgreSQL database
and use the Goldengate command line interface (CLI) to list the tables and
capture the table definitions:
GGSCI > dblogin sourcedb gg_postgres userid nkumar
Password:
2014-11-04 17:56:35
INFO
OGG-03036
[ 238 ]
Database character set
Data Migration from Other Databases to PostgreSQL and Upgrading the PostgreSQL Cluster
identified as UTF-8. Locale: en_US.
2014-11-04 17:56:35 INFO
OGG-03037
identified as UTF-8.
Successfully logged into database.
Session character set
GGSCI > list tables *
public.abcd
Found 1 table matching list criteria
GGSCI > capture tabledef "public"."abcd"
Table definitions for public.abcd:
col1
NUMBER (10) NOT NULL PK
col2
VARCHAR (20)
22. Now we will start the Goldengate extract process on the Oracle database. First we
will create the extract process that captures the changes for the abcd table in the
Oracle database and copy these changes directly to the PostgreSQL machine.
Every process needs the configuration file, so we will create one for the extract
process:
GGSCI > edit param epos
The parameters created are shown here when viewing the parameter file:
GGSCI > view param epos
EXTRACT epos
SETENV (NLS_LANG="AMERICAN_AMERICA.ZHS16GBK")
SETENV
(ORACLE_HOME="/home/abcd/oracle/product/11.2.0/dbhome_1")
SETENV (ORACLE_SID="orapd")
USERID nkumar, PASSWORD nkumar
RMTHOST dbtest, MGRPORT 7809
RMTTRAIL /usr/pgsql/ggs/dirdat/ep
TABLE nkumar.abcd;
23. The extract process is called epos and it connects as user nkumar using the
password nkumar to the Oracle database. Changes made on the Oracle table
abcd will be extracted and this information will be put in a trail file in the
PostgreSQL machine. Now that the parameter file has been created, we can add
the extract process and start it:
GGSCI > add extract epos, tranlog, begin now
EXTRACT added.
[ 239 ]
Data Migration from Other Databases to PostgreSQL and Upgrading the PostgreSQL Cluster
GGSCI > add exttrail /usr/pgsql/ggs/dirdat/ep, extract epos,
megabytes 5
EXTTRAIL added.
GGSCI > start epos
Sending START request to MANAGER ...
EXTRACT EPOS starting
GGSCI > info all
Program
Chkpt
Status
Group
Lag at Chkpt
Time Since
MANAGER
EXTRACT
RUNNING
RUNNING
EPOS
00:00:00
00:00:00
GGSCI > info extract epos
24. Since we are replicating the data in the heterogeneous environment (that is, data
replication is happening from Oracle to PostgreSQL), the process doing the
loading in the PostgreSQL would need to provide more details about the data in
the extract file. This is done by creating a definition file using the defgen utility:
GGSCI > view param defgen
DEFSFILE /home/abcd/oracle/ggs/dirdef/ABCD.def
USERID nkumar, password nkumar
TABLE NKUMAR.ABCD;
25. We can now exit from the Goldengate CLI, call the defgen utility on the
command line to create the definition file, and add the reference to the defgen
parameter file:
./defgen paramfile ./dirprm/defgen.prm
Definitions generated for 1 table in
/home/abcd/oracle/ggs/dirdef/ABCD.def
26. The next step would be to copy the defgen file to the machine where the
PostgreSQL database is hosted:
scp dirdef/ABCD.def postgres@dbtest:/usr/pgsql/ggs/dirdef
[ 240 ]
Data Migration from Other Databases to PostgreSQL and Upgrading the PostgreSQL Cluster
27. Start the PostgreSQL replicat process; and we are going to set up the parameter
file for this and include the definition file that was copied from the server hosting
the Oracle database to the server hosting PostgreSQL:
GGSCI > edit param rpos
The parameters created when viewing the parameter file are shown here.
GGSCI > view param rpos
REPLICAT rpos
SOURCEDEFS /usr/pgsql/ggs/dirdef/ABCD.def
SETENV ( PGCLIENTENCODING = "UTF8" )
SETENV (ODBCINI="/usr/pgsql/ggs/odbc.ini" )
SETENV (NLS_LANG="AMERICAN_AMERICA.AL32UTF8")
TARGETDB GG_Postgres, USERID nkumar, PASSWORD nkumar
DISCARDFILE /usr/pgsql/ggs/dirrpt/diskg.dsc, purge
MAP NKUMAR.ABCD, TARGET public.abcd, COLMAP
(COL1=col1,COL2=col2);
28. In the next step, we create the replicat process, start it, and verify that it is
running:
GGSCI > add replicat rpos, NODBCHECKPOINT, exttrail
/usr/pgsql/ggs/dirdat/ep
REPLICAT added.
GGSCI > start rpos
Sending START request to MANAGER ...
REPLICAT RPOS starting
GGSCI > info all
Program
Chkpt
Status
Group
Lag at Chkpt
Time Since
MANAGER
REPLICAT
RUNNING
RUNNING
RPOS
00:00:00
00:00:00
Group
Lag at Chkpt
Time Since
GGSCI > info all
Program
Status
Chkpt
MANAGER
RUNNING
[ 241 ]
Data Migration from Other Databases to PostgreSQL and Upgrading the PostgreSQL Cluster
REPLICAT
RUNNING
RPOS
00:00:00
00:00:02
GGSCI > view report rpos
29. Now that the extract and replicat processes have been set up on Oracle and
PostgreSQL Goldengate interfaces, the next step is to test the configuration. We
first begin by logging in to the Oracle database and inserting records into the
abcd table:
sqlplus nkumar
SQL> insert into abcd values(101,'Neeraj Kumar');
1 row created.
SQL> commit;
Commit complete.
SQL> select * from abcd;
COL1
---------101
COL2
----------------------------------Neeraj Kumar
30. Now we will check if the corresponding changes/new records inserted into the
abcd table in the Oracle database are visible in the corresponding abcd table in
the PostgreSQL database:
psql -U nkumar -d test
test=> select * from abcd;
col1 |
col2
------+------------------101 | Neeraj Kumar
(1 row)
This setup completes the heterogeneous testing scenario for migrating data/changes from
Oracle to PostgreSQL.
[ 242 ]
Data Migration from Other Databases to PostgreSQL and Upgrading the PostgreSQL Cluster
How it works…
We are going to discuss steps of the preceding section in chunks:
1. We will first talk about steps 1 to 6. Initially, we make a superuser connection in
Oracle with the sysdba privilege and make certain configuration changes. We
first enable a destination for holding the archived logs; that is, logs that contain
information about transactional changes are kept here in the location specified by
the log_archive_dest_1 initialization parameter as seen in step 2 of the
previous section. We then shut down the database in order to ensure that the
changes made in step 2 come into effect. Once the database is restarted, we
configure archiving in the database and enable supplemental logging, as seen in
step 4 and 5 of the preceding section. In step 6, we configure and include the
Goldengate directory path and library path in the PATH and LD_LIBRARY_PATH
environment variables.
2. We will now talk about steps 7 and 8 of the preceding section. After Goldengate
is installed on the server hosting Oracle database, we launch the Goldengate CLI
and then create various Goldengate subdirectories for holding parameter files,
checkpoint files, database definition files, extract data files, and so on.
3. Now we will talk about steps 9 and 10. The Goldengate manager performs a
number of functions like starting the Goldengate process, trail log file
management, and reporting. The manager process needs to be configured on
both source and target systems, and configuration is carried out with the help of
the parameter file as shown in step 9 .We configure the parameter PORT to define
the port on which the manager is running. Once the parameter file for the
manager is set up on the source machine, we start the manager and verify if it is
running. This is shown in step 10 of the preceding section.
4. We will now talk about steps 11 to 15 of the preceding section. Once Goldengate
is installed on the machine where PostgreSQL server is hosted, we add the
Goldengate directory and library path to the PATH and LD_LIBRARY_PATH
environment variables. Goldengate basically uses an ODBC connection to connect
to the PostgreSQL database. For this purpose, we set up an ODBC configuration
file called odbc.ini, which contains connection information to connect to
PostgreSQL server. This is shown in step 12. In the next step, we export the
ODBCINI environment variable and include the path of the configuration file.
Then from step 14 onwards, we launch the Goldengate command-line interface
for PostgreSQL; and we create various subdirectories for holding parameter files,
database definition files, and so on.
[ 243 ]
Data Migration from Other Databases to PostgreSQL and Upgrading the PostgreSQL Cluster
5. We will now talk about steps 16 and 17. Similar to what was performed in steps 9
and 10, on the source system for the manager process in Goldengate for Oracle,
we configure the parameter file for the manager process in Goldengate the for
target system, that is, PostgreSQL. Then we start the manager and verify it as
shown in step 17. The only parameter that has been configured in step 16 is the
PORT parameter, which identified the port on which the manager will listen.
6. We will now talk about steps 18 and 19 in the preceding section. Here we are
create two tables of the same name and the same structure. One table will be
created in the Oracle database and one will be created in PostgreSQL. The tables
are created in this manner because the idea of this exercise is that any
data/changes that happen on the table created in Oracle will be
replicated/propagated in PostgreSQL. This is the heterogeneous replication
concept.
7. Here we will talk about steps 20 and 21. Basically, in step 20, what we are doing
is logging to the Oracle database using the Goldengate interface, and we capture
the table definition for the table that was created in step 18 of the preceding
section. Similarly, in step 21, we are checking the ODBC connectivity to
PostgreSQL database from the Goldengate CLI, and once the connection is made,
we capture the table definitions for the table created in step 19.
8. Here we are going to talk about steps 22 and 23 from the preceding section. In
step 22, we create a parameter file for the extract process on the machine hosting
Oracle database, since it is used as the source. The extract process happens on the
source database. The extract process parameter file contains information
regarding the Oracle environment, the target remote host, the manager port, the
trail file, and the table for which the changes need to be captured. In step 23, we
start the extraction process on the source Oracle database and we add the trail
file. The extract process will extract any changes made to the Oracle table abcd
and will put this information on the trail file that resides on the machine hosting
PostgreSQL server.
9. Here we are going to talk about steps 24 and 25 from the preceding section; as the
replication is happening in a heterogeneous environment (that is, from Oracle to
PostgreSQL, in this scenario), it is important to get as many details as possible
about the data in the extract file to make things clear for the process loading the
data into the PostgreSQL database. For this to happen, we need to create a
definition file that will be created on the Goldengate interface of the Oracle
database and will then be shipped to the machine hosting PostgreSQL server. In
step 24, we are basically creating a parameter file for the defgen utility. In step
25, we call the defgen utility to create the definition file, and we also add a
reference to the parameter file created in step 24 of the preceding section.
[ 244 ]
Data Migration from Other Databases to PostgreSQL and Upgrading the PostgreSQL Cluster
10. In step 26 of the preceding section, we copy the definition file created in step 25
from the Oracle machine to the machine that hosts PostgreSQL server.
11. Here we are going to talk about steps 27 and 28. We start the replicat process.
Replicat basically reads the changes from the trail file and distributes them to the
PostgreSQL database. In step 27, we basically configure the parameter file for the
replicat process. In step 28, we start the replicat process, add the trail file to the
replicat process so that it can read changes from the trail file, and dump those
changes to the PostgreSQL database.
12. Here we are going to talk about steps 29 and 30. Basically, we are going to test
our configuration here. In step 29, we log in to the Oracle database, insert a
record in the abcd table, and save the changes. Now, with the Goldengate extract
and replicat processes running, the newly inserted record in the Oracle table
should be replicated to the corresponding table in the PostgreSQL database. We
confirm this by logging in to the PostgreSQL database and then selecting the
records from the abcd table in step 30. We can see in step 30 that the records
inserted in the Oracle table in step 29 are visible in the PostgreSQL database table
abcd. This confirms the successful implementation of the heterogeneous
replication from Oracle to PostgreSQL.
There's more…
Oracle also provides heterogeneous database connections to remote database instances.
Configuring these kinds of database links from Oracle to PostgreSQL will also help us to
migrate data instantly by using a trigger-based solution, as demonstrated at: http://manoj
adinesh.blogspot.in/2014/12/heterogeneous-database-sync.html.
Also, there are few open source tools, such as SymmetricDS, which offers replication to
multiple sources from different foreign databases. For more information about the tool: htt
ps://www.symmetricds.org/doc/3.8/html/user-guide.html.
[ 245 ]
11
Query Optimization
In this chapter, we will cover the following recipes:
Using sample data sets
Timing overhead
Studying hot and cold cache behavior
Clearing the cache
Querying plan node structure
Generating an explain plan
Computing basic costs
Running sequential scans
Running bitmap heap and index scans
Aggregate and hash aggregate
Grouping
Working with set operations
Running a CTE scan
Nesting loops
Working with merge and hash join
Working on semi and anti joins
Query Optimization
Introduction
When an end user submits an SQL query to a database, in general any database engine does
the parsing and then validates the syntax and semantics of the given query. Once the query
passes through the parsing levels, it will enter into the optimizer section. This optimizer
section is responsible for generating the execution plan of the submitted query, by reading
the table statistics. An execution plan of the same query can be varied while executing the
query multiple times. Since the plan that the optimizer is generating is based on the amount
of data, the query is about to process. In this chapter, we will be discussing various aspects
of the optimizer by discussing a few query processing algorithms.
Using sample data sets
There are enough sample data sets that are available for PostgreSQL, and most of the data
sets are implemented by considering real-time business cases.
Find the list of sample data sets that are available for postgres at this
URL: https://wiki.postgresql.org/wiki/Sample_Databases.
Benchmarking tools such as pgBench and BenchmarkSQL will initiate a sample data set, on
which they generate a real-time database load and do the benchmarking. In this chapter, to
demonstrate a few aspects of the optimizer, let's choose the BenchmarkSQL's catalog tables,
which will generate enough data load for this chapter's demonstrations.
Getting ready
BenchmarkSQL is an open source implementation of the popular TPC/C OLTP database
benchmark, which supports Firebird, Oracle, and PostgreSQL and captures detailed
benchmark results in CSV files, which can later be turned into an HTML report.
BenchmarkSQL also captures the OS metrics, such as CPU and memory usages.
How to do it…
Let's download and build the BenchmarkSQL.
[ 247 ]
Query Optimization
Download the source from the following URL: https://bitbucket.org/o
penscg/benchmarksql.
Let's unzip the downloaded source file, and build the BenchmarkSQL as follows:
1. Create the benchmarksql database and user as follows:
$ psql postgres
postgres=# CREATE USER benchmarksql WITH
ENCRYPTED PASSWORD 'benchmarksql';
postgres=# CREATE DATABASE benchmarksql OWNER benchmarksql;
2. Run the ant command to build the BenchmarkSQL:
$ ant
Buildfile: /Users/dineshkumar/Downloads/benchmarksql5.0/build.xml
init:
compile:
[javac] Compiling 11 source files to
/Users/dineshkumar/Downloads/benchmarksql-5.0/build
dist:
[jar] Building jar:
/Users/dineshkumar/Downloads/benchmarksql
5.0/dist/BenchmarkSQL-5.0.jar
BUILD SUCCESSFUL
Total time: 2 seconds
3. Let's create the benchmark configuration file using the
run/sample.postgresql.properties file as follows:
$ cp run/sample.postgresql.properties
run/postgres_benchmark.prop
4. Let's update run/postgres_benchmark.prop with the local BenchmarkSQL
database JDBC URI as follows:
$ cat postgres_benchmark.prop
db=postgres
driver=org.postgresql.Driver
conn=jdbc:postgresql://localhost:5432/benchmarksql
user=benchmarksql
password=benchmarksql
...
...
[ 248 ]
Query Optimization
5. Let's build the BenchmarkSQL database using the following command. The
following database build command will take some time to generate and load
sample data into the database:
$ ./runDatabaseBuild.sh postgres_benchmark.prop
# -----------------------------------------------------------# Loading SQL file ./sql.common/tableCreates.sql
# -----------------------------------------------------------...
...
Worker 000: Loading ITEM
Worker 001: Loading Warehouse
1
Worker 002: Loading Warehouse
2
Worker 003: Loading Warehouse
3
Worker 000: Loading ITEM done
Worker 000: Loading Warehouse
4
...
...
# -----------------------------------------------------------# Loading SQL file ./sql.postgres/buildFinish.sql
# ------------------------------------------------------------- ----- Extra commands to run after the tables are created, loaded,
-- indexes built and extra's created.
-- PostgreSQL version.
-- ---vacuum analyze;
6. Let's check the BenchmarkSQL database after it builds with sample data:
postgres=# SELECT
pg_size_pretty(pg_database_size('benchmarksql'));
pg_size_pretty
---------------1150 MB
(1 row)
How it works…
BenchmarkSQL is a brand new benchmarking tool for databases such as PostgreSQL and
Oracle. Using this tool, we can simulate the TPC-C environment. That is, this tool simulates
an e-ordering environment by creating sample warehouses, customers, and order delivery
status data.
[ 249 ]
Query Optimization
Refer to the following URL for more information about TPC-C: http://ww
w.tpc.org/tpcc/.
The runDatabaseBuild.sh command generates the sample dataset based on the given
properties file. The given property file postgres_benchmark.prop has all the default
options such as warehouses=10, loadworkers=3, and terminals=10. Based on these
settings, BenchmarkSQL generates 1.1 GB data. Please refer to the
sample.postgresql.properties file for the BenchmarkSQL supported parameters and
their usage.
Timing overhead
In this recipe, we will be discussing overhead of system timing calls.
Getting ready
In general, operating systems maintain a set of clock sources that monotonically increase the
time value. When we execute any time-related commands such as date, then the operation
system reads the date time value from the underlying hardware clock and then returns the
output in human readable format. The clock sources are designed as it should maintain its
consistency throughout all of its system time calls. In any case, the clock source should not
give us any time value that is in the past. That is, the clock time should be an atomically
incremental value. If the clock source loses its consistency due to a bad hardware time
keeper, then we also lose the system stability.
While running the EXPLAIN ANALYZE command, PostgreSQL needs to run multiple system
time calls, which keep tracking each node type start and end execution time. When the
system loses its consistency in keeping its time value, then EXPLAIN ANALYZE may take
more time than the actual query execution.
[ 250 ]
Query Optimization
How to do it…
PostgreSQL provides a contributed module called pg_test_timing, which makes multiple
system time calls and will observe the system clock behavior. If the system clock is going
backwards, then it will return the Detected clock going backwards in time error
message.
1. Let's run the sample pg_test_timing command, as shown in the following
snippet, to calculate the timing overhead in the system:
$ pg_test_timing -d 3
Testing timing overhead for 3 seconds.
Per loop time including overhead: 52.83 nsec
Histogram of timing durations:
< usec
% of total
count
1
94.78551
53828766
2
5.21020
2958875
4
0.00084
475
8
0.00077
436
16
0.00229
1300
32
0.00028
157
64
0.00007
41
128
0.00001
7
256
0.00002
13
512
0.00000
1
1024
0.00001
4
2048
0.00000
1
4096
0.00001
4
If the system clock source is consistent then, 90% of the system timing calls should fall in
under 1 microsecond.
How it works…
The pg_test_timing module is especially designed to identify the timing overhead in the
clock source. It makes several system clock calls and will identify the delta between its
previous value and current value. That is, it just simulates the system calls that EXPLAIN
ANALYZE generates during the plan generation.
On Unix systems, EXPLAIN ANALYZE is designed to calculate the timings based on the
gettimeofday() function call, whereas on Windows systems it uses the
QueryPerformanceCounter() function. These system calls will be performed for each
node type during the plan generation.
[ 251 ]
Query Optimization
Refer to the following URL for more information about the timing
overhead calculations on the system:
https://www.postgresql.org/docs/9.6/static/pgtesttiming.html.
Studying hot and cold cache behavior
Database cache plays a vital role in the query performance. In this recipe, we will be
discussing the hot and cold cache impact on the submitted SQL.
Getting ready
Cold cache is the behavior when the submitted SQL does not find its required data in
memory, which leads to process to read the data from the disk into the memory. Hot cache
is vice versa to the cold cache. If the SQL's required data is found in the memory, then it is a
hot cache behavior, which will avoid the disk read operations, and also improves the query
response time. The query performance is directly proportional to the amount of data we
have in the memory. That is, if we can make less I/O operations then performance will
improve automatically.
How to do it…
PostgreSQL provides a few extensions such as pg_prewarm, pg_buffercache, which will
help us to deal with cache. So, let's create these two extensions as follows:
benchmarksql=# CREATE EXTENSION pg_buffercache ;
CREATE EXTENSION
benchmarksql=# CREATE EXTENSION pg_prewarm ;
CREATE EXTENSION
[ 252 ]
Query Optimization
Cold cache
1. Let's check if any BenchmarkSQL tables are cached in memory or not by using
the following SQL statement:
benchmarksql=# SELECT COUNT(*) FROM pg_buffercache WHERE
relfilenode::regclass::text ~ 'bmsql_';
count
------0
(1 row)
2. It seems like no BenchmarkSQL tables are cached in memory. Let's also validate
the heap_blks_read, heap_blks_hit from pg_statio_user_tables by
taking bmsql_customer as a sample table:
benchmarksql=# SELECT heap_blks_read, heap_blks_hit FROM
pg_statio_user_tables WHERE relname = 'bmsql_customer';
heap_blks_read | heap_blks_hit
----------------+--------------0 |
0
(1 row)
3. Let's run an EXPLAIN ANALYZE along with the BUFFERS option on the
bmsql_customer table and validate the results:
benchmarksql=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM
bmsql_customer ;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------Seq Scan on bmsql_customer (cost=0.00..28211.00 rows=300000
width=557) (actual time=39.086..3211.754 rows=300000 loops=1)
Buffers: shared read=25211
Planning time: 94.650 ms
Execution time: 3232.508 ms
(4 rows)
4. From the preceding execution plan, we can see that all buffers are read from the
disk and that no single buffer is read. This causes the query to take 3232.508
ms. Also, let's validate heap_blk_read, heap_blks_hit from the
pg_statio_user_tables catalog view again:
benchmarksql=# SELECT heap_blks_read, heap_blks_hit FROM
pg_statio_user_tables WHERE relname = 'bmsql_customer';
[ 253 ]
Query Optimization
heap_blks_read | heap_blks_hit
----------------+--------------25211 |
0
(1 row)
The catalog view is also showing the number of heap blocks it read from the disk and the
number of blocks it read from the buffers. In this case, there are no blocks read from the
memory, this is the reason we have the heap_blks_hit value as 0.
This is a demonstration of cold cache behavior. During the cold cache, it is an expected
behavior of getting slow response from a query.
Hot cache
Once PostgreSQL reads any table data from disk, then it keeps those buffers in the memory
area called shared buffers. If we execute the same query again, then it tries to avoid the
disk I/O by looking for the previous buffers in the memory. Let's check how many buffers
there are in shared buffers for the bmsql_customers table using the following SQL
statement:
benchmarksql=# SELECT COUNT(*) FROM pg_buffercache WHERE
relfilenode::regclass::text = 'bmsql_customer';
count
------32
(1 row)
It seems as if PostgreSQL keeps 32 blocks in shared buffers when the previous query
was executed. Let's run the same explain analyze command to see how many buffers it is
going to read from disk and how many buffers it is going to hit from the cache:
benchmarksql=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM bmsql_customer;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------Seq Scan on bmsql_customer (cost=0.00..28211.00 rows=300000 width=557)
(actual time=0.034..92.899 rows=300000 loops=1)
Buffers: shared hit=32 read=25179
Planning time: 0.066 ms
Execution time: 111.866 ms
(4 rows)
[ 254 ]
Query Optimization
From the preceding results, we read the 32 blocks from memory and the remaining blocks
from disk or OS cache. As we read a few blocks from cache, the query performance has also
increased significantly. These 32 allocated blocks in shared buffers will be refreshed with
other table information by using the clock sweep algorithm.
By using the pg_prewarm extension we can also keep the whole table information in either
shared buffers or at OS cache. Let's load the whole table into shared buffers, as shown in the
following snippet, and let's see the query performance:
benchmarksql=# SELECT * FROM pg_prewarm('bmsql_customer', 'buffer');
pg_prewarm
-----------25211
(1 row)
benchmarksql=# SELECT COUNT(*) FROM pg_buffercache WHERE
relfilenode::regclass::text = 'bmsql_customer';
count
------16319
(1 row)
Using the pg_prewarm function, we loaded the whole table into shared buffers. That is, the
total number of blocks 25211 are loaded into the shared buffers. But the pg_buffercache
view is showing only 16319 blocks. The bmsql_customer table size is 25,211 * 8,192(block
size) ~ 196MB and the amount of table that is loaded into the shared buffers is 16,319 *,8,192
~ 127MB. That is approximately 65% of the table loaded into the shared buffers. Let's see the
actual shared buffers setting the value as follows:
benchmarksql=# SHOW shared_buffers ;
shared_buffers
---------------128MB
(1 row)
It seems as if we do not have enough shared buffers to hold the 196MB data and that is the
reason why pg_prewarm has only loaded 127MB of the relation. Let's run the same explain
query and see the query performance:
benchmarksql=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM bmsql_customer ;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------Seq Scan on bmsql_customer (cost=0.00..28211.00 rows=300000 width=557)
(actual time=0.022..81.067 rows=300000 loops=1)
Buffers: shared hit=16318 read=8893
Planning time: 0.065 ms
[ 255 ]
Query Optimization
Execution time: 100.300 ms
(4 rows)
From the preceding explain output, we hit 65% of table data, which is in memory and the
rest of the data is read from disk or OS cache. As most of the data is in cache, we got a better
query response time.
How it works…
As previously mentioned, whenever a query execution tries to fetch its required data then it
looks for it in the buffers. If the required data is found, then it is a cache hit, otherwise it is a
cache miss. Also, shared buffers will not keep all of its data permanently as it has a cache
eviction policy based on clock sweep. Once cache is evicted from shared buffers then it goes
to OS cache level from which queries try to fetch the data if data is not found in allocated
shared buffers. PostgreSQL provides another extension called pgfincore, which provides
information about OS cache.
Refer to the following URL that has excellent information about pgfincore:
http://fibrevillage.com/database/128-postgresql-database-buffe
r-cache-and-os-cache.
Clearing the cache
In this recipe, we will be discussing how to clean the cache from databases at operating
system level.
Getting ready
In PostgreSQL, we do not have any predefined functionality to clear the cache from the
memory. To clear the database level cache, we need to shut down the whole instance and to
clear the operating system cache, we need to use the operating system utility commands.
Do not run any of the following operations in any production servers as it
will cause production outage.
[ 256 ]
Query Optimization
How to do it…
1. Let's load some sample data into the database cache by using the pg_prewarm
function:
benchmarksql=# SELECT pg_prewarm('bmsql_customer', 'buffer');
pg_prewarm
-----------25211
(1 row)
2. Let's validate whether we are able to hit the buffers, which are loaded using the
pg_prewarm function:
benchmarksql=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM
bmsql_customer;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------Seq Scan on bmsql_customer (cost=0.00..28211.00 rows=300000
width=557) (actual time=0.013..65.337 rows=300000 loops=1)
Buffers: shared hit=16101 read=9110
Planning time: 204.910 ms
Execution time: 83.123 ms
(4 rows)
3. As we have cached the bmsql_customer table data in buffers, we got the query
response in 83 milliseconds. If the data is not cached then the query could
have taken more than 83 milliseconds, which we are trying to reproduce by
flushing all the shared buffers and OS cache. As previously mentioned, to clear
the shared buffers, we do not have any predefined utility except instance
shutdown. So let's shut down the PostgreSQL cluster to clear the database cache:
$ pg_ctl -D data stop -mf
waiting for server to shut down...... done
server stopped
4. To clear the operating system cache, we have to use OS utility commands as
follows:
# sync
# echo 3 >/proc/sys/vm/drop_caches
[ 257 ]
Query Optimization
5. As we cleared both the database and operating system cache, it is time to start the
PostgreSQL instance to observe the cold cache behavior:
$ pg_ctl -D data start
server starting
6. Let's execute the preceding SQL command and see the query response time and
cache hit ratio:
benchmarksql=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM
bmsql_customer;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------Seq Scan on bmsql_customer (cost=0.00..28211.00 rows=300000
width=557) (actual time=29.491..1451.221 rows=300000
loops=1)
Buffers: shared read=25211
Planning time: 3.430 ms
Execution time: 1488.079 ms
(4 rows)
As expected with cold cache, the executed SQL statement read all its required blocks from
disk rather from buffers. As it read from disk, the query response time is also increased
when compared with the previous hot cache behavior.
How it works…
When we shut down the PostgreSQL instance, the postmaster process takes the
responsibility of closing all the PostgreSQL background processes, besides cleaning the
allocated shared buffers memory. The sync command does the flush, which writes all the
modified memory blocks from the user space cache to respective data files and the echo 3
> /proc/sys/vm/drop_caches command drops most of the unused buffers without
touching the modified buffers, which are in kernel cache. That is the reasons we used the
sync command followed by drop_cache, which will clear both the modified and
unmodified cached buffers.
[ 258 ]
Query Optimization
There's more…
If any production maintenance activity is required for the system restart, and once the
PostgreSQL is restarted and available for the application servers, then for a couple of hours
we may hit performance problems due to cold cache behavior. To avoid these kinds of
performance issues, it is recommended to load all the hot tables data into the shared buffers
or operation system cache using the pg_prewarm function, during the maintenance
window.
Query plan node structure
In this recipe, we will be discussing the explain plan tree structure.
Getting ready
PostgreSQL generates a set of plans before choosing an optimal plan that it is going to
execute, based on the collected statistics about the relations. That is, the plan we are going
to get when we use the EXPLAIN command along with the SQL statement is the optimal
plan for that query. We also used EXPLAIN statements extensively on previous topics, and
now we are going to understand the plan structure, and the significance of each value in it.
How to do it…
Let's run a basic EXPLAIN query and evaluate it as follows:
benchmarksql=# EXPLAIN SELECT * FROM bmsql_customer WHERE c_id=0;
QUERY PLAN
----------------------------------------------------------------------------------------------Index Scan using bmsql_customer_pkey on bmsql_customer (cost=0.42..7092.08
rows=98 width=557)
Index Cond: (c_id = 0)
(2 rows)
In the preceding plan, we have only one node type called Index Scan on the given table
and the scan is performed on is its primary key index.
[ 259 ]
Query Optimization
The structure of the plan is a tree of node types and the leaf node always deals with
accessing the data from the relations using different access methods. That is, by using
sequential or index or bitmap index scans, PostgreSQL fetches the raw data from disk pages
and feeds this raw data to its parent nodes such as nested loops, merge joins, hash joins, and
so on. To demonstrate this, let's extend the preceding SQL query by adding a join condition
as follows:
benchmarksql=# EXPLAIN SELECT * FROM bmsql_customer, bmsql_warehouse
WHERE c_id=0 AND c_w_id=w_id ORDER BY c_since;
QUERY PLAN
---------------------------------------------------------------------------------------------------------Sort (cost=6328.89..6329.14 rows=98 width=637)
Sort Key: bmsql_customer.c_since
-> Nested Loop (cost=0.42..6325.65 rows=98 width=637)
-> Seq Scan on bmsql_warehouse (cost=0.00..1.10 rows=10
width=80)
-> Index Scan using bmsql_customer_pkey on bmsql_customer
(cost=0.42..632.36 rows=10 width=557)
Index Cond: ((c_w_id = bmsql_warehouse.w_id) AND (c_id =
0))
(6 rows)
From the preceding example, the leaf node types are Index Scan, and Seq Scan and the
parent node type is Nested Loop, which is a child node to the node type Sort. That is, the
raw data that is fetched using different scans will feed to the parent node Sort, which
returns the sorted output.
How it works…
As we discussed earlier, the PostgreSQL optimizer generates a set of plans for the submitted
SQL and it will choose the optimal plan among them for the final execution. That is, from all
the plans that are generated by the optimizer, it will only choose the plan that has the
minimal cost value at the top most node type.
[ 260 ]
Query Optimization
In the EXPLAIN text output format, each row defines a specific node type along with
arbitrary cost values. In some cases, a node type can also be associated with labels such as
Filter, Index Cond, and so on. From the preceding example, in the node type Index
Scan we have cost values such as 0.42 ..632.36. Here, 0.42 defines the cost that is
required to fetch the first raw record from the relation and 632.36 defines the cost that is
needed to fetch the final record from the relation. From the preceding plan, rows=10 is an
estimated number of values that we will be getting from the relation using the Index
Condition c_w_id = bmsql_warehouse.w_id AND c_id =0. In the preceding plan,
width=80 is an estimated average number of bytes that each row will be returning from the
relation bmsql_customer. All these numbers are the arbitrary cost values, which
PostgreSQL assumes. To track these actual timings, we have to use the ANALYZE option
along with the EXPLAIN command as follows:
benchmarksql=# EXPLAIN ANALYZE SELECT * FROM bmsql_customer,
bmsql_warehouse WHERE c_id=0 AND c_w_id=w_id ORDER BY c_since;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------Sort (cost=6328.89..6329.14 rows=98 width=637) (actual
time=34.858..34.858 rows=0 loops=1)
Sort Key: bmsql_customer.c_since
Sort Method: quicksort Memory: 25kB
-> Nested Loop (cost=0.42..6325.65 rows=98 width=637) (actual
time=34.853..34.853 rows=0 loops=1)
-> Seq Scan on bmsql_warehouse (cost=0.00..1.10 rows=10
width=80) (actual time=0.002..0.015 rows=10 loops=1)
-> Index Scan using bmsql_customer_pkey on bmsql_customer
(cost=0.42..632.36 rows=10 width=557) (actual time=3.471..3.471 rows=0
loops=10)
Index Cond: ((c_w_id = bmsql_warehouse.w_id) AND (c_id =
0))
Planning time: 0.292 ms
Execution time: 34.917 ms
(9 rows)
From the preceding results we can observe that each node type has additional information
along with cost values. Here, actual time shows the time it took to process the records in
milliseconds along with the actual number of rows it processed.
[ 261 ]
Query Optimization
It is always recommended to run the EXPLAIN ANALYZE operation on any
DML statements in a transaction. This is because ANALYZE actually runs the
given SQL statement, which may affect the rows as per the
DELETE/INSERT/UPDATE operations. If you do not want these changes to
affect the database, then we can ROLLBACK the transaction after getting the
execution plan for the statement.
Generating an explain plan
In this recipe, we will be discussing various supported formats of the EXPLAIN command.
Getting ready
The EXPLAIN command provides multiple formats of the plan output, which gives more
flexibility in parsing the explain plan. In PostgreSQL 9.6, EXPLAIN supports the following
formats:
TEXT
XML
YAML
JSON
How to do it…
1. Let's generate the XML formatted explain plan as follows:
benchmarksql=# EXPLAIN (FORMAT XML) SELECT * FROM bmsql_customer;
QUERY PLAN
---------------------------------------------------------<explain xmlns="http://www.postgresql.org/2009/explain">+
<Query>
+
<Plan>
+
<Node-Type>Seq Scan</Node-Type>
+
<Relation-Name>bmsql_customer</Relation-Name>
+
<Alias>bmsql_customer</Alias>
+
<Startup-Cost>0.00</Startup-Cost>
+
<Total-Cost>28218.00</Total-Cost>
+
<Plan-Rows>300000</Plan-Rows>
+
<Plan-Width>559</Plan-Width>
+
[ 262 ]
Query Optimization
</Plan>
</Query>
</explain>
(1 row)
+
+
2. Let's generate the YAML formatted explain plan as follows:
benchmarksql=# EXPLAIN(FORMAT YAML) SELECT * FROM bmsql_customer;
QUERY PLAN
------------------------------------- Plan:
+
Node Type: "Seq Scan"
+
Relation Name: "bmsql_customer"+
Alias: "bmsql_customer"
+
Startup Cost: 0.00
+
Total Cost: 28218.00
+
Plan Rows: 300000
+
Plan Width: 559
(1 row)
3. Let's generate the JSON formatted explain plan as follows:
benchmarksql=# EXPLAIN(FORMAT JSON) SELECT * FROM bmsql_customer;
QUERY PLAN
-----------------------------------------[
+
{
+
"Plan": {
+
"Node Type": "Seq Scan",
+
"Relation Name": "bmsql_customer",+
"Alias": "bmsql_customer",
+
"Startup Cost": 0.00,
+
"Total Cost": 28218.00,
+
"Plan Rows": 300000,
+
"Plan Width": 559
+
}
+
}
+
]
(1 row)
How it works…
All the previously mentioned approaches provide us with a way to parse the explain plan
using the XML, YAML, or JSON libraries. By default, explain only provides a text tree node
type structure where each node describes each specific node type.
[ 263 ]
Query Optimization
There's more…
We can also generate and log the plan for each query automatically by using the
auto_explain module. This extension logs the query along with its explain plan into the
log files, which gives us information about what the plan was when the query took a long
time to complete its execution. This extension also supports various formats, as we've
discussed previously.
Refer to the following URL for more information about the auto_explain
extension: https://www.postgresql.org/docs/9.6/static/auto-expla
in.html.
Computing basic cost
In this recipe, we will be discussing how the optimizer chooses the optimal plan from the
set of plans that it generates.
Getting ready
PostgreSQL generates a set of plans based on the optimizer attributes such as
enable_seqscan, enable_indexscan, and so on. Among these generated sets of plans,
PostgreSQL will only choose the plan that has the minimum cost value when compared
with the other plans.
How to do it…
1. Let's generate a simple plan, as shown in the following snippet, and evaluate it:
benchmarksql=# SHOW seq_page_cost;
seq_page_cost
--------------1
(1 row) benchmarksql=# EXPLAIN SELECT * FROM bmsql_customer;
QUERY PLAN
-----------------------------------------------------------------------Seq Scan on bmsql_customer
width=559)
(cost=0.00..28218.00 rows=300000
[ 264 ]
Query Optimization
(1 row)
2. The numbers that we are seeing in the preceding plan are all arbitrary cost
values. That is, assumed cost values to process the sequential scan on the whole
table. Let's increase the optimizer cost setting seq_page_cost to 2.0 and see the
same plan:
benchmarksql=# SET seq_page_cost TO 2;
SET
benchmarksql=# EXPLAIN SELECT * FROM bmsql_customer;
QUERY PLAN
-----------------------------------------------------------------------Seq Scan on bmsql_customer
width=559)
(1 row)
(cost=0.00..53436.00 rows=300000
3. Let's extend the same example to the index scan and evaluate the cost values for
it, as shown in the following snippet. To fetch required tuples from the table
using an index, PostgreSQL needs to perform two different scans. One scan
fetches the matching records from the index, and another scan fetches the
required tuples from the table. The index scan cost calculation will include both
cost calculations:
benchmarksql=# EXPLAIN SELECT * FROM bmsql_customer WHERE c_id = 1;
QUERY PLAN
----------------------------------------------------------------------------------------------Index Scan using bmsql_customer_pkey on bmsql_customer
(cost=0.42..7096.24 rows=98 width=557)
Index Cond: (c_id = 1)
(2 rows)
How it works…
From the preceding sequential, index scan examples, let us discuss how the optimizer
chooses the cost values, and what are all the setting values it considers during the cost
estimation are.
[ 265 ]
Query Optimization
Sequential scan
1. From the previous sequential scan query, the optimizer returned the cost value as
53436. Let us discuss how the optimizer calculates this value as follows:
benchmarksql=# SELECT reltuples, relpages FROM pg_class WHERE
relname='bmsql_customer';
reltuples | relpages
-----------+---------300000 |
25218
(1 row)
2. The cost value 53436.00 was calculated based on the number of relpages along
with the cpu_tuple_cost, seq_page_cost value. The default
cpu_tuple_cost value is 0.01, which is an estimated arbitrary cost value to
fetch each tuple from the relpages. The sequential scan cost value will be
calculated as relpages*seq_page_cost + cpu_tuple_cost*reltuples and
the evaluation for it is as follows:
benchmarksql=# SELECT (25218*2 + 0.01*300000) seq_scan_cost;
seq_scan_cost
--------------53436.00
(1 row)
3. Let's add a predicate and evaluate the cost value to the same SQL query:
benchmarksql=# EXPLAIN SELECT * FROM bmsql_customer WHERE
c_city = 'San Mateo';
QUERY PLAN
------------------------------------------------------------------Seq Scan on bmsql_customer (cost=0.00..54186.00 rows=9 width=559)
Filter: ((c_city)::text = 'San Mateo'::text)
(2 rows)
[ 266 ]
Query Optimization
4. From the preceding plan, the cost 54186.00 was calculated using
relpages*seq_page_cost + cpu_tuple_cost*reltuples +
cpu_operator_cost*reltuples. The default value of the
cpu_operator_cost parameter is 0.0025, which is an estimated abstract cost
value to compare single tuple with the string San Mateo. The following is a
pictorial representation of this sequential scan cost estimation:
Index scan
The index scan calculation is not straightforward like a sequential scan cost calculation. As
we've discussed previously, an index scan cost calculation has two phases.
[ 267 ]
Query Optimization
Phase I
In this phase, it will only calculate the cost that is required to process the index pages. In
PostgreSQL, this will be calculated using the internal functions that are stored in the pg_am
catalog. For each type of an index, there is an individual cost calculator. Let's find out what
index cost calculators are available in PostgreSQL 9.6 using the following SQL query:
benchmarksql=# SELECT amname, amcostestimate FROM pg_am;
amname | amcostestimate
--------+-----------------btree | btcostestimate
hash
| hashcostestimate
gist
| gistcostestimate
gin
| gincostestimate
spgist | spgcostestimate
brin
| brincostestimate
(6 rows)
For the preceding index scan example, the btcostestimate method will be used to find
the cost to read the index pages. This method will consider random_page_cost,
cpu_index_tuple_cost, the height of the index tree, and the arbitrary cost of each WHERE
clause condition in estimating the index tuple scan cost.
Phase II
In this phase, it will verify few aspects before fetching the data from the actual table. That is,
the optimizer verifies whether the table order is similar to an index order. If the order is not
similar, then PostgreSQL prefers to fetch the pages randomly, which automatically
increases the cost value. If the order is similar (CLUSTER table), then PostgreSQL prefers to
fetch the pages sequentially, which costs less. If the table is partially ordered, then
PostgreSQL chooses an average cost from the preceding two cost values. If the index scan is
an Index Only Scan type, then it will add different cost levels for this node type. In Index
Only Scans, PostgreSQL does not need to visit the actual table relpages. To calculate the
cost for this phase, the optimizer needs table relpages, index relpages, and
effective_cache_size values.
Running sequential scans
In this recipe, we will be discussing sequential scans.
[ 268 ]
Query Optimization
Getting ready
Sequential scans are a mechanism, and PostgreSQL tries to read each tuple from the
relation. The best example for the sequential scan is reading an entire table without any
predicate. Sequential scans are always preferred over index scans, when a query is reading
most of the data from the table, which will avoid the index lookup overhead.
Reading pages in sequential order takes less effort when compared with reading pages in
random order. This is because, in sequential file reading, we do not need to set the file
pointer to any specific location. However, during the index scan, PostgreSQL needs to read
random pages from the file as per the index results. That is, during the index scan we move
the file read pointer to multiple pages. This is the reason why the arbitrary cost parameter
seq_page_cost value 1 is always less than the random_page_cost value 4.
How to do it…
1. Let's run a query that reads the complete table as follows:
benchmarksql=# EXPLAIN SELECT * FROM bmsql_customer;
QUERY PLAN
-----------------------------------------------------------------------Seq Scan on bmsql_customer
width=559)
(1 row)
(cost=0.00..28218.00 rows=300000
2. As expected, the preceding query went to perform the full table sequential scan.
Let's add a predicate to the preceding query and see the execution plan:
benchmarksql=# EXPLAIN SELECT * FROM bmsql_customer WHERE
c_city = 'San Mateo';
QUERY PLAN
----------------------------------------------------------------------Bitmap Heap Scan on bmsql_customer (cost=4.49..40.09 rows=9
width=559)
Recheck Cond: ((c_city)::text = 'San Mateo'::text)
-> Bitmap Index Scan on city_idx (cost=0.00..4.49 rows=9
width=0)
Index Cond: ((c_city)::text = 'San Mateo'::text)
(4 rows)
[ 269 ]
Query Optimization
3. As we are only fetching the San Mateo city records from the bmsql_customer
table, the optimizer decided to go with an index scan rather than a sequential
scan. Let's force this query to take a sequential scan over an index scan, by
increasing the random_page_cost value to some higher number:
benchmarksql=# SET random_page_cost TO 10000;
SET
benchmarksql=# EXPLAIN SELECT * FROM bmsql_customer WHERE
c_city =
'San Mateo';
QUERY PLAN
------------------------------------------------------------------Seq Scan on bmsql_customer (cost=0.00..28961.00 rows=9
width=557)
Filter: ((c_city)::text = 'San Mateo'::text)
(2 rows)
From the preceding plan, the cost of a sequential scan is 28961 and the cost of a previous
index scan is 40.09. That is the reason why the optimizer chooses the index scan over the
sequential scan. This is the plan we will be getting, if we do not have an index on the
c_city column.
How it works…
Once the plan is generated for the submitted SQL query, then Postgres will start executing
the plan node types. Sequential scans will try to fetch each heap tuple from the relation, by
checking each given predicate in the SQL query, and it will update the number of tuples it
reads from the relation. Once a sequential scan completes getting the whole matching
records from the table, then postgres tries to keep the tuples in the shared buffers by storing
the number of tuples it processed in catalog tables. A simple pictorial representation of a
sequential scan is as follows:
[ 270 ]
Query Optimization
Running bitmap heap and index scan
In this recipe, we will be discussing bitmap heap scans and index scans.
Getting ready
PostgreSQL does not support creating bitmap indexes on tables. However, it will generate
bitmap pages while scanning the index, which will be utilized during the table scan.
PostgreSQL does not generate bitmap pages for every index scan, and it will only generate
them if the number of fetching rows from the query is high enough. This bitmap page is
unique to each query execution, and the scope of the bitmap page is the end of the query
execution.
Bitmap heap scans will always be the parent node type to the bitmap index scan, which
takes the bitmap pages as an input, and sorts the index pages as the physical table page
order, and then fetches the tuples from the relation.
How to do it…
1. For demonstration purposes, let's consider the following example, which
generates the bitmap heap scan:
benchmarksql=# EXPLAIN SELECT * FROM bmsql_customer WHERE
c_city = 'San Mateo';
QUERY PLAN
-----------------------------------------------------------------------Bitmap Heap Scan on bmsql_customer (cost=4.49..40.09 rows=9
width=559)
Recheck Cond: ((c_city)::text = 'San Mateo'::text)
-> Bitmap Index Scan on city_idx (cost=0.00..4.49 rows=9
width=0)
Index Cond: ((c_city)::text = 'San Mateo'::text)
(4 rows)
2. From the preceding plan, the optimizer is expected to fetch 9 rows for the city
San Mateo from the index. Let's add the ANALYZE option for the preceding plan,
and see how many rows we will be getting in the final result:
benchmarksql=# EXPLAIN ANALYZE SELECT * FROM bmsql_customer
WHERE c_city = 'San Mateo';
[ 271 ]
Query Optimization
QUERY PLAN
------------------------------------------------------------------------------------------------------------Bitmap Heap Scan on bmsql_customer (cost=4.49..40.09 rows=9
width=559)
(actual time=4.243..4.243 rows=0 loops=1)
Recheck Cond: ((c_city)::text = 'San Mateo'::text)
-> Bitmap Index Scan on city_idx (cost=0.00..4.49 rows=9
width=0)
(actual time=4.239..4.239 rows=0 loops=1)
Index Cond: ((c_city)::text = 'San Mateo'::text)
Planning time: 0.092 ms
Execution time: 4.278 ms
(6 rows)
3. From the preceding results, the bmsql_customer table does not have any rows
where the city belongs to San Mateo. But the estimated number of rows is
calculated as 9 and the actual number of rows we have is 0. If PostgreSQL does
not have enough statistical information, then we can expect these kinds of
estimation mismatches. Let's increase the default_statistics_target to its
maximum, and run the same query:
benchmarksql=# SET default_statistics_target TO 10000;
SET
benchmarksql=# ANALYZE bmsql_customer ;
ANALYZE
benchmarksql=# EXPLAIN SELECT * FROM bmsql_customer WHERE
c_city = 'San Mateo';
QUERY PLAN
-------------------------------------------------------------------Index Scan using city_idx on bmsql_customer (cost=0.42..28.53
rows=6
width=558)
Index Cond: ((c_city)::text = 'San Mateo'::text)
(2 rows)
4. From the preceding demonstration, after increasing the statistics limit value, the
optimizer chooses the Index Scan over the Bitmap Index Scan node type.
This is because there are not enough estimated tuples to build the bitmap pages.
It is not recommended to set the default_statistics_target value to
its maximum as the analyze operation will take more time to generate the
statistics for each table. Also, it will increase the time to generate the
execution plan of the query.
[ 272 ]
Query Optimization
How it works…
For demonstration purposes, let's add another predicate to the preceding SQL query, and
see the explain plan:
benchmarksql=# EXPLAIN SELECT * FROM bmsql_customer WHERE c_city = 'San
Mateo' or c_city='San Francisco';
QUERY PLAN
-------------------------------------------------------------------------------------------------Bitmap Heap Scan on bmsql_customer (cost=8.94..56.34 rows=12 width=558)
Recheck Cond: (((c_city)::text = 'San Mateo'::text) OR ((c_city)::text =
'San Francisco'::text))
-> BitmapOr (cost=8.94..8.94 rows=12 width=0)
-> Bitmap Index Scan on city_idx (cost=0.00..4.47 rows=6 width=0)
Index Cond: ((c_city)::text = 'San Mateo'::text)
-> Bitmap Index Scan on city_idx (cost=0.00..4.47 rows=6 width=0)
Index Cond: ((c_city)::text = 'San Francisco'::text)
(7 rows)
In the preceding example, we are fetching all the customer information whose city is either
San Mateo or San Francisco. As per the preceding plan, the optimizer has generated
two bitmap pages for each city from the city_idx index. And these two bitmap pages are
merged using the BitmapOr operation, and the result is forwarded to the bitmap heap scan.
Once the bitmap index is created, the bitmap heap scan will perform sorting on the bitmap
page, and then fetch the records from the relation in the sequential order. In the preceding
example, the optimizer has created two bitmap pages for each city, and we can limit this
number by using the IN operator. The bitmap heap scan cost is indirectly proportional to
the number of bitmap pages. Let's observe the following query, which retrieves similar
results to the previous example:
benchmarksql=# EXPLAIN SELECT * FROM bmsql_customer WHERE c_city IN
('San Mateo','San Francisco');
QUERY PLAN
------------------------------------------------------------------------------------Bitmap Heap Scan on bmsql_customer (cost=8.94..56.30 rows=12
width=558)
Recheck Cond: ((c_city)::text = ANY ('{"San Mateo","San
Francisco"}'::text[]))
-> Bitmap Index Scan on city_idx (cost=0.00..8.94 rows=12 width=0)
Index Cond: ((c_city)::text = ANY ('{"San Mateo","San
Francisco"}'::text[]))
(4 rows)
[ 273 ]
Query Optimization
The following is a sample pictorial representation of a bitmap heap scan:
As illustrated in the preceding figure, the index will be scanned to generate a bitmap index
in buffers. Later, the generated bitmap index will be used to fetch the required tuples from
the table.
Aggregate and hash aggregate
In this recipe, we will be discussing aggregate and hash aggregate mechanisms in
PostgreSQL.
Getting ready
Aggregate is a node type that only evaluates the aggregate operators. Some of the aggregate
operators are SUM, MIN, MAX, and so on.
Hash aggregate is a node type that requires an aggregate operator, and a group key column.
In general, we see this node type being utilized during the GROUP BY, DISTINCT, or set
operations.
[ 274 ]
Query Optimization
How to do it…
Aggregate
1. To demonstrate the aggregates behavior, let's query the benchmarsql as follows:
benchmarksql=# EXPLAIN SELECT max(i_price) FROM bmsql_item;
QUERY PLAN
----------------------------------------------------------------------Aggregate (cost=2549.00..2549.01 rows=1 width=6)
-> Seq Scan on bmsql_item (cost=0.00..2299.00 rows=100000
width=6)
(2 rows)
2. From the preceding plan, as expected we got the aggregate node type, which is
followed by a full table sequential scan. PostgreSQL will also choose different
node types for the aggregate operators, when it can produce the result in less
time. Let's create an ordered index on the i_price column, and see the execution
plan for the same query:
benchmarksql=# CREATE INDEX pric_idx ON bmsql_item(i_price
DESC);
CREATE INDEX
benchmarksql=# EXPLAIN SELECT max(i_price) FROM bmsql_item;
QUERY PLAN
----------------------------------------------------------------------------------------------------Result (cost=0.32..0.33 rows=1 width=0)
InitPlan 1 (returns $0)
-> Limit (cost=0.29..0.32 rows=1 width=6)
-> Index Only Scan using pric_idx on bmsql_item
(cost=0.29..2854.29 rows=100000 width=6)
Index Cond: (i_price IS NOT NULL)
(5 rows)
3. As we created the index on the column i_price in descending order, the index's
first value will always be the maximum, and the last value will be the minimum
value. This is the reason why the preceding query took Index Only Scan as it is
needed to fetch only one tuple from the index. Let's change the same query to
fetch the minimum price item, and evaluate the execution plan:
benchmarksql=# EXPLAIN SELECT min(i_price) FROM bmsql_item;
QUERY PLAN
[ 275 ]
Query Optimization
-------------------------------------------------------------------------------------------------------------Result (cost=0.32..0.33 rows=1 width=0)
InitPlan 1 (returns $0)
-> Limit (cost=0.29..0.32 rows=1 width=6)
-> Index Only Scan Backward using pric_idx on
bmsql_item (cost=0.29..2854.29 rows=100000 width=6)
Index Cond: (i_price IS NOT NULL)
(5 rows)
4. From the preceding plan, the index scan is performed backwards to fetch the
minimum value from the index, since the index is created in descending order. If
we run both the max and min operations in a single query, then we will be
getting both the forward and backward index scans in a single execution path.
Hash aggregate
1. To demonstrate the hash aggregate behavior, let's query benchmarksql as
follows:
benchmarksql=# EXPLAIN SELECT COUNT(*), c_zip FROM
bmsql_customer GROUP BY c_zip;
QUERY PLAN
----------------------------------------------------------------------------HashAggregate (cost=29718.00..29817.39 rows=9939 width=10)
Group Key: c_zip
-> Seq Scan on bmsql_customer (cost=0.00..28218.00
rows=300000 width=10)
(3 rows)
2. In the preceding example, we have mentioned the GROUP BY operation in the
SQL, which makes the optimizer choose one of the aggregate strategies. In the
preceding plan, the optimizer chooses the HashAggregate node type to evaluate
the data set. PostgreSQL will also prefer another approach called
GroupAggregate, which requires the input data set to be sorted. Let's observe
the following SQL statements:
benchmarksql=# CREATE INDEX zip_idx ON bmsql_customer(c_zip);
CREATE INDEX
benchmarksql=# EXPLAIN SELECT COUNT(*), c_zip FROM
bmsql_customer GROUP BY c_zip;
QUERY PLAN
-------------------------------------------------------------------
[ 276 ]
Query Optimization
------------------------------GroupAggregate (cost=0.42..10731.97 rows=9955 width=10)
Group Key: c_zip
-> Index Only Scan using zip_idx on bmsql_customer
(cost=0.42..9132.42 rows=300000 width=10)
(3 rows)
As we created an index on the c_zip column, PostgreSQL prefers the group aggregate over
hash aggregates, which only needs the index scan. In hash aggregates, the optimizer is
needed to do the full table scan, and memory is also required to build the hash table.
How it works…
Aggregate
In PostgreSQL, aggregates are implemented based on multiple state transition functions.
We can fetch these functions for each aggregate operator by querying the pg_aggregate
catalog table. As we used MAX aggregate functions in the previous example, let's get the
internal transition state functions for this operator:
postgres=# SELECT * FROM pg_aggregate WHERE aggfnoid::text ~ 'max'
LIMIT 1;
-[ RECORD 1 ]----+------------------aggfnoid
| pg_catalog.max
aggkind
| n
aggnumdirectargs | 0
aggtransfn
| int8larger
aggfinalfn
| aggmtransfn
| aggminvtransfn
| aggmfinalfn
| aggfinalextra
| f
aggmfinalextra
| f
aggsortop
| 413
aggtranstype
| 20
aggtransspace
| 0
aggmtranstype
| 0
aggmtransspace
| 0
agginitval
|
aggminitval
|
[ 277 ]
Query Optimization
From the preceding output, the transition function for the MAX aggregate is int8larger,
and there is no final state transition function for this aggregate.
State transition functions
To perform any aggregate operation, a set of transition state functions is required. Let's
discuss the MAX aggregate function behavior. To find the maximum record in this column,
PostgreSQL needs to fetch each tuple and it needs to compare it with its previous tuple
value. If the recent fetched tuple value is bigger than the previous tuple, then it will change
the maximum value with the latest tuple. This process will repeat until there are no more
tuples to fetch. On this iteration process, the maximum value state will keep on changing as
per the latest maximum value. That's why we call these functions state transition functions.
Each aggregate function will have a unique set of state transition functions.
Some aggregate functions need another state transition function called final state function,
to complete its aggregate operation. The best example for these kind of aggregates is AVG.
Let's get the AVG transition functions from the pg_aggregate table:
postgres=# SELECT * FROM pg_aggregate WHERE aggfnoid::text ~ 'avg'
LIMIT 1;
-[ RECORD 1 ]----+------------------aggfnoid
| pg_catalog.avg
aggkind
| n
aggnumdirectargs | 0
aggtransfn
| int8_avg_accum
aggfinalfn
| numeric_poly_avg
aggmtransfn
| int8_avg_accum
aggminvtransfn
| int8_avg_accum_inv
aggmfinalfn
| numeric_poly_avg
aggfinalextra
| f
aggmfinalextra
| f
aggsortop
| 0
aggtranstype
| 2281
aggtransspace
| 48
aggmtranstype
| 2281
aggmtransspace
| 48
agginitval
|
aggminitval
|
[ 278 ]
Query Optimization
From the preceding results, the final state transition is associated with the
numeric_poly_avg internal function. This function will be using the results from the state
transition function int8_avg_accum, which is implemented as return the sum of all tuple
values and the count of tuples it is processed. This sum and count information will be used
in the numeric_poly_avg function to find the final average value as sum divided by count.
These transition functions will also serve for the moving aggregates. Moving aggregates are
nothing but aggregate operators using window functions. An example of moving
aggregates is as follows:
benchmarksql=# EXPLAIN SELECT c_zip, COUNT(*) OVER(PARTITION BY
c_zip) FROM bmsql_customer;
QUERY PLAN
------------------------------------------------------------------------------------------------WindowAgg (cost=0.42..13632.42 rows=300000 width=10)
-> Index Only Scan using zip_idx on bmsql_customer
(cost=0.42..9132.42 rows=300000 width=10)
(2 rows)
The MAX aggregate function does not need the final state transition function, as it is not
needed to track the number of tuples it processed. That's why the aggregate functions such
as MAX, MIN, SUM, COUNT, and so on will have an aggfinalfn column value as empty.
Refer to the following documentation for more information about
aggregate behavior: https://www.postgresql.org/docs/9.6/static/sq
l-createaggregate.html.
Hash aggregates
During hash aggregates, PostgreSQL will build a hash table with the given group key
columns list. Once the hash table is prepared, then the state transition functions will be
executed for each bucket and they will return all the buckets aggregate information. To
build this hash table, PostgreSQL will utilize the work_mem memory area. If it does not find
enough memory to keep this hash table, then the optimizer prefers the group aggregate
node type.
Running CTE scan
In this recipe, we will be discussing CTE (Common Table Expression) scans.
[ 279 ]
Query Optimization
Getting ready
CTEs are the named inline queries, which we define using the WITH clause, and they can be
used multiple times in the same query. We can also implement recursive/DML queries
using CTEs. CTE improves the query readability when compared with inline sub queries.
CTE also provide the RECURSIVE option, which takes the usage of CTE to another level.
Using recursive CTE statements, we can build hierarchical SQL queries where the SQL
execution refers to its own results for further processing.
How to do it…
1. Let's get all the costly products from the sample dataset as follows:
benchmarksql=# EXPLAIN WITH costly_products AS (
SELECT i_name FROM bmsql_item WHERE i_price BETWEEN 10 AND 100
) SELECT * FROM costly_products;
QUERY PLAN
-----------------------------------------------------------------------------CTE Scan on costly_products (cost=2799.00..4619.68 rows=91034
width=40)
CTE costly_products
-> Seq Scan on bmsql_item (cost=0.00..2799.00 rows=91034
width=26)
Filter: ((i_price >= '10'::numeric) AND (i_price <=
'100'::numeric))
(4 rows)
2. The preceding sample query actually does not require the CTE usage; however,
for demonstration purposes we've divided the SQL query into two subsections,
where one section gets all the data from the base tables, and another one just
scans and returns the records that it got from the previous node type. We can also
create multiple CTE expressions in the same SQL statement as comma separated
expressions. We can use all these CTEs as a regular table joins, as shown in the
following example:
postgres=# WITH costly_products AS (
SELECT i_name FROM bmsql_item WHERE i_price BETWEEN 10 AND 100
),
not_costly_products AS (
SELECT i_name FROM bmsql_item WHERE i_price BETWEEN 1 AND 9)
SELECT COUNT(*), 'costly products' FROM costly_products
UNION
[ 280 ]
Query Optimization
SELECT COUNT(*), 'not costly products' FROM not_costly_products;
count | ?column?
-------+--------------------8018 | not costly products
90986 | costly products
(2 rows)
3. By using CTEs, we can implement recursive queries that keep processing the data
recursively, until the given predicate satisfies. Let's see the following sample
example, which produces the recursive output:
postgres=# WITH RECURSIVE recur(i) AS (SELECT 1 UNION SELECT
i+1 FROM recur WHERE i<5) SELECT * FROM recur;
t
--1
2
3
4
5
(5 rows)
From the preceding example, we just returned the numbers from 1 to 5 without using the
PostgreSQL generate_series function.
How it works…
Let us discuss how CTE scan and recursive union works in PostgreSQL:
CTE scan
We can assume that CTEs are like in-memory temporary tables, which will be created based
on the given expression. Once this in-memory table is created, it will be scanned multiple
times as per their usage in the SQL statement. To create this in-memory table, PostgreSQL
uses the work_mem area, and the scope of this table is the end of the query execution.
As we discussed previously, CTEs improves the query readability, and at the same time it
has its own limitations too. At this moment, CTEs are called optimizer fences. That is, CTEs
can't be rewritten as like the inline sub queries. Let's see the following example, where
optimizer fails to rewrite the SQL:
benchmarksql=# EXPLAIN ANALYZE WITH costly_products AS (
SELECT i_name,i_price FROM bmsql_item
[ 281 ]
Query Optimization
) SELECT * FROM costly_products WHERE i_price BETWEEN 10 AND 100 ;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------CTE Scan on costly_products (cost=2299.00..4799.00 rows=500
width=40) (actual time=0.026..174.884 rows=90991 loops=1)
Filter: ((i_price >= '10'::numeric) AND (i_price <=
'100'::numeric))
Rows Removed by Filter: 9009
CTE costly_products
-> Seq Scan on bmsql_item (cost=0.00..2299.00 rows=100000
width=26) (actual time=0.011..37.172 rows=100000 loops=1)
Planning time: 0.160 ms
Execution time: 199.404 ms
(7 rows)
Time: 200.127 ms
From the preceding plan, the given predicate is applied on the costy_products CTE. That
is, after populating the whole bmsql_item table into the memory, the planner is applying
the given predicate. Let's observe the following SQL plan, which is written using the inline
query, which produces the same results:
benchmarksql=# EXPLAIN ANALYZE SELECT * FROM (SELECT i_name,
i_price FROM bmsql_item) as costly_products WHERE i_price BETWEEN
10 AND 100 ;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------Seq Scan on bmsql_item (cost=0.00..2799.00 rows=91177 width=26)
(actual time=0.011..79.378 rows=90991 loops=1)
Filter: ((i_price >= '10'::numeric) AND (i_price <=
'100'::numeric))
Rows Removed by Filter: 9009
Planning time: 0.340 ms
Execution time: 85.402 ms
(5 rows)
Time: 86.286 ms
From the preceding plan, the same behavior is rewritten by the optimizer as the outer query
predicate is moved to its inline subquery. That is, during the tuple fetch operation itself, the
planner is filter rows. That's why we got the better query response time for this inline sub
query.
[ 282 ]
Query Optimization
Recursive union
This is the node type optimizer that is chosen when dealing with CTE recursive queries. The
way we should write the recursive queries is as follows:
WITH RECURSIVE recur AS (
Initial values for the recursive statements
UNION
Recursive statements which runs until the condition satisfies
)
SELECT * FROM recur;
To process this recursive union operation, this node type needs two temporary tables.
One is an intermediate table, and the other one is a work table. The work table will initially
assign with the recursive statements initial values. By using this work table, this node type
generates an intermediate table, and appends intermediate table results into the work table
again. This process will keep executing until the intermediate table result is empty. Once
the intermediate table has no entries to append to the work table, this node type returns the
work table as a final result. For example, let's see the following example, which
demonstrates the work flow of recursive queries:
WITH RECURSIVE recur(i) AS (
SELECT 1
UNION
SELECT i+1 FROM recur WHERE i<5) SELECT * FROM recur;
Loop count Work Table Intermediate Table
1
2
3
4
5
i=1
1 < 5 Condition true i = i + 1 i = 2
12
Append to work table
12
2 < 5 Condition true i = i + 1 i = 3
123
Append to work table
123
3 < 5 Condition true i = i + 1 i = 4
1234
Append to work table
1234
4 < 5 Condition true i = i +1 i = 5
12345
Append to work table
12345
5 < 5 Condition fail
[ 283 ]
Query Optimization
Let's observe the following explain plan for the same query:
benchmarksql=# EXPLAIN ANALYZE WITH RECURSIVE recur(i) AS (SELECT 1
UNION SELECT i+1 FROM recur WHERE i<5) SELECT * FROM recur;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------CTE Scan on recur (cost=2.95..3.57 rows=31 width=4) (actual
time=0.010..0.037 rows=5 loops=1)
CTE recur
-> Recursive Union (cost=0.00..2.95 rows=31 width=4) (actual
time=0.006..0.031 rows=5 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual
time=0.001..0.002 rows=1 loops=1)
-> WorkTable Scan on recur recur_1 (cost=0.00..0.23 rows=3
width=4) (actual time=0.002..0.003 rows=1 loops=5)
Filter: (i < 5)
Rows Removed by Filter: 0
Planning time: 0.108 ms
Execution time: 0.088 ms
(9 rows)
Time: 0.736 ms
From the preceding plan, the WorkTable Scan node type went through a five loop count,
as shown in the preceding snippet. Like CTE, there are also limitations in recursive queries.
Recursive queries do not allow us to use any aggregate functions in the recursive
statements.
Nesting loops
In this recipe, we will be discussing the nesting loops mechanism in PostgreSQL.
Getting ready
Nesting loops is one of the table joining mechanisms, in which PostgreSQL prefers to join
two different datasets based on a join condition. The name itself describes the nesting loops
as a loop inside another loop. The outer loop holds a dataset and compares each tuple with
the dataset that holds an inner loop. That is, if the outer loop has N number of tuples and
the inner loop has M number of tuples, then the nested loop performs N * M number of
comparisons to produce the output.
[ 284 ]
Query Optimization
How to do it…
1. For demonstrating the nesting loops, let's run a join query at benchmarksql to
retrieve the list of warehouse names along with the customer name that got the
product from that warehouse:
benchmarksql=# EXPLAIN SELECT w_name, c_first FROM bmsql_warehouse,
bmsql_customer WHERE w_id=c_w_id;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------Nested Loop (cost=0.42..13743.32 rows=300000 width=21)
-> Seq Scan on bmsql_warehouse (cost=0.00..1.10 rows=10
width=12)
-> Index Only Scan using bmsql_customer_idx1 on bmsql_customer
(cost=0.42..1074.22 rows=30000 width=17)
Index Cond: (c_w_id = bmsql_warehouse.w_id)
(4 rows)
2. It seems that the optimizer chooses the nested loops mechanism to join these two
tables based on the available statistics of these two relations. From the preceding
explain plan, the Seq Scan node type is an outer loop and the Index Only Scan is
an inner loop. That is, the inner loop Index Only Scan will be executed for each
tuple that it read from the Seq Scan node type. From the preceding plan, the
estimated number of tuples from Seq Scan is 10. So the Index Only Scan will be
executed 10 times. To confirm these loop counts, let's add the ANALYZE operation
along with EXPLAIN as follows:
benchmarksql=# EXPLAIN ANALYZE SELECT w_name,c_first FROM
bmsql_warehouse, bmsql_customer WHERE w_id=c_w_id;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------Nested Loop (cost=0.42..13743.32 rows=300000 width=21) (actual
time=0.039..165.479 rows=300000 loops=1)
-> Seq Scan on bmsql_warehouse (cost=0.00..1.10 rows=10 width=12)
(actual time=0.004..0.016 rows=10 loops=1)
-> Index Only Scan using bmsql_customer_idx1 on bmsql_customer
(cost=0.42..1074.22 rows=30000 width=17) (actual
time=0.016..10.889 rows=30000 loops=10)
Index Cond: (c_w_id = bmsql_warehouse.w_id)
Heap Fetches: 0
Planning time: 0.374 ms
[ 285 ]
Query Optimization
Execution time: 182.885 ms
(7 rows)
From the preceding results, as expected the inner loop Index Only Scan is executed 10
times, as the number of tuples in outer loop is 10. From the preceding results, the Heap
Fetches: 0 defines how many tuples the Index Only Scan needed to fetch data from the
relation, when the planner doesn't find enough information from the table's visibility map
file.
How it works…
To understand the nested loop behavior, let's divide the work that it is doing in two phases.
Phase I
Let's name this phase as an outer loop phase. From the preceding explain plan, we can find
this node type as an immediate follower of the nested loop node type, and from the
preceding plan it is Seq Scan. In this section, it reads the bmsql_warehouse table using a
sequential scan and it fetched 10 rows.
Phase II
Let's name this phase as an inner loop. In the explain plan, we can find this node type as an
immediate follower of the outer loop section, and from the preceding plan it is an Index
Only Scan. In this section, it will take rows one by one from the outer section and compare it
with the bmsql_customer_idx1 index, and produce the matching rows.
The following is a pictorial representation of these phases:
[ 286 ]
Query Optimization
Now you might have a question such as why does the planner choose bmsql_warehouse
as an outer loop section, and what causes the bmsql_customer table to be chosen as an
inner loop section? If you are assuming that the reason for this is that bmsql_warehouse
has fewer rows when compared to the bmsql_customer table, then your assumption is not
correct. To justify this, let's observe the following example:
benchmarksql=# EXPLAIN SELECT * FROM test t1, test2 t2 WHERE t1.t=t2.t;
QUERY PLAN
-----------------------------------------------------------------------------------Nested Loop (cost=0.14..18455.00 rows=109 width=8)
-> Seq Scan on test t1 (cost=0.00..1443.00 rows=100000 width=4)
-> Index Only Scan using test2_idx on test2 t2 (cost=0.14..0.16
rows=1 width=4)
Index Cond: (t = t1.t)
(4 rows)
From the preceding example, the planner chooses the table that has a high estimated
number of rows as the outer loop dataset, and the table that has a lower estimated number
of rows as the inner loop. The reason behind choosing the test2 table as the inner loop is
the index on that table. PostgreSQL planner always prefers to perform its inner loop section
with an index scan if possible; this will improve the data access speed when compared with
sequential scans. That is the reason why in the preceding example the planner chose the
bmsql_warehouse as the outer loop dataset, and bmsql_customer as the inner loop
section.
Working with hash and merge join
In this recipe, we will be discussing merge and hash join mechanisms in PostgreSQL.
Getting ready
Merge join is another joining approach to perform the join operation between two datasets.
PostgreSQL optimizer will generally choose this joining method for an equi joins or for
union operations. To perform this join on two datasets, it is required to sort the two join key
columns first and then it will run the join condition. The optimizer prefers this node type,
while joining huge tables.
[ 287 ]
Query Optimization
Hash join is another joining approach. In general, this approach is pretty fast if and only if
the server has enough memory resources. To perform this join, PostgreSQL does not need
any sorted results. Rather, it will take one table data to build a hash index, which it will be
comparing with the other table tuples. PostgreSQL optimizer will generally choose this
joining method for an equi joins or for union operations. The optimizer prefers this node
type, while joining the medium size tables.
How to do it…
Let us generate simple hash, merge join plan as demonstrated as follow:
Hash join
To demonstrate the hash join behavior, let's query the benchmarksql to retrieve all the
customer IDs along with the amount paid for the all the product deliveries:
benchmarksql=# EXPLAIN SELECT SUM(h_amount), c_id FROM bmsql_history,
bmsql_customer WHERE h_c_id = c_id GROUP BY c_id;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------HashAggregate (cost=561095.19..561132.69 rows=3000 width=9)
Group Key: bmsql_customer.c_id
-> Hash Join (cost=11453.42..411194.93 rows=29980051 width=9)
Hash Cond: (bmsql_customer.c_id = bmsql_history.h_c_id)
-> Index Only Scan using bmsql_customer_pkey on
bmsql_customer (cost=0.42..9132.42 rows=300000 width=4)
-> Hash (cost=6238.00..6238.00 rows=300000 width=9)
-> Seq Scan on bmsql_history (cost=0.00..6238.00
rows=300000 width=9)
(7 rows)
From the preceding example, the optimizer chose the hash join to fetch the matching
records from the bmsql_history and bmsql_customer tables. To build the hash table in
memory, the optimizer needs to scan the whole table as per the join condition. Once the
hash table is built, each row from the bmsql_customer table will probe the hash table with
the join condition. Hash join is always preferred by the optimizer, when we are joining two
tables where one table is small and the other one is big in size. Also, to build the hash table,
the optimizer always prefers the small size table or low cost node type results, since it will
reduce the memory usage and also reduces the number of hash buckets.
[ 288 ]
Query Optimization
Merge join
To demonstrate the merge join behavior, let's query the benchmarksql to retrieve all the
warehouse names that are located in each district as follows:
benchmarksql=# EXPLAIN SELECT w_name, d_name FROM bmsql_district,
bmsql_warehouse WHERE w_id = d_w_id;
QUERY PLAN
---------------------------------------------------------------------------Merge Join (cost=7.59..9.14 rows=100 width=17)
Merge Cond: (bmsql_warehouse.w_id = bmsql_district.d_w_id)
-> Sort (cost=1.27..1.29 rows=10 width=12)
Sort Key: bmsql_warehouse.w_id
-> Seq Scan on bmsql_warehouse (cost=0.00..1.10 rows=10
width=12)
-> Sort (cost=6.32..6.57 rows=100 width=13)
Sort Key: bmsql_district.d_w_id
-> Seq Scan on bmsql_district (cost=0.00..3.00 rows=100
width=13)
(8 rows)
For the preceding query, the optimizer actually chose the hash join plan, since the dataset is
small. For demonstrating the merge join behavior, I have forced the planner to choose
merge join by disabling the optimizer attribute, enable_hash join. In general, merge join
will come into the picture while joining huge datasets.
As we discussed, merge join needs a sorted dataset as an input, which will produce the
matching results. From the preceding plan output, the planner also did the sort operation
on both join keys and fed both sorted results to the merge join node type.
How it works…
Let us see how the hash, merge join works in PostgreSQL:
Hash join
Let's divide the hash join internal implementation as two phases. In phase I, it will choose
the candidate table to build the hash table and in phase II, it will probe the hash table with
the other one.
[ 289 ]
Query Optimization
Phase I
In this phase, PostgreSQL will identify the candidate table to build the hash index. From the
preceding example, the optimizer chooses the bmsql_history table as a candidate for the
hash index. This is because the table has fewer numbers of relpages when compared to
bmsql_customer and that is the reason that the Seq Scan cost on bmsql_history gives a
less estimated cost. Let's get a number of relpages of these two tables from the pg_class
catalog as follows:
benchmarksql=# SELECT relpages, relname FROM pg_class WHERE relname IN
('bmsql_history', 'bmsql_customer');
relpages |
relname
----------+---------------3238 | bmsql_history
25218 | bmsql_customer
(2 rows)
Let's try to increase the seq_page_cost to a higher number and see how the optimizer
chooses the candidate table for the hash index:
benchmarksql=# SET seq_page_cost TO 10;
SET
benchmarksql=# EXPLAIN SELECT SUM(h_amount), c_id FROM bmsql_history,
bmsql_customer WHERE h_c_id = c_id GROUP BY c_id;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------HashAggregate (cost=641453.19..641490.69 rows=3000 width=9)
Group Key: bmsql_customer.c_id
-> Hash Join (cost=24602.42..491552.93 rows=29980051 width=9)
Hash Cond: (bmsql_history.h_c_id = bmsql_customer.c_id)
-> Seq Scan on bmsql_history (cost=0.00..35380.00
rows=300000 width=9)
-> Hash (cost=9132.42..9132.42 rows=300000 width=4)
-> Index Only Scan using bmsql_customer_pkey on
bmsql_customer (cost=0.42..9132.42 rows=300000 width=4)
(7 rows)
From the preceding example, the Index Only Scan on the bmsql_customer table's cost is
minimal when compared with Seq Scan on the bmsql_history table. That's why the
optimizer choose bmsql_customer as a candidate table to build the hash. To keep this hash
table, PostgreSQL uses the configured work_mem memory area. If the hash table size does
not fit into the work_mem area then it will process the hash probes using multiple batches,
by creating temporary files for each batch.
[ 290 ]
Query Optimization
Phase II
In this phase, PostgreSQL reads the tuples from the other table using the selected node type,
and generates a hash value based on a hash function. If the generated hash value matches
with any of the hash bucket values, then that tuple will be treated as a matching record.
Once it completes probing all the tuple values, then it will return all the matching records.
The following is a simple pictorial representation of the hash join:
Merge join
Let's divide this merge join operation into two phases. In phase I, PostgreSQL does sort
operations on both join key values. In phase II, PostgreSQL compares the two sorted results
in parallel.
Phase I
In this phase, PostgreSQL will utilize the work_mem memory area to sort the two join keys
one after the other. If any join key column has any sorted index, then the optimizer prefers
to choose the index scan over a sequential scan and sort.
Phase II
Once postgres has two sorted results, then it will scan them in parallel, as shown in the
following diagram. From the following diagram, Pointer 1 will only move to its next value
if its value is not equal to Pointer 2. If the Pointer1 value is equal to the Pointer2 value, then
it will produce a matching tuple. Pointer2 will keep moving to its next value if its value is
equal to Pointer1. This process will repeat until Pointer1 reaches its end, and returns all the
matching records. In this way, it will try to process these sorted results in parallel. The
following is a simple pictorial representation of merge join:
[ 291 ]
Query Optimization
Merge join always produces the sorted results, whereas hash join does not
provide the sorted output. This is one major advantage of merge join over
hash join plan.
Grouping
In this recipe, we will be discussing the optimizer node type, which will be chosen during
the group by operation.
Getting ready
As we discussed group or aggregate operations in the previous recipe, grouping operations
will have performed based on the group key list. The PostgreSQL optimizer chooses hash
aggregate, when it finds enough memory and if not, group aggregate will be the option.
Unlike hash aggregate, the group aggregate operation needs data to be sorted. If the group
columns have a sorted index already, then group aggregate will choose over the hash
aggregate as to reduce the memory usage.
[ 292 ]
Query Optimization
How to do it…
1. To demonstrate group aggregate, let's run the query in the benchmarksql
database to get the count of customers, grouped by their city:
benchmarksql=# EXPLAIN SELECT COUNT(*), c_city FROM
bmsql_customer GROUP BY c_city;
QUERY PLAN
-------------------------------------------------------------------------------------------------GroupAggregate (cost=0.42..11570.20 rows=33378 width=16)
Group Key: c_city
-> Index Only Scan using city_idx on bmsql_customer
(cost=0.42..9736.42 rows=300000 width=16)
(3 rows)
2. From the preceding plan, the optimizer chooses the GroupAggregate as the
c_city column is already indexed. Let's run the same query, by dropping the
index:
benchmarksql=# DROP INDEX city_idx ;
DROP INDEX
benchmarksql=# EXPLAIN SELECT COUNT(*), c_city FROM
bmsql_customer GROUP BY c_city;
QUERY PLAN
----------------------------------------------------------------------------HashAggregate (cost=29718.00..30053.29 rows=33529 width=15)
Group Key: c_city
-> Seq Scan on bmsql_customer (cost=0.00..28218.00 rows=300000
width=15)
(3 rows)
3. As we expected, after dropping the index, the optimizer chooses the
HashAggregate node type, which will utilize the work_mem area to perform the
hash aggregate operations. Let's run the same query, by reducing the work_mem
setting as follows:
benchmarksql=# SET work_mem TO '64kB';
SET
benchmarksql=# EXPLAIN SELECT COUNT(*), c_city FROM
bmsql_customer GROUP BY c_city;
QUERY PLAN
----------------------------------------------------------------------------------GroupAggregate (cost=70892.40..73477.69 rows=33529 width=15)
[ 293 ]
Query Optimization
Group Key: c_city
-> Sort (cost=70892.40..71642.40 rows=300000 width=15)
Sort Key: c_city
-> Seq Scan on bmsql_customer (cost=0.00..28218.00
rows=300000 width=15)
(5 rows)
From the preceding plan, the optimizer chooses the GroupAggregate as it does not find
enough memory areas to build the hash. If GroupAggregate does not find enough
work_mem, then it will use the external disk to perform the sort operation.
How it works…
PostgreSQL tries to divide the whole dataset into multiple groups by doing either sort
operations, by choosing an order index over the group column, or by creating a hash table
for the mention group column list. Once the complete dataset is divided into multiple
groups, then the mentioned aggregate function will invoke for each group and will execute
the state transition functions, as we discussed in the previous recipe.
Working with set operations
In this recipe, we will be discussing various PostgreSQL set operations.
Getting ready
PostgreSQL provides various set operations, which deal with multiple independent data
sets. The supported set operators are UNION/ALL, INTERSECT/ALL, and EXCEPT/ALL. In
general, we use the set operations in SQL when we need to either join or merge operations
among independent datasets. To process these independent datasets, PostgreSQL will
evaluate each dataset operation independently, and then it applies the given set operation
on the final datasets.
[ 294 ]
Query Optimization
How to do it…
1. To demonstrate these set operations, let's query the benchmarsql to get all the
customer IDs, that have not placed any online order:
benchmarksql=# EXPLAIN SELECT c_id FROM bmsql_customer
EXCEPT
SELECT h_c_id FROM bmsql_history;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------HashSetOp Except (cost=0.42..22870.42 rows=3000 width=4)
-> Append (cost=0.42..21370.42 rows=600000 width=4)
-> Subquery Scan on "*SELECT* 1" (cost=0.42..12132.42
rows=300000 width=4)
-> Index Only Scan using bmsql_customer_pkey on
bmsql_customer (cost=0.42..9132.42 rows=300000
width=4)
-> Subquery Scan on "*SELECT* 2" (cost=0.00..9238.00
rows=300000 width=4)
-> Seq Scan on bmsql_history (cost=0.00..6238.00
rows=300000 width=4)
(6 rows)
2. To demonstrate the INTERSECT behavior, let's run the query to select all
customers who have placed any number of orders:
benchmarksql=# EXPLAIN SELECT c_id FROM bmsql_customer
INTERSECT SELECT h_c_id FROM bmsql_history;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------HashSetOp Intersect (cost=0.42..22870.42 rows=3000 width=4)
-> Append (cost=0.42..21370.42 rows=600000 width=4)
-> Subquery Scan on "*SELECT* 1" (cost=0.42..12132.42
rows=300000 width=4)
-> Index Only Scan using bmsql_customer_pkey on
bmsql_customer (cost=0.42..9132.42 rows=300000
width=4)
-> Subquery Scan on "*SELECT* 2" (cost=0.00..9238.00
rows=300000 width=4)
-> Seq Scan on bmsql_history (cost=0.00..6238.00
rows=300000 width=4)
(6 rows)
[ 295 ]
Query Optimization
3. From the preceding two examples, the queries took the HashSetOp mechanism to
perform the set operations. PostgreSQL also has another set operation mechanism
called SetOp, which will be preferred by the optimizer, when it does not find
enough memory for the hash table. Let's run the same query by reducing the
work_mem area as follows:
benchmarksql=# set work_mem to '64kB';
SET
benchmarksql=# EXPLAIN SELECT c_id FROM bmsql_customer
INTERSECT SELECT h_c_id FROM bmsql_history;
QUERY
PLAN
----------------------------------------------------------------------------------------------------------------------------SetOp Intersect (cost=103566.23..106566.23 rows=3000 width=4)
-> Sort (cost=103566.23..105066.23 rows=600000 width=4)
Sort Key: "*SELECT* 1".c_id
-> Append (cost=0.42..21370.42 rows=600000 width=4)
-> Subquery Scan on "*SELECT* 1"
(cost=0.42..12132.42 rows=300000 width=4)
-> Index Only Scan using bmsql_customer_pkey
on bmsql_customer (cost=0.42..9132.42
rows=300000 width=4)
-> Subquery Scan on "*SELECT* 2"
(cost=0.00..9238.00 rows=300000 width=4)
-> Seq Scan on bmsql_history
(cost=0.00..6238.00 rows=300000 width=4)
(8 rows)
From the preceding plan, as the optimizer does not find enough work memory to build the
hash table, it chooses the SetOp mechanism, which requires the sorted data as an input.
How it works…
In PostgreSQL, the set operations are implemented based on set operations using sort and
set operations using hash. One mechanism among these two strategies will be chosen based
on the set operation and dataset size. Let's see a few demonstrations of each set operator as
follows.
To demonstrate the set operations behavior, let's choose one small (bmsql_warehouse)
table and one table that is bigger than the previous one (bmsql_district).
[ 296 ]
Query Optimization
INTERSECT
This operator returns common tuples among two individual datasets. To process these two
datasets, PostgreSQL always chooses the smallest table or minimal cost node type (index
scan, bitmap heap scan, and so on) to build the hash or sorted data. Let's see the following
query plans, where optimize always chooses the small table:
benchmarksql=# EXPLAIN SELECT w_state FROM bmsql_warehouse INTERSECT SELECT
d_state FROM bmsql_district ;
QUERY PLAN
--------------------------------------------------------------------------------HashSetOp Intersect (cost=0.00..5.48 rows=10 width=3)
-> Append (cost=0.00..5.20 rows=110 width=3)
-> Subquery Scan on "*SELECT* 1" (cost=0.00..1.20 rows=10 width=3)
-> Seq Scan on bmsql_warehouse (cost=0.00..1.10 rows=10 width=3)
-> Subquery Scan on "*SELECT* 2" (cost=0.00..4.00 rows=100 width=3)
-> Seq Scan on bmsql_district (cost=0.00..3.00 rows=100 width=3)
(6 rows)
From the preceding plan, the first child node Seq Scan on bmsql_warehouse chooses to
build the hash table. Even if we reverse the dataset, the optimizer will not change the plan:
benchmarksql=# EXPLAIN SELECT d_state FROM bmsql_district
INTERSECT
SELECT w_state FROM bmsql_warehouse;
QUERY PLAN
--------------------------------------------------------------------------------HashSetOp Intersect (cost=0.00..5.48 rows=10 width=3)
-> Append (cost=0.00..5.20 rows=110 width=3)
-> Subquery Scan on "*SELECT* 2" (cost=0.00..1.20 rows=10
width=3)
-> Seq Scan on bmsql_warehouse (cost=0.00..1.10
rows=10 width=3)
-> Subquery Scan on "*SELECT* 1" (cost=0.00..4.00 rows=100
width=3)
-> Seq Scan on bmsql_district (cost=0.00..3.00
rows=100 width=3)
(6 rows)
[ 297 ]
Query Optimization
After observing the preceding two plans, you might have a question such as where in the
preceding explain plan is it clearly mentioned that the hash table is built based on the
bmsql_warehouse table, since we are actually seeing the append node as the parent node
of Seq Scan on bmsql_warehouse? Currently, the planner will internally choose the node
that is needed to build hash, by comparing the node cost values. However, this information
is clearly visible when the optimizer chooses the set operation sort mechanism. Let's see the
following query to explain where it clearly shows the key value it chooses to do the sort
operation:
benchmarksql=# SET enable_hashagg TO off;
SET
benchmarksql=# EXPLAIN SELECT d_state FROM bmsql_district
INTERSECT
SELECT w_state FROM bmsql_warehouse;
QUERY PLAN
--------------------------------------------------------------------------------------SetOp Intersect (cost=8.93..9.48 rows=10 width=3)
-> Sort (cost=8.93..9.20 rows=110 width=3)
Sort Key: "*SELECT* 2".w_state
-> Append (cost=0.00..5.20 rows=110 width=3)
-> Subquery Scan on "*SELECT* 2" (cost=0.00..1.20
rows=10 width=3)
-> Seq Scan on bmsql_warehouse (cost=0.00..1.10
rows=10 width=3)
-> Subquery Scan on "*SELECT* 1" (cost=0.00..4.00
rows=100 width=3)
-> Seq Scan on bmsql_district (cost=0.00..3.00
rows=100 width=3)
(8 rows)
From the preceding plan, as you can see the planner chooses the bmsql_warehouse table
data to do the sorting operation.
[ 298 ]
Query Optimization
EXCEPT
Unlike INTERSECT, this operator always tries to build the hash table on the first table. Even
though the table size is huge in size, it will try to build the hash on that table. Let's see the
following example:
benchmarksql=# EXPLAIN SELECT d_state FROM bmsql_district
EXCEPT
SELECT w_state FROM bmsql_warehouse;
QUERY PLAN
--------------------------------------------------------------------------------HashSetOp Except (cost=0.00..5.48 rows=96 width=3)
-> Append (cost=0.00..5.20 rows=110 width=3)
-> Subquery Scan on "*SELECT* 1" (cost=0.00..4.00 rows=100
width=3)
-> Seq Scan on bmsql_district (cost=0.00..3.00
rows=100 width=3)
-> Subquery Scan on "*SELECT* 2" (cost=0.00..1.20 rows=10
width=3)
-> Seq Scan on bmsql_warehouse (cost=0.00..1.10
rows=10 width=3)
(6 rows)
From the preceding plan, as expected the hash table will choose the bmsql_district table
as it is mentioned on the left side of the EXCEPT operator. Let's swap these relations and see
the execution plan:
benchmarksql=# EXPLAIN SELECT w_state FROM bmsql_warehouse
EXCEPT
SELECT d_state FROM bmsql_district ;
QUERY PLAN
--------------------------------------------------------------------------------HashSetOp Except (cost=0.00..5.48 rows=10 width=3)
-> Append (cost=0.00..5.20 rows=110 width=3)
-> Subquery Scan on "*SELECT* 1" (cost=0.00..1.20 rows=10
width=3)
-> Seq Scan on bmsql_warehouse (cost=0.00..1.10
rows=10 width=3)
-> Subquery Scan on "*SELECT* 2" (cost=0.00..4.00 rows=100
width=3)
-> Seq Scan on bmsql_district (cost=0.00..3.00
rows=100 width=3)
(6 rows)
[ 299 ]
Query Optimization
This time, the plan also changed as the optimizer chose the bmsql_warehouse table to
build the hash.
UNION
For this set operation, PostgreSQL will not use the HashSetOp/SetOp node type, as this can
easily be performed by using the Unique or HashAggregate node types. For
demonstration, let's see the following explain plans:
benchmarksql=# EXPLAIN SELECT w_state FROM bmsql_warehouse
UNION
SELECT d_state FROM bmsql_district;
QUERY PLAN
--------------------------------------------------------------------------HashAggregate (cost=5.48..6.58 rows=110 width=3)
Group Key: bmsql_warehouse.w_state
-> Append (cost=0.00..5.20 rows=110 width=3)
-> Seq Scan on bmsql_warehouse (cost=0.00..1.10 rows=10
width=3)
-> Seq Scan on bmsql_district (cost=0.00..3.00 rows=100
width=3)
(5 rows)
Let's disable the hash aggregate, and see the plan it chooses for this union operation:
benchmarksql=# SET enable_hashagg TO off;
SET
benchmarksql=# EXPLAIN SELECT w_state FROM bmsql_warehouse
UNION
SELECT d_state FROM bmsql_district;
QUERY PLAN
--------------------------------------------------------------------------------Unique (cost=8.93..9.48 rows=110 width=3)
-> Sort (cost=8.93..9.20 rows=110 width=3)
Sort Key: bmsql_warehouse.w_state
-> Append (cost=0.00..5.20 rows=110 width=3)
-> Seq Scan on bmsql_warehouse (cost=0.00..1.10
rows=10 width=3)
-> Seq Scan on bmsql_district (cost=0.00..3.00
rows=100 width=3)
(6 rows)
[ 300 ]
Query Optimization
Working on semi and anti joins
In this recipe, we will be discussing hash different joining methods.
Getting ready
Semi or anti joins are kind of sub join types to the joining methods such as hash, merge, and
nested loop, where the optimizer prefers to use them for EXISTS/IN or NOT EXISTS/NOT
IN operators.
Semi join will return a single value for all the matching records from the other table. That is,
if the second table has multiple matching entries for the first table's record, then it will
return only one copy from the first table. However, a normal join it will return multiple
copies from the first table.
Anti-join will return rows, when no matching records are found in the second table. It is
quite opposite to the semi join, since it is returning records from the first table, when there
is no match in the second table.
How to do it…
Let's run a query in the benchmarksql database to get the list of items that are in stock:
benchmarksql=# EXPLAIN SELECT * FROM bmsql_item WHERE EXISTS (SELECT
true FROM bmsql_stock WHERE i_id=s_i_id);
QUERY PLAN
------------------------------------------------------------------------------------------------------------Hash Semi Join (cost=42387.40..55508.40 rows=100000 width=73)
Hash Cond: (bmsql_item.i_id = bmsql_stock.s_i_id)
-> Seq Scan on bmsql_item (cost=0.00..2299.00 rows=100000
width=73)
-> Hash (cost=25980.41..25980.41 rows=999999 width=4)
-> Index Only Scan using bmsql_stock_pkey on bmsql_stock
(cost=0.42..25980.41 rows=999999 width=4)
(5 rows)
From the preceding plan, the optimizer chooses the Hash Semi Join as it is returning only
matching rows from the bmsql_item table. The same plan will be used if we implement the
query with the IN operator.
[ 301 ]
Query Optimization
Let's change the same query to fetch the items that are not in stock at this moment using the
NOT EXISTS operator and see the execution plan:
benchmarksql=# EXPLAIN SELECT * FROM bmsql_item WHERE NOT EXISTS
(SELECT true FROM bmsql_stock WHERE i_id=s_i_id);
QUERY PLAN
------------------------------------------------------------------------------------------------------------Hash Anti Join (cost=38480.40..44154.40 rows=1 width=73)
Hash Cond: (bmsql_item.i_id = bmsql_stock.s_i_id)
-> Seq Scan on bmsql_item (cost=0.00..2299.00 rows=100000
width=73)
-> Hash (cost=25980.41..25980.41 rows=999999 width=4)
-> Index Only Scan using bmsql_stock_pkey on bmsql_stock
(cost=0.42..25980.41 rows=999999 width=4)
(5 rows)
As expected, the optimizer chooses Anti Join as it is fetching the non-matching records
from the first table. Optimizer will choose the same plan if we change the query with the
NOT IN operator.
How it works…
Semi or anti join are like hints to the joining approaches, on how to behave when any
matching tuples are found.
From the preceding example, the optimizer chooses hash semi join. Unlike hash join, which
always chooses the low cost node type to build the hash, hash semi join prefers to build the
hash table on the sub select table, which is bmsql_stock in the preceding example. Once
the hash table is prepared, it will probe the hash with the first table tuples, and return the
matching records. Anti-join works in a similar way, as it does return only the non-matching
tuples from the first table.
[ 302 ]
Query Optimization
There might be chances for the optimizer to choose the merge/nested loop join along with
semi or anti join. Let's see the following example where the optimizer chooses the nested
loop semi join:
benchmarksql=# EXPLAIN SELECT * FROM bmsql_item WHERE i_id = 10 AND
EXISTS (SELECT true FROM bmsql_stock WHERE i_id=s_i_id);
QUERY PLAN
--------------------------------------------------------------------------------------------------Nested Loop Semi Join (cost=0.72..18488.85 rows=1 width=73)
-> Index Scan using bmsql_item_pkey on bmsql_item (cost=0.29..8.31
rows=1 width=73)
Index Cond: (i_id = 10)
-> Index Only Scan using bmsql_stock_pkey on bmsql_stock
(cost=0.42..18480.52 rows=10 width=4)
Index Cond: (s_i_id = 10)
(5 rows)
Here, the nested loop semi join works in a different way compared to the plain nested loop
join. As like a regular nested loop join, it does not perform the cross number of matches.
Once the outer tuple matches with the inner tuple, then it proceeds to match with the next
outer tuple. Anti-join also works in a similar way. If any matching tuple is found from the
outer loop, then it skips all the remaining comparisons from the inner loop table. During
anti join, the resulting set will form when any outer tuple does not match with any inner
tuples. Semi join is opposite to anti-join; as the resulting set will only form when any outer
tuple matches with any inner tuple.
[ 303 ]
12
Database Indexing
In this chapter, we will cover the following recipes:
Measuring query and index block statistics
Index lookups
Comparing indexed scans and sequential scans
Clustering against an index
Concurrent indexes
Combined indexes
Partial indexes
Finding unused indexes
Forcing a query to use an index
Detecting missing indexes
Concurrent indexes
Introduction
Indexes play a crucial role in a database as they improve the query response time by adding
an amount of overhead for each write operation of a table. Creating or dropping an index is
an important responsibility for a DBA as they need to make a choice by looking at the read
and write ratio of the database tables. A wise move would definitely be to increase the
database performance, which needs some investigation into the table growth, along with its
access metrics. In some cases, an index may improve the query performance instantly,
which may not be useful in future. In this chapter, we are going to learn how to measure the
table growth along with its access metrics using catalog tables, by using
the BenchmarkSQL database as a sample dataset.
Database Indexing
Measuring query and index block statistics
In this recipe, we will be discussing how to measure the index statistics, using various
catalog views.
Getting ready
PostgreSQL offers a few catalog views and extensions, which are enough to study the index
usage statistics. The catalog views are pg_stat_user_indexes and
pg_statio_user_indexes. These give the index usage statistics, and the extension
pgstattuple provides insight into the details of the index by reading its physical files.
How to do it…
1. Let's get a sample non-primary key index to measure its statistics, as follows:
benchmarksql=# SELECT indexrelid::regclass FROM
pg_index WHERE indisprimary IS FALSE AND
indrelid::regclass::text='bmsql_item' LIMIT
1;
indexrelid
-----------pric_idx
(1 row)
2. Let's reset the statistics using the pg_stat_reset function, as follows:
benchmarksql=# SELECT pg_stat_reset();
pg_stat_reset
--------------(1 row)
3. Let's run the following sample query on bmsql_item, which uses the index
pric_idx to fetch the heap tuples from the table:
benchmarksql=# EXPLAIN ANALYZE SELECT COUNT(*)
FROM bmsql_item WHERE i_price BETWEEN 1 AND 10;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------Aggregate (cost=1649.57..1649.58 rows=1 width=8)
[ 305 ]
(actual
Database Indexing
time=1647.810..1647.810 rows=1 loops=1)
Bitmap Heap Scan on bmsql_item (cost=192.74..1627.02
rows=9019 width=0) (actual time=327.813..1646.526 rows=9021
loops=1)
Recheck Cond: ((i_price >= '1'::numeric) AND (i_price <=
'10'::numeric))
Heap Blocks: exact=1299
-> Bitmap Index Scan on pric_idx (cost=0.00..190.48 rows=9019
width=0) (actual time=288.113..288.113 rows=9021 loops=1)
Index Cond: ((i_price >= '1'::numeric) AND (i_price <=
'10'::numeric))
Planning time: 399.779 ms
Execution time: 1756.573 ms
(8 rows)
->
4. Let's run the following query, which returns the index usage of the preceding
SQL statement:
benchmarksql=# SELECT idx_scan, idx_tup_read,
idx_tup_fetch,
idx_blks_read, idx_blks_hit
FROM pg_stat_user_indexes psui, pg_statio_user_indexes psiui
WHERE psui.relid=psiui.relid AND
psui.indexrelid=psiui.indexrelid AND
psui.indexrelid='pric_idx'::regclass;
-[ RECORD 1 ]-+----idx_scan
| 2
idx_tup_read | 9022
idx_tup_fetch | 1
idx_blks_read | 28
idx_blks_hit | 2
How it works…
From the preceding example, we can see that the optimizer preferred to use an index scan
to fetch the tuples from the bmsql_item table. Let's get the pric_idx index details by
using the pg_class catalog table, as follows:
benchmarksql=# SELECT relpages, reltuples FROM pg_class WHERE relname =
'pric_idx';
-[ RECORD 1 ]----relpages | 276
reltuples | 100000
[ 306 ]
Database Indexing
From the preceding output, the pric_idx index has 100000 tuples in 276 pages. Each page
is of 8 KB (default block size), in size and the total size of the index would be 276*8*1024 =
2260992 bytes, that is approximately 2 MB. However, we can also get the size of an index
using the \di+ PostgreSQL meta command, as follows:
benchmarksql=# \di+ pric_idx
List of relations
-[ RECORD 1 ]----------Schema
| public
Name
| pric_idx
Type
| index
Owner
| postgres
Table
| bmsql_item
Size
| 2208 kB
Description |
idx_scan
This defines the number of times the index is scanned to fetch the results. According to the
preceding index scan metrics, the query took two index scans (idx_scan) to fetch the
tuples. Now you might ask, “In the preceding plan, I only see one index scan node type;
where is the other one?” The answer to your question is that the optimizer performed one
additional index scan, for its selectivity. That is, in the preceding SQL statement, we have an
inequality predicate, which is an operator. If the optimizer sees any inequality predicate in
the SQL query, then it tries to calculate the selectivity of the index with its histogram
boundaries. We can find these histogram boundaries of each column in the pg_stats view.
This index selectivity ratio defines whether the optimizer needs to choose an index scan
over the sequential scan. To demonstrate this, let's play around with index selectivity.
Let's get the maximum and minimum values from i_price:
benchmarksql=# SELECT max(i_price), min(i_price) FROM bmsql_item;
max
| min
--------+-----100.00 | 1.00
(1 row)
Let's reset the statistics, using the pg_stat_reset() function:
benchmarksql=# SELECT pg_stat_reset();
pg_stat_reset
--------------(1 row)
[ 307 ]
Database Indexing
Let's run the same SQL statement, with the minimum and maximum values of the i_price
column:
benchmarksql=# EXPLAIN ANALYZE SELECT COUNT(*) FROM bmsql_item WHERE
i_price BETWEEN 1 AND 100;
QUERY PLAN
----------------------------------------------------------------------Aggregate (cost=3049.00..3049.01 rows=1 width=8) (actual
time=41.861..41.861 rows=1 loops=1)
-> Seq Scan on bmsql_item (cost=0.00..2799.00 rows=100000 width=0)
(actual time=0.011..34.318 rows=100000 loops=1)
Filter: ((i_price >= '1'::numeric) AND (i_price <= '100'::numeric))
Planning time: 0.164 ms
Execution time: 41.885 ms
(5 rows)
From the preceding plan, we do not see any index scan node types. But the optimizer
decided to go with the sequential scan, by referring its index selectivity. If the given range
values try to fetch the whole index, then the optimizer chooses the sequential scan to avoid
the index lookup overhead. We can confirm this statement by referring the number of index
scans the optimizer performed to the preceding SQL statement, using the following SQL
statement:
benchmarksql=# SELECT idx_scan
FROM
pg_stat_user_indexes psui, pg_statio_user_indexes psiui
WHERE psui.relid=psiui.relid AND psui.indexrelid=psiui.indexrelid AND
psui.indexrelid = 'pric_idx'::regclass;
-[ RECORD 1 ]-+-idx_scan
| 2
From the preceding results, we can see that the optimizer performed two index scans to
find the index selectivity and then chose the sequential scan as the better plan, to execute
the given SQL statement. To confirm this index selectivity, the optimizer does not read the
whole index; rather, it only fetches a few index blocks.
idx_tup_read
This defines the number of tuples read from the index. From the preceding example, it
reads the total tuples as 9022, which includes the number of tuples it read during the index
selectivity.
[ 308 ]
Database Indexing
idx_tup_fetch
This defines the number of tuples read from the table with the simple index scan. In the
preceding plan, the optimizer chose the bitmap index scan, which is not a regular index
scan method. From the preceding results, the value of idx_tup_fetch is 1, which is
fetched during the optimizer index selectivity.
idx_blks_read
This defines the number of blocks read from the disk. From the preceding results, the value
of idx_blks_read is 28, and each block is 8 KB in size. This includes the number of blocks
read from an index, during the index selectivity.
idx_blks_hit
This defines the number blocks read from the cache. From the preceding results, the value
of idx_blks_hit is 2, and each block is 8 KB in size. This includes the number of blocks hit
from the cache, during the index selectivity.
Index lookup
In this recipe, we will be discussing various index scans, and how they try to fetch matching
tuples from tables.
Getting ready
As we all know, an index is a physical object that can help a query retrieve table data much
faster than a sequential scan. In the explain plan, we frequently get index scan, bitmap heap
scan, or index only scan node types, which works on a particular index. Whenever an index
is scanned, then an immediately followed action would be to return the matching tuples
from the index, build a bitmap in memory, or fetch the tuples from the base table.
[ 309 ]
Database Indexing
How to do it…
To demonstrate the index scan, let's run the following simple query and see how many
tuples it is reading from the base table:
benchmarksql=# SELECT pg_stat_reset();
pg_stat_reset
--------------(1 row)
benchmarksql=# SET enable_bitmapscan TO off;
SET
benchmarksql=# EXPLAIN ANALYZE SELECT * FROM bmsql_item WHERE i_price = 1;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------Index Scan using pric_idx1 on bmsql_item (cost=0.29..32.41 rows=7
width=72) (actual time=0.019..0.026 rows=7 loops=1)
Index Cond: (i_price = '1'::numeric)
Planning time: 0.650 ms
Execution time: 0.049 ms
(4 rows)
In the preceding SQL statements, we have forced the optimizer to choose its behavior by
disabling enable_bitmapscan. Now let's see how many tuples the index scan read from
the bmsql_item table using the following SQL statement:
benchmarksql=# SELECT idx_scan, idx_tup_fetch FROM pg_stat_user_tables
WHERE relname = 'bmsql_item';
idx_scan | idx_tup_fetch
---------+--------------1 |
7
(1 row)
To demonstrate the bitmap scan behavior, let's enable enable_bitmapscan and then rerun
the same statement to see how many tuples it read from the index to build the bitmap, and
how many tuples it read from the table using that bitmap index:
benchmarksql=# EXPLAIN ANALYZE SELECT * FROM bmsql_item WHERE i_price = 1;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------Bitmap Heap Scan on bmsql_item (cost=4.35..30.89 rows=7 width=72) (actual
time=18.373..48.416 rows=7 loops=1)
Recheck Cond: (i_price = '1'::numeric)
Heap Blocks: exact=7
-> Bitmap Index Scan on pric_idx1 (cost=0.00..4.35 rows=7 width=0)
[ 310 ]
Database Indexing
(actual time=0.328..0.328 rows=7 loops=1)
Index Cond: (i_price = '1'::numeric)
Planning time: 125.660 ms
Execution time: 48.515 ms
(7 rows)
Get the number of tuples and queries read from the index by using the following SQL
statement:
benchmarksql=# SELECT idx_scan, idx_tup_fetch, idx_tup_read
FROM
FROM
pg_stat_user_indexes
WHERE indexrelname = 'pric_idx1';
-[ RECORD 1 ]-+-idx_scan
| 1
idx_tup_fetch | 0
idx_tup_read | 7
Get the number of tuples that the bitmap index read from the table using the previous SQL,
which we executed on pg_stat_user_tables:
-[ RECORD 1 ]-+-idx_scan
| 1
idx_tup_fetch | 7
To demonstrate the index only scan behavior, let's modify the SQL statement to only fetch
the records from the indexed column by disabling enable_bitmapscan:
benchmarksql=# EXPLAIN ANALYZE SELECT i_price FROM bmsql_item WHERE i_price
= 1;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------Index only scan using pric_idx1 on bmsql_item (cost=0.29..32.41 rows=7
width=6) (actual time=0.018..0.025 rows=7 loops=1)
Index Cond: (i_price = '1'::numeric)
Heap Fetches: 7
Planning time: 0.598 ms
Execution time: 0.047 ms
(5 rows)
Let's run the previous SQL statements that use pg_stat_user_indexes:
-[ RECORD 1 ]-+-idx_scan
| 1
idx_tup_fetch | 7
idx_tup_read | 7
[ 311 ]
Database Indexing
From the preceding plan, the index only scan performs Heap Fetches: 7, which fetches
the actual tuple from the base table, to confirm its visibility information. Once the index
only scan gets the visibility information, it will return the tuples from the index.
How it works…
Let us discuss about various index scan techniques:
Index scan
In general, an index scan fetches the matching tuples from the index and then immediately
hits the table to fetch the heap tuples. An index does not maintain any visibility
information, whereas a heap does. This visibility information helps the index scan to return
only the committed live rows, rather than returning the uncommitted dirty rows.
Bitmap heap scan
A bitmap heap scan builds the bitmap index with the matching tuples, and then tries to
fetch the heap tuples from the table. One advantage that a bitmap heap scan has over an
index scan is a sequential order of reading the table pages, rather than doing random reads
due to the index scan. We will be discussing this in more detail in further chapters.
Index only scan
An index only scan is the mechanism which PostgreSQL tries to fetch all the required data
from the index it self rather fetching it from the table. This index only scan utilizes the
visibility map file of the table, to avoid the heap scan. In the preceding demonstration, the
index only scan performs seven heap fetches, as it does not find any visibility map file for
the bmsql_item relation. In this case, the index only scan behaves like a normal index scan.
Let's consider the following test case, where the index only scan fetches the tuples from the
only index:
benchmarksql=# VACUUM bmsql_item;
VACUUM
benchmarksql=# EXPLAIN ANALYZE SELECT i_price FROM bmsql_item WHERE i_price
= 1;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------Index Only Scan using pric_idx1 on bmsql_item (cost=0.29..4.42 rows=7
[ 312 ]
Database Indexing
width=6) (actual time=0.017..0.019 rows=7 loops=1)
Index Cond: (i_price = '1'::numeric)
Heap Fetches: 0
Planning time: 0.674 ms
Execution time: 0.040 ms
(5 rows)
In the preceding example, we run the VACUUM command on the table, where it generates the
visibility map file, which will feed the index only scan operation. In the preceding plan, we
can also observe the Heap Fetches: 0, since it utilizes the visibility map file.
Please refer to the following URL for more information on index only
scans: https://www.postgresql.org/docs/9.6/static/indexes-indexonly-scans.html.
Comparing indexed scans and sequential
scans
In this recipe, let's compare the index and sequential scan behaviors using the
inotifywait utility command, which will print a message when the mentioned event
occurs on the given files.
Getting ready
inotify tools is a module that we can download using either the apt-get or yum
command in the corresponding Linux distribution. This contrib module is developed based
on the inotify kernel API, which provides some kind of audit mechanism over the
filesystem. To compare the index and sequential scan behavior, let's audit the relation and
index physical file while executing the SQL queries.
How to do it...
Let's get the locations of index and relation physical files location using the
pg_relation_filepath function, as follows:
benchmarksql=# SELECT pg_relation_filepath('pric_idx');
pg_relation_filepath
----------------------
[ 313 ]
Database Indexing
base/12439/16545
(1 row)
benchmarksql=# SELECT pg_relation_filepath('bmsql_item');
pg_relation_filepath
---------------------base/12439/16417
(1 row)
Let's add these two physical files to the inoitfywait command, as shown here, and
observe the output after each SQL statement execution:
$ inotifywait -m -e access base/12439/16417 base/12439/16545
Setting up watches.
Watches established.
Let's run the SQL query, which does the sequential scan over the table, as follows:
SELECT * FROM bmsql_item;
Output of inotifywait
base/12439/16545 ACCESS
base/12439/16417 ACCESS
Let's run the SQL query, which does the index scan over the table, as follows:
SELECT * FROM bmsql_item WHERE i_price = 1;
Output of inotifywait
base/12439/16545 ACCESS
base/12439/16417 ACCESS
Let's run the SQL query, which does the index only scan, as follows:
SELECT i_price FROM bmsql_item WHERE i_price = 1;
Output of inotifywait
base/12439/16545 ACCESS
base/12439/16417 ACCESS
Let's run the same preceding SQL after running the VACUUM operation on the bmsql_item
table, which prefers index only scans:
Output of inotifywait
base/12439/16545 ACCESS
Let's enable enable_bitmapscan, and then run the following SQL statement:
SELECT * FROM bmsql_item WHERE i_price = 1;
Output of inotifywait
base/12439/16545 ACCESS
base/12439/16417 ACCESS
[ 314 ]
Database Indexing
While performing the preceding test cases, we needed to restart the
Postgres instance multiple times. This is because, if the object file is
already cached, then Postgres will return the required tuples from
memory.
How it works…
Sequential scan
From looking at the preceding demonstration results, you might wonder why the query
accessed the index file to perform the full table scan. As aforementioned, the optimizer tries
to generate multiple plans for the submitted SQL statement, even though the SQL does not
have any predicate clause in it. In the preceding sequential scan, the optimizer tried to read
each index that was created on the bmsql_item table. Let's run the same SQL statement by
adding all the indexes of that table to the inotifywait command, as follows:
$ inotifywait -m -e access base/12439/16417 base/12439/16545
base/12439/16457 base/12439/24576
Setting up watches.
Watches established.
base/12439/16457
base/12439/16545
base/12439/24576
base/12439/16417
ACCESS
ACCESS
ACCESS
ACCESS
Index scan
From the preceding results, as expected, whenever the index scan is performed then an
immediate table scan is also performed.
Index only scan
From the preceding results, before performing VACCUM on the table, the result of the
inotify output is similar to the index scan's inotify output. Once VACUUM is performed,
then inotify only shows the output of the index entry, which is as an expected behavior.
[ 315 ]
Database Indexing
Bitmap heap scan
From the preceding results, as expected, the index and heap file is scanned for the bitmap
heap scan node type.
There's more…
The inotifywait and inotifywatch commands provide multiple audit levels of the
given file or directory. That is, whenever the file events are removed, updated, accessed,
moved, and so on, then these utility commands print some useful audit information. I
would encourage you to go through more details of this command on the man page.
Clustering against an index
In this recipe, we will be discussing a table reordering strategy based on an index, which
will improve the query performance for a certain time.
Getting ready
In PostgreSQL, we have a utility command called CLUSTER that is like VACUUM. It will do
some kind of table reorganization by acquiring access exclusive locks on the table. The
CLUSTER command will create a new physical table by aligning its pages in the order of the
mentioned index. An advantage of doing this is to avoid the index lookup overhead during
the index scan, since the table itself is in ordered. Another benefit of doing this is removing
all dead tuples from the table and index.
Running the CLUSTER command during the business hours is not
recommended as it will acquire the access exclusive lock, which will block
the incoming queries on the table. Also, the CLUSTER operation requires an
additional amount of disk space for the newly created table and indexes,
which will drop the old table and its associated indexes when the
operation is successful.
[ 316 ]
Database Indexing
How to do it…
For demonstration, let's do the cluster operation on the bmsql_item table using the
pric_idx index:
benchmarksql=# CLUSTER bmsql_item USING pric_idx;
CLUSTER
Let's perform the ANALYZE operation on the table after the CLUSTER operation:
benchmarksql=# ANALYZE bmsql_item;
ANALYZE
How it works…
As aforementioned, the CLUSTER operation will acquire an exclusive lock on the table as it
needs to rebuild the table. After clustering is complete, we need to perform an ANALYZE
operation as to update the statistics about the table ordering. Once the table is ordered, then
the further index scan's future required disk reads will be much faster, as the table is
already ordered, and we will get better performance.
There's more…
In PostgreSQL, the CLUSTER operation will not keep its table ordering as per the incoming
transactions, as we need to execute the same operation frequently to get better performance.
Please refer to the following URL for more information about CLUSTER
behavior: https://www.postgresql.org/docs/9.6/static/sql-cluster
.html.
Concurrent indexes
In this recipe, we will be discussing how to create an online index that will not block the
base table for the incoming transactions.
[ 317 ]
Database Indexing
Getting ready
In general, while creating/dropping an index, it will block the table to avoid further write
operations into the index. Until the index operation is complete, the whole table will be
locked for write operations, which will be an outage to the application. To avoid this table
lock problem, PostgreSQL provides an option called concurrent, which will avoid this
blocking behavior. It will also keep this index status as invalid until its creation is
successfully completed. Building an index concurrently takes more time, as the operation
needs to allow the incoming write operations.
How to do it…
Let's create a regular index on the bmsql_item table, and see how many full table scans it
performed to build this index:
benchmarksql=# CREATE INDEX data_idx ON bmsql_item(i_data);
CREATE INDEX
benchmarksql=# SELECT seq_scan,seq_tup_read FROM pg_stat_user_tables WHERE
relname = 'bmsql_item';
seq_scan | seq_tup_read
----------+-------------1 |
100000
(1 row)
Let's drop and re-create the same index using the CONCURRENTLY option, and see the
number of full table scans it performed:
benchmarksql=# CREATE INDEX CONCURRENTLY data_idx ON bmsql_item(i_data);
CREATE INDEX
benchmarksql=# SELECT seq_scan,seq_tup_read FROM pg_stat_user_tables WHERE
relname = 'bmsql_item';
seq_scan | seq_tup_read
----------+-------------2 |
200000
(1 row)
[ 318 ]
Database Indexing
How it works…
While creating a regular index, it only needs to do a full table scan as it is locking the table
for write operations, whereas, while building the concurrent index, it needs to do two full
table scans over the table, as it is not locking for write operations. If the table size is huge in
size, then we can expect that the concurrent index takes more CPU and I/O usage when
compared with the regular index creation. These two full table scans will perform in two
different transaction snapshots, which will help in building an index concurrently, by
waiting for any ongoing write operations on that table.
During the two table scans, if any error occurs then it won't mark the index as valid, and
these invalid indexes won't be used for any SQL queries. However, these invalid indexes
don't hide from the incoming write operations, which leads to an unnecessary I/O. To repair
these invalid indexes to be valid indexes, we have to either reindex the index, or drop the
index concurrently and then recreate it again. Use the following query to get the list of
invalid indexes:
SELECT COUNT(*) FROM pg_index WHERE indisvalid IS FALSE;
Combined indexes
In this recipe, we will be discussing how PostgreSQL combines multiple indexes to produce
desired results.
Getting ready
When the query has a predicate with multiple columns and each column has associated
with an index then the optimizer combines all the index results to fetch the desired results.
How to do it…
For demonstration, let's query the database as follows:
benchmarksql=# EXPLAIN SELECT * FROM bmsql_item WHERE i_data = 'Item Name'
OR i_price = 3;
QUERY PLAN
----------------------------------------------------------------------------------Bitmap Heap Scan on bmsql_item (cost=8.82..60.67 rows=14 width=71)
Recheck Cond: (((i_data)::text = 'Item Name'::text) OR (i_price =
[ 319 ]
Database Indexing
'3'::numeric))
-> BitmapOr (cost=8.82..8.82 rows=14 width=0)
-> Bitmap Index Scan on data_idx (cost=0.00..4.45 rows=4 width=0)
Index Cond: ((i_data)::text = 'Item Name'::text)
-> Bitmap Index Scan on price_idx (cost=0.00..4.37 rows=10 width=0)
Index Cond: (i_price = '3'::numeric)
(7 rows)
How it works…
In the preceding example, the optimizer performed two bitmap index scans and combined
the results with the bitmap OR operator, which finally performed the bitmap heap scan.
Refer to the following URL for more information about combined indexes:
https://www.postgresql.org/docs/9.6/static/indexes-bitmap-scan
s.html.
Partial indexes
In this recipe, we will be discussing how to create an index for the required data sample set.
Getting ready
Using partial indexes, we can reduce the size of an index by adding a predicate in the index
definition. That is, only the entries that match the predicate will only be indexed instead of
all of them. This partial index will be utilized when the index predicate satisfies the
submitted SQL predicate.
How to do it…
For example, let's say that our application does a frequent query on bmsql_item as to list
all the items that have a price between $5 to $10, then it is a candidate predicate to create a
partial index as follows:
benchmarksql=# CREATE INDEX CONCURRENTLY part_idx ON bmsql_item(i_price)
WHERE i_price BETWEEN 5 AND 10;
CREATE INDEX
[ 320 ]
Database Indexing
Let's query the table so as to return the items that have a cost between 5 and 10:
benchmarksql=# EXPLAIN SELECT * FROM bmsql_item WHERE i_price BETWEEN 5 AND
10;
QUERY PLAN
-------------------------------------------------------------------------------Index Scan using part_idx on bmsql_item (cost=0.28..206.97 rows=4979
width=72)
(1 row)
How it works…
As aforementioned, when the submitted SQL predicate matches with the partial index
predicate, then that index will be preferred. This might happen during the parameterized
SQL too. Parameterized SQL statements will pick the partial index, as per the submitted
parameters. Consider the following example, where the parameterized SQL query picks the
partial index when the given range falls under the index predicate:
benchmarksql=# PREPARE params(int, int) AS SELECT * FROM bmsql_item WHERE
i_price BETWEEN $1 AND $2;
PREPARE
benchmarksql=# EXPLAIN EXECUTE params(4, 6);
QUERY PLAN
--------------------------------------------------------------------Seq Scan on bmsql_item (cost=0.00..2800.00 rows=2035 width=72)
Filter: ((i_price >= '4'::numeric) AND (i_price <= '6'::numeric))
(2 rows)
benchmarksql=# EXPLAIN EXECUTE params(5, 6);
QUERY PLAN
------------------------------------------------------------------------------Index Scan using part_idx on bmsql_item (cost=0.28..50.87 rows=1005
width=72)
Index Cond: (i_price <= '6'::numeric)
(2 rows)
Finding unused indexes
In this recipe, we will be discussing how to find the unused indexes from their creation
time, which is utilizing the unnecessary I/O.
[ 321 ]
Database Indexing
Getting ready
In a database, unused indexes will cause an unnecessary I/O for each write operation to be
written into a table. To find these unused indexes, we have to depend on the number of
scans that an index has performed as of that moment. From the time of index creation, if the
index scan count is zero and the index is not a primary key index, then we can treat that as
an unused index.
To get the number of index scans of a table, we have to depend on
PostgreSQL statistical counters. However, these counters can be reset to
zero using pg_stat_reset(). It would be wise to check when was the
last time the stats were reset, using the stats_reset column from the
pg_stat_database view.
How to do it…
Let's run the following query to get the list of unused indexes from the database:
benchmarksql=# SELECT
indrelid::regclass tab_name,pi.indexrelid::regclass
unused_index
FROM
pg_index pi, pg_stat_user_indexes psui
WHERE pi.indexrelid=psui.indexrelid AND NOT indisunique AND
0idx_scan=0;
tab_name
|
unused_index
----------------+--------------------bmsql_customer | bmsql_customer_idx1
bmsql_district | d_state_idx
bmsql_item
| pric_idx
bmsql_customer | zip_idx
bmsql_item
| pric_idx1
bmsql_item
| data_idx
(6 rows)
Rather than dropping these unused indexes immediately, let's make all these indexes as
invalid and then observe the application behavior for a certain time and then drop them
using the CONCURRENTLY option:
UPDATE pg_index SET indisvalid = false
WHERE indexrelid::regclass::text IN ( < Unused indexes > );
[ 322 ]
Database Indexing
How it works…
As aforementioned, once the index is marked as invalid, then it won't be used for any
queries. If a table is clustered with that index, then it will also ignore the index scan and do
the normal sequential scan on the table. After making the unused indexes as invalid, if we
do not find any application performance issues, then it is a good sign that we should drop
those indexes permanently.
Forcing a query to use an index
In this recipe, we will be discussing how to force a query to pick an index.
Getting ready
As we discussed in the previous chapters, the optimizer generates a set of plans based on
the statistics it collected. Among all these plans, whatever plan has the least cost value
would be preferred as a final execution plan of that query. Forcing a specific index to the
SQL query is not possible in the current release of PostgreSQL; however, you can somehow
guide the planner to pick the index scan over the other bitmap and sequential scans by
disabling the session level optimizer parameters. Otherwise, you have to change the
arbitrary cost value of random_page_cost so that it is close to the value of the
seq_page_cost parameter.
How to do it…
Let's write a sample SQL query that prefers the sequential scan, as follows:
benchmarksql=# EXPLAIN ANALYZE SELECT COUNT(*) FROM bmsql_item WHERE
i_price BETWEEN 10 AND 80;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------Aggregate (cost=2976.67..2976.68 rows=1 width=8) (actual
time=47.900..47.900 rows=1 loops=1)
-> Seq Scan on bmsql_item (cost=0.00..2800.00 rows=70669 width=0)
(actual time=3.341..40.537 rows=70947 loops=1)
Filter: ((i_price >= '10'::numeric) AND (i_price <= '80'::numeric))
Rows Removed by Filter: 29053
Planning time: 0.168 ms
Execution time: 47.936 ms
[ 323 ]
Database Indexing
(6 rows)
Let's force the index scan for the same query by reducing the random_page_cost value so
that it is near to 1 (default seq_page_cost):
benchmarksql=# SET random_page_cost TO 0.9;
SET
benchmarksql=# EXPLAIN ANALYZE SELECT COUNT(*) FROM bmsql_item WHERE
i_price BETWEEN 10 AND 80;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------Aggregate (cost=2685.65..2685.66 rows=1 width=8) (actual
time=34.134..34.134 rows=1 loops=1)
-> Index Only Scan using pric_idx1 on bmsql_item (cost=0.29..2508.97
rows=70669 width=0) (actual time=0.054..27.760 rows=70947 loops=1)
Index Cond: ((i_price >= '10'::numeric) AND (i_price <= '80'::numeric))
Heap Fetches: 70947
Planning time: 0.187 ms
Execution time: 34.176 ms
(6 rows)
We will also get the same behavior if we turn off the enable_seqscan parameter.
How it works…
In the preceding example, as we set the random_page_cost value to 0.9, the optimizer got
the final cost value of the index only scan as 2685.66, which is 2976.68 in the sequential
scan plan.
Detecting a missing index
In this recipe, we will be discussing how to identify the tables that need to be indexed.
Getting ready
To find the missing indexes in a database is a tricky task. To find the missing indexes on a
table, we have to use the sequential, index scan counter values from the catalog tables. In
case we see too many sequential scans on a table, then we can't confirm that the table is a
candidate for the index. To confirm this, we have to analyze the queries that we execute on
that table using hypothetical indexes.
[ 324 ]
Database Indexing
In general, it is always recommended that you create indexes on foreign key columns, as it
helps the query to choose an index while joining parent and child tables. The foreign key's
index also improves the key validation among child and parent tables. It is also
recommended that you create indexes on the child tables, while creating child tables by
inheriting from parent tables.
How to do it…
Let's query the database as to whether the delta of seq_scan, idx_scan is greater than a
certain number, as shown here:
benchmarksql=# SELECT relname, seq_scan-idx_scan delta FROM
pg_stat_user_tables WHERE seq_scan-idx_scan >0;
relname
| delta
------------+------bmsql_item |
8
(1 row)
Now let's get the list of queries that our business logic implemented on the preceding
bmsql_item table, and then identify the candidate for the index creation.
How it works…
Before creating any index on the table, it would be good to check whether the newly created
index will be used in the required SQL statements. To get this confirmation about the
indexes, we have to create hypothetical indexes using the HypoPG extension.
You can find more information about this extension at the following URL:
https://github.com/dalibo/hypopg.
[ 325 ]
Index
A
active sessions
checking 167, 168, 169
aggregate
about 274
demonstrating 275
state transition functions 278
working 277
anti join
about 301
performing 302
working 302
auto_explain extension
URL 264
autovacuum 211, 216
B
background writer statistics
buffers_alloc 207
buffers_backend 206
buffers_backend_fsync 206
buffers_checkpoint 204
buffers_clean 204, 205
checkpoint_sync_time 203
checkpoint_write_time 203
checkpoints_req 202
checkpoints_timed 202
max_written_clean 205
stats_reset 207
tuning with 200, 201
Barman
setting up 147, 149
used, for backup 152
used, for recovery 150, 152
basic cost
computing 264
index scan calculation 267
phase I 268
phase II 268
sequential scan 266
working 265
BenchmarkSQL
about 247
building 248
download link 247
working 249
bgwriter process 200
bitmap heap scans
about 271, 312
generating 271
performing 271
working 273, 274
bitmap index scans
about 271
working 273
bloating tables
dealing with 209
URL 209
blocked sessions
finding 171, 173
prepared transaction locks 176
SKIP LOCKED usage 178
table level locks 175, 176
transactional locks 173, 174
bonnie++
disk speed, benchmarking 14, 15
Bucardo
URL 123
URL, for replication 126
used, for replication 122
buffer cache contents
analyzing 51, 53
C
cache
clearing 256, 258, 259
checkpoint
overhead, identifying 50
URL 51
checksums
enabling 191
clustering
against index 316
reference link 317
cold cache
about 252, 253
working 256
combined indexes
about 319
reference link 320
Completely Fair Queuing (CFQ) 48
concurrent 318
concurrent indexes
creating 317
connection pooling
configuring, with PgBouncer 81
implementing, with PgBouncer 81
connection-related parameters
listen_addresses 35
max_connections 36
port 36
superuser_reserved_connections 36
tuning 35
unix_socket_directory 36
unix_socket_permissions 36
CPU bottlenecks
identifying 61, 62
CPU consuming processes
tracking 59
CPU load
monitoring 60
CPU scheduling parameters
about 46
kernel.sched_autogroup_enabled 47
kernel.sched_latency_ns 47
kernel.sched_migration_cost_ns 48
kernel.sched_min_granularity_ns 47
kernel.sched_wakeup_granularity_ns 47
CPU speed
benchmarking 8
benchmarking, reference link 8
benchmarking, with Phoronix 8, 10
benchmarking, with sysbench 10
CPU usage
monitoring 55, 56
CREATE AGGREGATE
URL 279
CTE (Common Table Expression) scans
about 280
performing 280
recursive union 283
working 281
current running SQL commands
obtaining 170, 171
D
data replicating
from Oracle to PostgreSQL, Goldengate used
232, 234, 236, 237, 238, 242, 244, 245
database corruption 190
database
monitoring 166
performance 167
server, restarting 34
dead tuples
generation, controlling 224
deadlocks
troubleshooting 179, 180
troubleshooting, advisory locks used 181
troubleshooting, FOR UPDATE used 180
disk growth
estimating, with pgbench took 22
estimating, with pgbench tool 21
disk I/O bottlenecks
identifying 62, 63
disk I/O schedulers
about 48
CFQ 49
deadline 49
noop 49
disk IOPS
benchmarking, with open source tools 19, 21
[ 327 ]
random mixed read/write, executing 20
sequential mixed read/write, executing 19
disk seek rate speed
benchmarking 15
disk space
monitoring 69
disk speed
benchmarking 13
benchmarking, with bonnie++ 14
benchmarking, with Phoronix 14
disk usage
column size, determining 187
database size, determining 189
determining 186, 189
relation size, determining 188
tablespace size, determining 189
DRBD
used, for replication 126, 135
E
EPEL package
URL 122
explain plan
about 264
generating 262
JSON formatted explain plan, generating 263
working 263
XML formatted explain plan, generating 262
YAML formatted explain plan, generating 263
F
Flexible I/O (fio)
about 19
URL 19
Free Space Map (FSM) 215
freezing operation
wrapping up 217, 220
fsync speed
benchmarking, with open source tools 17, 18
G
Goldengate command line interface (CLI) 238
Goldengate, for Oracle database
reference link 233
Goldengate
about 228
used, for replicating data from Oracle to
PostgreSQL 232, 234, 236, 237, 238, 243,
244, 245
group aggregate
about 292
demonstrating 293
working 294
H
hash aggregate
about 274
demonstrating 276
working 279
hash join internal implementation
about 289
phase I 290
Phase II 291
hash join
about 288
behaviour, demonstrating 288
historical CPU usage
tracking 66
historical memory usage
tracking 68, 69
hot cache
about 252, 254
working 256
HypoPG extension
reference link 325
I
index block statistics
idx_blks_hit 309
idx_blks_read 309
idx_scan 307
idx_tup_fetch 309
idx_tup_read 308
measuring 305
index maintenance operations
URL 197
index only scan
about 312
reference link 313
index scans
[ 328 ]
about 309, 315
bitmap heap scan 312, 316
index only scan 312, 315
techniques 312
versus sequential scans 313
indexes
clustering against 316
dealing with 209
using, query forced to 323
Input/Output Per Second (IOPS) 19
demonstrating 289
merge join
about 287, 291
phase I 291
phase II 291
missing index
detecting 324
Multi Version Concurrency Control (MVCC) 208
K
nested loop behavior
phase I 286
phase II 286
nesting loops
about 284
demonstrating 285
working 286
network interfaces
status, monitoring 71
kernel parameters
URL 46
kernel scheduling entity (KSE) 48
L
Linux/Unix memory parameters, handling
about 43
kernel.shmall 44
kernel.shmmax 44
kernel.shmmni 44
vm.dirty_background_ratio 45
vm.dirty_ratio 46
vm.overcommit_memory 45
vm.overcommit_ratio 45
vm.swappiness 45
log_checkpoints 207
logging parameters
URL 186
logging-related parameters
tuning 39
Londiste
references 121
used, for replication 113, 120
M
master/slave streaming replication
setting up 104
URL 106
memory speed
benchmarking 11
benchmarking, with Phoronix 11
benchmarking, with tmpfs 3, 11
merge join behavior
N
O
OmniPITR
setting up 153, 154
used, for WAL management 154, 156
Out Of Memory (OOM) 45
P
page corruption
preventing 190, 191, 193
paging 56, 57
partial indexes 320
pg_catcheck extension
URL 193
pg_dump utility
used, for upgrading data 228, 229
pg_partman
URL 96
pg_stat_activity 169
pg_test_timing
URL 251
pg_upgrade utility
used, for version upgrade 230, 231, 232
pg_upgrade
URL 232
pgbench
[ 329 ]
about 247
configuring 25, 26
read-only benchmarking 27
reference link 28
URL 26
used, for estimating disk growth 21, 22
used, for executing test cases 27
write only benchmarking 27
PgBouncer
about 73
installing 79, 80, 81
managing 85
reference link 80
URL, for parameters 84
used, for configuring connection pooling 81
used, for implementing PgBouncer 81
pgBucket
URL 215
pgHA
URL 84
Pgpool-II, stopping
fast mode 79
smart mode 79
Pgpool-II
about 72, 73
configuring 75, 76
installing 73, 75
references 77, 78, 79
setup, testing 75, 76
URL 74
pgstattuple() function
URL 211
Phoronix
CPU speed, benchmarking 9, 10
disk speed, benchmarking 14
memory speed, benchmarking 11
reference link 8, 10
references 8
PL/Proxy
installing 96, 97
partitioning, with 97, 100, 101, 102
URL 97, 102
planner statistics
avg_width 199
correlation 200
generating 198, 199
histogram_bounds 199
most_common_elem_freqs 200
most_common_elems 200
most_common_freqs 199
most_common_vals 199
n_distinct 199
null_frac 199
working 199
Postgres-XL cluster
setting up 136
PostgreSQL cluster
upgrading, pg_dump utility used 228, 229
upgrading, pg_upgrade utility used 230, 231,
232
URL 229
PostgreSQL data directory
base 187
global 187
pg_clog 187
pg_log 187
pg_tblsp 187
pg_xlog 187
URL 187
PostgreSQL server, memory units
effective_cache_size 43
maintenance_work_mem 42
max_stack_depth 43
shared_buffers 42
temp_buffers 42
URL 43
wal_buffers 43
work_mem 42
PostgreSQL server, parameters
archive_mode 26
autovacuum_work_mem 25
backend_flush_after 26
checkpoint_completion_target 26
huge_pages 25
log_lock_waits 26
log_temp_files 26
max_connections 26
max_parallel_workers_per_gather 26
max_worker_processes 25
random_page_cost 26
[ 330 ]
shared_buffers 25
work_mem 25
PostgreSQL server
memory units 41
starting, manually 30
stopping 31, 32
stopping, in emergency 32
URL 26, 228
postgresql.conf file, parameters
archive_command 107
archive_mode 107
Listen_Addresse 107
max_wal_senders 107
wal_keep_segments 107
wal_level 107
prepared transaction locks 176
Q
query optimization
performing 247
query plan node structure
about 259, 260
working 260
query-related parameters
tuning 36, 37, 38, 39
query
forcing, to use index 323
measuring 305
R
recovery.conf file, parameters
primary_conninfo 108
standby_mode 108
trigger_file 108
Redundant Array of Interdependent Disks (RAID)
about 23
levels 22
RAID 0 23
RAID 1 24
RAID 10 24
RAID 5 24
RAID 6 24
references 24
replica
creating, with repmgr 160, 162
creating, with Walctl 165
repmgr
setting up 157
URL 157
used, for creating replica 160, 162
repository packages
URL 122
routine reindexing
about 194
performing 195, 197
working 197
S
sample data sets
using 247, 249
semi join
about 301
performing 301
working 302
sequential scans
about 269, 315
running 268, 269
versus index scans 313
working 270
server configuration
reloading 33
set operations
about 294
demonstrating 295, 296
EXCEPT 299
INTERSECT 297, 298
UNION 300
working with 294, 296
shared buffers 254
SKIP LOCKED usage 178
skytools 3.2
URL 114
Slony
URL 109, 113
used, for replication 109
slow statements
logging 185
working 186
statistical information
URL 200
[ 331 ]
statistics collector 168
swapping 56, 57
sysbench
CPU speed, benchmarking 10
system load
monitoring 64, 66
T
table access statistics
pg_stat_user_tables 182, 183, 184
pg_statio_user_tables 184
studying 181, 182
working 184
table level locks 175
table partitioning
implementing 88, 92
managing 93, 94, 96
references 92
timing overhead, system time calls
calculating 250, 251
tmpfs
memory speed, benchmarking 3, 11
test, reading 13
test, writing 12
URL 3, 11
TPC-C
reference link 249
transaction ID
ID wraparound, fixing 222
ID wraparound, preventing 223
wrapping up 217, 220
transactional locks 173, 174
troubleshooting
deadlocks 179
U
unused indexes
searching 322
V
VACUUM FREEZE process, parameters
autovacuum_freeze_max_age 222
vacuum_freeze_min_age 222
vacuum_freeze_table_age 222
vacuum
about 211
progress, monitoring 223
URL, for progress reporting 224
xmax 216
xmin 216
W
WAL management
with OmniPITR 154, 156
Walctl
setting up 162
used, for creating replica 165
Download