database tuning slides - NYU Computer Science Department

advertisement
Database Tuning
Principles, Experiments and
Troubleshooting Techniques
Dennis Shasha (shasha@cs.nyu.edu)
Philippe Bonnet
(bonnet.p@gmail.com)
1
Availability of Materials
1. Power Point presentation is available on
my web site (and Philippe Bonnet’s). Just
type our names into google
2. Experiments available from Denmark site
maintained by Philippe (site in a few
slides)
3. Book with same title available from
Morgan Kaufmann.
2
Database Tuning
Database Tuning is the activity of making a
database application run more quickly.
“More quickly” usually means higher
throughput, though it may mean lower
response time for time-critical applications.
3
Application
Programmer
(e.g., business analyst,
Data architect)
Application
Sophisticated
Application
Programmer
Query Processor
(e.g., SAP admin)
Indexes
Storage Subsystem
Concurrency Control
Recovery
DBA,
Tuner
Operating System
Hardware
[Processor(s), Disk(s), Memory]
4
Outline of Tutorial
1.
2.
3.
4.
5.
6.
7.
8.
9.
Basic Principles
Tuning the guts
Indexes
Relational Systems
Application Interface
Ecommerce Applications
Data warehouse Applications
Distributed Applications
Troubleshooting
5
Goal of the Tutorial
• To show:
– Tuning principles that port from one system to
the other and to new technologies
– Experimental results to show the effect of these
tuning principles.
– Troubleshooting techniques for chasing down
performance problems.
6
Tuning Principles Leitmotifs
• Think globally, fix locally (does it matter?)
• Partitioning breaks bottlenecks (temporal
and spatial)
• Start-up costs are high; running costs are
low (disk transfer, cursors)
• Be prepared for trade-offs (indexes and
inserts)
7
Experiments -- why and where
•
•
Simple experiments to illustrate the
performance impact of tuning principles.
Philippe Bonnet has lots of code on his
course’s site:
https://learnit.itu.dk/mod/workshop/view.p
hp?id=43059
8
Experimental DBMS and
Hardware
•
Results presented here obtained with old
systems. Conclusions mostly hold:
–
–
1.
2.
3.
SQL Server 7, SQL Server 2000, Oracle 8i,
Oracle 9i, DB2 UDB 7.1
Three configurations:
Dual Xeon (550MHz,512Kb), 1Gb RAM, Internal RAID
controller from Adaptec (80Mb) 2 Ultra 160 channels, 4x18Gb
drives (10000RPM), Windows 2000.
Dual Pentium II (450MHz, 512Kb), 512 Mb RAM, 3x18Gb
drives (10000RPM), Windows 2000.
Pentium III (1 GHz, 256 Kb), 1Gb RAM, Adapter 39160 with 2
channels, 3x18Gb drives (10000RPM), Linux Debian 2.4.
9
Tuning the Guts
• Concurrency Control
– How to minimize lock contention?
• Recovery
– How to manage the writes to the log (to dumps)?
• OS
– How to optimize buffer size, process scheduling, …
• Hardware
– How to allocate CPU, RAM and disk subsystem
resources?
10
Isolation
• Correctness vs. Performance
– Number of locks held by each transaction
– Kind of locks
– Length of time a transaction holds locks
11
Isolation Levels
• Read Uncommitted (No lost update)
– Exclusive locks for write operations are held for the duration of the
transactions
– No locks for read
• Read Committed (No dirty retrieval)
– Shared locks are released as soon as the read operation terminates.
• Repeatable Read (no unrepeatable reads for read/write )
– Two phase locking
•
Serializable (read/write/insert/delete model)
– Table locking or index locking to avoid phantoms
12
Snapshot isolation
• Each transaction executes
against the version of the data
items that was committed when
the transaction started:
– No locks for read
– Costs space (old copy of data
must be kept)
• Almost serializable level:
–
–
–
–
T1
T2
T1: x:=y
T2: y:= x
Initially x=3 and y =17
X=Y=Z=0
Serial execution:
x,y=17 or x,y=3
– Snapshot isolation:
x=17, y=3 if both transactions
start at the same time.
T3
TIME
13
Value of Serializability -- Data
Settings:
accounts( number, branchnum, balance);
create clustered index c on accounts(number);
–
–
–
–
–
–
100000 rows
Cold buffer; same buffer size on all systems.
Row level locking
Isolation level (SERIALIZABLE or READ COMMITTED)
SQL Server 7, DB2 v7.1 and Oracle 8i on Windows 2000
Dual Xeon (550MHz,512Kb), 1Gb RAM, Internal RAID
controller from Adaptec (80Mb), 4x18Gb drives (10000RPM),
Windows 2000.
14
Value of Serializability -transactions
Concurrent Transactions:
– T1: summation query [1 thread]
select sum(balance) from accounts;
– T2: swap balance between two account numbers (in
order of scan to avoid deadlocks) [N threads]
T1:
valX:=select balance from accounts where number=X;
valY:=select balance from accounts where number=Y;
T2:
update accounts set balance=valX where number=Y;
update accounts set balance=valY where number=X;
15
Value of Serializability -- results
Ratio of correct
answers
SQLServer
1
0.8
0.6
0.4
0.2
0
read committed
serializable
0
2
4
6
8
10
Concurrent update threads
Ratio of correct
answers
Oracle
1
0.8
0.6
0.4
0.2
0
read committed
serializable
0
2
4
6
8
Concurrent update threads
10
• With SQL Server and
DB2 the scan returns
incorrect answers if
the read committed
isolation level is used
(default setting)
• With Oracle correct
answers are returned
(snapshot isolation).
16
Cost of Serializability
Throughput
(trans/sec)
SQLServer
0
2
4
6
8
read committed
Concurrent Update Threads
serializable
10
Throughput
(trans/sec)
Oracle
Because the update
conflicts with the scan,
correct answers are
obtained at the cost of
decreased concurrency
and thus decreased
throughput.
read committed
serializable
0
2
4
6
8
10
Concurrent Update Threads
17
Locking Implementation
Database Item
(e.g., row or
table)
Lock set
L
Wait set
W
x
LO1(T1,S),
LO3(T3,S)
LW2(T2,X)
y
LO2(T2,X)
LW4(T4, S),
LW5(T5, X)
@ Dennis Shasha and Philippe Bonnet, 2013
Transactio
n
ID
Locks
T1
LO1
T2
LO2, LW2
T3
LO3
T4
LW4
T5
LW5
18
Latches and Locks
• Locks are used for concurrency control
– Requests for locks are queued
• Priority queue
– Lock data structure
• Locking mode (S, lock granularity, transaction id.
• Lock table
• Latches are used for mutual exclusion
– Requests for latch succeeds or fails
• Active wait (spinning) on latches on multiple CPU.
– Single location in memory
• Test and set for latch manipulation
@ Dennis Shasha and Philippe Bonnet, 2013
Phantom Problem
T1:
Table R
E#
Name
age
[row1]
01
Smith
35
[row2]
02
Jones
28
Snapshot isolation not in effect
Insert into R values (03, Smythe, 42) into R
T2:
Select max(age) from R where Name like
‘Sm%’
Select max(age) from R where Name like
‘Sm%’
Any serializable execution returns twice the same max value
(either 35 if T2 is executed before T1, or 42 if T1 is executed
before
Time
T1:
insert(03,Smythe, 42), commit
T2: get locks on existing rows, read all rows, compute
max,
read all rows, compute max, release locks, commi
35
42
2 Phase locking with row locks does
This schedule returns t
not prevent concurrent insertions as
Different max values!
they only protect existing rows.
@ Dennis Shasha and Philippe Bonnet, 2013
Solution to Phantom Problem
• Solution#1:
Set of
Tuples
in R
Tuples
that
satisfy
predicate
P
– Table locking (mode X)
– No insertion is allowed in the table
– Problem: too coarse if predicate is
used in transactions
• Solution #2:
– Predicate locking – avoid inserting
tuples that satisfy a given predicate
• E.g., 30 < age < 40
– Problem: very complex to
implement
• Solution #3:
– Next key locking (NS)
– See index tuning
Set of all tuples that can be inserted in R
@ Dennis Shasha and Philippe Bonnet, 2013
Locking Overhead -- data
Settings:
accounts( number, branchnum,
balance);
create clustered index c on accounts(number);
–
–
–
–
100000 rows
Cold buffer
SQL Server 7, DB2 v7.1 and Oracle 8i on Windows 2000
No lock escalation on Oracle; Parameter set so that there is no
lock escalation on DB2; no control on SQL Server.
–
Dual Xeon (550MHz,512Kb), 1Gb RAM, Internal
RAID controller from Adaptec (80Mb), 4x18Gb
drives (10000RPM), Windows 2000.
22
Locking Overhead -- transactions
No Concurrent Transactions:
– Update [10 000 updates]
update accounts set balance = Val;
– Insert [10 000 transactions], e.g. typical one:
insert into accounts values(664366,72255,2296.12);
23
Locking Overhead
Throughput ratio
(row locking/table locking)
1
0.8
db2
sqlserver
oracle
0.6
0.4
0.2
Row locking is barely
more expensive than table
locking because recovery
overhead is higher than
row locking overhead
– Exception is updates on
DB2 where table locking is
distinctly less expensive
than row locking.
0
update
insert
24
Logical Bottleneck: Sequential
Key generation
• Consider an application in which one needs
a sequential number to act as a key in a
table, e.g. invoice numbers for bills.
• Ad hoc approach: a separate table holding
the last invoice number. Fetch and update
that number on each insert transaction.
• Counter approach: use facility such as
Sequence (Oracle)/Identity(MSSQL).
25
Counter Facility -- data
Settings:
accounts( number, branchnum, balance);
create clustered index c on accounts(number);
counter ( nextkey );
insert into counter values (1);
– default isolation level: READ COMMITTED; Empty tables
– Dual Xeon (550MHz,512Kb), 1Gb RAM, Internal RAID
controller from Adaptec (80Mb), 4x18Gb drives (10000RPM),
Windows 2000.
26
Counter Facility -- transactions
No Concurrent Transactions:
– System [100 000 inserts, N threads]
• SQL Server 7 (uses Identity column)
insert into accounts values (94496,2789);
• Oracle 8i
insert into accounts values (seq.nextval,94496,2789);
– Ad-hoc [100 000 inserts, N threads]
begin transaction
NextKey:=select nextkey from counter;
update counter set nextkey = NextKey+1;
insert into accounts values(NextKey,?,?);
commit transaction
27
Avoid Bottlenecks: Counters
Throughput
(statements/sec)
SQLServer
system
ad-hoc
0
10
20
30
40
50
Number of concurrent insertion threads
Throughput
(statements/sec)
Oracle
system
ad-hoc
0
10
20
30
40
Number of concurrent insertion threads
• System generated counter
(system) much better than a
counter managed as an
attribute value within a table
(ad hoc). Counter is separate
transaction.
• The Oracle counter can
become a bottleneck if every
update is logged to disk, but
caching many counter numbers
is possible.
• Counters may miss ids.
50
28
Semantics-Altering Chopping
You call up an airline and want to
reserve a seat. You talk to the
agent, find a seat and then reserve
it.
Should all of this be a single
transaction?
31
Semantics-Altering Chopping II
Probably not. You don’t want to
hold locks through human
interaction.
Transaction redesign: read is its own
transaction. Get seat is another.
Consequences?
32
Atomicity and Durability
COMMITTED
COMMIT
Ø
ACTIVE
BEGIN
TRANS
(running, waiting)
ROLLBACK
ABORTED
• Every transaction either
commits or aborts. It
cannot change its mind
• Even in the face of
failures:
– Effects of committed
transactions should be
permanent;
– Effects of aborted
transactions should leave no
trace.
33
UNSTABLE STORAGE
DATABASE BUFFER
LOG BUFFER
lri lrj
Pi
WRITE
log records before commit
LOG
DATA
RECOVERY
Pj
WRITE
modified pages after commit
DATA
DATA
STABLE STORAGE
34
Physical Logging (pure)
• Update records contain before and after images
– Roll-back: install before image
– Roll-forward: install after image
• Pros:
– Idempotent. If a crash occurs during recovery, then
recovery simply restart as phase 2 (rollforward) and 3
(rollback) rely on operations that are indempotent.
• Cons:
– A single SQL statement might touch many pages and
thus generate many update records
– The before and after images are large
@ Dennis Shasha and Philippe Bonnet,
2013
•
Logical Logging (kdb/other main
memory
systems)
Update records contain logical operations and its
inverse instead of before and after image
– E.g., <op: insert t in T, inv: delete t from T>
• Pro: compact
• Cons:
– Not idempotent.
• Solution: Include a LastLSN in each database page. During phase 2
of recovery, an operation is rolled forward iff its LSN is higher than
the LastLSN of the page.
– Not atomic
• What if a logical operation actually involves several pages, e.g., a
data and index page? And possibly several index pages?
• Solution: Physiological logging
@ Dennis Shasha and Philippe Bonnet,
2013
Physiological Logging
• Physiological logging
– Physical across pages
– Logical within a page
• Combines the benefits of logical logging and avoids
the problem of atomicity, as logical mini-operations
are bound to a single page
– Logical operations split into mini-operations on each
page
– A log record is created for each mini-operation
– Mini-operations are not idempotent, thus page LSN
have to be used in phase 2 of recovery
@ Dennis Shasha and Philippe Bonnet,
2013
Logging in SQL Server, DB2
Log entries:
- LSN
- before and after images or
logical log
Physiological logging
Free
Current Flush
Log cachesLog caches
Log caches
DATABASE BUFFER
Waiting
processes
db
writer
Flush
Log caches
Flush queue
Synchronous I/O
free Pi
free Pj
Lazywriter
Asynchronous I/O
DATA
LOG
@ Dennis Shasha and Philippe Bonnet,
2013
Logging in Oracle after 10g
Physiological loggin
Free list
In-memory undo
In-memory undo
In-memory undo
(public) Redo allocation
latch
Redo log buffer
Redo copy latches
In memory undo latch
In memory undo latch
In memory undo latch
Private redo
Private redo
Private redo
Redo log
Redo log
records
Redo log
records
(redo+undo)
records
(redo+undo)
(redo+undo)
redo allocation latch
redo allocation latch
redo allocation latch
DATABASE BUFFER
Pi
DBWR
LGWR
(log writer)
Log File
#1
LOG
Log File
#2
Pj
(database writer)
UNDO
@ Dennis Shasha and Philippe Bonnet,
2013
DATA
Log IO -- data
Settings:
lineitem ( L_ORDERKEY, L_PARTKEY , L_SUPPKEY,
L_LINENUMBER
, L_QUANTITY, L_EXTENDEDPRICE ,
L_DISCOUNT, L_TAX , L_RETURNFLAG, L_LINESTATUS ,
L_SHIPDATE, L_COMMITDATE,
L_RECEIPTDATE, L_SHIPINSTRUCT ,
L_SHIPMODE , L_COMMENT );
– READ COMMITTED isolation level
– Empty table
– Dual Xeon (550MHz,512Kb), 1Gb RAM, Internal RAID controller
from Adaptec (80Mb), 4x18Gb drives (10000RPM), Windows 2000.
40
Log IO -- transactions
No Concurrent Transactions:
Insertions [300 000 inserts, 10 threads], e.g.,
insert into lineitem values
(1,7760,401,1,17,28351.92,0.04,0.02,'N','O',
'1996-03-13','1996-02-12','1996-0322','DELIVER IN PERSON','TRUCK','blithely
regular ideas caj');
41
Group Commits
• For small transactions, log records might have to be
flushed to disk before a log page is filled up
• When many small transactions are committed, many
IOs are executed to flush half empty pages
• Group commits allow to delay transaction commit
until a log page is filled up
– Many transactions committed together
• Pros: Avoid too many round trips to disk, i.e.,
improves throughput
• Cons: Increase mean response time
@ Dennis Shasha and Philippe Bonnet,
2013
Throughput (tuples/sec)
Group Commits
• DB2 UDB v7.1 on
Windows 2000
• Log records of many
transactions are written
together
350
300
250
200
150
100
50
0
1
25
Size of Group Commit
– Increases throughput by
reducing the number of
writes
– at cost of increased
minimum response time.
43
Put Log on a Separate Disk
• Improve log writer performance
– HDD: sequential IOs not disturbed by random
IOs
– SSD: minimal garbage collection/wear leveling
• Isolate data and log failures
@ Dennis Shasha and Philippe Bonnet,
2013
Throughput (tuples/sec)
Put the Log on a Separate Disk
350
Log on same disk
Log on separate disk
300
250
200
150
100
50
0
controller cache
no controller cache
• DB2 UDB v7.1 on
Windows 2000
• 5 % performance
improvement if log is
located on a different disk
• Controller cache hides
negative impact
– mid-range server, with
Adaptec RAID controller
(80Mb RAM) and 2x18Gb
disk drives.
45
Tuning Database Writes
• Dirty data is written to disk
– When the number of dirty pages is greater than
a given parameter (Oracle 8)
– When the number of dirty pages crosses a given
threshold (less than 3% of free pages in the
database buffer for SQL Server 7)
– When the log is full, a checkpoint is forced.
This can have a significant impact on
performance.
46
Tune Checkpoint Intervals
• Oracle 8i on Windows
2000
• A checkpoint (partial flush
of dirty pages to disk)
occurs at regular intervals
or when the log is full:
Throughput Ratio
1.2
1
0.8
0.6
0.4
0.2
0
0 checkpoint
4 checkpoints
– Impacts the performance of
on-line processing
+ Reduces the size of log
+ Reduces time to recover
from a crash
47
Database Buffer Size
DATABASE PROCESSES
• Buffer too small, then hit
ratio too small
hit ratio =
(logical acc. - physical acc.) /
(logical acc.)
DATABASE
BUFFER
Paging
Disk
RAM
LOG
DATA
DATA
• Buffer too large, paging
• Recommended strategy:
monitor hit ratio and increase
buffer size until hit ratio
flattens out. If there is still
paging, then buy memory.
48
Buffer Size -- data
Settings:
employees(ssnum, name, lat, long, hundreds1,
hundreds2);
clustered index c on employees(lat); (unused)
– 10 distinct values of lat and long, 100 distinct values of hundreds1
and hundreds2
– 20000000 rows (630 Mb);
– Warm Buffer
– Dual Xeon (550MHz,512Kb), 1Gb RAM, Internal RAID
controller from Adaptec (80Mb), 4x18Gb drives (10000 RPM),
Windows 2000.
49
Buffer Size -- queries
Queries:
– Scan Query
select sum(long) from employees;
– Multipoint query
select * from employees where lat = ?;
50
Database Buffer Size
Scan Query
• SQL Server 7 on
Windows 2000
• Scan query:
Throughput
(Queries/sec)
0.1
0.08
0.06
0.04
0.02
0
0
200
400
600
800
1000
Buffer Size (Mb)
Multipoint Query
• Multipoint query:
160
Throughput
(Queries/sec)
– LRU (least recently used) does
badly when table spills to disk
as Stonebraker observed 20
years ago.
– Throughput increases with
buffer size until all data is
accessed from RAM.
120
80
40
0
0
200
400
600
800
1000
Buffer Size (Mb)
51
Scan Performance -- data
Settings:
lineitem ( L_ORDERKEY, L_PARTKEY , L_SUPPKEY,
L_LINENUMBER
, L_QUANTITY, L_EXTENDEDPRICE ,
L_DISCOUNT, L_TAX , L_RETURNFLAG, L_LINESTATUS ,
L_SHIPDATE, L_COMMITDATE,
L_RECEIPTDATE, L_SHIPINSTRUCT ,
L_SHIPMODE , L_COMMENT );
–
–
–
–
600 000 rows
Lineitem tuples are ~ 160 bytes long
Cold Buffer
Dual Xeon (550MHz,512Kb), 1Gb RAM, Internal RAID
controller from Adaptec (80Mb), 4x18Gb drives (10000RPM),
Windows 2000.
52
Scan Performance -- queries
Queries:
select avg(l_discount) from lineitem;
53
Throughput (Trans/sec)
Prefetching
• DB2 UDB v7.1 on
Windows 2000
0.2
0.15
0.1
scan
0.05
0
32Kb
64Kb
128Kb
Prefetching
256Kb
• Throughput increases
up to a certain point
when prefetching size
increases.
54
Throughput (Trans/sec)
Usage Factor
0.2
0.15
0.1
scan
0.05
0
70
80
90
Usage Factor (%)
100
• DB2 UDB v7.1 on
Windows 2000
• Usage factor is the
percentage of the page
used by tuples and
auxilliary data structures
(the rest is reserved for
future)
• Scan throughput increases
with usage factor.
55
Large Reads
• External algorithms for sorting/hashing
manipulate a working set which is larger than
RAM (or larger than the buffer space allocated
for sorting/hashing)
– Sort or hash is performed in multiple passes
– In each pass data is read from disk,
hashed/sorted/merged in memory and then written
to secondary
– Pass N+1 can only start when Pass N is done
writing data back to secondary storage
@ Dennis Shasha and Philippe Bonnet,
2013
Chop
Large Update Transactions
• Consider an update-intensive batch transaction
(concurrent access is not an issue):
It can be broken up in short transactions
(mini-batch):
+ Does not overfill the log buffers
+ Does not overfill the log files
Example: Transaction that updates, in sorted order, all accounts that had
activity on them, in a given day.
Break-up to mini-batches each of which access 10,000 accounts and then
updates a global counter.
Note: DB2 has a parameter limiting the portion of the log used by a single
transaction (max_log)
© Dennis Shasha, Philippe Bonnet 2001
Tuning the Storage Subsystem
58
Outline
• Storage Subsystem Components
– Moore’s law and consequences
– Magnetic disk performances
• From SCSI to SAN, NAS and Beyond
– Storage virtualization
• Tuning the Storage Subsystem
– RAID levels
– RAID controller cache
59
Exponential Growth
Moore’s law
– Every 18 months:
• New processing = sum of all existing processing
• New storage = sum of all existing storage
– 2x / 18 months ~ 100x / 10 years
http://www.intel.com/research/silicon/moorespaper.pdf
60
Consequences of “Moore’s law”
• Over the last decade:
–
–
–
–
–
–
10x better access time
10x more bandwidth
100x more capacity
4000x lower media price
Scan takes 10x longer (3 min vs 45 min)
Data on disk is accessed 25x less often (on average)
61
Memory Hierarchy
Access Time
Price $/ Mb
100
10
0.2
0.2
(nearline)
Processor cache
RAM/flash
Disks
Tapes / Optical Disks
1 ns
x10
x10 6
x10 10
63
Magnetic Disks
tracks
platter
spindle
read/write
head
actuator
disk arm
Controller
1956: IBM (RAMAC) first
disk drive
5 Mb – 0.002 Mb/in2
35000$/year
9 Kb/sec
1980: SEAGATE
first 5.25’’ disk drive
5 Mb – 1.96 Mb/in2
625 Kb/sec
1999: IBM MICRODRIVE
first 1’’ disk drive
340Mb
6.1 MB/sec
disk interface
64
Magnetic Disks
• Access Time (2001)
–
–
–
–
Controller overhead (0.2 ms)
Seek Time (4 to 9 ms)
Rotational Delay (2 to 6 ms)
Read/Write Time (10 to 500
KB/ms)
• Disk Interface
– IDE (16 bits, Ultra DMA - 25
MHz)
– SCSI: width (narrow 8 bits vs.
wide 16 bits) - frequency
(Ultra3 - 80 MHz).
http://www.pcguide.com/ref/hdd/65
Read
Write
Scheduling
& Mapping
Garbage
collection
Wear
Leveling
Physical address space
Logical address space
Solid State Drive (SSD)
Channels
Read
Program
Erase
Chip
Chip
Chip
Chip
Chip
…
Chip
…
Chip
…
Chip
Chip
Chip
Chip
Chip
Flash memory array
Example on a disk with 1 channel and 4 chips
Chip bound Channel bound
Chip bound
Page
Page program
Chip1
transfer
Page program
Chip2
Page
Page program
Chip3
read
Command
Page program
Chip4
Four parallel reads
Four parallel writes
@ Dennis Shasha and Philippe Bonnet,
2013
…
Performance Contract
• Disk (spinning media)
– The block device abstraction hides a lot of complexity while
providing a simple performance contract:
• Sequential IOs are orders of magnitude faster than random IOs
• Contiguity in the logical space favors sequential Ios
• Flash (solid state drive)
– No intrinsic performance contract
– A few invariants:
• No need to avoid random IOs
• Applications should avoid writes smaller than a flash page
• Applications should fill up the device IO queues (but not overflow
them) so that the SSD can leverage its internal parallelism
@ Dennis Shasha and Philippe Bonnet,
2013
RAID Levels
• RAID 0: striping (no redundancy)
• RAID 1: mirroring (2 disks)
• RAID 5: parity checking
– Read: stripes read from multiple disks (in parallel)
– Write: 2 reads + 2 writes
• RAID 10: striping and mirroring
• Software vs. Hardware RAID:
– Software RAID: run on the server’s CPU
– Hardware RAID: run on the RAID controller’s CPU
68
Why 4 read/writes when updating
a single stripe using RAID 5?
• Read old data stripe; read parity stripe (2
reads)
• XOR old data stripe with replacing one.
• Take result of XOR and XOR with parity
stripe.
• Write new data stripe and new parity stripe
(2 writes).
69
RAID Levels -- data
Settings:
accounts( number, branchnum, balance);
create clustered index c on accounts(number);
– 100000 rows
– Cold Buffer
– Dual Xeon (550MHz,512Kb), 1Gb RAM, Internal
RAID controller from Adaptec (80Mb), 4x18Gb drives
(10000RPM), Windows 2000.
70
RAID Levels -- transactions
No Concurrent Transactions:
– Read Intensive:
select avg(balance) from accounts;
– Write Intensive, e.g. typical insert:
insert into accounts values (690466,6840,2272.76);
Writes are uniformly distributed.
71
RAID Levels
Throughput (tuples/sec)
Read-Intensive
• SQL Server7 on Windows
2000 (SoftRAID means
striping/parity at host)
• Read-Intensive:
80000
60000
40000
20000
0
SoftRAID5
RAID5
RAID0
RAID10
RAID1
Single
Disk
Throughput (tuples/sec)
Write-Intensive
160
120
– Using multiple disks
(RAID0, RAID 10, RAID5)
increases throughput
significantly.
• Write-Intensive:
80
– Without cache, RAID 5
suffers. With cache, it is ok.
40
0
SoftRAID5
RAID5
RAID0
RAID10
RAID1
Single
Disk
72
RAID Levels
• Log File
– RAID 1 is appropriate
• Fault tolerance with high write throughput. Writes are
synchronous and sequential. No benefits in striping.
• Temporary Files
– RAID 0 is appropriate.
• No fault tolerance. High throughput.
• Data and Index Files
– RAID 5 is best suited for read intensive apps or if the
RAID controller cache is effective enough.
– RAID 10 is best suited for write intensive apps.
73
Controller Prefetching no,
Write-back yes.
• Read-ahead:
– Prefetching at the disk controller level.
– No information on access pattern.
– Better to let database management system do it.
• Write-back vs. write through:
– Write back: transfer terminated as soon as data is
written to cache.
• Batteries to guarantee write back in case of power failure
– Write through: transfer terminated as soon as data is
written to disk.
74
SCSI Controller Cache -- data
Settings:
employees(ssnum, name, lat, long, hundreds1,
hundreds2);
create clustered index c on
employees(hundreds2);
– Employees table partitioned over two disks; Log on a separate
disk; same controller (same channel).
– 200 000 rows per table
– Database buffer size limited to 400 Mb.
– Dual Xeon (550MHz,512Kb), 1Gb RAM, Internal RAID
controller from Adaptec (80Mb), 4x18Gb drives (10000RPM),
Windows 2000.
75
SCSI (not disk) Controller Cache
-- transactions
No Concurrent Transactions:
update employees set lat = long, long = lat
where hundreds2 = ?;
– cache friendly: update of 20,000 rows (~90Mb)
– cache unfriendly: update of 200,000 rows (~900Mb)
76
SCSI Controller Cache
2 Disks - Cache Size 80Mb
Throughput (tuples/sec)
2000
no cache
cache
1500
1000
500
0
cache friendly (90Mb)
cache unfriendly (900Mb)
• SQL Server 7 on Windows
2000.
• Adaptec ServerRaid
controller:
– 80 Mb RAM
– Write-back mode
• Updates
• Controller cache increases
throughput whether
operation is cache friendly or
not.
– Efficient replacement policy!
77
Dealing with Multi-Core
Socket 0
Socket 1
Core 0
Core 1
Core 2
Core 3
CPU
CPU
CPU
CPU
L1 cache
L1 cache
L1 cache
L1 cache
L2 Cache
Cache locality is king:
•Processor affinity
•Interrupt affinity
•Spinning vs. blocking
L2 Cache
RAM
System Bus
IO, NIC Interrupts
LOOK UP: SW for shared multi-core, Interrupts and IRQ tuning
@ Dennis Shasha and Philippe Bonnet,
2013
Row store
Page Layout
structure record_id {
integer page_id;
integer row_id:
}
procedure RECORD_ID_TO_BYTES(int record_id) returns byt
pid = record_id.page_id;
p = PAGE_ID_TO_PAGE(pid);
byte byte_array[PAGE_SIZE_IN_BYTES];
byte_array = p.contents;
Record Structure
Procedure column_id_to_bytes return bytes
Storing Large Attributes
Columnstore Ids
• Explicit IDs
– Expand size on disk
– Expand size when transferring data to RAM
• Virtual IDs
– Offset as virtual ID
– Trades simple arithmetic for space
• I.e., CPU time for IO time
– Assumes fixed width attributes
• Challenge when using compression
Page Layout
source: IEEE
Row store: N-ary
Decomposed
PAX Model –
Storage Model – NSM) Storage Model – DSM Partition Attributes
Across
PAX Model
• Invented by A.Ailamaki in early 2000s
• IO Pattern of NSM
• Great for cache utilization
– columns packed together in cache lines
Partitioning
• There is parallelism (i) across servers, and (ii) within a server
both at the CPU level and throughout the IO stack.
• To leverage this parallelism
– Rely on multiple instances/multiple partitions per instance
• A single database is split across several instances. Different partitions can
be allocated to different CPUs (partition servers) / Disks (partition).
• Problem#1: How to control overall resource usage across
instances/partitions?
– Control the number/priority of threads spawned by a DBMS
instance
• Problem#2: How to manage priorities?
• Problem#3: How to map threads onto the available cores
– Fix: processor/interrupt affinity
@ Dennis Shasha and Philippe Bonnet,
2013
Instance Caging
• Allocating a number of CPU (core) or a percentage
of the available IO bandwidth to a given DBMS
Instance
• Two policies:
# Cores
Instance A
(2 CPU)
– Partitioning: the total
number of CPUs is
partitioned across all
instances
– Over-provisioning:
more than the total
number of CPUs is
allocated to all instances
Max #core
Instance A
(2 CPU)
Instance B
(3 CPU)
Instance B
(2 CPU)
Instance C
(1 CPU)
Instance D
(1 CPU)
Partitioning
@ Dennis Shasha and Philippe Bonnet,
LOOK UP: Instance Caging
2013
Instance C
(2 CPU)
Instance C
(1 CPU)
Over-provisioning
Number of DBMS Threads
• Given the DBMS process architecture
– How many threads should be defined for
• Query agents (max per query, max per instance)
– Multiprogramming level (see tuning the writes and index tuning)
• Log flusher
– See tuning the writes
• Page cleaners
– See tuning the writes
• Prefetcher
– See index tuning
• Deadlock detection
– See lock tuning
• Fix the number of DBMS threads based on the number of cores
available at HW/VM level
• Partitioning vs. Over-provisioning
• Provisioning for monitoring, back-up, expensive stored procedures/UDF
@ Dennis Shasha and Philippe Bonnet,
2013
Priorities
• Mainframe OS have allowed to configure thread
priority as well as IO priority for some time. Now it
is possible to set IO priorities on Linux as well:
– Threads associated to synchronous IOs (writes to the
log, page cleaning under memory pressure, query agent
reads) should have higher priorities than threads
associated to asynchronous IOs (prefetching, page
cleaner with no memory pressure) – see tuning the
writes and index tuning
– Synchronous IOs should have higher priority than
asynchronous IOs.
LOOK UP: Getting@Priorities
Straight,
Linux IO priorities
Dennis Shasha and
Philippe Bonnet,
2013
The Priority Inversion Problem
Three transactions:
T1, T2, T3 in priority
order (high to low)
–
Priority #1
T1
–
–
Priority #2
Priority #3
T3
Transaction states
running
waiting
T2
T3 obtains lock on x and
is preempted
T1 blocks on x lock, so is
descheduled
T2 does not access x and
runs for a long time
Net effect: T1 waits for T2
Solution:
–
–
No thread priority
Priority inheritance
@ Dennis Shasha and Philippe Bonnet,
2013
Processor/Interrupt Affinity
• Mapping of thread context or interrupt to a
given core
– Allows cache line sharing between application
threads or between application thread and interrupt
(or even RAM sharing in NUMA)
– Avoid dispatch of all IO interrupts to core 0 (which
then dispatches software interrupts to the other
cores)
– Should be combined with VM options
– Specially important in NUMA context
– Affinity policy set at OS level or DBMS level?
@ Dennis Shasha and Philippe Bonnet,
LOOK UP: Linux CPU tuning
2013
Processor/Interrupt Affinity
• IOs should complete on the core that issued
them
– I/O affinity in SQL server
– Log writers distributed across all NUMA nodes
• Locking of a shared data structure across cores,
and specially across NUMA nodes
– Avoid multiprogramming for query agents that
modify data
– Query agents should be on the same NUMA node
• DBMS have pre-set NUMA affinity policies
LOOK UP: Oracle and@ Dennis
NUMA,
SQL Server and NUMA
Shasha and Philippe Bonnet,
2013
Transferring large files (with
TCP)
• With the advent of compute cloud, it is
often necessary to transfer large files over
the network when loading a database.
• To speed up the transfer of large files:
–
–
–
–
Increase the s ize of the TCP buffers
Increase the socket buffer size (Linux)
Set up TCP large windows (and timestamp)
Rely on selective acks
LOOK UP: TCP Tuning
@ Dennis Shasha and Philippe Bonnet,
2013
Throwing hardware at a problem
• More Memory
– Increase buffer size without increasing paging
• More Disks
– Log on separate disk
– Mirror frequently read file
– Partition large files
• More Processors
– Off-load non-database applications onto other CPUs
– Off-load data mining applications to old database copy
– Increase throughput to shared data
• Shared memory or shared disk architecture
@ Dennis Shasha and Philippe Bonnet, 2013
Virtual Storage Performance
Host: Ubuntu 12.04
Core i5
CPU
750 @
2.67GHz
4 cores
Intel 710
(100GN, 2.5in SATA 3Gb/s, 25nm, MLC)
170 MB/s sequential write
85 usec latency write
38500 iops Random reads (full range; 32 iodepth)
75 usec latency reads
noop scheduler
VM: VirtualBox 4.2 (nice 5)
all accelerations enabled
4 CPUs
8 GB VDI disk (fixed)
SATA Controller
Guest: Ubuntu 12.04
noop scheduler
[randReads]
ioengine=libaio
Iodepth=32
rw=read
bs=4k,4k
direct=1
Numjobs=1
size=50m
directory=/tmp
Experiments with flexible I/O tester (fio):
- sequential writes (seqWrites.fio)
- random reads (randReads.fio)
@ Dennis Shasha and Philippe Bonnet,
2013
[seqWrites]
ioengine=libaio
Iodepth=32
rw=write
bs=4k,4k
direct=1
Numjobs=1
size=50m
directory=/tmp
Virtual Storage
seqWrites
Default page size is 4k, default iodepth is 3
Performance on Host
Performance on Guest
@ Dennis Shasha and Philippe Bonnet,
2013
Virtual Storage
seqWrites
Page size is 4k, iodepth is 32
@ Dennis Shasha and Philippe Bonnet,
2013
Virtual Storage - randReads
Page size is 4k*
* experiments with 32k show negligible difference
@ Dennis Shasha and Philippe Bonnet,
2013
Index Tuning
• Index issues
– Indexes may be better or worse than scans
– Multi-table joins that run on for hours, because
the wrong indexes are defined
– Concurrency control bottlenecks
– Indexes that are maintained and never used
98
Index
An index is a data structure that supports
efficient access to data
Condition
on
attribute
value
index
(search key)
Set of
Records
Matching
records
99
Search Keys
• A (search) key is a sequence of attributes.
create index i1 on accounts(branchnum, balance);
• Types of keys
– Sequential: the value of the key is monotonic
with the insertion order (e.g., counter or
timestamp)
– Non sequential: the value of the key is
unrelated to the insertion order (e.g., social
security number)
100
Data Structures
• Most index data structures can be viewed as trees.
• In general, the root of this tree will always be in
main memory, while the leaves will be located on
disk.
– The performance of a data structure depends on the
number of nodes in the average path from the root to
the leaf.
– Data structure with high fan-out (maximum number of
children of an internal node) are thus preferred.
101
B+-Tree Performance
• Tree levels
– Tree Fanout
• Size of key
• Page utilization
• Tree maintenance
– Online
• On inserts
• On deletes
– Offline
• Tree locking
• Tree root in main memory
102
B+-Tree
A B+-Tree is a balanced tree whose nodes
contain a sequence of key-pointer pairs.
Depth
Fan-out
B+-Tree
• Nodes contains a bounded number of key-pointer
pairs determined by b (branching factor)
– Internal nodes: ceiling(b/ 2) <= size(node) <= b
size(node) = number of key-pointer pairs
– Root node:
• Root is the only node in the tree: 1 <= size(node) <= b
• Internal nodes exist: 2 <= size(node) <= b
– Leaves (no pointers):
• floor(b/2) <= number(keys) <= b-1
• Insertion, deletion algorithms keep the tree
balanced, and maintain these constraints on the
size of each node
– Nodes might then be split or merged, possibly the
depth of the tree is increased.
B+-Tree Performance #1
• Memory / Disk
– Root is always in memory
– What is the portion of the index actually in memory?
• Impacts the number of IO
• Worst-case: an I/O per level in the B+-tree!
• Fan-out and tree level
– Both are interdependent
– They depend on the branching factor
– In practice, index nodes are mapped onto index pages of fixed size
– Branching factor then depends on key size (pointers of fixed size) and
page utilization
B+-Tree Performance
• Key length influences fanout
– Choose small key when creating an index
– Key compression
• Prefix compression (Oracle 8, MySQL): only store that part of
the key that is needed to distinguish it from its neighbors: Smi,
Smo, Smy for Smith, Smoot, Smythe.
• Front compression (Oracle 5): adjacent keys have their front
portion factored out: Smi, (2)o, (2)y. There are problems with
this approach:
– Processor overhead for maintenance
– Locking Smoot requires locking Smith too.
106
Hash Index
• A hash index stores key-value pairs based
on a pseudo-randomizing function called a
hash function.
Hashed key
key
2341
Hash
function
0
1
n
values
R1 R5
R3 R6 R9
R14 R17 R21
R25
The length of
these chains impacts
performance
107
Clustered / Non clustered index
• Clustered index
(primary index)
– A clustered index on
attribute X co-locates
records whose X values are
near to one another.
Records
• Non-clustered index
(secondary index)
– A non clustered index does
not constrain table
organization.
– There might be several nonclustered indexes per table.
Records
108
Dense / Sparse Index
• Sparse index
• Dense index
– Pointers are associated to
pages
P1
P2
Pi
– Pointers are associated to
records
– Non clustered indexes are
dense
record
record
record
109
DBMS Implementation
• Oracle 11g
– B-tree vs. Hash vs. Bitmap
– Index-organized table vs. heap
• Non-clustered index can be defined on both
– Reverse key indexes
– Key compression
– Invisible index (not visible to optimizer – allows
for experimentation)
– Function-based indexes
• CREATE INDEX idx ON table_1 (a + b * (c - 1), a, b);
See: Oracle 11g indexes
DBMS Implementation
• DB2 10
–
–
–
–
B+-tree (hash index only for db2 for z/OS)
Non-cluster vs. cluster
Key compression
Indexes on expression
• SQL Server 2012
–
–
–
–
–
B+-tree (spatial indexes built on top of B+-trees)
Columnstore index for OLAP queries
Non-cluster vs. cluster
Non key columns included in index (for coverage)
Indexes on simple expressions (filtered indexes)
See: DB2 types of indexes, SQLServer 2012 indexes
Types of Queries
•
Point Query
•
SELECT balance
FROM accounts
WHERE number = 1023;
•
Multipoint Query
SELECT balance
FROM accounts
WHERE branchnum = 100;
Range Query
SELECT number
FROM accounts
WHERE balance > 10000 and
balance <= 20000;
•
Prefix Match Query
SELECT *
FROM employees
WHERE name = ‘J*’ ;
113
More Types of Queries
•
Extremal Query
SELECT *
FROM accounts
WHERE balance =
max(select balance from accounts)
•
•
Grouping Query
SELECT branchnum, avg(balance)
FROM accounts
GROUP BY branchnum;
•
Join Query
Ordering Query
SELECT *
FROM accounts
ORDER BY balance;
SELECT distinct branch.adresse
FROM accounts, branch
WHERE
accounts.branchnum =
branch.number
and accounts.balance > 10000;
114
Index Tuning -- data
Settings:
employees(ssnum, name, lat, long, hundreds1,
hundreds2);
clustered index c on employees(hundreds1)
with fillfactor = 100;
nonclustered index nc on employees (hundreds2);
index nc3 on employees (ssnum, name, hundreds2);
index nc4 on employees (lat, ssnum, name);
– 1000000 rows ; Cold buffer
– Dual Xeon (550MHz,512Kb), 1Gb RAM, Internal RAID controller from Adaptec
(80Mb), 4x18Gb drives (10000RPM), Windows 2000.
115
Index Tuning -- operations
Operations:
– Update:
update employees set name = ‘XXX’ where ssnum = ?;
– Insert:
insert into employees values
(1003505,'polo94064',97.48,84.03,4700.55,3987.2);
– Multipoint query:
select * from employees where hundreds1= ?;
select * from employees where hundreds2= ?;
– Covered query:
select ssnum, name, lat from employees;
– Range Query:
select * from employees where long between ? and ?;
– Point Query:
select * from employees where ssnum = ?
116
Clustered Index
clustered
nonclustered
no index
Throughput ratio
1
0.8
0.6
0.4
0.2
0
SQLServer
Oracle
DB2
• Multipoint query that
returns 100 records
out of 1000000.
• Cold buffer
• Clustered index is
twice as fast as nonclustered index and
orders of magnitude
faster than a scan.
117
Index “Face Lifts”
Throughput
(queries/sec)
SQLServer
100
80
60
40
20
0
No maintenance
Maintenance
0
20
40
60
80
% Increase in Table Size
100
• Index is created with
fillfactor = 100.
• Insertions cause page splits
and extra I/O for each query
• Maintenance consists in
dropping and recreating the
index
• With maintenance
performance is constant
while performance degrades
significantly if no
maintenance is performed.
118
Index Maintenance
Throughput
(queries/sec)
Oracle
20
15
10
No
maintenance
5
0
0
20
40
60
80
% Increase in Table Size
100
• In Oracle, clustered index are
approximated by an index
defined on a clustered table
• No automatic physical
reorganization
• Index defined with pctfree = 0
• Overflow pages cause
performance degradation
119
Covering Index - defined
• Select name from employee where
department = “marketing”
• Good covering index would be on
(department, name)
• Index on (name, department) less useful.
• Index on department alone moderately
useful.
120
Throughput (queries/sec)
Covering Index - impact
70
60
covering
50
covering - not
ordered
non clustering
40
30
20
clustering
10
0
SQLServer
• Covering index performs
better than clustering
index when first attributes
of index are in the where
clause and last attributes
in the select.
• When attributes are not in
order then performance is
much worse.
121
Throughput (queries/sec)
Scan Can Sometimes Win
scan
non clustering
0
5
10
15
% of selected records
20
25
• IBM DB2 v7.1 on
Windows 2000
• Range Query
• If a query retrieves 10% of
the records or more,
scanning is often better
than using a nonclustering non-covering
index. Crossover > 10%
when records are large or
table is fragmented on
disk – scan cost increases.
122
Throughput (updates/sec)
Index on Small Tables
18
16
14
12
10
8
6
4
2
0
no index
index
• Small table: 100 records,
i.e., a few pages.
• Two concurrent processes
perform updates (each
process works for 10ms
before it commits)
• No index: the table is
scanned for each update.
No concurrent updates.
• A clustered index allows
to take advantage of row
locking.
123
Bitmap vs. Hash vs. B+-Tree
Settings:
employees(ssnum, name, lat, long, hundreds1,
hundreds2);
create cluster c_hundreds (hundreds2 number(8)) PCTFREE 0;
create cluster c_ssnum(ssnum integer) PCTFREE 0 size 60;
create cluster c_hundreds(hundreds2 number(8)) PCTFREE 0
HASHKEYS 1000 size 600;
create cluster c_ssnum(ssnum integer) PCTFREE 0 HASHKEYS
1000000 SIZE 60;
create bitmap index b on employees (hundreds2);
create bitmap index b2 on employees (ssnum);
– 1000000 rows ; Cold buffer
– Dual Xeon (550MHz,512Kb), 1Gb RAM, Internal RAID controller from Adaptec
124
(80Mb), 4x18Gb drives (10000RPM), Windows 2000.
Multipoint query: B-Tree, Hash
Tree, Bitmap
Throughput (queries/sec)
Multipoint Queries
25
20
15
10
5
0
B-Tree
Hash index
Bitmap index
• There is an overflow chain
in a hash index because
hundreds2 has few values
• In a clustered B-Tree
index records are on
contiguous pages.
• Bitmap is proportional to
size of table and nonclustered for record
access.
125
B-Tree, Hash Tree, Bitmap
Throughput (queries/sec)
Range Queries
• Hash indexes don’t
help when evaluating
range queries
0.5
0.4
0.3
0.2
0.1
0
B-Tree
Hash index
Bitmap index
• Hash index
outperforms B-tree on
point queries
Throughput(queries/sec)
Point Queries
60
50
40
30
20
10
0
B-Tree
hash index
126
Tuning Relational Systems
• Schema Tuning
– Denormalization, Vertical Partitioning
• Query Tuning
– Query rewriting
– Materialized views
127
Denormalizing -- data
Settings:
lineitem ( L_ORDERKEY, L_PARTKEY , L_SUPPKEY,
L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE ,
L_DISCOUNT, L_TAX , L_RETURNFLAG, L_LINESTATUS ,
L_SHIPDATE, L_COMMITDATE,
L_RECEIPTDATE, L_SHIPINSTRUCT ,
L_SHIPMODE , L_COMMENT );
region( R_REGIONKEY, R_NAME, R_COMMENT );
nation( N_NATIONKEY, N_NAME, N_REGIONKEY, N_COMMENT,);
supplier( S_SUPPKEY, S_NAME, S_ADDRESS, S_NATIONKEY,
S_PHONE,
S_ACCTBAL, S_COMMENT);
– 600000 rows in lineitem, 25 nations, 5 regions, 500 suppliers
128
Denormalizing -- transactions
lineitemdenormalized ( L_ORDERKEY, L_PARTKEY ,
L_SUPPKEY, L_LINENUMBER, L_QUANTITY,
L_EXTENDEDPRICE ,
L_DISCOUNT, L_TAX , L_RETURNFLAG, L_LINESTATUS ,
L_SHIPDATE, L_COMMITDATE,
L_RECEIPTDATE, L_SHIPINSTRUCT ,
L_SHIPMODE , L_COMMENT, L_REGIONNAME);
– 600000 rows in lineitemdenormalized
– Cold Buffer
– Dual Pentium II (450MHz, 512Kb), 512 Mb RAM,
3x18Gb drives (10000RPM), Windows 2000.
129
Queries on Normalized vs.
Denormalized Schemas
Queries:
select L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER, L_QUANTITY,
L_EXTENDEDPRICE, L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS,
L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT, L_SHIPMODE,
L_COMMENT, R_NAME
from LINEITEM, REGION, SUPPLIER, NATION
where
L_SUPPKEY = S_SUPPKEY
and S_NATIONKEY = N_NATIONKEY
and N_REGIONKEY = R_REGIONKEY
and R_NAME = 'EUROPE';
select L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER, L_QUANTITY,
L_EXTENDEDPRICE, L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS,
L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT, L_SHIPMODE,
L_COMMENT, L_REGIONNAME
from LINEITEMDENORMALIZED
where L_REGIONNAME = 'EUROPE';
130
Throughput (Queries/sec)
Denormalization
0.002
0.0015
0.001
0.0005
0
normalized
denormalized
• TPC-H schema
• Query: find all lineitems
whose supplier is in
Europe.
• With a normalized schema
this query is a 4-way join.
• If we denormalize
lineitem and add the name
of the region for each
lineitem (foreign key
denormalization)
throughput improves 30%
131
Vertical Partitioning
• Consider account(id, balance, homeaddress)
• When might it be a good idea to do a
“vertical partitioning” into
account1(id,balance) and
account2(id,homeaddress)?
• Join vs. size.
132
Vertical Partitioning
• Which design is better
depends on the query
pattern:
– The application that sends a
monthly statement is the
principal user of the address
of the owner of an account
– The balance is updated or
examined several times a
day.
• The second schema might
be better because the
relation (account_ID,
balance) can be made
smaller:
– More account_ID, balance
pairs fit in memory, thus
increasing the hit ratio
– A scan performs better
because there are fewer
pages.
133
Tuning Normalization
• A single normalized relation XYZ is better than
two normalized relations XY and XZ if the single
relation design allows queries to access X, Y and
Z together without requiring a join.
• The two-relation design is better iff:
– Users access tend to partition between the two sets Y
and Z most of the time
– Attributes Y or Z have large values
134
Vertical Partitioning
and Scan
• R (X,Y,Z)
– X is an integer
– YZ are large strings
Througput (queries/sec)
0.02
0.015
0.01
0.005
0
No Partitioning Vertical
No Partitioning Vertical
XYZ
Partitioning - XYZ
XY
Partitioning - XY
• Scan Query
• Vertical partitioning
exhibits poor performance
when all attributes are
accessed.
• Vertical partitioning
provides a sped up if only
two of the attributes are
accessed.
135
Vertical Partitioning
and Point Queries
• R (X,Y,Z)
– X is an integer
– YZ are large strings
Throughput (queries/sec)
1000
800
600
400
no vertical partitioning
vertical partitioning
200
0
0
20
40
60
% of access that only concern XY
80
100
• A mix of point queries
access either XYZ or XY.
• Vertical partitioning gives
a performance advantage
if the proportion of queries
accessing only XY is
greater than 20%.
• The join is not expensive
compared to a simple
look-up.
136
Queries
Settings:
employee(ssnum, name, dept, salary, numfriends);
student(ssnum, name, course, grade);
techdept(dept, manager, location);
clustered
index i1 on employee (ssnum);
nonclustered index i2 on employee (name);
nonclustered index i3 on employee (dept);
clustered
index i4 on student
nonclustered index i5 on student
clustered
(ssnum);
(name);
index i6 on techdept (dept);
– 100000 rows in employee, 100000 students, 10 departments; Cold buffer
– Dual Pentium II (450MHz, 512Kb), 512 Mb RAM, 3x18Gb drives
(10000RPM), Windows 2000.
137
Queries – View on Join
View Techlocation:
create view techlocation as
select ssnum, techdept.dept, location
from employee, techdept
where employee.dept = techdept.dept;
Queries:
– Original:
select dept from techlocation where ssnum = ?;
– Rewritten:
select dept from employee where ssnum = ?;
138
Query Rewriting - Views
Throughput improvement percent
80
70
60
SQLServer 2000
Oracle 8i
DB2 V7.1
50
40
30
20
10
0
view
• All systems expand the
selection on a view into a
join
• The difference between a
plain selection and a join
(on a primary key-foreign
key) followed by a
projection is greater on
SQL Server than on
Oracle and DB2 v7.1.
139
Queries – Correlated Subqueries
Queries:
– Original:
select ssnum
from employee e1
where salary =
(select max(salary)
from employee e2
where e2.dept = e1.dept);
– Rewritten:
select max(salary) as bigsalary, dept
into TEMP
from employee group by dept;
select ssnum
from employee, TEMP
where salary = bigsalary
and employee.dept = temp.dept;
140
Query Rewriting –
Correlated Subqueries
> 1000
Throughput improvement percent
80
> 10000
70
60
50
SQLServer 2000
Oracle 8i
DB2 V7.1
40
30
20
10
0
-10
correlated subquery
• SQL Server 2000 does a
good job at handling the
correlated subqueries (a
hash join is used as
opposed to a nested loop
between query blocks)
– The techniques
implemented in SQL Server
2000 are described in
“Orthogonal Optimization
of Subqueries and
Aggregates” by C.GalindoLegaria and M.Joshi,
SIGMOD 2001.
141
Eliminate unneeded DISTINCTs
• Query: Find employees who work in the
information systems department. There should be
no duplicates.
SELECT distinct ssnum
FROM employee
WHERE dept = ‘information systems’
• DISTINCT is unnecessary, since ssnum is a key of
employee so certainly is a key of a subset of
employee.
142
Eliminate unneeded DISTINCTs
• Query: Find social security numbers of
employees in the technical departments.
There should be no duplicates.
SELECT DISTINCT ssnum
FROM employee, tech
WHERE employee.dept = tech.dept
• Is DISTINCT needed?
143
Distinct Unnecessary Here Too
• Since dept is a key of the tech table, each
employee record will join with at most one
record in tech.
• Because ssnum is a key for employee,
distinct is unnecessary.
144
Reaching
• The relationship among DISTINCT, keys and
joins can be generalized:
– Call a table T privileged if the fields returned by the
select contain a key of T
– Let R be an unprivileged table. Suppose that R is joined
on equality by its key field to some other table S, then
we say R reaches S.
– Now, define reaches to be transitive. So, if R1 reaches
R2 and R2 reaches R3 then say that R1 reaches R3.
145
Reaches: Main Theorem
• There will be no duplicates among the
records returned by a selection, even in the
absence of DISTINCT if one of the two
following conditions hold:
– Every table mentioned in the FROM clause is
privileged
– Every unprivileged table reaches at least one
privileged table.
146
Reaches: Proof Sketch
• If every relation is privileged then there are no
duplicates
– The keys of those relations are in the from clause
• Suppose some relation T is not privileged but
reaches at least one privileged one, say R. Then
the qualifications linking T with R ensure that
each distinct combination of privileged records is
joined with at most one record of T.
147
Reaches: Example 1
SELECT ssnum
FROM employee, tech
WHERE employee.manager = tech.manager
• The same employee record may match several
tech records (because manager is not a key of
tech), so the ssnum of that employee record may
appear several times.
• Tech does not reach the privileged relation
employee.
148
Reaches: Example 2
SELECT ssnum, tech.dept
FROM employee, tech
WHERE employee.manager = tech.manager
• Each repetition of a given ssnum vlaue
would be accompanied by a new tech.dept
since tech.dept is a key of tech
• Both relations are privileged.
149
Reaches: Example 3
SELECT student.ssnum
FROM student, employee, tech
WHERE student.name = employee.name
AND employee.dept = tech.dept;
• Student is priviledged
• Employee does not reach student (name is not a
key of employee)
• DISTINCT is needed to avoid duplicates.
150
Aggregate Maintenance -- data
Settings:
orders( ordernum, itemnum, quantity, storeid, vendorid );
create clustered index i_order on orders(itemnum);
store( storeid, name );
item(itemnum, price);
create clustered index i_item on item(itemnum);
vendorOutstanding( vendorid, amount);
storeOutstanding( storeid, amount);
– 1000000 orders, 10000 stores, 400000 items; Cold buffer
– Dual Pentium II (450MHz, 512Kb), 512 Mb RAM, 3x18Gb drives
(10000RPM), Windows 2000.
151
Aggregate Maintenance -triggers
Triggers for Aggregate Maintenance
create trigger updateVendorOutstanding on orders for insert as
update vendorOutstanding
set amount =
(select vendorOutstanding.amount+sum(inserted.quantity*item.price)
from inserted,item
where inserted.itemnum = item.itemnum
)
where vendorid = (select vendorid from inserted) ;
create trigger updateStoreOutstanding on orders for insert as
update storeOutstanding
set amount =
(select storeOutstanding.amount+sum(inserted.quantity*item.price)
from inserted,item
where inserted.itemnum = item.itemnum
)
where storeid = (select storeid from inserted)
152
Aggregate Maintenance -transactions
Concurrent Transactions:
– Insertions
insert into orders values
(1000350,7825,562,'xxxxxx6944','vendor4');
– Queries (first without, then with redundant tables)
select orders.vendor, sum(orders.quantity*item.price)
from orders,item
where orders.itemnum = item.itemnum
group by orders.vendorid;
vs. select * from vendorOutstanding;
select store.storeid, sum(orders.quantity*item.price)
from orders,item, store
where orders.itemnum = item.itemnum
and orders.storename = store.name
group by store.storeid;
vs. select * from storeOutstanding;
153
Aggregate Maintenance
pect. of gain with aggregate maintenance
35000
30000
25000
20000
15000
10000
5000
0
-5000
31900
21900
- 62.2
insert
vendor total
store total
• SQLServer 2000 on
Windows 2000
• Using triggers for
view maintenance
• If queries frequent or
important, then
aggregate maintenance
is good.
154
Superlinearity -- data
Settings:
sales( id, itemid, customerid, storeid, amount, quantity);
item (itemid);
customer (customerid);
store (storeid);
A sale is successful if all foreign keys are present.
successfulsales(id, itemid, customerid, storeid, amount,
quantity);
unsuccessfulsales(id, itemid, customerid, storeid, amount,
quantity);
tempsales( id, itemid, customerid, storeid,
amount,quantity);
155
Superlinearity -- indexes
Settings (non-clustering, dense indexes):
index s1 on item(itemid);
index s2 on customer(customerid);
index s3 on store(storeid);
index succ on successfulsales(id);
– 1000000 sales, 400000 customers, 40000 items, 1000 stores
– Cold buffer
– Dual Pentium II (450MHz, 512Kb), 512 Mb RAM, 3x18Gb drives
(10000RPM), Windows 2000.
156
Superlinearity -- queries
Queries:
– Insert/create indexdelete
insert into successfulsales
select sales.id, sales.itemid, sales.customerid,
sales.storeid, sales.amount, sales.quantity
from sales, item, customer, store
where sales.itemid = item.itemid
and sales.customerid = customer.customerid
and sales.storeid = store.storeid;
insert into unsuccessfulsales
select * from sales;
go
delete from unsuccessfulsales
where id in (select id from successfulsales)
157
Superlinearity -- batch queries
Queries:
– Small batches
DECLARE @Nlow INT;
DECLARE @Nhigh INT;
DECLARE @INCR INT;
set @INCR = 100000
set @NLow = 0
set @Nhigh = @INCR
WHILE (@NLow <= 500000)
BEGIN
insert into tempsales
select * from sales
where id between @NLow and @Nhigh
set @Nlow = @Nlow + @INCR
set @Nhigh = @Nhigh + @INCR
delete from tempsales
where id in (select id from successfulsales);
insert into unsuccessfulsales
select * from tempsales;
delete from tempsales;
END
158
Superlinearity -- outer join
Queries:
– outerjoin
insert into successfulsales
select sales.id, item.itemid, customer.customerid, store.storeid,
sales.amount, sales.quantity
from
((sales left outer join item on sales.itemid = item.itemid)
left outer join customer on sales.customerid = customer.customerid)
left outer join store on sales.storeid = store.storeid;
insert into unsuccessfulsales
select *
from successfulsales
where itemid is null
or customerid is null
or storeid is null;
go
delete from successfulsales
where itemid is null
or customerid is null
or storeid is null
159
Circumventing Superlinearity
140
Response Time (sec)
120
100
insert/delete no index
insert/delete indexed
small batches
outer join
80
60
40
20
0
small
large
• SQL Server 2000
• Outer join achieves the
best response time.
• Small batches do not
help because overhead
of crossing the
application interface is
higher than the benefit
of joining with smaller
tables.
160
Tuning the Application Interface
• 4GL
– Power++, Visual basic
• Programming language +
Call Level Interface
– ODBC: Open DataBase Connectivity
– JDBC: Java based API
– OCI (C++/Oracle), CLI (C++/ DB2), Perl/DBI
• In the following experiments, the client program is
located on the database server site. Overhead is
due to crossing the application interface.
161
Looping can hurt -- data
Settings:
lineitem ( L_ORDERKEY, L_PARTKEY , L_SUPPKEY,
L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE ,
L_DISCOUNT, L_TAX , L_RETURNFLAG, L_LINESTATUS ,
L_SHIPDATE, L_COMMITDATE,
L_RECEIPTDATE, L_SHIPINSTRUCT ,
L_SHIPMODE , L_COMMENT );
– 600 000 rows; warm buffer.
– Dual Pentium II (450MHz, 512Kb), 512 Mb RAM, 3x18Gb drives
(10000RPM), Windows 2000.
162
Looping can hurt -- queries
• Queries:
– No loop:
sqlStmt = “select * from lineitem where l_partkey < 200;”
odbc->prepareStmt(sqlStmt);
odbc->execPrepared(sqlStmt);
– Loop:
sqlStmt = “select * from lineitem where l_partkey = ?;”
odbc->prepareStmt(sqlStmt);
for (int i=1; i<200; i++)
{
odbc->bindParameter(1, SQL_INTEGER, i);
odbc->execPrepared(sqlStmt);
}
163
throughput (records/sec)
Looping can Hurt
600
500
400
300
200
100
0
loop
no loop
• SQL Server 2000 on
Windows 2000
• Crossing the application
interface has a significant
impact on performance.
• Why would a programmer
use a loop instead of
relying on set-oriented
operations: objectorientation?
164
Cursors are Death -- data
Settings:
employees(ssnum, name, lat, long, hundreds1,
hundreds2);
– 100000 rows ; Cold buffer
– Dual Pentium II (450MHz, 512Kb), 512 Mb RAM, 3x18Gb drives
(10000RPM), Windows 2000.
165
Cursors are Death -- queries
Queries:
– No cursor
select * from employees;
– Cursor
DECLARE d_cursor CURSOR FOR select * from employees;
OPEN d_cursor
while (@@FETCH_STATUS = 0)
BEGIN
FETCH NEXT from d_cursor
END
CLOSE d_cursor
go
166
Throughput (records/sec)
Cursors are Death
5000
4000
3000
2000
1000
0
cursor
SQL
• SQL Server 2000 on
Windows 2000
• Response time is a few
seconds with a SQL
query and more than
an hour iterating over
a cursor.
167
Retrieve Needed Columns Only data
Settings:
lineitem ( L_ORDERKEY, L_PARTKEY , L_SUPPKEY,
L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE ,
L_DISCOUNT, L_TAX , L_RETURNFLAG, L_LINESTATUS ,
L_SHIPDATE, L_COMMITDATE,
L_RECEIPTDATE, L_SHIPINSTRUCT ,
L_SHIPMODE , L_COMMENT );
create index i_nc_lineitem on lineitem (l_orderkey,
l_partkey, l_suppkey, l_shipdate, l_commitdate);
– 600 000 rows; warm buffer.
– Lineitem records are ~ 10 bytes long
– Dual Pentium II (450MHz, 512Kb), 512 Mb RAM, 3x18Gb drives
(10000RPM), Windows 2000.
168
Retrieve Needed Columns Only queries
Queries:
– All
Select * from lineitem;
– Covered subset
Select l_orderkey, l_partkey, l_suppkey, l_shipdate,
l_commitdate from lineitem;
169
Throughput (queries/msec)
Retrieve Needed Columns Only
• Avoid transferring
unnecessary data
• May enable use of a
covering index.
• In the experiment the
subset contains ¼ of the
attributes.
1.75
1.5
1.25
all
covered subset
1
0.75
0.5
0.25
0
no index
index
Experiment performed on
Oracle8iEE on Windows 2000.
– Reducing the amount of
data that crosses the
application interface yields
significant performance
improvement.
170
Bulk Loading Data
Settings:
lineitem ( L_ORDERKEY, L_PARTKEY , L_SUPPKEY,
L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE ,
L_DISCOUNT, L_TAX , L_RETURNFLAG, L_LINESTATUS ,
L_SHIPDATE, L_COMMITDATE,
L_RECEIPTDATE, L_SHIPINSTRUCT ,
L_SHIPMODE , L_COMMENT );
– Initially the table is empty; 600 000 rows to be inserted (138Mb)
– Table sits one disk. No constraint, index is defined.
– Dual Pentium II (450MHz, 512Kb), 512 Mb RAM, 3x18Gb drives
(10000RPM), Windows 2000.
171
Bulk Loading Queries
Oracle 8i
sqlldr directpath=true control=load_lineitem.ctl data=E:\Data\lineitem.tbl
load data
infile "lineitem.tbl"
into table LINEITEM append
fields terminated by '|'
(
L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER,
L_QUANTITY, L_EXTENDEDPRICE, L_DISCOUNT, L_TAX,
L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE DATE "YYYY-MMDD", L_COMMITDATE DATE "YYYY-MM-DD", L_RECEIPTDATE
DATE "YYYY-MM-DD", L_SHIPINSTRUCT, L_SHIPMODE,
L_COMMENT
)
172
Direct Path
Throughput (rec/sec)
50000
40000
30000
20000
10000
65
0
conventional
direct path
insert
Experiment performed on
Oracle8iEE on Windows 2000.
• Direct path loading
bypasses the query
engine and the storage
manager. It is orders
of magnitude faster
than for conventional
bulk load (commit
every 100 records) and
inserts (commit for
each record).
173
Batch Size
Throughput (records/sec)
5000
4000
3000
2000
1000
0
0
100000 200000 300000 400000 500000 600000
Experiment performed on
SQL Server 2000
on Windows 2000.
• Throughput increases
steadily when the batch
size increases to 100000
records.Throughput
remains constant
afterwards.
• Trade-off between
performance and amount
of data that has to be
reloaded in case of
problem.
174
Tuning
E-Commerce Applications
Database-backed web-sites:
– Online shops
– Shop comparison portals
– MS TerraServer
175
E-commerce Application
Architecture
DB cache
Application servers Database server
DB cache
Web cache
Web cache
Web servers
Web cache
Clients
176
E-commerce Application
Workload
• Touristic searching (frequent, cached)
– Access the top few pages. Pages may be personalized.
Data may be out-of-date.
• Category searching (frequent, partly cached and
need for timeliness guarantees)
– Down some hierarchy, e.g., men’s clothing.
• Keyword searching (frequent, uncached, need for
timeliness guarantees)
• Shopping cart interactions (rare, but transactional)
• Electronic purchasing (rare, but transactional)
177
Design Issues
• Need to keep historic information
– Electronic payment acknowledgements get lost.
• Preparation for variable load
– Regular patterns of web site accesses during the day,
and within a week.
• Possibility of disconnections
– State information transmitted to the client (cookies)
• Special consideration for low bandwidth
• Schema evolution
– Representing e-commerce data as attribute-value pairs
(IBM Websphere)
178
Caching
• Web cache:
– Static web pages
– Caching fragments of dynamically created web pages
• Database cache (Oracle9iAS, TimesTen’s
FrontTier)
– Materialized views to represent cached data.
– Queries are executed either using the database cache or
the database server. Updates are propagated to keep the
cache(s) consistent.
– Note to vendors: It would be good to have queries
distributed between cache and server.
179
Ecommerce -- setting
Settings:
shoppingcart( shopperid, itemid, price, qty);
– 500000 rows; warm buffer
– Dual Pentium II (450MHz, 512Kb), 512 Mb RAM, 3x18Gb drives
(10000RPM), Windows 2000.
180
Ecommerce -- transactions
Concurrent Transactions:
– Mix
insert into shoppingcart values (107999,914,870,214);
update shoppingcart set Qty = 10 where shopperid =
95047 and itemid = 88636;
delete from shoppingcart where shopperid = 86123 and
itemid = 8321;
select shopperid, itemid, qty, price from shoppingcart
where shopperid = ?;
– Queries Only
select shopperid, itemid, qty, price from shoppingcart
where shopperid = ?;
181
Connection Pooling (no refusals)
Response Time
250
200
150
100
connection pooling
50
simple connections
0
20
40
60
80
100
120
Number of client threads
Experiment performed on
Oracle8i on Windows 2000
140
• Each thread establishes a
connection and performs 5
insert statements.
• If a connection cannot be
established the thread waits 15
secs before trying again.
• The number of connection is
limited to 60 on the database
server.
• Using connection pooling, the
requests are queued and
serviced when possible. There
are no refused connections.
182
Throughput (queries/sec)
Indexing
90
80
70
60
50
40
30
20
10
0
• Using a clustered
index on shopperid in
the shopping cart
provides:
no index
clustered index
Mix
Queries Only
Experiment performed on
SQL Server 2000
on Windows 2000
– Query speed-up
– Update/Deletion speedup
183
Capacity Planning
Entry (S1)
0.4
0.5
Search (S2)
• Arrival Rate
– A1 is given as an assumption
– A2 = (0.4 A1) + (0.5 A2)
– A3 = 0.1 A2
• Service Time (S)
0.1
– S1, S2, S3 are measured
• Utilization
Checkout (S3)
Getting the demand assumptions right
is what makes capacity planning hard
– U=AxS
• Response Time
– R = U/(A(1-U)) = S/(1-U)
(assuming Poisson arrivals)
184
How to Handle Multiple Servers
• Suppose one has n servers for some task
that requires S time for a single server to
perform.
• The perfect parallelism model is that it is as
if one has a single server that is n times as
fast.
• However, this overstates the advantage of
parallelism, because even if there were no
waiting, single tasks require S time.
185
Rough Estimate for Multiple Servers
• There are two components to response time:
waiting time + service time.
• In the parallel setting, the service time is
still S.
• The waiting time however can be well
estimated by a server that is n times as fast.
186
Approximating waiting time for n
parallel servers.
• Recall: R = U/(A(1-U)) = S/(1-U)
• On an n-times faster server, service time is
divided by n, so the single processor
utilization U is also divided by n. So we
would get: Rideal = (S/n)/(1 – (U/n)).
• That Rideal = serviceideal + waitideal.
• So waitideal = Rideal – S/n
• Our assumption: waitideal ~ wait for n
processors.
187
Approximating response time for
n parallel servers
• Waiting time for n parallel processors ~
(S/n)/(1 – (U/n)) – S/n =
(S/n) ( 1/(1-(U/n)) – 1) =
(S/(n(1 – U/n)))(U/n) =
(S/(n – U))(U/n)
• So, response time for n parallel processors
is above waiting time + S.
188
Example
•
•
•
•
A = 8 per second.
S = 0.1 second.
U = 0.8.
Single server response time = S/(1-U) = 0.1/0.2 =
0.5 seconds.
• If we have 2 servers, then we estimate waiting
time to be (0.1/(2-0.8))(0.4) = 0.04/1.2 = 0.033. So
the response time is 0.133.
• For a 2-times faster server, S = 0.05, U = 0.4, so
response time is 0.05/0.6 = 0.0833
189
Example -- continued
•
•
•
•
A = 8 per second.
S = 0.1 second.
U = 0.8.
If we have 4 servers, then we estimate waiting
time to be (S/(n – U))(U/n) = 0.1/(3.2) * (0.8/4) =
0.02/3.2 = 0.00625 So response time is 0.10625.
190
Datawarehouse Tuning
• Aggregate (strategic) targeting:
– Aggregates flow up from a wide selection of
data, and then
– Targeted decisions flow down
• Examples:
– Riding the wave of clothing fads
– Tracking delays for frequent-flyer customers
191
Data Warehouse
Workload
• Broad
– Aggregate queries over ranges of values, e.g., find
the total sales by region and quarter.
• Deep
– Queries that require precise individualized
information, e.g., which frequent flyers have been
delayed several times in the last month?
• Dynamic (vs. Static)
– Queries that require up-to-date information, e.g.
which nodes have the highest traffic now?
192
Tuning Knobs
• Indexes
• Materialized views
• Approximation
193
Bitmaps -- data
Settings:
lineitem ( L_ORDERKEY, L_PARTKEY , L_SUPPKEY, L_LINENUMBER,
L_QUANTITY, L_EXTENDEDPRICE ,
L_DISCOUNT, L_TAX , L_RETURNFLAG, L_LINESTATUS ,
L_SHIPDATE, L_COMMITDATE,
L_RECEIPTDATE, L_SHIPINSTRUCT ,
L_SHIPMODE , L_COMMENT );
create bitmap index b_lin_2 on lineitem(l_returnflag);
create bitmap index b_lin_3 on lineitem(l_linestatus);
create bitmap index b_lin_4 on lineitem(l_linenumber);
– 100000 rows ; cold buffer
– Dual Pentium II (450MHz, 512Kb), 512 Mb RAM, 3x18Gb drives
(10000RPM), Windows 2000.
194
Bitmaps -- queries
Queries:
– 1 attribute
select count(*) from lineitem where l_returnflag =
'N';
– 2 attributes
select count(*) from lineitem where l_returnflag = 'N'
and l_linenumber > 3;
– 3 attributes
select count(*) from lineitem where l_returnflag =
'N' and l_linenumber > 3 and l_linestatus = 'F';
195
Bitmaps
Throughput (Queries/sec)
12
10
8
6
linear scan
bitmap
4
2
0
1
2
3
A
N
R
• Order of magnitude
improvement compared to
scan, because summing up
the 1s is very fast.
• Bitmaps are best suited for
multiple conditions on
several attributes, each
having a low selectivity.
l_returnflag
O
F
l_linestatus
196
Vertical Partitioning Revisited
In the same settings that bit vectors are useful
– lots of attributes, unpredictable
combinations queried – so is vertical
partitioning more attractive.
If a table has 200 attributes, no reason to
retrieve 200 field rows if only three
attributes are needed.
Sybase IQ, KDB, Vertica all use columnoriented tables.
197
Multidimensional Indexes -- data
Settings:
create table spatial_facts
( a1 int, a2 int, a3 int, a4 int, a5 int, a6 int, a7
int, a8 int, a9 int, a10 int, geom_a3_a7
mdsys.sdo_geometry );
create index r_spatialfacts on
spatial_facts(geom_a3_a7) indextype is
mdsys.spatial_index;
create bitmap index b2_spatialfacts on
spatial_facts(a3,a7);
– 500000 rows ; cold buffer
– Dual Pentium II (450MHz, 512Kb), 512 Mb RAM, 3x18Gb drives
(10000RPM), Windows 2000.
198
Multidimensional Indexes -queries
Queries:
– Point Queries
select count(*) from fact where a3 = 694014 and a7 = 928878;
select count(*) from spatial_facts where
SDO_RELATE(geom_a3_a7, MDSYS.SDO_GEOMETRY(2001, NULL,
MDSYS.SDO_POINT_TYPE(694014,928878, NULL), NULL, NULL),
'mask=equal querytype=WINDOW') = 'TRUE';
– Range Queries
select count(*) from spatial_facts where
SDO_RELATE(geom_a3_a7, mdsys.sdo_geometry(2003,NULL,NULL,
mdsys.sdo_elem_info_array(1,1003,3),mdsys.sdo_ordinate_array
(10,800000,1000000,1000000)), 'mask=inside
querytype=WINDOW') = 'TRUE';
select count(*) from spatial_facts where a3 > 10 and a3 <
1000000 and a7 > 800000 and a7 < 1000000;
199
Multidimensional Indexes
• Oracle 8i on Windows
2000
• Spatial Extension:
Response Time (sec)
14
12
bitmap - two
attributes
r-tree
10
8
6
4
2
0
point query
range query
– 2-dimensional data
– Spatial functions used
in the query
• R-tree does not
perform well because
of the overhead of
spatial extension.
200
Multidimensional Indexes
R-Tree
SELECT STATEMENT
SORT AGGREGATE
TABLE ACCESS BY INDEX
ROWID SPATIAL_FACTS
DOMAIN INDEX
R_SPATIALFACTS
Bitmaps
SELECT STATEMENT
SORT AGGREGATE
BITMAP CONVERSION
COUNT
BITMAP AND
BITMAP INDEX SINGLE
VALUE B_FACT7
BITMAP INDEX SINGLE
VALUE B_FACT3
201
Materialized Views -- data
Settings:
orders( ordernum, itemnum, quantity, storeid, vendor );
create clustered index i_order on orders(itemnum);
store( storeid, name );
item(itemnum, price);
create clustered index i_item on item(itemnum);
– 1000000 orders, 10000 stores, 400000 items; Cold buffer
– Oracle 9i
– Pentium III (1 GHz, 256 Kb), 1Gb RAM, Adapter 39160 with 2
channels, 3x18Gb drives (10000RPM), Linux Debian 2.4.
202
Materialized Views -- data
Settings:
create materialized view vendorOutstanding
build immediate
refresh complete
enable query rewrite
as
select orders.vendor, sum(orders.quantity*item.price)
from orders,item
where orders.itemnum = item.itemnum
group by orders.vendor;
203
Materialized Views -transactions
Concurrent Transactions:
– Insertions
insert into orders values
(1000350,7825,562,'xxxxxx6944','vendor4');
– Queries
select orders.vendor, sum(orders.quantity*item.price)
from orders,item
where orders.itemnum = item.itemnum
group by orders.vendor;
select * from vendorOutstanding;
204
Response Time (sec)
Materialized Views
14
12
• Graph:
– Oracle9i on Linux
– Total sale by vendor is
materialized
10
8
6
4
2
0
Throughput (statements/sec)
Materialized View
No Materialized View
1600
1200
800
400
0
Materialized
Materialized
No
View (fast on View (complete Materialized
commit)
on demand)
View
• Trade-off between query speed-up
and view maintenance:
– The impact of incremental
maintenance on performance is
significant.
– Rebuild maintenance achieves a
good throughput.
– A static data warehouse offers a
good trade-off.
205
Materialized View
Maintenance
• Problem when large
number of views to
maintain.
• The order in which views
are maintained is
important:
– A view can be computed
from an existing view
instead of being recomputed
from the base relations
(total per region can be
computed from total per
nation).
•
•
•
•
Let the views and base tables be
nodes v_i
Let there be an edge from v_1 to
v_2 if it possible to compute the
view v_2 from v_1. Associate the
cost of computing v_2 from v_1 to
this edge.
Compute all pairs shortest path
where the start nodes are the set of
base tables.
The result is an acyclic graph A.
Take a topological sort of A and let
that be the order of view
construction.
206
Materialized View Example
• Detail(storeid, item, qty, price, date)
• Materialized view1(storeid, category, qty,
month)
• Materializedview2(city, category, qty,
month)
• Materializedview3(storeid, category, qty,
year)
207
Approximations -- data
Settings:
– TPC-H schema
– Approximations
insert into approxlineitem
select top 6000 *
from lineitem
where l_linenumber = 4;
insert into approxorders
select O_ORDERKEY, O_CUSTKEY, O_ORDERSTATUS,
O_TOTALPRICE, O_ORDERDATE, O_ORDERPRIORITY, O_CLERK,
O_SHIPPRIORITY, O_COMMENT
from orders, approxlineitem
where o_orderkey = l_orderkey;
208
Approximations -- queries
insert into approxsupplier
select distinct S_SUPPKEY,
S_NAME
,
S_ADDRESS,
S_NATIONKEY,
S_PHONE,
S_ACCTBAL,
S_COMMENT
from approxlineitem, supplier
where s_suppkey = l_suppkey;
insert into approxpart
select distinct P_PARTKEY,
P_NAME ,
P_MFGR
,
P_BRAND ,
P_TYPE
,
P_SIZE ,
P_CONTAINER ,
P_RETAILPRICE ,
P_COMMENT
from approxlineitem, part
where p_partkey = l_partkey;
insert into approxpartsupp
select distinct PS_PARTKEY,
PS_SUPPKEY,
PS_AVAILQTY,
PS_SUPPLYCOST,
PS_COMMENT
from partsupp, approxpart, approxsupplier
where ps_partkey = p_partkey and
ps_suppkey = s_suppkey;
insert into approxcustomer
select distinct C_CUSTKEY,
C_NAME
,
C_ADDRESS,
C_NATIONKEY,
C_PHONE
,
C_ACCTBAL,
C_MKTSEGMENT,
C_COMMENT
from customer, approxorders
where o_custkey = c_custkey;
insert into approxregion select * from
region;
insert into approxnation select * from
nation;
209
Approximations -- more queries
Queries:
– Single table query on lineitem
select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as
sum_charge,
avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc, count(*) as count_order
from lineitem
where datediff(day, l_shipdate, '1998-12-01') <= '120'
group by l_returnflag, l_linestatus
order by l_returnflag, l_linestatus;
210
Approximations -- still more
Queries:
– 6-way join
select n_name, avg(l_extendedprice * (1 - l_discount)) as
revenue
from customer, orders, lineitem, supplier, nation, region
where c_custkey = o_custkey
and l_orderkey = o_orderkey
and l_suppkey = s_suppkey
and c_nationkey = s_nationkey
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'AFRICA'
and o_orderdate >= '1993-01-01'
and datediff(year, o_orderdate,'1993-01-01') < 1
group by n_name
order by revenue desc;
211
Approximation accuracy
Approximated
Aggregate Values
(% of Base
Aggregate Value)
TPC-H Query Q1: lineitem
60
1% sample
10% sample
40
20
0
-20
1
2
3
4
5
6
7
-40
Aggregate values
Approximated
aggregated values
(% of Base
Aggregate Value)
TPC-H Query Q5: 6 way join
60
40
20
0
-20
-40
1
2
3
4
5
1% sample
10% sample
Groups returned in the query
8
• Good approximation for
query Q1 on lineitem
• The aggregated values
obtained on a query with a
6-way join are
significantly different
from the actual values -for some applications may
still be good enough.
212
Approximation Speedup
Response Time (sec)
• Aqua approximation
on the TPC-H schema
4
3.5
3
2.5
2
1.5
1
0.5
0
– 1% and 10% lineitem
sample propagated.
base relations
1 % sample
10% sample
Q1
Q5
TPC-H Queries
• The query speed-up
obtained with
approximated relations
is significant.
213
Good Way to Find a Random
Sample
• If you know how big your data is (i.e. it is not
streaming), then the best way is to include each
record with some probability p drawn
independently and randomly.
• If you don’t know how big your data is, hash each
record to a value in some range. Choose the lowest
k.
• Can use a priority queue having k elements. As
each new element is inserted, perform deletemax.
214
Approximate Counts of Every
Item
• Count-min idea.
• Imagine that you had a perfect hash function H
(i.e. no collisions). Then, you could simply
increment location H(x) every time x appeared.
• You don’t. So use several hash function H1, H2,
… Hj.
• When you see x, increment locs H1(x), H2(x), …
• To estimate the count of x, look at min of those
locs.
215
Keep Track of k Most Frequent
Items
• Take the first k items.
• Each time you see an item x, if x is among the top
k, increment the count of x.
• Otherwise, replace the item y having the lowest
count among the top k by x and increment the
count (formerly for y) and ascribe that count to x.
• Works (approximately) if items are presented as a
random permutation. Remarkably robust.
216
Maintaining “typical” clusters for
streaming data
• Find a sample using the sampling technique from
before.
• Form a cluster off-line.
• As points come in, associate them with nearest
centroid.
• Let the initial fraction with centroid i be p(i).
• If that fraction changes a lot, then maybe the
distribution has changed.
217
Approximating Count Distinct
• Suppose you have one or more columns with
duplicates. You want to know how many distinct
values there are.
• Imagine that you could hash into (0,1). (Simulate
this by a very large integer range).
• Hashing will put duplicates in the same location
and, if done right, will distribute values uniformly.
• Draw the dots of hash values between 0 and 1.
218
Approximating Count Distinct
• Suppose you hash every item and take the
minimum 1000 hash values.
• Suppose that highest hash value of the 1000
is f (between 0 and 1).
• Intuitively, if f is small then there are many
distinct values.
• What is a good approximation of the
number of distinct values?
219
Tuning Distributed Applications
• Queries across multiple databases
– Federated Datawarehouse
– IBM’s DataJoiner now integrated at DB2 v7.2
• The source should perform as much work as
possible and return as few data as possible for
processing at the federated server
• Processing data across multiple databases
220
A puzzle
• Two databases X and Y
– X records inventory data to be used for restocking
– Y contains delivery data about shipments
• Improve the speed of shipping by sharing data
– Certain data of X should be postprocessed on Y shortly
after it enters X
– You want to avoid losing data from X and you want to
avoid double-processing data on Y, even in the face of
failures.
221
Two-Phase Commit
commit
X
commit
Source
(coordinator
& participant)
Y
Destination
(participant)
+ Commits are
coordinated between
the source and the
destination
- If one participant fails
then blocking can
occur
- Not all db systems
support prepare-tocommit interface
222
Replication Server
X
Source
Y
Destination
+ Destination within a
few seconds of being
up-to-date
+ Decision support
queries can be asked
on destination db
- Administrator is
needed when network
connection breaks!
223
Staging Tables
X
Staging table
Source
Y
+ No specific
mechanism is
necessary at source or
destination
- Coordination of
transactions on X and
Y
Destination
224
Issues in Solution
• A single thread can invoke a transaction on
X, Y or both, but the thread may fail in the
middle. A new thread will regain state by
looking at the database.
• We want to delete data from the staging
tables when we are finished.
• The tuples in the staging tables will
represent an operation as well as data.
225
States
• A tuple (operation plus data) will start in
state unprocessed, then state processed, and
then deleted.
• The same transaction that processes the
tuple on the destination site also changes its
state. This is important so each tuple’s
operation is done exactly once.
226
Staging Tables
Write to destination
Read from source table
M2
M1
Yunprocessed
Yunprocessed
Yunprocessed
Yunprocessed
Yunprocessed
Yunprocessed
Table S
Table I
Database X
Database Y
Table I
Table S
STEP 1
unprocessed
unprocessed
unprocessed
Database X
Database Y
STEP 2
227
Staging Tables
Update on destination
M3
Yunprocessed
Yunprocessed
Yunprocessed
Delete source; then dest Y
Xact: Update source site
Xact:update destination site
M4
processed
unprocessed
unprocessed
Table S
Table I
Database X
Database Y
Yunprocessed
Yunprocessed
Yunprocessed
Table I
Table S
STEP 3
processed
unprocessed
unprocessed
Database X
Database Y
STEP 4
228
What to Look For
• Transactions are atomic but threads,
remember, are not. So a replacement thread
will look to the database for information.
• In which order should the final step do its
transactions? Should it delete the
unprocessed tuple from X as the first
transaction or as the second?
229
Main-memory Databases
Database Tuning – Spring 2015
Philippe Bonnet – phbo@itu.dk
Agenda
1.
2.
3.
4.
Memory hierarchy
Cache-optimized data structures
Multi-version Concurrency Control
System Examples
a) New Architectures
b) SQL Server 2014 (Hekaton)
Memory Hierarchy
Memory Hierarchy
Source: http://www.3dincites.com/2015/01/iedm-2014-3d-short-course-highlights-3d-m
Main-memory Databases
• Memory has finite capacity: How to deal
with space limitation?
• Memory is volatile: How to deal with
persistence?
• Memory is fast: How can a DBMS take
advantage?
Dealing with capacity
On Azure, G-series
instances go up to 32
cores, 448 GiB of RAM,
and 6,596 GB of local
SSD storage
Dealing with Capacity
1. Many databases whose size is 10s or 100s
GB fit in RAM on a single host
a) The storage engine of SQL Server 2014
makes it possible to split a database into a
collection of tables that are memoryoptimized and a collection of tables that are
disk-optimized (data resides on disk and is
manipulated through the traditional buffer
pool).
2. For databases that are larger than RAM,
Dealing with Persistence
1. Check-in/Check-out
– Memory optimized tables are loaded in memory
when the database management system starts
up
– They are stored on disk when the database
management system shuts down
2. Swap in/Swap out
– Tables are swapped in when data gets hot and
swapped out when data gets cold
Dealing with Persistence
Defining the database state:
• 2nd storage
– Traditional log-based approach
– Write-ahead logging protocol (remember that?)
• RAM and 2nd storage on a single host (SQL Server)
– Traditional log-based approach
– No need for write-ahead logging (see log tuning)
• RAM on multiple hosts (VoltDB)
– Recovery is needed when a host fails
– Checkpointing to save consistent states
– Command log to replay history and recover failed host (see log
tuning)
Dealing with Persistence
• Restrict (greatly) the modifications that are
allowed on memory-optimized tables:
–
–
–
–
No insertions/deletions
No updates
No new index
No concurrent modifications, only short
transactions
Leveraging RAM
• Using actual pointers (that refer to memory
addresses), rather than logical IDs and
offsets within buffer pool pages
• Variable-size data structures, rather than
fixed-size pages
– Using an index structure to represent a
collection of tuples
• SQL Server 2014 relies on hash-index
Leveraging RAM
• Traditional focus on RAM/Disk transfers
• With main-memory databases, the focus is
on cache/RAM transfers
1. Maximize cache utilization once data is in
cache
• Column-based representation (see storage tuning)
• Cache conscious indexes (see index tuning)
• Staged query processing
2. Avoid concurrent accesses on cache lines
• Introduce optimistic concurrency control
Avoiding Concurrent Accesses
on Cache Lines
Two approaches:
1. Enforce serial transaction execution
(VoltDB)
– No need for concurrency control
2. Explicit conflict declaration/discovery
when transactions start
3. Conflict discovery when transactions
commit
a) Optimistic concurrency control
Troubleshooting Techniques
(*)
• Extraction and Analysis of Performance
Indicators
– Consumer-producer chain framework
– Tools
• Query plan monitors
• Performance monitors
• Event monitors
(*) From Alberto Lerner’s chapter
243
A Consumer-Producer Chain of
a DBMS’s Resources
sql
High Level
Consumers
commands
PARSER
OPTIMIZER
EXECUTION
SUBSYSTEM
DISK
SYBSYSTEM
probing
spots
for
indicators
CACHE
MANAGER
MEMORY
CPU
LOCKING
SUBSYSTEM
Intermediate
Resources/
Consumers
LOGGING
SUBSYSTEM
DISK/
CONTROLLER
NETWORK
Primary
Resources
244
Recurrent Patterns of Problems
Effects are not always felt first where the cause is!
• An overloading high-level
consumer
• A poorly parameterized
subsystem
• An overloaded primary
resource
245
A Systematic Approach to Monitoring
Extract indicators to answer the following questions
• Question 1: Are critical
queries being served in
the most efficient manner?
• Question 2: Are
subsystems making
optimal use of resources?
• Question 3: Are there
enough primary resources
available?
246
Investigating High Level Consumers
•
Answer question 1:
“Are critical queries being served in
the most efficient manner?”
1. Identify the critical queries
2. Analyze their access plans
3. Profile their execution
247
Identifying Critical Queries
Critical queries are usually those that:
• Take a long time
• Are frequently executed
Often, a user complaint will tip us off.
248
Event Monitors to Identify
Critical Queries
• If no user complains...
• Capture usage measurements at specific
events (e.g., end of each query) and then sort
by usage
• Less overhead than other type of tools
because indicators are usually by-product of
events monitored
• Typical measures include CPU used, IO
used, locks obtained etc.
249
DBMS Components
Query Processor
Parser
Compiler
Execution Engine
Storage Subsystem
Indexes
Concurrency Control
Recovery
Buffer Manager
@ Dennis Shasha and Philippe Bonnet,
2013
DB2 9.7 Process Architecture
Exercise 1:
Is intra-query parallelism
possible with this process
model? In other words,
can a query be executed
in
parallel within a same
instance (or partition)?
@ Dennis Shasha and Philippe Bonnet, 2013
Source: http://pic.dhe.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.admin.perf.doc/doc/00003525.gif
DB2 10.1 Process Architecture
No need to know
much more about
the process
abstractions. We will
cover much more on
the memory
abstractions (log
tuning), and on the
communication
abstractions (tuning
the application
interface, tuning across
instances).
@ Dennis Shasha and Philippe Bonnet,
2013
Source: http://pic.dhe.ibm.com/infocenter/db2luw/v10r1/index.jsp?topic=%2Fcom.ibm.db2.luw.admin.perf.doc%2Fdoc%2Fc0008930.html
MySQL Architecture
@ Dennis Shasha and Philippe Bonnet,
2013
Source: http://docs.oracle.com/cd/E19957-01/mysql-refman-5.5/storage-engines.html
Experimental Framework
1.
System
–
–
2.
Throughput / Response Time
DBMS performance indicators
OS performance indicators
Workload
–
4.
Is throughput always the
inverse of response time?
Metrics
–
–
–
3.
Application + DBMS + OS + HW
Parameters (fixed/factors)
Exercise 2:
Actual users (production),
replay trace or synthetic workload
(e.g., TPC benchmark)
Experiments
–
What factor to vary?
@ Dennis Shasha and Philippe Bonnet,
2013
Exercise 3:
Define an experiment to
measure the write
throughput of the file
system on your laptop
Example
DBMS
Statement number: 1
select C_NAME, N_NAME
from DBA.CUSTOMER join DBA.NATION on C_NATIONKEY = N_NATIONKEY
where C_ACCTBAL > 0
Number of rows retrieved is: 136308
Number of rows sent to output is: 0
Elapsed Time is:
76.349 seconds
…
Buffer pool data logical reads
= 272618
Buffer pool data physical reads
= 131425
Buffer pool data writes
=0
Buffer pool index logical reads
= 273173
Buffer pool index physical reads
= 552
Buffer pool index writes
=0
Total buffer pool read time (ms)
= 71352
Total buffer pool write time (ms)
=0
…
Summary of Results
==================
Elapsed
Agent CPU
Rows
Rows
Statement # Time (s)
Time (s)
Fetched Printed
1
76.349
6.670
136308
0
OS
Windows: Performance monito
Linux: iostat, vmstat
More on this in tuning t
An example Event Monitor
• CPU indicators
sorted by Oracle’s
Trace Data Viewer
• Similar tools: DB2’s
Event Monitor and
MSSQL’s Server
Profiler
256
An example Plan Explainer
• Access plan according
to MSSQL’s Query
Analyzer
• Similar tools: DB2’s
Visual Explain and
Oracle’s SQL Analyze
Tool
257
Finding Strangeness in Access Plans
What to pay attention to in a plan
• Access paths for each table
• Sorts or intermediary results (join,
group by, distinct, order by)
• Order of operations
• Algorithms used in the operators
258
To Index or not to index?
select c_name, n_name from CUSTOMER join NATION
on c_nationkey=n_nationkey where c_acctbal > 0
Which plan performs best?
(nation_pk is an non-clustered index over n_nationkey,
and similarly for acctbal_ix over c_acctbal)
259
Non-clustering indexes can be trouble
For a low selectivity predicate, each access to
the index generates a random access to the table
– possibly duplicate! It ends up that the number
of pages read from the table is greater than its
size, i.e., a table scan is way better
Table Scan
CPU time
data logical reads
data physical reads
index logical reads
index physical reads
5 sec
143,075 pages
6,777 pages
136,319 pages
7 pages
Index Scan
76 sec
272,618 pages
131,425 pages
273,173 pages
552 pages
260
An example
Performance Monitor (query level)
• Buffer and CPU
consumption for a query
according to DB2’s
Benchmark tool
• Similar tools:
MSSQL’s SET
STATISTICS switch
and Oracle’s SQL
Analyze Tool
Statement number: 1
select C_NAME, N_NAME
from DBA.CUSTOMER join DBA.NATION on C_NATIONKEY = N_NATIONKEY
where C_ACCTBAL > 0
Number of rows retrieved is: 136308
Number of rows sent to output is: 0
Elapsed Time is:
76.349 seconds
…
Buffer pool data logical reads
= 272618
Buffer pool data physical reads
= 131425
Buffer pool data writes
=0
Buffer pool index logical reads
= 273173
Buffer pool index physical reads
= 552
Buffer pool index writes
=0
Total buffer pool read time (ms)
= 71352
Total buffer pool write time (ms)
=0
…
Summary of Results
==================
Elapsed
Agent CPU
Rows
Rows
Statement # Time (s)
Time (s)
Fetched Printed
1
76.349
6.670
136308
0
261
An example
Performance Monitor (system level)
• An IO indicator’s
consumption evolution
(qualitative and
quantitative) according
to DB2’s System
Monitor
• Similar tools: Window’s
Performance Monitor
and Oracle’s
Performance Manager
262
Investigating High Level Consumers:
Summary
Find
critical
queries
Investigate
lower levels
no
Found any?
yes
Answer Q1
over them
no
Overconsumption?
yes
Tune problematic
queries
263
Investigating Primary Resources
•
Answer question 3:
“Are there enough primary resources
available for a DBMS to consume?”
•
Primary resources are: CPU, disk &
controllers, memory, and network
Analyze specific OS-level indicators to
discover bottlenecks.
A system-level Performance Monitor is the
right tool here
•
•
264
CPU Consumption Indicators
at the OS Level
100%
CPU
% of
utilization
70%
total usage
system usage
time
Sustained utilization
over 70% should
trigger the alert.
System utilization
shouldn’t be more
than 40%.
DBMS (on a nondedicated machine)
should be getting a
decent time share.
265
Disk Performance Indicators
at the OS Level
Should be
close to zero
Average Queue Size
Disk Transfers
/second
Idle disk with
pending requests?
Check controller
contention.
Also, transfers
should be
balanced among
disks/controllers
New requests
Wait queue
Wait times
should also
be close to
zero
266
Memory Consumption Indicators
at the OS Level
Page faults/time
should be close
to zero. If paging
happens, at least
not DB cache pages.
% of pagefile in
use will tell
you how much
memory is “needed.”
real
memory
virtual
memory
pagefile
267
Investigating Intermediate
Resources/Consumers
•
Answer question 2:
“Are subsystems making optimal use of
resources?”
•
Main subsystems: Cache Manager, Disk
subsystem, Lock subsystem, and Log/Recovery
subsystem
Similarly to Q3, extract and analyze relevant
Performance Indicators
•
268
Cache Manager Performance Indicators
Table
scan
readpage()
Cache
Manager
Pick
victim
strategy
If page is not in the
cache, readpage
(logical) generates an
actual IO (physical).
Fraction of readpages
that did not generate
physical IO should be
90% or more (hit ratio)
Data Pages
Free Page slots
Page reads/
writes
Pages are regularly
saved to disk to make
free space.
# of free slots should
always be > 0
269
Disk Manager Performance Indicators
rows
page
Storage
Hierarchy
(simplified)
extent
file
disk
Displaced rows (moved to
other pages) should be kept
under 5% of rows
Free space fragmentation:
pages with little space should
not be in the free list
Data fragmentation: ideally
files that store DB objects
(table, index) should be in one
or few (<5) contiguous extents
File position: should balance
workload evenly among all
270
disks
Lock Manager Performance Indicators
Lock
request
Object
Lock
List
Lock Type
Locks
pending list
TXN ID
Lock wait time for a
transaction should be a
small fraction of the whole
transaction time.
Number of lock waits
should be a small fraction
of the number of locks on
the lock list.
Deadlocks and timeouts
should seldom happen (no
more then 1% of the
transactions)
271
Investigating Intermediate and Primary
Resources: Summary
Answer Q3
no
Answer Q2
Tune
subsystems
yes
Problematic
subsystems?
no
Problems at
OS level?
yes
Tune low-level
resources
Investigate
upper level
272
Troubleshooting Techniques
• Monitoring a DBMS’s performance should
be based on queries and resources.
– The consumption chain helps distinguish
problems’ causes from their symptoms
– Existing tools help extracting relevant
performance indicators
273
Recall Tuning Principles
• Think globally, fix locally (troubleshoot to see
what matters)
• Partitioning breaks bottlenecks (find parallelism in
processors, controllers, caches, and disks)
• Start-up costs are high; running costs are low
(batch size, cursors)
• Be prepared for trade-offs (unless you can rethink
the queries)
274
Download