Making the most of Solid State Disk in Oracle 11g

Making the most of Solid State Disk
in Oracle 11g
Guy Harrison
Director, R&D Melbourne
Email:
Twitter:
Web:
guy.harrison@quest.com
@guyharrison
http://www.guyharrison.net
©2011 Quest Software, Inc. All rights reserved..
Introductions
2
Agenda
• Brief History of Magnetic Disk
• Solid State Disk (SSD) technologies
• SSD internals
• Oracle DB flash cache architecture
• Performance comparisons
• Recommendations and Suggestions
3
©2011 Quest Software, Inc. All rights reserved..
A brief history of disk
4
©2011 Quest Software, Inc. All rights reserved..
5MB HDD circa 1956
28MB HDD - 1961
1800 RPM
The more that things change....
Moore’s law
• Transistor density doubles every 18 months
• Exponential growth is observed in most electronic
components:
• CPU clock speeds
• RAM
• Hard Disk Drive storage density
• But not in mechanical components
• Service time (Seek latency) – limited by actuator arm speed and disk
circumference
• Throughput (rotational latency) – limited by speed of rotation,
circumference and data density
8
©2011 Quest Software, Inc. All rights reserved..
Disk trends 2001-2009
2,000
1,500
%age change
1,000
500
260
1,635
1,013
0
-630
-390
-500
-1,000
IO Rate
Disk Capacity
IO/Capacity
CPU
IO/CPU
Solid State Disk
10
©2011 Quest Software, Inc. All rights reserved..
SSD to the rescue?
SSD DDR-RAM
15
SSD PCI flash
25
SSD SATA Flash
80
Magnetic Disk
4,000
0
1,000
2,000
3,000
Seek time (us)
4,000
5,000
Power consumption
Start up
20
0.15
Seek
Flash SSD
10
SATA HDD
0.08
Idle
8
0.01
0.1
1
Watts (logarithmic scale)
10
100
Economics of SSD
0.00
10.00
$/GB
30.00
20.00
40.00
50.00
60.00
0.06
FusionIO PCI SLC SSD
53.44
0.06
FusionIO PCI MLC Duo SSD
24.92
0.05
Intel SLC SATA SSD
21.88
$/IOP
0.05
6.88
Intel MLC SATA SSD
Seagate SAS HDD
Seagate SATA HDD
$/GB
1.00
1.53
0.09
0.00
2.38
0.50
1.00
1.50
$/IOP
2.00
2.50
Tiered storage management
Main Memory
Flash SSD
Fast Disk (SAS, RAID 0+1)
Slow Disk (SATA, RAID 5)
Tape, Flat Files, Hadoop
$/IOP
$/GB
DDR SSD
Storage Tiering
Storage Tiering For Dummies,® Oracle
Special Edition, Wiley 2011
15
©2011 Quest Software, Inc. All rights reserved..
SSD technology and internals
16
©2011 Quest Software, Inc. All rights reserved..
Flavours of Flash SSD
DDR RAM Drive
SATA flash drive
PCI flash drive
SSD storage Server
PCI SSD vs SATA SSD
PCI vs SATA
• SATA was designed for traditional disk drives with high latencies
• PCI is designed for high speed devices
• PCI SSD has latency ~ 1/3rd of SATA
Booth 1107
19
Flash SSD Technology
Storage Hierarchy:
• Cell: One (SLC) or Two (MLC) bits
• Page: Typically 4K
• Block: Typically 128-512K
Writes:
• Read and first write require single page IO
• Overwriting a page requires an erase & overwrite of the block
Write endurance:
• 100,000 erase cycles for SLC before failure
• 5,000 – 10,000 erase cycles for MLC
20
©2011 Quest Software, Inc. All rights reserved..
Flash SSD performance
Update (256K block erase)
2000
First insert (4k page write)
250
Read (4k page seek)
25
0
200
400
600
800
1000
1200
Microseconds
1400
1600
1800
2000
21
©2011 Quest Software, Inc. All rights reserved..
Flash Disk write degradation
Empty
Partially Full
All Blocks empty:
Write time=250 us
25% part full:
• Write time= ( ¾ * 250 us + 1/4 * 2000 us) = 687 us
75% part full
• Write time = ( ¼ * 250 us + ¾ * 2000 us ) = 1562 us
Data Insert
Free Block Pool
Insert
SSD Controller
Used Block Pool
Empty Data Page
Valid Data Page
InValid Data Page
Free Block Pool
Data Update
Update
SSD Controller
Used Block Pool
Empty Data Page
Valid Data Page
Invalid Data Page
Free Block Pool
Garbage Collection
SSD Controller
Used Block Pool
Empty Data Page
Valid Data Page
Invalid Data Page
26
©2011 Quest Software, Inc. All rights reserved..
11g DB flash Cache
27
©2011 Quest Software, Inc. All rights reserved..
Oracle DB flash cache
•Introduced in 11gR2 for
OEL and Solaris only
•Secondary cache
maintained by the DBWR,
but only when idle cycles
permit
•Architecture is tolerant of
poor flash write
performance
28
©2011 Quest Software, Inc. All rights reserved..
Buffer cache and Free buffer waits
Read from buffer cache
Oracle process
Write to buffer cache
Free
Buffer
Waits
Free buffer waits often occur
when reads are much faster
than writes....
Buffer
cache
DBWR
Read from disk
Database
files
Write dirty blocks to disk
Flash Cache
Buffer
cache
Read from buffer cache
Oracle process
Write to buffer cache
Read from
flash cache
Flash Cache
DBWR
Write clean
blocks (time
permitting)
DB Flash cache architecture is designed to
accelerate buffered reads
Read from disk
Database
files
Write dirty blocks to disk
Configuration
• Create filesystem from flash device
• Set DB_FLASH_CACHE_FILE and
DB_FLASH_CACHE_SIZE.
• Consider Filesystemio_options=setall
31
©2011 Quest Software, Inc. All rights reserved..
Flash KEEP pool
• You can prioritise blocks for important objects using the
FLASH_CACHE clause:
32
©2011 Quest Software, Inc. All rights reserved..
Oracle Db flash cache statistics
http://guyharrison.squarespace.com/storage/flash_insert_stats.sql
33
©2011 Quest Software, Inc. All rights reserved..
Flash Cache Efficiency
http://guyharrison.squarespace.com/storage/flash_time_savings.sql
Flash cache Contents
http://guyharrison.squarespace.com/storage/flashContents.sql
Performance tests
36
©2011 Quest Software, Inc. All rights reserved..
Test systems
• Low end system:
• Dell Optiplex dual-core 4GB RAM
• 2xSeagate 7500RPM Baracuda SATA HDD
• Intel X-25E SLC SATA SSD
• Higher end system:
• Dell R510 2xquad core, 32 GB RAM
• 4x300GB 15K RPM,6Gbps Dell SAS HDD
• 1xFusionIO ioDrive SLC PCI SSD
37
©2011 Quest Software, Inc. All rights reserved..
Performance: indexed reads(X-25)
Flash tablespace
48.17
CPU
Flash cache
143.27
db file IO
flash cache IO
Other
No Flash
529.7
0
100
200
300
Elapsed (s)
400
500
600
Performance: Read/Write (X-25)
Flash tablespace
200
CPU
db file IO
Flash Cache
1,693
write complete
free buffer
flash cache IO
Other
3,289
No Flash
0
500
1000
1500
2000
Elapsed time (s)
2500
3000
3500
Random reads – FusionIO
Table on SSD
121
SAS disk, flash cache
583
CPU
Other
DB File IO
Flash cache IO
SAS disk, no flash cache
2,211
0
500
1000
1500
Elapsed time (s)
2000
2500
Updates – Fusion IO
Table on SSD
529
DB CPU
db file IO
SAS disk, flash cache
1,934
log file IO
flash cache
free buffer waits
Other
SAS disk, no flash cache
6,219
0
1000
2000
3000
4000
Elapsed Time (s)
5000
6000
7000
Full table scan – FusionIO
Table on SSD
72
CPU
SAS disk, flash cache
398
Other
DB File IO
Flash Cache IO
SAS disk, no flash cache
418
0
50
100
150
200
250
Elasped time (s)
300
350
400
450
Sorting – what we expect
Time
Multi-pass
Disk Sort
Single Pass
Disk Sort
Memory Sort
PGA Memory available (MB)
Table/Index IO
CPU Time
Temp Segment IO
43
Disk Sorts – temporary tablespace
4000
3500
Multi-pass
Disk Sort
2500
2000
1500
Single Pass
Disk Sort
Elapsed time (s)
3000
1000
500
0
300
250
200
150
Sort Area Size
SAS based TTS
100
SSD based TTS
50
0
44
Redo performance – Fusion IO
Flash based redo log
291.93
CPU
Log IO
SAS based redo log
292.39
0
50
100
150
200
Elapsed time (s)
250
300
350
Concurrent redo workload (x10)
Flash based redo log
1,637
331
1,681
CPU
Other
Log File IO
SAS based redo log
1,605
0
500
1,000
397
1,500
1,944
2,000
2,500
Elapsed time (s)
3,000
3,500
4,000
4,500
46
Buffer Cache bottlenecks
• Flash cache architecture
avoids ‘free buffer waits’
due to flash IO, but write
complete waits can still
occur on hot blocks.
• Free buffer waits are still
likely against the
database files, due to
high physical read rates
created by the flash
cache
47
©2011 Quest Software, Inc. All rights reserved..
Write degradation
• In theory, high sustained write IO can lead to SSD
degradation when GC fails to cope with the block
erase/update cycle
• In practice, this is rarely noticeable from Oracle:
• Oracle write IO is largely asynchronous (DBWR)
• Almost all write activity has at least an equal amount of read activity
• Garbage collection and wear levelling algorithms are sophisticated in
decent SSD drives
48
©2011 Quest Software, Inc. All rights reserved..
49
©2011 Quest Software, Inc. All rights reserved..
50
©2011 Quest Software, Inc. All rights reserved..
Fusion IO direct cache
•Temp
Tablespace
• Hot Segments
• Hot Partitions
• DB Flash
Cache
File System/ Raw
Devices/ ASM
File System/ Raw
Devices/ ASM
Caching Block Device
Regular Block Device
ioMemory VSL
directCache
ioMemory VSL
Readintensive,
potentially
massive
tablespaces
(limited to the
size of the SSD)
LUN
51
Fusion IO direct cache – Table scans
direct cache on 2nd scan
36
direct cache on 1st scan
147
CPU
IO
Other
No cache 2nd scan
147
No cache 1st scan
147
0
20
40
60
80
Elapsed time (s)
100
120
140
160
Exadata
53
©2011 Quest Software, Inc. All rights reserved..
53
Exadata flash storage
• 4x96GB PCI Flash drives on each storage server
• Flash can be configured as:
• Exadata Smart Flash Cache (ESFC)
• Solid State Disk available to ASM disk groups
• ESFC is not the same as the DB flash cache:
• Maintained by cellsrv, not DBWR
• DOES support full table scans
• DOES NOT support smart scans
• Unless CELL_FLASH_CACHE= KEEP,
• Statistics accessed via the cellcli program
• Considerations for cache vs. SSD are similar
55
©2011 Quest Software, Inc. All rights reserved..
Exadata: Flash grid disk vs ESFC
SSD disks (no flash cache)
119
SAS disk with flash cache
429
IO
CPU
SAS disk no flash cache
1,240
0
200
400
600
800
1000
Seconds
100M row table, 200,000 random PK lookups, 1M possible keys
1200
1400
Summary
57
©2011 Quest Software, Inc. All rights reserved..
Recommendations
• Don’t wait for SSD to become as cheap as HDD
• Magnetic HDD will always be cheaper per GB, SSD cheaper per IO
• Consider a mixed or tiered storage strategy
• Using DB flash cache, selective SSD tablespaces or partitions
• Use SSD where your IO bottleneck is greatest and SSD advantage is
significant
• DB flash cache offers an easy way to leverage SSD for
OLTP workloads, but has few advantages for OLAP or
Data Warehouse
58
©2011 Quest Software, Inc. All rights reserved..
How to use SSD
• Database flash cache
• If your bottleneck is single block (indexed reads) and you are on OEL or
Solaris 11GR2
• Flash tablespace
• Optimize read/writes against “hot” segments or partitions
• Flash temp tablespace
• If multi-pass disk sorts or hash joins are your bottleneck
• FusionIO direct cache
• If you want to optimize both scans and index reads OR you are not on
OEL/Solaris 11GR2
59
©2011 Quest Software, Inc. All rights reserved..
59
60
©2011 Quest Software, Inc. All rights reserved..
61
©2011 Quest Software, Inc. All rights reserved..
References
• Latest version of this presentation:
http://www.slideshare.net/gharriso/ssd-and-the-db-flash-cache
• Quest whitepaper:
• http://www.quest.com/documents/landing.aspx?id=15423
• Guy’s SSD guide
• http://guyharrison.squarespace.com/ssdguide/
62
©2011 Quest Software, Inc. All rights reserved..