Microsoft SQL Server I/O Internals

advertisement
MICROSOFT SQL SERVER
DATABASE ENGINE I/O
by Bob Dorr, Microsoft SQL Server Principle Escalation Engineer, 1994 – Present
Built: Jan 2008
Areas Covered

















Write Ahead Logging (WAL) Protocol
Synchronous vs Asynchronous I/O
Scatter / Gather I/O
Sector alignment, Block Alignment
Latching and a page: A read walk-through
SQL Server I/O Sizes
Data cache maintenance
PAE and AWE
Read Ahead
User Mode and Kernel Mode (SYSTRAP)
Sparse Files and Copy On Write (COW) Pages
Locked Pages
Scribbler(s) and Bit flips
Page Protection and Constant Pages
Checksum vs Torn
Stale Read
Stalled I/O
WAL Protocol





Write Ahead Logging
ACID (Durability Property)
Log records secured before data
Hardened / Stable Media
Log contains parity bit
• Commit
• Rollback
• Trigger Snapshot
Synchronous vs Asynchronous I/O


Sync: Wait for Completion
Async: Post and Continue




Overlapped
Event
Completion Port
SQL Server

98% Async Usage



Overlapped and HasOverlappedIoCompleted
Network Layers Use Completion Port
Backup/Restore Use Sync – Sequential Patterns
• dm_io_pending_io_requests
• Overlapped Structure
• Async Processing ~= CPU
• Package vs Phone
Scatter / Gather I/O

Memory

Consolidates or Distributes
APIs
ReadFileScatter
 WriteFileGather


Scatter
Gather


Increases Efficiency
Used by SQL I/O Paths
Used by Windows Page File
Disk
• Old Design: 6.x Sorting
• AWE Availability
• WriteMultiple
• # of 8K Pages
• Forward and Backward
• Buffer Pool Ramp-up
Sector Alignment
Block Alignment





Sector: Log Writes
Block: Performance
Avoid Crossovers
DiskPart/DiskPar Utilities
Discuss with your Vendor
Alignment: http://support.microsoft.com/kb/929491
To verify that an existing partition is aligned, divide the size of the stripe unit by the starting offset of
the RAID disk group. Use the following syntax: ((Partition offset) * (Disk sector size)) / (Stripe unit size)
Example of alignment calculations in bytes for a 256-KB stripe unit size:
(63 * 512) / 262144 = 0.123046875
(64 * 512) / 262144 = 0.125
(128 * 512) / 262144 = 0.25
(256 * 512) / 262144 = 0.5
(512 * 512) / 262144 = 1
These examples shows that the partition is not aligned correctly for a 256-KB stripe unit size until the
partition is created by using an offset of 512 sectors (512 bytes per sector).
• Double Touch
• Rewrites
• Defragment
• 4K Sectors
Latch
Memory (Data Pages)



Multiple Readers (SH)
One Writer (EX)
Protects In-Memory Data Page



BUF Array


Latch = Physical Protection
Lock = Logical Protection
User Mode
UMS/SQLOS Aware
Optimized FIFO Ordering
BUF
Status
Latch
Database*
PageId
Hash *
…
• Flushed & Rollback
• Latch Timeout
• Sub-latch
Reading A Page





0:000> uf ZwWriteFile
mov r10,rcx
mov eax,5
Syscall  Kernel Transition
ret



Get Free Buffer for Read
Acquire Exclusive (EX) Latch
Is already in-memory/hashed?
Add Entry to Page Hash
Post and Record Asynchronous Read
… Continue Processing ….
Check Status (Scheduler Switch)
Complete: Validate I/O and Release Latch
kernel transition – Stuck I/O?
ntdll!ZwWriteFile+0xa
kernel32!WriteFile+0xf6
sqlservr!DiskWriteAsync+0xee
…
• Page Audits
• Read retry
• Stalled I/O Warnings
• Error raised at Acquire
• Shared (SH) waiters
• PAGE_IO* vs PAGE* Latch
• Writing A Page
Myth: Single Worker Per File
Truth: Each Worker Issues I/O
Vol
#1
dbTest.MDF
Vol
#2
dbTest.NDF
dbTest.LDF
Serial Plan
select * from dbTest.dbo.tblTest
insert into dbTest.dbo.tblTest
Create Database
Workers Assigned by Volume ID
Primary = dbTest.MDF
Secondary = dbTest.NDF
Log = dbTest.LDF
Parallel Plan
select * from dbTest.dbo.tblTest
insert into dbTest.dbo.tblTest
Data Cache Maintenance

Memory Pressure: LazyWriter



Per NUMA Node
Time Of Last Access (TLA)
Recovery Interval: Checkpoint





Queue
I/O Targets
.LDF Usage Triggers
Alternate Triggers (Backup, Restore, …)
Scatter/Gather Usage (WriteMultiple)
• Checkpoint Assignments
• By Ordinal Sweep
• Stalled I/O – LW #0
• I/O Queue Depth > 2
PAE and AWE

Physical Address Extensions






/PAE in Boot.ini
Boots Kernel with 36 bit addressing
Physical Memory > 4GB
Virtual Address Unchanged (/2gb or /3GB)
Automatic for Hot Add Memory Computers
Address Windows Extension



Windows APIs (AllocateUserPhysicalPages)
Physical Memory Allocations
Un/Mapped in or out of Virtual Address Range
•32 Bit Address = 4294967295 (0xFFFFFFFF) 4GB
•Interlocked Instruction
lock xadd
dword ptr [ecx],eax
•36 Bit Address = 68719476735 (0xFFFFFFFFF) 64GB
•Multiple Instructions
• Data Pages-Only
• Locked Pages
• Windows Paging
•Windows 2000 Bugs
Read Ahead





128 Pages Standard SKU
1024 Pages Enterprise SKU
Uses ReadFileScatter
Plan Based Decisions
Power of Asynchronous I/O
• Read Over Write
• Ramp-up
Sparse Files – Copy On Write

Usage
 Online
DBCC
 Snapshot Databases


Buffer Pool: PrepareToDirty
File Control Block (FCB) Chaining
• Sparse Allocation
• FCB Tracking
• Windows Limits
• New Page Allocations
Advanced Protection


What is a Scribbler?
Data Page Audits






None
Torn Bits
Checksum
Log Block Checksum
Constant Page
Backup with Checksum
• DBCC Page Audit
• Stale Read Check
• SQLIOSim
REFERENCES
Overview
SQL Server Always On
http://www.microsoft.com/sql/alwayson
 SQL Server I/O Basics Chapter 1
http://www.microsoft.com/technet/prodtechnol
/sql/2000/maintain/sqlIObasics.mspx
 SQL Server I/O Basics Chapter 2
http://www.microsoft.com/technet/prodtechnol
/sql/2005/iobasics.mspx

Fundamentals and Requirements
KB230785 - SQL Server 7.0, SQL Server
2000 and SQL Server 2005 logging and data
storage algorithms extend data reliability
 KB917047 - Microsoft SQL Server I/O
subsystem requirements for the tempdb
database
 KB231347 - SQL Server databases not
supported on compressed volumes (except
2005 read only files)

Subsystems







KB917043 - Key factors to consider when evaluating thirdparty file cache systems with SQL Server
KB234656- Using disk drive caching with SQL Server
KB46091- Using hard disk controller caching with SQL
Server
KB86903 - Description of caching disk controls in SQL
Server
KB304261- Description of support for network database
files in SQL Server
KB910716 (in progress) - Support for third-party Remote
Mirroring solutions used with SQL Server 2000 and 2005
KB833770 - Support for SQL Server 2000 on iSCSI
technology components (applies to SQL Server 2005)
Design and Configuration








White paper - Physical Database Layout and Design
KB298402 - Understanding How to Set the SQL Server I/O
Affinity Option
KB78363 - When Dirty Cache Pages are Flushed to Disk
White paper - Database Mirroring in SQL Server 2005
White paper - Database Mirroring Best Practices and
Performance Considerations
KB910378 - Scalable shared database are supported by
SQL Server 2005
MSDN article - Read-Only Filegroups
KB156932 - Asynchronous Disk I/O Appears as
Synchronous on Windows NT, Windows 2000, and Windows
XP
Diagnostics







KB826433 - Additional SQL Server Diagnostics Added to
Detect Unreported I/O Problems
KB897284 - SQL Server 2000 SP4 diagnostics help detect
stalled and stuck I/O operations (applies to SQL
Server 2005)
KB828339 - Error message 823 may indicate hardware
problems or system problems in SQL Server
KB167711 - Understanding Bufwait and Writelog Timeout
Messages
KB815436 - Use Trace Flag 3505 to Control SQL Server
Checkpoint Behavior
KB906121 - Checkpoint resumes behavior that it exhibited
before you installed SQL Server 2000 SP3 when you
enable trace flag 828
WebCast- Data Recovery in SQL Server 2005
Certification Policy
KB913945- Microsoft does not certify that
third-party products will work with Microsoft
SQL Server
 KB841696 - Overview of the Microsoft thirdparty storage software solutions support policy
 KB231619 - How to use the SQLIOStress utility
to stress a disk subsystem such as SQL Server

Utilities
Download - SQLIO Disk Subsystem Benchmark
Tool
 Download - SQLIOStress utility to stress disk
subsystem (applies to SQL Server 7.0, 2000,
and 2005 - replaced with SQLIOSim and SQL
Server 2008 installed in BINN)

Blog Content
















SQL Server Urban Legends Discussed
http://blogs.msdn.com/psssql/archive/2007/02/21/sql-server-urban-legends-discussed.aspx
How It Works: SQL Server Checkpoint (FlushCache) Outstanding I/O Target
http://blogs.msdn.com/psssql/archive/2008/04/11/how-it-works-sql-server-checkpoint-flushcache-outstanding-i-o-target.aspx
How It Works: SQL Server Page Allocations
http://blogs.msdn.com/psssql/archive/2008/04/08/how-it-works-sql-server-page-allocations.aspx
How It Works: Shapshot Database (Replica) Dirty Page Copy Behavior (NewPage)
http://blogs.msdn.com/psssql/archive/2008/03/24/how-it-works-shapshot-database-replica-dirty-page-copy-behavior-newpage.aspx
How It Works: SQL Server 2005 I/O Affinity and NUMA Don't Always Mix
http://blogs.msdn.com/psssql/archive/2008/03/18/how-it-works-sql-server-2005-i-o-affinity-and-numa-don-t-always-mix.aspx
How It Works: Debugging SQL Server Stalled or Stuck I/O Problems - Root Cause
http://blogs.msdn.com/psssql/archive/2008/03/03/how-it-works-debugging-sql-server-stalled-or-stuck-i-o-problems-root-cause.aspx
How It Works: SQL Server 2005 Database Snapshots (Replica)
http://blogs.msdn.com/psssql/archive/2008/02/07/how-it-works-sql-server-2005-database-snapshots-replica.aspx
How It Works: File Stream the Before and After Image of a File
http://blogs.msdn.com/psssql/archive/2008/01/15/how-it-works-file-stream-the-before-and-after-image-of-a-file.aspx
Using SQLIOSim to Diagnose SQL Server Reported Checksum (Error 824/823) Failures
http://blogs.msdn.com/psssql/archive/2008/12/19/using-sqliosim-to-diagnose-sql-server-reported-checksum-error-824-823-failures.aspx
How to use the SQLIOSim utility to simulate SQL Server activity on a disk subsystem
http://support.microsoft.com/kb/231619
Should I run SQLIOSim? - An e-mail follow-up from SQL PASS 2008
http://blogs.msdn.com/psssql/archive/2008/11/24/should-i-run-sqliosim-an-e-mail-follow-up-from-sql-pass-2008.aspx
What do I need to know about SQL Server database engine I/O?
http://blogs.msdn.com/psssql/archive/2006/11/27/what-do-i-need-to-know-about-sql-server-database-engine-i-o.aspx
SQLIOSim is "NOT" an I/O Performance Tuning Tool
http://blogs.msdn.com/psssql/archive/2008/04/05/sqliosim-is-not-an-i-o-performance-tuning-tool.aspx
How It Works: SQLIOSim - Running Average, Target Duration, Discarded Buffers ...
http://blogs.msdn.com/psssql/archive/2008/11/12/how-it-works-sqliosim-running-average-target-duration-discarded-buffers.aspx
How It Works: SQLIOSim [Audit Users] and .INI Control File Sections with User Count Options
http://blogs.msdn.com/psssql/archive/2008/08/19/how-it-works-sqliosim-audit-users-and-ini-control-file-sections-with-user-count-options.aspx
Understanding SQLIOSIM Output
http://sqlblog.com/blogs/kevin_kline/archive/2007/06/28/understanding-sqliosim-output.aspx
Additional Learning Resources

Inside SQL Server 7.0 and Inside SQL Server 2000
Written by Kalen Delaney – her husband is Paul Randle who wrote the core dbcc checks for SQL 7.0,
2000 and 2005

The Guru’s Guide to SQL Server Architecture and Internals – ISBN 0-201-70047-6
Written by Ken after he joined Microsoft SQL Server Support
Many chapters reviewed by developers and folks like myself

SQL Server 2005 Practical Troubleshooting ISBN 0-321-44774-3 – Ken
Henderson
Authors of this book were key developers or support team members
Cesar – QP developer and leader of the QP RedZone with Keithelm and Jackli
Sameert – Developer of UMS and SQLOS Scheduler
Santeriv – Developer of the lock manager
Slavao – Developer of the SOS memory managers and engine architect
Wei Xiao – Engine developer
Bart Duncan – long time SQL EE and now developer of the Microsoft Data Warehouse –
performance focused
Bob Ward – SQL Server Support Senior EE

Advanced Windows Debugging – ISBN 0-321-37446
Written by Microsoft developers – excellent resource

Applications for Windows – Jeffrey Richter
Great details about Windows basics
Download