MICROSOFT SQL SERVER DATABASE ENGINE I/O by Bob Dorr, Microsoft SQL Server Principle Escalation Engineer, 1994 – Present Built: Jan 2008 Areas Covered Write Ahead Logging (WAL) Protocol Synchronous vs Asynchronous I/O Scatter / Gather I/O Sector alignment, Block Alignment Latching and a page: A read walk-through SQL Server I/O Sizes Data cache maintenance PAE and AWE Read Ahead User Mode and Kernel Mode (SYSTRAP) Sparse Files and Copy On Write (COW) Pages Locked Pages Scribbler(s) and Bit flips Page Protection and Constant Pages Checksum vs Torn Stale Read Stalled I/O WAL Protocol Write Ahead Logging ACID (Durability Property) Log records secured before data Hardened / Stable Media Log contains parity bit • Commit • Rollback • Trigger Snapshot Synchronous vs Asynchronous I/O Sync: Wait for Completion Async: Post and Continue Overlapped Event Completion Port SQL Server 98% Async Usage Overlapped and HasOverlappedIoCompleted Network Layers Use Completion Port Backup/Restore Use Sync – Sequential Patterns • dm_io_pending_io_requests • Overlapped Structure • Async Processing ~= CPU • Package vs Phone Scatter / Gather I/O Memory Consolidates or Distributes APIs ReadFileScatter WriteFileGather Scatter Gather Increases Efficiency Used by SQL I/O Paths Used by Windows Page File Disk • Old Design: 6.x Sorting • AWE Availability • WriteMultiple • # of 8K Pages • Forward and Backward • Buffer Pool Ramp-up Sector Alignment Block Alignment Sector: Log Writes Block: Performance Avoid Crossovers DiskPart/DiskPar Utilities Discuss with your Vendor Alignment: http://support.microsoft.com/kb/929491 To verify that an existing partition is aligned, divide the size of the stripe unit by the starting offset of the RAID disk group. Use the following syntax: ((Partition offset) * (Disk sector size)) / (Stripe unit size) Example of alignment calculations in bytes for a 256-KB stripe unit size: (63 * 512) / 262144 = 0.123046875 (64 * 512) / 262144 = 0.125 (128 * 512) / 262144 = 0.25 (256 * 512) / 262144 = 0.5 (512 * 512) / 262144 = 1 These examples shows that the partition is not aligned correctly for a 256-KB stripe unit size until the partition is created by using an offset of 512 sectors (512 bytes per sector). • Double Touch • Rewrites • Defragment • 4K Sectors Latch Memory (Data Pages) Multiple Readers (SH) One Writer (EX) Protects In-Memory Data Page BUF Array Latch = Physical Protection Lock = Logical Protection User Mode UMS/SQLOS Aware Optimized FIFO Ordering BUF Status Latch Database* PageId Hash * … • Flushed & Rollback • Latch Timeout • Sub-latch Reading A Page 0:000> uf ZwWriteFile mov r10,rcx mov eax,5 Syscall Kernel Transition ret Get Free Buffer for Read Acquire Exclusive (EX) Latch Is already in-memory/hashed? Add Entry to Page Hash Post and Record Asynchronous Read … Continue Processing …. Check Status (Scheduler Switch) Complete: Validate I/O and Release Latch kernel transition – Stuck I/O? ntdll!ZwWriteFile+0xa kernel32!WriteFile+0xf6 sqlservr!DiskWriteAsync+0xee … • Page Audits • Read retry • Stalled I/O Warnings • Error raised at Acquire • Shared (SH) waiters • PAGE_IO* vs PAGE* Latch • Writing A Page Myth: Single Worker Per File Truth: Each Worker Issues I/O Vol #1 dbTest.MDF Vol #2 dbTest.NDF dbTest.LDF Serial Plan select * from dbTest.dbo.tblTest insert into dbTest.dbo.tblTest Create Database Workers Assigned by Volume ID Primary = dbTest.MDF Secondary = dbTest.NDF Log = dbTest.LDF Parallel Plan select * from dbTest.dbo.tblTest insert into dbTest.dbo.tblTest Data Cache Maintenance Memory Pressure: LazyWriter Per NUMA Node Time Of Last Access (TLA) Recovery Interval: Checkpoint Queue I/O Targets .LDF Usage Triggers Alternate Triggers (Backup, Restore, …) Scatter/Gather Usage (WriteMultiple) • Checkpoint Assignments • By Ordinal Sweep • Stalled I/O – LW #0 • I/O Queue Depth > 2 PAE and AWE Physical Address Extensions /PAE in Boot.ini Boots Kernel with 36 bit addressing Physical Memory > 4GB Virtual Address Unchanged (/2gb or /3GB) Automatic for Hot Add Memory Computers Address Windows Extension Windows APIs (AllocateUserPhysicalPages) Physical Memory Allocations Un/Mapped in or out of Virtual Address Range •32 Bit Address = 4294967295 (0xFFFFFFFF) 4GB •Interlocked Instruction lock xadd dword ptr [ecx],eax •36 Bit Address = 68719476735 (0xFFFFFFFFF) 64GB •Multiple Instructions • Data Pages-Only • Locked Pages • Windows Paging •Windows 2000 Bugs Read Ahead 128 Pages Standard SKU 1024 Pages Enterprise SKU Uses ReadFileScatter Plan Based Decisions Power of Asynchronous I/O • Read Over Write • Ramp-up Sparse Files – Copy On Write Usage Online DBCC Snapshot Databases Buffer Pool: PrepareToDirty File Control Block (FCB) Chaining • Sparse Allocation • FCB Tracking • Windows Limits • New Page Allocations Advanced Protection What is a Scribbler? Data Page Audits None Torn Bits Checksum Log Block Checksum Constant Page Backup with Checksum • DBCC Page Audit • Stale Read Check • SQLIOSim REFERENCES Overview SQL Server Always On http://www.microsoft.com/sql/alwayson SQL Server I/O Basics Chapter 1 http://www.microsoft.com/technet/prodtechnol /sql/2000/maintain/sqlIObasics.mspx SQL Server I/O Basics Chapter 2 http://www.microsoft.com/technet/prodtechnol /sql/2005/iobasics.mspx Fundamentals and Requirements KB230785 - SQL Server 7.0, SQL Server 2000 and SQL Server 2005 logging and data storage algorithms extend data reliability KB917047 - Microsoft SQL Server I/O subsystem requirements for the tempdb database KB231347 - SQL Server databases not supported on compressed volumes (except 2005 read only files) Subsystems KB917043 - Key factors to consider when evaluating thirdparty file cache systems with SQL Server KB234656- Using disk drive caching with SQL Server KB46091- Using hard disk controller caching with SQL Server KB86903 - Description of caching disk controls in SQL Server KB304261- Description of support for network database files in SQL Server KB910716 (in progress) - Support for third-party Remote Mirroring solutions used with SQL Server 2000 and 2005 KB833770 - Support for SQL Server 2000 on iSCSI technology components (applies to SQL Server 2005) Design and Configuration White paper - Physical Database Layout and Design KB298402 - Understanding How to Set the SQL Server I/O Affinity Option KB78363 - When Dirty Cache Pages are Flushed to Disk White paper - Database Mirroring in SQL Server 2005 White paper - Database Mirroring Best Practices and Performance Considerations KB910378 - Scalable shared database are supported by SQL Server 2005 MSDN article - Read-Only Filegroups KB156932 - Asynchronous Disk I/O Appears as Synchronous on Windows NT, Windows 2000, and Windows XP Diagnostics KB826433 - Additional SQL Server Diagnostics Added to Detect Unreported I/O Problems KB897284 - SQL Server 2000 SP4 diagnostics help detect stalled and stuck I/O operations (applies to SQL Server 2005) KB828339 - Error message 823 may indicate hardware problems or system problems in SQL Server KB167711 - Understanding Bufwait and Writelog Timeout Messages KB815436 - Use Trace Flag 3505 to Control SQL Server Checkpoint Behavior KB906121 - Checkpoint resumes behavior that it exhibited before you installed SQL Server 2000 SP3 when you enable trace flag 828 WebCast- Data Recovery in SQL Server 2005 Certification Policy KB913945- Microsoft does not certify that third-party products will work with Microsoft SQL Server KB841696 - Overview of the Microsoft thirdparty storage software solutions support policy KB231619 - How to use the SQLIOStress utility to stress a disk subsystem such as SQL Server Utilities Download - SQLIO Disk Subsystem Benchmark Tool Download - SQLIOStress utility to stress disk subsystem (applies to SQL Server 7.0, 2000, and 2005 - replaced with SQLIOSim and SQL Server 2008 installed in BINN) Blog Content SQL Server Urban Legends Discussed http://blogs.msdn.com/psssql/archive/2007/02/21/sql-server-urban-legends-discussed.aspx How It Works: SQL Server Checkpoint (FlushCache) Outstanding I/O Target http://blogs.msdn.com/psssql/archive/2008/04/11/how-it-works-sql-server-checkpoint-flushcache-outstanding-i-o-target.aspx How It Works: SQL Server Page Allocations http://blogs.msdn.com/psssql/archive/2008/04/08/how-it-works-sql-server-page-allocations.aspx How It Works: Shapshot Database (Replica) Dirty Page Copy Behavior (NewPage) http://blogs.msdn.com/psssql/archive/2008/03/24/how-it-works-shapshot-database-replica-dirty-page-copy-behavior-newpage.aspx How It Works: SQL Server 2005 I/O Affinity and NUMA Don't Always Mix http://blogs.msdn.com/psssql/archive/2008/03/18/how-it-works-sql-server-2005-i-o-affinity-and-numa-don-t-always-mix.aspx How It Works: Debugging SQL Server Stalled or Stuck I/O Problems - Root Cause http://blogs.msdn.com/psssql/archive/2008/03/03/how-it-works-debugging-sql-server-stalled-or-stuck-i-o-problems-root-cause.aspx How It Works: SQL Server 2005 Database Snapshots (Replica) http://blogs.msdn.com/psssql/archive/2008/02/07/how-it-works-sql-server-2005-database-snapshots-replica.aspx How It Works: File Stream the Before and After Image of a File http://blogs.msdn.com/psssql/archive/2008/01/15/how-it-works-file-stream-the-before-and-after-image-of-a-file.aspx Using SQLIOSim to Diagnose SQL Server Reported Checksum (Error 824/823) Failures http://blogs.msdn.com/psssql/archive/2008/12/19/using-sqliosim-to-diagnose-sql-server-reported-checksum-error-824-823-failures.aspx How to use the SQLIOSim utility to simulate SQL Server activity on a disk subsystem http://support.microsoft.com/kb/231619 Should I run SQLIOSim? - An e-mail follow-up from SQL PASS 2008 http://blogs.msdn.com/psssql/archive/2008/11/24/should-i-run-sqliosim-an-e-mail-follow-up-from-sql-pass-2008.aspx What do I need to know about SQL Server database engine I/O? http://blogs.msdn.com/psssql/archive/2006/11/27/what-do-i-need-to-know-about-sql-server-database-engine-i-o.aspx SQLIOSim is "NOT" an I/O Performance Tuning Tool http://blogs.msdn.com/psssql/archive/2008/04/05/sqliosim-is-not-an-i-o-performance-tuning-tool.aspx How It Works: SQLIOSim - Running Average, Target Duration, Discarded Buffers ... http://blogs.msdn.com/psssql/archive/2008/11/12/how-it-works-sqliosim-running-average-target-duration-discarded-buffers.aspx How It Works: SQLIOSim [Audit Users] and .INI Control File Sections with User Count Options http://blogs.msdn.com/psssql/archive/2008/08/19/how-it-works-sqliosim-audit-users-and-ini-control-file-sections-with-user-count-options.aspx Understanding SQLIOSIM Output http://sqlblog.com/blogs/kevin_kline/archive/2007/06/28/understanding-sqliosim-output.aspx Additional Learning Resources Inside SQL Server 7.0 and Inside SQL Server 2000 Written by Kalen Delaney – her husband is Paul Randle who wrote the core dbcc checks for SQL 7.0, 2000 and 2005 The Guru’s Guide to SQL Server Architecture and Internals – ISBN 0-201-70047-6 Written by Ken after he joined Microsoft SQL Server Support Many chapters reviewed by developers and folks like myself SQL Server 2005 Practical Troubleshooting ISBN 0-321-44774-3 – Ken Henderson Authors of this book were key developers or support team members Cesar – QP developer and leader of the QP RedZone with Keithelm and Jackli Sameert – Developer of UMS and SQLOS Scheduler Santeriv – Developer of the lock manager Slavao – Developer of the SOS memory managers and engine architect Wei Xiao – Engine developer Bart Duncan – long time SQL EE and now developer of the Microsoft Data Warehouse – performance focused Bob Ward – SQL Server Support Senior EE Advanced Windows Debugging – ISBN 0-321-37446 Written by Microsoft developers – excellent resource Applications for Windows – Jeffrey Richter Great details about Windows basics