Bob Ward Principal Architect Escalation Engineer Microsoft Customer Stories From the Front Line Adam Saxton Sr. Escalation Engineer Microsoft May 2nd – 4th, Nottingham The day… 9:00-10:30am • Engine Stories 10:30-10:50am • Break 10:50am-1:00pmish • Connectivity Stories 1:00pmish-2:00pm • Lunch 2:00pm-3:45pm 3:45-4:00pm 4:00pm-4:45pm 4:45pm-? • Strange tales of scribble, leak, and fail • Break • The Data Surgery Story • Q&A 2 Why are you here today? Learn from the pain of others Walk away with practical information Plenty of demos Learn “how stuff works” (and how CSS does their job) Tweet you asked Bob a question he cannot answer @bobwardms We may go deep in some places 3 What kind of stories will I tell? VLF Fragmentation Memory Leak Always On Data Surgery Memory Scribbler Connectivity Scheduler Non-Yield 4 Tim Chapman VLF Fragmentation Strikes Again Database Mirroring principal behind syncing to secondary on SQL 2008 SP2 Non-yielding error in the ERRORLOG with stack dump PFE – East Region 2 problems that appear to be unrelated 2012-08-01 17:37:00.500 Server Using 'dbghelp.dll' version '4.0.5' 2012-08-01 17:37:00.530 Server ***Unable to get thread context for spid 0 2012-08-01 17:37:00.530 Server * ******************************************************************************* 2012-08-01 17:37:00.530 Server * 2012-08-01 17:37:00.530 Server * BEGIN STACK DUMP: Means “no session” but is 2012-08-01 17:37:00.530 Server * 08/01/12 17:37:00 spid 10728 “background” 2012-08-01 17:37:00.530 Server * 2012-08-01 17:37:00.530 Server * Non-yielding Scheduler 2012-08-01 17:37:00.530 Server * 2012-08-01 17:37:00.530 Server * ******************************************************************************* 2012-08-01 17:37:00.530 Server Stack Signature for the dump is 0x000000000000028D 2012-08-01 17:37:05.750 Server External dump process return code 0x20000001. External dump process returned no errors. 6 VLF Fragmentation Inside VLF Fragmentation This looks fairly harmless Symptoms are typically really long recovery or restore even if there is nothing to recover What if I have to scan this list? VLF1 VLF2 Traversing used to be scan entire list each time VLF<n> Fixes in all versions to scan from “where you are” You still may need to go through a large list VLF 900,000 Cheaper to find everything in small # of VLFs Finding specific LSN really bad because no index 7 Back to our Story VLF Fragmentation DB Mirror Principal lagging is typically caused by • Network latency • Secondary log harden slow (high safety) But….the stack dump pointed to something interesting • What workers are not waiting? • One is related to finding something in the log and Database Mirroring We suspect a VLF problem • DBCC LOGINFO shows 970,000 Virtual Log Files • This is never good. I’ve found anything over 10,000 can be impactful (especially without fixes). Large #VLFs cause DBM background thread to spend CPU cycles searching for log blocks but not yielding on scheduler (60+secs) 8 Demo The Solution VLF Fragmentation Break the Mirror Reduce VLFs to reasonable number BACKUP_IO wait_type • Truncate the log • BACKUP LOG to keep log backup chain. Takes longer and maybe needed more than once • Change to SIMPLE recovery is quicker but breaks log backup chain • Then….DBCC SHRINKFILE using TRUNCATEONLY • What if I’m in the middle of recovery? • Let it finish • Kill it and rebuild the log (WARNING: POSSIBLE INCONSISTENCY) Resize the log to desired size with larger autogrow increment Reinitialize the mirror 10 VLF Fragmentation Prevention Apply latest fixes (included in SQL Server 2012) Autogrow is usually the cause Create log to needed size Choose larger but not “too large” autogrow increment Don’t rely on autogrow (but keep for emergency) System Center Advisor detects this problem 11 System Center Advisor – systemcenteradvisor.com FREE Cloud Service for Configuration Monitoring Based on CSS Knowledge Integrated into SCOM 2012 SP1 Windows Server SQL, SharePoint, Exchange and Lync 12 Bob Ward Texas It looks like tempdb to me Customer benchmark cannot scale on SQL Server 2008R2 RTM Larger number of users run into massive blocking Looks like a tempdb latch allocation contention problem Tempdb latch contention • Users waiting on PAGELATCH_UP for 2:<n>:<alloc page> Customer has implemented normal solutions for tempdb latch contention • Expanded number of tempdb files • Using –T1118 • Allocation contention not system tables Using prepared ad-hoc queries referencing temporary tables 14 Bob asked to investigate the case Yielding problem DMV does show latch contention on tempdb Who is the owner of the latch and what are they doing? Owner is RUNNABLE for extended period of time So a query is not yielding holding up the owner of a tempdb latch OK, so who is RUNNING? What is RUNNABLE? Tempdb is not the problem Every scheduler has someone RUNNING a SELECT “Ready to run” but someone else running on scheduler 15 So if it is not tempdb, what is it? Yielding problem We know a SELECT is doing this What is the query? In some cases we can use sql_handle and find a prepared ad-hoc query accessing tempdb But….in other cases, the sql_handle and/or plan_handle are NULL. How can this be? The only time this occurs is either we are compiling or finding our query in cache These are prepared and compiles/sec very low So must be cache lookup Let’s look at the procedure cache 16 Yielding problem Now we find the problem dm_exec_cached_plans shows high number of entries for same bucket_id • This means long hash chains to find a cached plan This part of the code doesn’t yield well Each ad-hoc plan for each user unique but hashing as though they were “same query” Hashing problem easily reproducible but no performance impact Performance impact only seen when hash chain in tens of thousands Which explains only an issue when benchmark tries to scale up 17 Yielding problem The problem seen visually Plan to find A bad hash chain Bucket 1 Entry1 Entry2 Bucket 2 Entry1 Entry2 Bucket 3 Entry1 Entry2 …………….. …………….. …………….. Entry 10000 Entry 10000 Entry 10000 …………….. …………….. …………….. Entry 100000 Entry 100000 Entry 100000 A better hash chain Bucket 1 Entry2 Entry1 Entry2 …..…. . Entry1 Plan found here Bucket 100000 18 Demo Yielding problem The Solution Stored procedures won’t hit this We seek a fix with SQL Development • Build a repro • File a bug and request a fix • Hotfix coded and delivered Benchmark exceeds more than ever before Customer would have to rewrite a ton The fix will ensure smoother hash to avoid long chains Everyone happy 20 Adam Saxton Cannot Generate SSPI Context Users cannot connect to SQL Server with the error below Started happening after the weekend Texas Doesn’t matter who tries to connect or which application they use 22 The Misplaced SPN What is an “SSPI Context?” Security Support Provider Interface (SSPI) is the API used by SQL Server to authenticate users for “Windows Authentication” SSPI Providers provide the authentication implementation such as NTLM, Kerberos, and “Negotiate” sys.dm_exec_connections For Negotiate, Kerberos is used by default for remote connections and NTLM used as “fallback” (or local) Kerberos requires the correct configuration of a Service Principal Name (SPN) “double-hop” scenarios require delegration 23 Service Principal Name (SPN) The Misplaced SPN Uniquely identifies an instance of a service • Required for Kerberos • Used to request a service ticket • Can have a port number for TCP • Bound to an Active Directory User or Computer Account • Should be only associated with one account which should be Service “Log on As” account “Built-in” account = computer account 24 The Misplaced SPN SQL Service SPN Svc account Must have AD perms SQL Server startup = register SPN SQL Server shutdown = delete SPN TCP MSSQLSvc/passsql.pass.local:1433 MSSQLSvc/passsql.pass.local:56772 class / host : Named Pipes Default Instance Named Instance port Default Instance MSSQLSvc/passsql.pass.local MSSQLSvc/passsql.pass.local:myinstance Blog: What SPN do I use and how does it get there? Named Instance 25 It should always work but….. The Misplaced SPN Missing SPN • SPN not bound to any AD account Duplicate SPN • SPN bound to more than one AD account Misplaced SPN • SPN bound to “wrong” AD account Not to the current “Log on As” account 26 How can this happen? Missing SPN SQL Service Account does not have permissions to add SPN to AD The Misplaced SPN Duplicate SPN Misplaced SPN SQL shutdown unexpectedly so SPN not removed SQL shutdown unexpectedly so SPN not removed SQL Service account changed and has permissions to add new SPN SQL Service Account changed but does not have permissions to add new SPN “Fallback” to NTLM 27 Back to our Case…. There only one cause for failure What is misplaced? The Misplaced SPN • Misplaced SPN • SPN is associated with computer account but current “Log on As” is a domain account • There is no SPN for the domain account What do we do? • Remove the SPN from computer account • Register SPN for the domain account Connection now works • I’ve sometimes seen app restart required and ‘klist purge’ 28 Demo Kerberos Event Logging System Center Advisor The Misplaced SPN http://www.systemcenteradvisor.com You may experience connectivity issues to SQL Server if SPNs are misconfigured http://support.microsoft.com/kb/2443457 30 Adam Saxton Connection Timeout Login Timeout Expired error message at client. Can occur intermittently and/or for periods of time Texas May or may not be specific to Windows Authentication Major Cause is Network issues 32 Connection Timeout Cause: AD Communication Issues • • • • Slow response from Active Directory Could be network, could be AD issues, could be a DC problem SQL Authentication logins work dm_os_wait_stats / PREEMPTIVE_OS_LOOKUPACCOUNTSID TDS Login Receive TDS packet Engine creates SQLOS Task Find available worker on scheduler Validate Windows Logon Assign SPID Login Timeout 33 Connection Timeout Cause: Worker Pool Starvation • • • • No available workers to service connection task dm_os_wait_stats / THREADPOOL Dedicated Admin Connection to see this live Look for other waits and often a long blocking chain DO NOT assume you need more worker threads Applies to any task TDS Login Receive TDS packet Engine creates SQLOS Task Find available worker on scheduler If none, set THREADPOOL wait type Next available worker runs task Login Timeout 34 The Solution This problem means you connected but delay in processing login • Perhaps you need a longer login timeout? Check if a SQL Server problem • • • • Can you connect locally? Look for potential blocking causing THREADPOOL Dedicated Admin Connection may be needed Don’t forget about LOGON TRIGGER Look for domain issues • PREEMPTIVE_OS_LOOKUPACCOUNTSID signature Could be network infrastructure • Probably will need network tracing to determine 35 Demo Connection Timeout Tejas Shah India Error 18056 Periodically not able to connect ERRORLOG has multiple of these below Lasts for 5-10 minutes and “resolves itself” without having to restart SQL spid52 Error: 18056, Severity: 20, State: 29. spid52 The client was unable to reuse a session with SPID 52, which had been reset for connection pooling. The failure ID is 29. This error may have been caused by an earlier operation failing. Check the error logs for failed operations immediately before this error message. 38 Error 18056 What is a “redo” login? No Connection Pooling CONNECT DISCONNECT SQL Server SPID <n> Connection Pooling CONNECT DISCONNECT Connection Pool SQL Server CONNECT (nothing to server) SPID <n> BATCH/RPC TDS header reset bit query reset the connection redo run batch/rpc 39 Error 18056 18056 vs 18456 Msg 18456 database in 2013-04-26 16:32:39.63 spid56 Error: connection string 18056, Severity: 20, State: 46. cannot be accessed 2013-04-26 16:32:39.63 spid56 The client was unable to reuse a session with SPID 56, login which had been reset for connection pooling. The failure ID is 46. This error may have been caused by an earlier operation failing. Check 2013-04-26 16:32:39.66 Logon Error: the error logs for failed operations 18456, Severity: 14, State: 38. immediately before this error message. 2013-04-26 16:32:39.66 Logon Login failed for user <account>. Reason: Failed to open the explicitly specified database 'mydb'. [CLIENT: Redo login <local machine>] 40 18056 States This blog post has the states for both 18056 and 18456 Error 18056 40 - LoginSessDb_UseDbImplicit • The default database for the login cannot be accessed 46 - RedoLoginSessDb_UseDbExplicit LOGON TRIGGER State = 1 (Msg 17892 for login) • Database in connection string cannot be accessed 29 – RedoLoginException • Some “exception” encountered during “redo login” • dm_os_ring_buffers (RING_BUFFER_EXCEPTION) can provide more details 41 Error 18056 What about state 29? Any “exception” that occurs during “redo” that is not one of the other states Typically there is some performance problem on the server Query cancelled or times out (Attention received from client) High CPU Low Memory IO Lock timeout Blocking/Scheduling dm_os_ring_buffers of type RING_BUFFER_EXCEPTION has details 42 Demo Error 18056 The problem seen visually Connection Pooling CONNECT DISCONNECT Connection Pool SQL Server CONNECT (nothing to server) SPID <n> BATCH/RPC TDS header reset bit query reset the connection run batch/rpc redo Attention Query timeout or explicit cancel If Server receives ATTN during “redo”, 18056 state 29 in ERRORLOG 44 The Solution We removed 18056 state 29 for ATTN scenario from ERRORLOG • CU fix for 2008/2008R2 • SQL Server 2012 If still happening it must be some other “error” • Look in ERRORLOG • sys.dm_os_ring_buffers Could condition for ATTN still be happening? • Performance issues on server • Application is cancelling query (or timeout) • sys.dm_os_ring_buffers 45 Suresh Kandoth Memory Scribbler Different Customers calling over a long period of time with different errors • • • • Page Header Corruption with same signature for all cases Texas All systems running same hardware [brand and model] Walk through the timeline Understanding some errors (832, 5242/5243, Bugcheck, etc) How do we recover? Understanding the resolution 47 Memory Scribbler November & December 2011 • Error 832 reported by constant page sniffer • First 4 bytes of page header damaged with random hex Perhaps not so random? 2011-12-19 09:46:09.60 spid5s Error: 832, Severity: 24, State: 1. 2011-12-19 09:46:09.60 spid5s A page that should have been constant has changed (expected checksum: 2247c294, actual checksum: ba52fd57, database 5, file 'Data.mdf', page (1:248037)). This usually indicates a memory failure or other hardware or OS corruption. 00000003`437a0000 00000003`437a0010 00000003`437a0020 00000003`437a0030 2b 53 52 4d 31 8e 8e 00 87 12 12 00 7f 00 00 00 00 01 01 00 82 00 00 00 01 17 00 00 00-51 00-05 00-fe 00-00 8e a0 71 00 12 d2 02 00 00 0a 00 00 01 f6 6e 78 00 00 33 65 1c 00 dc 1e 00 00 23 48 Memory Scribbler Checksum Expected Checksum 0x200 = HAS_CHECKSUM Actual Checksum Page 1:248037 Page Header m_flagBits=0x200 m_tornBits=0x2247c294 1 100 2 0xba52fd57 200 111 96 2011-12-19 09:46:09.60 spid5s Error: 832, Severity: 24, State: 1. 2011-12-19 09:46:09.60 spid5s A page that should have been constant has changed (expected checksum: 2247c294, actual checksum: ba52fd57, database 5, file 'Data.mdf', page (1:248037)). This usually indicates a memory failure or other hardware or OS corruption. 49 Memory Scribbler The Page 2b 6e 51 53 05 52 31 00 8e 8e a0 8e 13 87 12 12 d2 12 7f 4f 00 00 0a 00 00 01 01 f6 01 82 00 01 00 00 01 1c 17 dc 00 00 00 00 1e 00 Thread, another process, driver, hw 50 Memory Scribbler Constant Page Sniffer Error 832 Lazy Writer Background Task • • • • • • Every Second – Housekeeping Sweeps over 16 Buffers (Pages) Finds “Clean” Buffer with Checksum enabled Validates Checksum 832 thrown if failure is detected XEvent - constant_page_corruption_detected 51 Memory Scribbler Constant Page Sniffer Trace Flag 831 User issues Update statement Lazy Writer Background Task • • • • Error 832 More Aggressive Validates Checksum before page modified (dirty) Validates Checksum when page is discarded (free buffer) 832 thrown if failure is detected 52 Demo Error 832 January 2012 Memory Scribbler • Two more cases with different symptoms 2011-12-26 19:25:32.40 spid52 ***Stack Dump being sent to E:\SQLDump0077.txt 2011-12-26 19:25:32.40 spid52 * ******************************************************************************* 2011-12-26 19:25:32.40 spid52 * 2011-12-26 19:25:32.40 spid52 * BEGIN STACK DUMP: 2011-12-26 19:25:32.40 spid52 * 12/26/11 19:25:32 spid 9376 2011-12-26 19:25:32.40 spid52 * 2011-12-26 19:25:32.40 spid52 * Location: e:\sql10_main_t\sql\ntdbms\storeng\dfs\access\pagecompression.inl:1659 2011-12-26 19:25:32.40 spid52 * Expression: uncomprSize <= uncomprDataSize 2011-12-26 19:25:32.40 spid52 * SPID: 52 2011-12-26 19:25:32.40 spid52 * Process ID: 7276 Error: 5243, Severity: 22, State: 8. An inconsistency was detected during an internal operation. Please contact technical support. 54 Memory Scribbler Error 5242/5243 Page 1:248037 Page Header m_flagBits=0x200 m_tornBits=0x2247c294 1 100 2b31877f 2 200 Data Row Error 5242/5243 111 96 • Row structure is malformed • DBCC CHECKDB is your next step • If we are unable to determine Database or page affected, raise 5243 55 Demo Error 5242/5243 Memory Scribbler February 2012 • More reports of similar problems on same model of machine • Also received reports external to SQL - [blue screen/ bug checks/ other applications like svchost crashing] ******************************************************************************* * Bugcheck Analysis * ******************************************************************************* MEMORY_MANAGEMENT (1a) # Any other values for parameter 1 must be individually examined. Arguments: Arg1: 0000000000041201, The subtype of the bugcheck. Arg2: fffff68000002000 Arg3: 000000007f87312b Arg4: fffffa80656a60b0 BUGCHECK_STR: 0x1a_41201 PROCESS_NAME: conhost.exe CURRENT_IRQL: 0 LAST_CONTROL_TRANSFER: from fffff800018df2de to fffff80001882c40 2b31877f rax=fffffa80656a60b0 rbx=000000007f87312b rcx=000000000000001a 57 What we knew? Memory Scribbler All systems were running on the same hardware Corrupted section was overwritten same set of 4 bytes • [2b 31 87 7f] First 4 bytes of a 8KB sql database page boundary First 4 bytes of a 4KB OS page boundary 58 Memory Scribbler March 2012 • Microsoft & Hardware manufacture joint investigation • Identified damage always happens on same physical page in memory • Safe workaround provided to customers • bcdedit /set badmemorylist 0x4013 List of Page Frame Numbers 59 Memory Scribbler Physical Memory Page Process Virtual Memory Page 0 0x0000 Physical Memory 0x0000 0x0000 Page 1 0x0001 Page 2 0x0002 Page 3 0x0003 6e 00 13 4f 2b 0x4013 31 87 7f 0x4014 Page n 0xFFFF 0x4013 0xFFFF 60 Memory Scribbler 61 Memory Scribbler April 2012 – The Solution • BIOS update released to address the issue. • Three models of servers identified BIOS Release Date: 4/9/2012 Version: 2.7.0 Check DELL site. There may be newer BIOS updates 62 Memory Scribbler Resources • SQL Server I/O Basics White Paper – Chapter 2 • http://technet.microsoft.com/en-us/library/cc917726.aspx • How to tshoot Msg 832 (constant page has changed) in SQL Server • http://support.microsoft.com/kb/2015759/EN-US 63 The classic memory leak story Overall Server Performance Degrades over time on SQL Server to the point of failure Server runs fine for a few days but then ERRORLOG starts to show Msg 701 Wafa Jaffal Texas Restart of SQL Server works but problem comes back in a few days 2012-05-20 13:13:24.80 spid146 Failed Virtual Allocate Bytes: FAIL_VIRTUAL_RESERVE 65536 2012-05-20 13:34:39.17 spid111 Error: 701, Severity: 17, State: 89. 2012-05-20 13:34:39.17 spid111 There is insufficient system memory in resource pool 'default‘… 2012-05-20 13:34:39.17 spid58 Error: 701, Severity: 17, State: 123. 2012-05-20 13:34:39.17 spid58 There is insufficient system memory in resource pool 'default‘… • This sounds like a memory leak? • Understanding how to troubleshoot 701 can be difficult • Is this a SQL Server bug or something “I did”? 65 SQL Memory Leak Where do you even start? Look at memory statistics in the ERRORLOG What do the MEMORYCLERKS say? What about perfmon? Memory problems outside of SQL Server? What about those magic DMVs I’m always hearing about 66 SQL Memory Leak We dig deeper into SQL Memory MEMORYCLERK_SQLGENERAL • “General” doesn’t sound too helpful But it does narrow out some things • Not data pages, proc cache, compile, … • SQLGENERAL is a “general purpose” category of “miscellaneous” type memory • Memory objects use this clerk Our choices • • • • Could be a bug Could be an application “doing something” How can you “do something” to cause this? Either way we could capture a trace of queries to look for patterns Who is the largest memory object user? • Need DMVs to find that out such as sys.dm_os_memory_objects • sys.dm_os_memory_allocations can provide detailed audit 67 Demo Can SQL Server 2012 help? SQL Memory Leak A new memory manger New Memory Extended Events Use the Event Pairing target 69 Did we find the solution? SQL Memory Leak 70 Why did it fail and fail to failover? Application cannot connect to SQL Server on primary and secondary. Using AlwaysOn on SQL Server 2012 at 11:55am SQL Server on Primary has a major “problem” Curt Mathews North Carolina Secondary replica configured but failover didn’t work. What caused failure and why didn’t failover work? 2012-04-19 11:53:57.44 spid43s AlwaysOn Availability Groups connection with secondary database terminated for primary database <db> on the availability replica with Replica ID: {59f9f2aa-a90e-48358cb2-701d9f7da0ed}. This is an informational message only. No user action is required. . Msg . 35267 really? . 2012-04-19 11:54:05.57 spid9s SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [L:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\Data\<db>_log.LDF] in database [<db>] (11). The OS file handle is 0x00000000000019BC. The offset of the latest long I/O is: 0x00000038f67200 72 AlwaysOn Failover First let’s talk about what failed and why? ERRORLOG shows connection problems to secondary ERRORLOG shows I/O delays close to app problem time But what caused the failure? System Event Log shows an unexpected machine restart Because we use Windows Clustering this should trigger a automatic failover So we know what the failure “was” We don’t know what caused it but it appears “system” related We still don’t know why automatic failover didn’t work 73 Let’s talk SQL Server 2012 failover detection Prior to SQL Server 2012 AlwaysOn Failover Windows Service Manager Cluster Failover Resource DLL SQL Server Server 2012 (Failover Policy Levels) Windows Service Manager Cluster Failover Resource DLL SQL Server The “Engine” knows if its healthy 74 AlwaysOn Failover Granularity Flexible Failover Policy Levels Level Condition Description 1 Failover or restart on server down SQL Service not running or lease expired 2 Failover or restart on server unresponsive #1 + sp_server_diagnostics not returning within HealthTimeOut 3 (default) Failover or restart on critical server errors #2 + “system” error 4 Failover or restart on moderate server errors #3 + “resource” error 5 Failover or restart on any qualified failure conditions #4 + “query processing” error 75 AlwaysOn Failover sp_server_diagnostics sp_server_diagnostics T-SQLrepeat=30% repeat=30%ofofhealth healthtimeout timeout T-SQL SQLSERVR.EXE SQLSERVR.EXE repeat interval repeat=5min sp_server_diagnostics_component_result Event 76 sp_server_diagnostics result set AlwaysOn Failover Component Description system Spinlocks, latches, AVs, non-yield, stack dumps resource Memory resources query_processing worker threads, tasks, wait types, CPU intensive sessions, and blocking tasks io_subsystem Long Pending I/O, I/O stalls events Ring buffer information Each component has error and warning conditions but also provides detailed information Error conditions are for failover detection Warnings provide information before headed to error state Io_subsystem and events are informational only 77 That’s all nice, but why didn’t failover work? SYNCHRONIZED = HEALTHY NOT SYNCHRONIZING = NOT_HEALTHY X AlwaysOn Failover SYNCHRONIZING = PARTIALLY_HEALTHY We PULL Log blocks Primary Secondary We are Getnow me synced in sync FAILOVER Allowed FAILOVERNot Allowed Windows Failover Clustering Services No data loss but downtime 78 AlwaysOn Failover How do we keep it all in sync? SYNCHRONIZED = HEALTHY SYNCHRONIZING = PARTIALLY_HEALTHY NOT SYNCHRONIZING = NOT_HEALTHY X We PULL Log blocks Primary Secondary wait_type Transactions Transactions Transactions WRITELOG and HADR_SYNC_COMMIT WRITELOG HADR_SYNCHRONIZING_THROTTLE 79 Demo What if Primary didn’t come back online? AlwaysOn Failover You can manually failover to a secondary You might not lose data? Look at secondary DMVs If primary now comes back online, it will become a secondary 81 Our conclusions for the customer AlwaysOn Failover Primary lost ability to connect to secondary Primary experienced unexpected restart of computer Secondary NOT_SYCHRONIZING so automatic failover cannot proceed (possible data loss) Primary came back online Secondary synced back up Normal AlwaysOn synchronization resumed Our logging and tracing helped tell the story the “first time” 82 Our recommendations to the customer AlwaysOn Failover Investigate possible system issues with primary Alerts when NOT_SYNCHRONIZING occurs. Primary is still up but auto failover in jeopardy Look at changing failover policy level Create scripts to monitor connectivity between machines 83 Patient enters the emergency room Recovery fails for customer db. Database SUSPECT and not accessible • • • • • • • • Customer attempts to restore backup do not work. Robert Dorr Texas Database is a SharePoint content database so potential loss is company documents Customer using DPM to backup SQL Databases This means “backups” are MDF and LDF files They find one good backup from 3 months back using T-SQL BACKUP We can’t find the actual current database that went SUSPECT. Overwritten by RESTORE attempts We can’t even access using EMERGENCY. File header page damaged But is there is any good data in the database? Even if there, how do we get it? Should we even try? 85 Data Recovery The Problem Visually Good Backup 3 months ago (Old) Current Date 1 month ago No Backups (HOLE) All Backups Damaged Desired Outcome • 1st: Recover Current Date • 2nd Merge Current with Old 86 Data Recovery What if your database is damaged? RESTORE from backup CHECKDB repair Try to access using EMERGENCY mode Start working on your resume 87 Patient damage may be from previous injury Engineer scans current MDF file and finds a lot of pages have 0s Zip file comparison • “Old” good backup zips from 145Gb to 80Gb • Current MDF file zips from 145Gb to 6Gb Sounds to us like a known DPM problem Don’t know the original problem that caused SUSPECT DB Data Recovery So… a call is put into the surgeons 88 Data Recovery How do you “scan” an MDF file? Read 8Kb from File Map 96 bytes of this into a C++ class based on DBCC PAGE Analyze the class for its values MDF File Bad page header 0000000000000000 Not exactly the same 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 89 The Data Recovery Surgery Problem Data Recovery Are there any good pages? How do we get good pages out of the database? How do we put them into a good database? How do we access them once we put them there? Wait…the major injury that has patient at risk are SharePoint docs Which means LOB pages This complicates the surgery plan We want to use the power of SQL Engine as much as possible 90 How does SQL Server scan data? Data Recovery Heap sysallocunits allocunitid firstpg IAMs scanned “on disk order” firstiam IAM Page IAM bitmap Data Page Data Page allocunitid allocunitid 1001 Clustered Index This is a linked list. allocunitid still matters sysallocunits allocunitid 0 Data Page firstpg firstiam Data Page ………………… Data Page 0 91 Data Recovery How do we retrieve LOB data? If “text in row” LOB is in the data row LOB Root Page Data Page Root Links for Row 1 Root Links for Row 2 Data Row TEXTPTR PageID SlotID LOB Page LOB Data LOB Page LOB Data 92 Data Recovery The Surgeons come up with a plan • • 700 data pages from “doc” table Find 300 other pages from any other table “old” good backup restored • “tempdoc” db “bad MDF file • 1000 data pages from “doc” table 10000 LOB pages from “doc” table These are in valid cl index with same allocunitid firstpg 1:1000 1:101 ……. 1:1002 1:1003 1:1004 Data Pages from Other Table Data Pages from Doc Table 1:100 1:1001 1:200 1:500 1:800 ……. LOB pages from “doc” table in bad MDF 1:2000 1:2001 1:5000 ……………… Now we have a valid table to work with 93 What is the status of the patient now? We believe some data rows point to valid LOB chains We believe some data rows do not point to valid LOB chains We believe some valid LOB chains exist but don’t have a data row Data Recovery These are valid documents to recover These are “lost” documents These are valid documents to recover but are “orphans” 94 Data Recovery Why didn’t we use the tlog? Turns out the LDF files are all intact Could we somehow reverse engineer changes from each of the LDF files captured over time? Problems with transaction log reverse engineering months? • We have no tool to do this • If not using replication, updates are only diffs from pages 95 Data Recovery The 2nd surgery Copy out from “temp doc” db into clean db Merge in 3 month old backup What about LOB “orphans”? Find orphans and export them into files We had to write a special program Heiko Hatzfeld PFE BlobDumper utility to copy out all new “docs” into files (export program) 96 Data Recovery We ran into complications Text Pages – SQL Engine does not check allocation state like SQL 6.x does Select with (INDEX=0) forces clustered index chain only scan - FirstPg -T651 Disable ghost activity to avoid Issues AVs and Hangs -T652 Disable read ahead to ignore IAM usage, page chain only OPTION (MAXDOP 1) to avoid IAM usage Read only DB to avoid update stats and other allocations 97 Data Recovery What happened to the patient? ~16,000 new/changed documents recovered (1st surgery) in db and files Customer can import in 3 month old documents along with this (Total documents recovered ~120,000) We also recovery ~1000 “orphaned” docs with program (2nd surgery) Customer builds temporary sharepoint site Imports documents Users go to temporary site to copy out important documents for their needs 98 Data Recovery Why we got a bit lucky though One MDF file Schema was same from 3 months ago We had a clustered index structure to use No partitions No compressed pages No encryption used Clustered index not rebuilt (allocunitd same) 99 Data Recovery Surgery really can be expensive Took the two of us 400 total hours over 3 weeks We must keep data privacy in mind at all times 7 computers; 3 labs; 3 countries all of the time Long hours and weekends We clearly can’t scale this Bottom line is that we don’t have a staff of surgeons just waiting around The knowledge to do this is not prevalent because we rarely need it 100 bobward@microsoft.com asaxton@microsoft.com @bobwardms @awsaxton http://blogs.msdn.com/psssql Download link: http://sdrv.ms/YeMtET