Customer Stories from the Front Line

advertisement
Bob Ward
Principal Architect Escalation Engineer
Microsoft
Customer Stories
From the Front Line
Adam Saxton
Sr. Escalation Engineer
Microsoft
May 2nd – 4th, Nottingham
The day…
9:00-10:30am
• Engine Stories
10:30-10:50am
• Break
10:50am-1:00pmish • Connectivity Stories
1:00pmish-2:00pm • Lunch
2:00pm-3:45pm
3:45-4:00pm
4:00pm-4:45pm
4:45pm-?
• Strange tales of scribble, leak, and fail
• Break
• The Data Surgery Story
• Q&A
2
Why are you here today?
Learn from
the pain of
others
Walk away
with practical
information
Plenty of
demos
Learn “how
stuff works”
(and how CSS
does their job)
Tweet you
asked Bob a
question he
cannot answer
@bobwardms
We may go
deep in some
places
3
What kind of stories will I tell?
VLF
Fragmentation
Memory
Leak
Always On
Data
Surgery
Memory
Scribbler
Connectivity
Scheduler
Non-Yield
4
Tim Chapman
VLF Fragmentation Strikes Again
Database Mirroring
principal behind syncing
to secondary on SQL 2008
SP2
Non-yielding error in the
ERRORLOG with stack
dump
PFE – East Region
2 problems that appear to
be unrelated
2012-08-01 17:37:00.500 Server Using 'dbghelp.dll' version '4.0.5'
2012-08-01 17:37:00.530 Server ***Unable to get thread context for spid 0
2012-08-01 17:37:00.530 Server *
*******************************************************************************
2012-08-01 17:37:00.530 Server *
2012-08-01 17:37:00.530 Server * BEGIN STACK DUMP:
Means “no session” but is
2012-08-01 17:37:00.530 Server * 08/01/12 17:37:00 spid 10728
“background”
2012-08-01 17:37:00.530 Server *
2012-08-01 17:37:00.530 Server * Non-yielding Scheduler
2012-08-01 17:37:00.530 Server *
2012-08-01 17:37:00.530 Server *
*******************************************************************************
2012-08-01 17:37:00.530 Server Stack Signature for the dump is 0x000000000000028D
2012-08-01 17:37:05.750 Server External dump process return code 0x20000001.
External dump process returned no errors.
6
VLF
Fragmentation
Inside VLF Fragmentation
This looks fairly harmless
Symptoms are
typically really long
recovery or restore
even if there is
nothing to recover
What if I have to scan this list?
VLF1
VLF2
Traversing used to
be scan entire list
each time
VLF<n>
Fixes in all versions
to scan from
“where you are”
You still may need
to go through a
large list
VLF 900,000
Cheaper to find
everything in small
# of VLFs
Finding specific
LSN really bad
because no index
7
Back to our Story
VLF
Fragmentation
DB Mirror Principal lagging is typically caused by
• Network latency
• Secondary log harden slow (high safety)
But….the stack dump pointed to something interesting
• What workers are not waiting?
• One is related to finding something in the log and Database Mirroring
We suspect a VLF problem
• DBCC LOGINFO shows 970,000 Virtual Log Files
• This is never good. I’ve found anything over 10,000 can be impactful (especially without fixes).
Large #VLFs cause DBM background thread to spend CPU cycles searching
for log blocks but not yielding on scheduler (60+secs)
8
Demo
The Solution
VLF
Fragmentation
Break the Mirror
Reduce VLFs to reasonable number
BACKUP_IO
wait_type
• Truncate the log
• BACKUP LOG to keep log backup chain. Takes longer and maybe needed
more than once
• Change to SIMPLE recovery is quicker but breaks log backup chain
• Then….DBCC SHRINKFILE using TRUNCATEONLY
• What if I’m in the middle of recovery?
• Let it finish
• Kill it and rebuild the log (WARNING: POSSIBLE INCONSISTENCY)
Resize the log to desired size with larger autogrow increment
Reinitialize the mirror
10
VLF
Fragmentation
Prevention
Apply latest fixes
(included in SQL
Server 2012)
Autogrow is
usually the cause
Create log to
needed size
Choose larger but
not “too large”
autogrow
increment
Don’t rely on
autogrow (but
keep for
emergency)
System Center
Advisor detects
this problem
11
System Center Advisor – systemcenteradvisor.com
FREE Cloud Service
for Configuration
Monitoring
Based on CSS
Knowledge
Integrated into
SCOM 2012 SP1
Windows Server
SQL, SharePoint,
Exchange and Lync
12
Bob Ward
Texas
It looks like tempdb to me
Customer benchmark
cannot scale on SQL
Server 2008R2 RTM
Larger number of users
run into massive blocking
Looks like a tempdb latch
allocation contention
problem
Tempdb latch contention
• Users waiting on PAGELATCH_UP for 2:<n>:<alloc page>
Customer has implemented normal solutions for tempdb latch contention
• Expanded number of tempdb files
• Using –T1118
• Allocation contention not system tables
Using prepared ad-hoc queries referencing temporary tables
14
Bob asked to investigate the case
Yielding problem
DMV does show
latch contention on
tempdb
Who is the owner
of the latch and
what are they
doing?
Owner is
RUNNABLE for
extended period of
time
So a query is not
yielding holding up
the owner of a
tempdb latch
OK, so who is
RUNNING?
What is
RUNNABLE?
Tempdb is not the
problem
Every scheduler
has someone
RUNNING a
SELECT
“Ready to run”
but someone else
running on
scheduler
15
So if it is not tempdb, what is it?
Yielding problem
We know a SELECT is doing this
What is the query?
In some cases we can use sql_handle and find a prepared ad-hoc query accessing tempdb
But….in other cases, the sql_handle and/or plan_handle are NULL. How can this be?
The only time this occurs is either we are compiling or finding our query in cache
These are prepared and compiles/sec very low
So must be cache lookup
Let’s look at the procedure cache
16
Yielding problem
Now we find the problem
dm_exec_cached_plans shows high number of entries for same bucket_id
• This means long hash chains to find a cached plan
This part of the code doesn’t yield well
Each ad-hoc plan for each user unique but hashing as though they were “same
query”
Hashing problem easily reproducible but no performance impact
Performance impact only seen when hash chain in tens of thousands
Which explains only an issue when benchmark tries to scale up
17
Yielding problem
The problem seen visually
Plan to
find
A bad hash chain
Bucket
1
Entry1
Entry2
Bucket
2
Entry1
Entry2
Bucket
3
Entry1
Entry2
……………..
……………..
……………..
Entry
10000
Entry
10000
Entry
10000
……………..
……………..
……………..
Entry
100000
Entry
100000
Entry
100000
A better hash chain
Bucket
1
Entry2
Entry1
Entry2
…..….
.
Entry1
Plan found
here
Bucket
100000
18
Demo
Yielding problem
The Solution
Stored procedures won’t
hit this
We seek a fix with SQL
Development
• Build a repro
• File a bug and request a fix
• Hotfix coded and delivered
Benchmark exceeds
more than ever before
Customer would have to
rewrite a ton
The fix will ensure
smoother hash to avoid
long chains
Everyone happy
20
Adam Saxton
Cannot Generate SSPI Context
Users cannot connect to
SQL Server with the error
below
Started happening after
the weekend
Texas
Doesn’t matter who tries
to connect or which
application they use
22
The Misplaced
SPN
What is an “SSPI Context?”
Security Support Provider Interface (SSPI) is the API used by
SQL Server to authenticate users for “Windows Authentication”
SSPI Providers provide the authentication implementation such
as NTLM, Kerberos, and “Negotiate”
sys.dm_exec_connections
For Negotiate, Kerberos is used by default for remote
connections and NTLM used as “fallback” (or local)
Kerberos requires the correct configuration of a Service Principal
Name (SPN)
“double-hop”
scenarios require
delegration
23
Service Principal Name (SPN)
The Misplaced
SPN
Uniquely identifies an instance of a service
• Required for Kerberos
• Used to request a service ticket
• Can have a port number for TCP
• Bound to an Active Directory User or
Computer Account
• Should be only associated with one account
which should be Service “Log on As” account
“Built-in” account =
computer account
24
The Misplaced
SPN
SQL Service SPN
Svc account
Must have AD
perms
SQL Server startup = register SPN
SQL Server shutdown = delete SPN
TCP
MSSQLSvc/passsql.pass.local:1433
MSSQLSvc/passsql.pass.local:56772
class
/
host
:
Named Pipes
Default
Instance
Named
Instance
port
Default
Instance
MSSQLSvc/passsql.pass.local
MSSQLSvc/passsql.pass.local:myinstance
Blog: What SPN do I use and how does it get there?
Named
Instance
25
It should always work but…..
The Misplaced
SPN
Missing
SPN
• SPN not bound to any AD
account
Duplicate
SPN
• SPN bound to more than
one AD account
Misplaced
SPN
• SPN bound to “wrong” AD
account
Not to the current “Log on
As” account
26
How can this happen?
Missing
SPN
SQL Service Account
does not have
permissions to add
SPN to AD
The Misplaced
SPN
Duplicate
SPN
Misplaced
SPN
SQL shutdown
unexpectedly so SPN
not removed
SQL shutdown
unexpectedly so SPN
not removed
SQL Service account
changed and has
permissions to add
new SPN
SQL Service Account
changed but does
not have permissions
to add new SPN
“Fallback” to NTLM
27
Back to our Case….
There only one
cause for failure
What is
misplaced?
The Misplaced
SPN
• Misplaced SPN
• SPN is associated with computer account but current
“Log on As” is a domain account
• There is no SPN for the domain account
What do we
do?
• Remove the SPN from computer account
• Register SPN for the domain account
Connection
now works
• I’ve sometimes seen app restart required and ‘klist
purge’
28
Demo
Kerberos Event Logging
System Center Advisor
The Misplaced
SPN
http://www.systemcenteradvisor.com
You may experience connectivity issues to SQL Server if SPNs are misconfigured
http://support.microsoft.com/kb/2443457
30
Adam Saxton
Connection Timeout
Login Timeout Expired
error message at client.
Can occur intermittently
and/or for periods of time
Texas
May or may not be
specific to Windows
Authentication
Major Cause
is Network
issues
32
Connection
Timeout
Cause: AD Communication Issues
•
•
•
•
Slow response from Active Directory
Could be network, could be AD issues, could be a DC problem
SQL Authentication logins work
dm_os_wait_stats / PREEMPTIVE_OS_LOOKUPACCOUNTSID
TDS Login
Receive TDS
packet
Engine
creates
SQLOS Task
Find
available
worker on
scheduler
Validate
Windows
Logon
Assign SPID
Login Timeout
33
Connection
Timeout
Cause: Worker Pool Starvation
•
•
•
•
No available workers to service connection task
dm_os_wait_stats / THREADPOOL
Dedicated Admin Connection to see this live
Look for other waits and often a long blocking chain
DO NOT
assume you
need more
worker threads
Applies to any task
TDS Login
Receive TDS
packet
Engine
creates SQLOS
Task
Find available
worker on
scheduler
If none, set
THREADPOOL
wait type
Next available
worker runs
task
Login Timeout
34
The Solution
This problem means you connected but delay in processing login
• Perhaps you need a longer login timeout?
Check if a SQL Server problem
•
•
•
•
Can you connect locally?
Look for potential blocking causing THREADPOOL
Dedicated Admin Connection may be needed
Don’t forget about LOGON TRIGGER
Look for domain issues
• PREEMPTIVE_OS_LOOKUPACCOUNTSID signature
Could be network infrastructure
• Probably will need network tracing to determine
35
Demo
Connection Timeout
Tejas Shah
India
Error 18056
Periodically not able to
connect
ERRORLOG has multiple
of these below
Lasts for 5-10 minutes
and “resolves itself”
without having to restart
SQL
spid52
Error: 18056, Severity: 20, State: 29.
spid52
The client was unable to reuse a session
with SPID 52, which had been reset for connection
pooling. The failure ID is 29. This error may have been
caused by an earlier operation failing. Check the error
logs for failed operations immediately before this
error message.
38
Error 18056
What is a “redo” login?
No Connection Pooling
CONNECT
DISCONNECT
SQL Server
SPID <n>
Connection Pooling
CONNECT
DISCONNECT
Connection
Pool
SQL Server
CONNECT (nothing to server)
SPID <n>
BATCH/RPC
TDS header
reset bit
query
reset the
connection
redo
run
batch/rpc
39
Error 18056
18056 vs 18456
Msg 18456
database in
2013-04-26 16:32:39.63 spid56 Error:
connection string
18056, Severity: 20, State: 46.
cannot be accessed
2013-04-26 16:32:39.63 spid56 The client
was unable to reuse a session with SPID 56,
login
which had been reset for connection pooling.
The failure ID is 46. This error may have been
caused by an earlier operation failing. Check 2013-04-26 16:32:39.66 Logon
Error:
the error logs for failed operations
18456, Severity: 14, State: 38.
immediately before this error message.
2013-04-26 16:32:39.66 Logon
Login failed
for user <account>. Reason: Failed to open the
explicitly specified database 'mydb'. [CLIENT:
Redo login
<local machine>]
40
18056 States
This blog post has the states
for both 18056 and 18456
Error 18056
40 - LoginSessDb_UseDbImplicit
• The default database for the login cannot be accessed
46 - RedoLoginSessDb_UseDbExplicit
LOGON TRIGGER State =
1 (Msg 17892 for login)
• Database in connection string cannot be accessed
29 – RedoLoginException
• Some “exception” encountered during “redo login”
• dm_os_ring_buffers (RING_BUFFER_EXCEPTION) can
provide more details
41
Error 18056
What about state 29?
Any “exception” that occurs
during “redo” that is not one
of the other states
Typically there is some
performance problem on the
server
Query cancelled or
times out (Attention
received from client)
High CPU
Low Memory
IO
Lock timeout
Blocking/Scheduling
dm_os_ring_buffers of type RING_BUFFER_EXCEPTION has details
42
Demo
Error 18056
The problem seen visually
Connection Pooling
CONNECT
DISCONNECT
Connection
Pool
SQL Server
CONNECT (nothing to server)
SPID <n>
BATCH/RPC
TDS header
reset bit
query
reset the
connection
run
batch/rpc
redo
Attention
Query timeout or
explicit cancel
If Server receives
ATTN during “redo”,
18056 state 29 in
ERRORLOG
44
The Solution
We removed 18056 state 29 for ATTN scenario from
ERRORLOG
• CU fix for 2008/2008R2
• SQL Server 2012
If still happening it must be some other “error”
• Look in ERRORLOG
• sys.dm_os_ring_buffers
Could condition for ATTN still be happening?
• Performance issues on server
• Application is cancelling query (or timeout)
• sys.dm_os_ring_buffers
45
Suresh Kandoth
Memory Scribbler
Different Customers
calling over a long period
of time with different
errors
•
•
•
•
Page Header Corruption
with same signature for
all cases
Texas
All systems running same
hardware [brand and
model]
Walk through the timeline
Understanding some errors (832, 5242/5243, Bugcheck, etc)
How do we recover?
Understanding the resolution
47
Memory
Scribbler
November & December 2011
• Error 832 reported by constant page sniffer
• First 4 bytes of page header damaged with random hex
Perhaps not
so random?
2011-12-19 09:46:09.60 spid5s
Error: 832, Severity: 24, State: 1.
2011-12-19 09:46:09.60 spid5s
A page that should have been constant has
changed (expected checksum: 2247c294, actual checksum: ba52fd57, database 5,
file 'Data.mdf', page (1:248037)). This usually indicates a memory failure or
other hardware or OS corruption.
00000003`437a0000
00000003`437a0010
00000003`437a0020
00000003`437a0030
2b
53
52
4d
31
8e
8e
00
87
12
12
00
7f
00
00
00
00
01
01
00
82
00
00
00
01
17
00
00
00-51
00-05
00-fe
00-00
8e
a0
71
00
12
d2
02
00
00
0a
00
00
01
f6
6e
78
00
00
33
65
1c 00
dc 1e
00 00
23
48
Memory
Scribbler
Checksum
Expected
Checksum
0x200 =
HAS_CHECKSUM
Actual
Checksum
Page 1:248037
Page Header
m_flagBits=0x200
m_tornBits=0x2247c294
1
100
2
0xba52fd57
200
111
96
2011-12-19 09:46:09.60 spid5s
Error: 832, Severity: 24, State: 1.
2011-12-19 09:46:09.60 spid5s
A page that should have been constant has
changed (expected checksum: 2247c294, actual checksum: ba52fd57, database 5,
file 'Data.mdf', page (1:248037)). This usually indicates a memory failure or
other hardware or OS corruption.
49
Memory
Scribbler
The Page
2b
6e
51
53
05
52
31
00
8e
8e
a0
8e
13
87
12
12
d2
12
7f
4f
00
00
0a
00
00
01
01
f6
01
82
00
01
00
00
01
1c
17
dc
00
00
00
00
1e
00
Thread, another
process, driver,
hw
50
Memory
Scribbler
Constant Page Sniffer
Error
832
Lazy Writer
Background Task
•
•
•
•
•
•
Every Second – Housekeeping
Sweeps over 16 Buffers (Pages)
Finds “Clean” Buffer with Checksum enabled
Validates Checksum
832 thrown if failure is detected
XEvent - constant_page_corruption_detected
51
Memory
Scribbler
Constant Page Sniffer
Trace Flag 831
User issues
Update
statement
Lazy Writer
Background Task
•
•
•
•
Error
832
More Aggressive
Validates Checksum before page modified (dirty)
Validates Checksum when page is discarded (free buffer)
832 thrown if failure is detected
52
Demo
Error 832
January 2012
Memory
Scribbler
• Two more cases with different symptoms
2011-12-26 19:25:32.40 spid52
***Stack Dump being sent to
E:\SQLDump0077.txt
2011-12-26 19:25:32.40 spid52
*
*******************************************************************************
2011-12-26 19:25:32.40 spid52
*
2011-12-26 19:25:32.40 spid52
* BEGIN STACK DUMP:
2011-12-26 19:25:32.40 spid52
*
12/26/11 19:25:32 spid 9376
2011-12-26 19:25:32.40 spid52
*
2011-12-26 19:25:32.40 spid52
* Location:
e:\sql10_main_t\sql\ntdbms\storeng\dfs\access\pagecompression.inl:1659
2011-12-26 19:25:32.40 spid52
* Expression: uncomprSize <=
uncomprDataSize
2011-12-26 19:25:32.40 spid52
* SPID:
52
2011-12-26 19:25:32.40 spid52
* Process ID: 7276
Error: 5243, Severity: 22, State: 8.
An inconsistency was detected during an internal operation.
Please contact technical support.
54
Memory
Scribbler
Error 5242/5243
Page 1:248037
Page Header
m_flagBits=0x200
m_tornBits=0x2247c294
1
100
2b31877f
2
200
Data Row
Error
5242/5243
111
96
• Row structure is malformed
• DBCC CHECKDB is your next step
• If we are unable to determine Database or page affected, raise 5243
55
Demo
Error 5242/5243
Memory
Scribbler
February 2012
• More reports of similar problems on same model of machine
• Also received reports external to SQL - [blue screen/ bug checks/ other
applications like svchost crashing]
*******************************************************************************
*
Bugcheck Analysis
*
*******************************************************************************
MEMORY_MANAGEMENT (1a)
# Any other values for parameter 1 must be individually examined.
Arguments:
Arg1: 0000000000041201, The subtype of the bugcheck.
Arg2: fffff68000002000
Arg3: 000000007f87312b
Arg4: fffffa80656a60b0
BUGCHECK_STR: 0x1a_41201
PROCESS_NAME: conhost.exe
CURRENT_IRQL: 0
LAST_CONTROL_TRANSFER: from fffff800018df2de to fffff80001882c40
2b31877f
rax=fffffa80656a60b0 rbx=000000007f87312b rcx=000000000000001a
57
What we knew?
Memory
Scribbler
All systems were running on the same hardware
Corrupted section was overwritten same set of 4 bytes
• [2b 31 87 7f]
First 4 bytes of a 8KB sql database page boundary
First 4 bytes of a 4KB OS page boundary
58
Memory
Scribbler
March 2012
• Microsoft & Hardware manufacture joint investigation
• Identified damage always happens on same physical page in memory
• Safe workaround provided to customers
• bcdedit /set badmemorylist 0x4013
List of Page
Frame Numbers
59
Memory
Scribbler
Physical Memory Page
Process Virtual Memory
Page 0
0x0000
Physical Memory
0x0000
0x0000
Page 1
0x0001
Page 2
0x0002
Page 3
0x0003
6e 00 13 4f
2b 0x4013
31 87 7f
0x4014
Page n
0xFFFF
0x4013
0xFFFF
60
Memory
Scribbler
61
Memory
Scribbler
April 2012 – The Solution
• BIOS update released to address the issue.
• Three models of servers identified
BIOS Release Date: 4/9/2012
Version: 2.7.0
Check DELL site. There may
be newer BIOS updates
62
Memory
Scribbler
Resources
• SQL Server I/O Basics White Paper – Chapter 2
• http://technet.microsoft.com/en-us/library/cc917726.aspx
• How to tshoot Msg 832 (constant page has changed) in SQL Server
• http://support.microsoft.com/kb/2015759/EN-US
63
The classic memory leak story
Overall Server
Performance Degrades
over time on SQL Server
to the point of failure
Server runs fine for a few
days but then ERRORLOG
starts to show Msg 701
Wafa Jaffal
Texas
Restart of SQL Server
works but problem comes
back in a few days
2012-05-20 13:13:24.80 spid146 Failed Virtual Allocate Bytes: FAIL_VIRTUAL_RESERVE 65536
2012-05-20 13:34:39.17 spid111 Error: 701, Severity: 17, State: 89.
2012-05-20 13:34:39.17 spid111 There is insufficient system memory in resource pool 'default‘…
2012-05-20 13:34:39.17 spid58 Error: 701, Severity: 17, State: 123.
2012-05-20 13:34:39.17 spid58 There is insufficient system memory in resource pool 'default‘…
• This sounds like a memory leak?
• Understanding how to troubleshoot 701 can be difficult
• Is this a SQL Server bug or something “I did”?
65
SQL Memory
Leak
Where do you even start?
Look at memory
statistics in the
ERRORLOG
What do the
MEMORYCLERKS
say?
What about
perfmon?
Memory problems
outside of SQL
Server?
What about those
magic DMVs I’m
always hearing
about
66
SQL Memory
Leak
We dig deeper into SQL Memory
MEMORYCLERK_SQLGENERAL
• “General” doesn’t sound too helpful
But it does narrow out some things
• Not data pages, proc cache, compile, …
• SQLGENERAL is a “general purpose” category of “miscellaneous” type memory
• Memory objects use this clerk
Our choices
•
•
•
•
Could be a bug
Could be an application “doing something”
How can you “do something” to cause this?
Either way we could capture a trace of queries to look for patterns
Who is the largest memory object user?
• Need DMVs to find that out such as sys.dm_os_memory_objects
• sys.dm_os_memory_allocations can provide detailed audit
67
Demo
Can SQL Server 2012 help?
SQL Memory
Leak
A new memory manger
New Memory Extended Events
Use the Event Pairing target
69
Did we find the solution?
SQL Memory
Leak
70
Why did it fail and fail to failover?
Application cannot
connect to SQL Server on
primary and secondary.
Using AlwaysOn on SQL
Server 2012 at 11:55am
SQL Server on Primary
has a major “problem”
Curt Mathews
North Carolina
Secondary replica
configured but failover
didn’t work. What caused
failure and why didn’t
failover work?
2012-04-19 11:53:57.44 spid43s AlwaysOn Availability Groups connection with secondary database
terminated for primary database <db> on the availability replica with Replica ID: {59f9f2aa-a90e-48358cb2-701d9f7da0ed}. This is an informational message only. No user action is required.
.
Msg
.
35267
really?
.
2012-04-19 11:54:05.57 spid9s SQL Server has encountered 1 occurrence(s) of I/O requests taking
longer than 15 seconds to complete on file [L:\Program Files\Microsoft SQL
Server\MSSQL11.MSSQLSERVER\MSSQL\Data\<db>_log.LDF] in database [<db>] (11). The OS file
handle is 0x00000000000019BC. The offset of the latest long I/O is: 0x00000038f67200
72
AlwaysOn
Failover
First let’s talk about what failed
and why?
ERRORLOG shows
connection
problems to
secondary
ERRORLOG shows
I/O delays close
to app problem
time
But what caused
the failure?
System Event Log
shows an
unexpected
machine restart
Because we use
Windows
Clustering this
should trigger a
automatic failover
So we know what the failure “was”
We don’t know what caused it but it appears “system” related
We still don’t know why automatic failover didn’t work
73
Let’s talk SQL Server 2012 failover detection
Prior to SQL Server 2012
AlwaysOn
Failover
Windows Service
Manager
Cluster Failover
Resource DLL
SQL Server
Server 2012 (Failover Policy Levels)
Windows Service
Manager
Cluster Failover
Resource DLL
SQL Server
The “Engine” knows if its healthy
74
AlwaysOn
Failover
Granularity
Flexible Failover Policy Levels
Level
Condition
Description
1
Failover or restart on server down
SQL Service not running or
lease expired
2
Failover or restart on server unresponsive
#1 +
sp_server_diagnostics not
returning within HealthTimeOut
3
(default)
Failover or restart on critical server errors
#2 +
“system” error
4
Failover or restart on moderate server errors
#3 +
“resource” error
5
Failover or restart on any qualified failure
conditions
#4 +
“query processing” error
75
AlwaysOn
Failover
sp_server_diagnostics
sp_server_diagnostics
T-SQLrepeat=30%
repeat=30%ofofhealth
healthtimeout
timeout
T-SQL
SQLSERVR.EXE
SQLSERVR.EXE
repeat interval
repeat=5min
sp_server_diagnostics_component_result
Event
76
sp_server_diagnostics result set
AlwaysOn
Failover
Component
Description
system
Spinlocks, latches, AVs, non-yield, stack dumps
resource
Memory resources
query_processing
worker threads, tasks, wait types, CPU intensive sessions, and blocking tasks
io_subsystem
Long Pending I/O, I/O stalls
events
Ring buffer information
Each component has error and warning conditions but also provides detailed information
Error conditions are for failover detection
Warnings provide information before headed to error state
Io_subsystem and events are informational only
77
That’s all nice, but why didn’t failover work?
SYNCHRONIZED =
HEALTHY
NOT SYNCHRONIZING =
NOT_HEALTHY
X
AlwaysOn
Failover
SYNCHRONIZING =
PARTIALLY_HEALTHY
We PULL Log blocks
Primary
Secondary
We are
Getnow
me synced
in sync
FAILOVER
Allowed
FAILOVERNot
Allowed
Windows Failover
Clustering Services
No data loss
but downtime
78
AlwaysOn
Failover
How do we keep it all in sync?
SYNCHRONIZED =
HEALTHY
SYNCHRONIZING =
PARTIALLY_HEALTHY
NOT SYNCHRONIZING =
NOT_HEALTHY
X
We PULL Log blocks
Primary
Secondary
wait_type
Transactions
Transactions
Transactions
WRITELOG and HADR_SYNC_COMMIT
WRITELOG
HADR_SYNCHRONIZING_THROTTLE
79
Demo
What if Primary didn’t come back
online?
AlwaysOn
Failover
You can manually
failover to a
secondary
You might not lose
data?
Look at secondary
DMVs
If primary now
comes back online,
it will become a
secondary
81
Our conclusions for the customer
AlwaysOn
Failover
Primary lost ability to connect to secondary
Primary experienced unexpected restart of computer
Secondary NOT_SYCHRONIZING so automatic failover
cannot proceed (possible data loss)
Primary came back online
Secondary synced back up
Normal AlwaysOn synchronization resumed
Our logging and tracing helped tell the story the
“first time”
82
Our recommendations to the customer
AlwaysOn
Failover
Investigate possible system issues with primary
Alerts when NOT_SYNCHRONIZING occurs. Primary
is still up but auto failover in jeopardy
Look at changing failover policy level
Create scripts to monitor connectivity between
machines
83
Patient enters the emergency room
Recovery fails for
customer db. Database
SUSPECT and not
accessible
•
•
•
•
•
•
•
•
Customer attempts to
restore backup do not
work.
Robert Dorr
Texas
Database is a SharePoint
content database so
potential loss is company
documents
Customer using DPM to backup SQL Databases
This means “backups” are MDF and LDF files
They find one good backup from 3 months back using T-SQL BACKUP
We can’t find the actual current database that went SUSPECT. Overwritten by
RESTORE attempts
We can’t even access using EMERGENCY. File header page damaged
But is there is any good data in the database?
Even if there, how do we get it?
Should we even try?
85
Data Recovery
The Problem Visually
Good Backup
3 months ago (Old)
Current Date
1 month ago
No Backups (HOLE)
All Backups
Damaged
Desired Outcome
• 1st: Recover Current Date
• 2nd Merge Current with Old
86
Data Recovery
What if your database is damaged?
RESTORE from
backup
CHECKDB repair
Try to access using
EMERGENCY mode
Start working on
your resume
87
Patient damage may be from previous injury
Engineer
scans
current
MDF file
and finds a
lot of pages
have 0s
Zip file
comparison
• “Old” good
backup zips
from 145Gb
to 80Gb
• Current MDF
file zips from
145Gb to
6Gb
Sounds to
us like a
known
DPM
problem
Don’t know
the original
problem
that caused
SUSPECT
DB
Data Recovery
So… a call is
put into the
surgeons
88
Data Recovery
How do you “scan” an MDF file?
Read 8Kb
from File
Map 96 bytes of
this into a C++
class based on
DBCC PAGE
Analyze the
class for its
values
MDF File
Bad page header
0000000000000000
Not
exactly
the
same
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
89
The Data Recovery Surgery Problem
Data Recovery
Are there any good pages? How do we get good pages out of the database?
How do we put them into a good database?
How do we access them once we put them there?
Wait…the major injury that has patient at risk are SharePoint docs
Which means LOB pages
This complicates the surgery plan
We want to use the power of SQL Engine as much as possible
90
How does SQL Server scan data?
Data Recovery
Heap
sysallocunits
allocunitid
firstpg
IAMs scanned “on
disk order”
firstiam
IAM Page
IAM bitmap
Data Page
Data Page
allocunitid
allocunitid
1001
Clustered Index
This is a linked list.
allocunitid still matters
sysallocunits
allocunitid
0
Data
Page
firstpg
firstiam
Data
Page
…………………
Data
Page
0
91
Data Recovery
How do we retrieve LOB data?
If “text in row” LOB is in the data row
LOB Root Page
Data Page
Root Links for Row 1
Root Links for Row 2
Data Row
TEXTPTR
PageID
SlotID
LOB Page
LOB Data
LOB Page
LOB Data
92
Data Recovery
The Surgeons come up with a plan
•
•
700 data pages from “doc” table
Find 300 other pages from any other table
“old” good backup
restored
•
“tempdoc” db
“bad MDF
file
•
1000 data pages from
“doc” table
10000 LOB pages from
“doc” table
These are in valid cl index with
same allocunitid
firstpg
1:1000
1:101
…….
1:1002
1:1003
1:1004
Data Pages from Other Table
Data Pages from Doc Table
1:100
1:1001
1:200
1:500
1:800
…….
LOB pages from “doc” table in bad MDF
1:2000
1:2001
1:5000
………………
Now we have a
valid table to
work with
93
What is the status of the patient now?
We believe
some data
rows point to
valid LOB
chains
We believe
some data
rows do not
point to valid
LOB chains
We believe
some valid
LOB chains
exist but don’t
have a data
row
Data Recovery
These are valid documents
to recover
These are “lost” documents
These are valid documents
to recover but are “orphans”
94
Data Recovery
Why didn’t we use the tlog?
Turns out the LDF files are
all intact
Could we somehow
reverse engineer changes
from each of the LDF files
captured over time?
Problems with transaction
log reverse engineering
months?
• We have no tool to do this
• If not using replication, updates
are only diffs from pages
95
Data Recovery
The 2nd surgery
Copy out from
“temp doc” db
into clean db
Merge in 3 month
old backup
What about LOB
“orphans”?
Find orphans and
export them into
files
We had to write a
special program Heiko Hatzfeld
PFE
BlobDumper utility
to copy out all new
“docs” into files
(export program)
96
Data Recovery
We ran into complications
Text Pages – SQL Engine does not check allocation state like SQL 6.x does
Select with (INDEX=0) forces clustered index chain only scan - FirstPg
-T651 Disable ghost activity to avoid Issues AVs and Hangs
-T652 Disable read ahead to ignore IAM usage, page chain only
OPTION (MAXDOP 1) to avoid IAM usage
Read only DB to avoid update stats and other allocations
97
Data Recovery
What happened to the patient?
~16,000 new/changed documents recovered (1st surgery) in db and files
Customer can import in 3 month old documents along with this
(Total documents recovered ~120,000)
We also recovery ~1000 “orphaned” docs with program (2nd surgery)
Customer builds temporary sharepoint site
Imports documents
Users go to temporary site to copy out important documents for their needs
98
Data Recovery
Why we got a bit lucky though
 One MDF file
 Schema was same from 3 months ago
 We had a clustered index structure to use
 No partitions
 No compressed pages
 No encryption used
 Clustered index not rebuilt (allocunitd same)
99
Data Recovery
Surgery really can be expensive
Took the two of us 400 total hours over 3 weeks
We must keep data privacy in mind at all times
7 computers; 3 labs; 3 countries all of the time
Long hours and weekends
We clearly can’t scale this
Bottom line is that we don’t have a staff of surgeons just waiting around
The knowledge to do this is not prevalent because we rarely need it
100
bobward@microsoft.com
asaxton@microsoft.com
@bobwardms
@awsaxton
http://blogs.msdn.com/psssql
Download link:
http://sdrv.ms/YeMtET
Download