Windows Crash Dump Analysis Daniel Pearson David Solomon Expert Seminars Daniel Pearson • Started working with Windows NT 3.51 • Three years at Digital Equipment Corporation • Supporting Intel and Alpha systems running Windows NT • Seven years at Microsoft • Senior Escalation Lead in Windows base team • Worked in the Mobile Internet sustained engineering team • Instructor for David Solomon, co-author of the Windows Internals book series Agenda • Causes of Windows crashes • What happens during a crash • Configuring Windows crash options • Writing a crash dump • Automated and manual crash analysis • Using Driver Verifier to detect errors • Attaching a kernel debugger * Portions of this session are based on material developed by Mark Russinovich and David Solomon Why Analyze a Crash? • When Windows Error Reporting has no solution or when it blames “a device driver” Why Does Windows Crash • A device driver or part of the operating system incurs an unhandled exception • A device driver or part of the operating system explicitly crashes the system due to an unrecoverable condition • A page fault occurs at an interrupt request level of dispatch or higher • A hardware condition such as a nonmaskable interrupt or faulty memory, disk, etc. Causes of Windows Crashes Percentage of Top 500 Crashes for Windows Vista with Service Pack 11 6% 11% 13% Third-party device drivers Microsoft code Crash too corrupt for analysis Hardware errors 70% 1. Microsoft Corporation. 2008. Online Crash Analysis research performed in September of 2008. What Happens During a Crash • When a condition is detected that requires a crash, the kernel API KeBugCheckEx is called • KeBugCheckEx accepts a bugcheck code that indicates the reason for the crash and four parameters that supply additional information KeBugCheckEx( IN ULONG BugCheckCode, IN ULONG_PTR BugCheckParameter1, IN ULONG_PTR BugCheckParameter2, IN ULONG_PTR BugCheckParameter3, IN ULONG_PTR BugCheckParameter4 ); Inside of KeBugCheckEx • KeBugCheckEx performs several functions • Disables interrupts • Notifies other CPUs to halt execution • Notifies registered drivers • Writes crash dump information to disk* • Restarts the system* * Only if the system is configured to do so The Windows Stop Screen 1 2 3 4 5 Bugcheck Codes • Shared by many components and drivers • The Windows Driver Kit currently documents over 250 unique bugcheck codes Memory Dump Types • Small memory dump • Records the smallest set of useful information • Kernel memory dump* • Records only kernel memory, which speeds up the process of writing a crash dump • Complete memory dump* • Records the entire contents of system memory * If either a Kernel or Complete memory dump is selected, the system will also create a minidump and store it in the %SystemRoot%\minidump directory Configuring Debugging Information Options Writing a Crash Dump • Crash dump information is written to the paging file on the boot volume or to a dedicated dump file if specified • Too risky to create a new file on the system • How does the system know its safe? • The boot volume paging file’s on-disk mapping is obtained when the system starts • Critical crash components are checksummed • When a crash occurs, if the checksum doesn’t match, a memory dump is not written Why Would You Not Get a Dump? • Problems with page file configuration • The paging file on the boot volume is too small or one does not exist • The system crashed before the paging file was initialized • Critical crash components are corrupted • Windows didn’t crash! • The system spontaneously restarted • The system is hung Analyzing a Crash Dump • The Microsoft kernel debuggers can be used to open and analyze a crash dump • kd, a command line tool and WinDbg, a GUI tool • Available as part of the Debugging Tools for Windows http://www.microsoft.com/whdc/devtools/debugging/default.mspx • Configure the debugger to point to symbols srv*C:\SYMBOLS*http://msdl.microsoft.com/download/symbols Automated Analysis • When you open a crash dump with WinDbg or kd, the debugger performs basic crash analysis* • Displays stop code and parameter information • Takes a guess at the offending driver • The analysis is the result of the automated execution of the !analyze debugger command • !analyze uses the bugcheck parameters and a set of heuristics to determine what component is the likely cause of the crash * Set the environment variable DBGENG_NO_BUGCHECK_ANALYSIS=1 to disable Automated Analysis Using !analyze Memory Corruption • Occurs when a driver goes past the end, called an overrun, or the beginning, an underrun, of it’s memory allocation • Usually detected when overwritten data is referenced by the kernel or another driver • It’s possible there’s a long delay between corruption and detection Viewing the Effects of Memory Corruption Crash Transformation • For crashes that are difficult to analyze • The “victim” crashed the system, not the culprit • The debugger points to ntoskrnl.exe, win32k.sys or other Windows components • You get many different crash dumps all pointing at different causes • Your goal isn’t to analyze difficult crashes … It’s to try to make an “unanalyzable” crash into one that can be easily analyzed Driver Verifier • Useful for identifying code defects in drivers • Performs more thorough checks on the system and device drivers as well as simulating failures • Support is built into the operating system • The requirements for the Windows logo program state that a driver must not fail while running under Driver Verifier Using Driver Verifier to Catch a Buffer Overrun Manual Analysis • Sometimes !analyze isn’t enough • It might not tell you anything useful • You want to know in more detail what was happening at the time of the crash • Several useful commands and techniques • Verify the time of the crash, .time • A short uptime value can mean frequent problems • Check the stack on each CPU, stacks are read from the bottom to the top • !cpuinfo will display a list of all the CPUs • Use ~s to switch to a different CPU for investigation • k to display the stack Manual Analysis • Several useful commands and techniques • Look at memory usage, !vm • Make sure memory pools are not depleted or contain errors • Use !poolused to identify large users • Check the currently running thread, !thread • May or may not be related to the crash • Check pending I/O requests using !irp • List all processes on the system, !process 0 0 • Make sure you understand what was running at the time • List loaded drivers, lm t n • Make sure all the drivers are recognizable and up to date Manual Analysis of a Crash Dump Attaching a Kernel Debugger • Required for debugging initialization failures and crashes where no dump file is created • Requires that the system be started with the debugger enabled to work • Support for using a null-modem, IEEE 1394 and USB 2.0 cable as well as virtual machines and over the network in Windows 7 • Limited support for local kernel debugging Attaching a Kernel Debugger to a Live System Hung Systems • Sometimes systems becomes unresponsive • Keyboard and mouse frozen • Two types of hangs • Instant lockup • Kernel synchronization deadlock • Infinite loop at a high IRQL or a very high priority thread • Slowly grinding to a halt • Resource depletion Initiating a Manual Crash • Using the keyboard • Requires a PS/2 keyboard + registry key • HKLM\SYSTEM\CurrentControlSet\Services\i8042prt\ Parameters\CrashOnCtrlScroll • Using an NMI button • Requires specialized hardware + registry key • HKLM\SYSTEM\CurrentControlSet\Control\ CrashControl\NMICrashDump • Using the debugger • Break in and execute the .crash command Debugging a Hung System Additional Information • Windows Internals 5th edition • Debugging Tools for Windows documentation • Mark Russinovich’s Blog • http://blogs.technet.com/markrussinovich • Advanced Windows Debugging Blog • http://blogs.msdn.com/ntdebugging • Crash Dump Analysis and Debugging Portal • http://www.dumpanalysis.org Additional Information • David Solomon Expert Seminars offers training on Windows Internals both as public and private workshops and public webinars via the Internet • Currently scheduled up and coming classes • Public workshop in London, April 12th – April 16th • Public webinar, April 26th & April 28th • Public workshop in New York, May 3rd – May 7th • Public workshop in San Francisco, November 8th – November 12th • Visit http://www.solsem.com for further course descriptions and up to date information