Watchdog Timers Jeffrey Schwentner EEL6897, Fall 2007 Software Reliability Embedded systems must be able to cope with both hardware and software anomalies to be truly robust. In many cases, embedded devices operate in total isolation and are not accessible to an operator. Manually resetting a device in this scenario when its software “hangs” is not possible. In extreme cases, this can result in damaged hardware or loss of life and incur significant cost impact. The Clementine In 1994, a deep space probe, the Clementine, was launched to make observations of the moon and a large asteroid (1620 Geographos). After months of operation, a software exception caused a control thruster to fire for 11 minutes, which depleted most of the remaining fuel and caused the probe to rotate at 80 RPM. Control was eventually regained, but it was too late to successfully complete the mission. Watchdog Timers While it is not possible to cope with all hardware and software anomalies, the developer can employ the use of watchdog timers to help mitigate the risks. A watchdog timer is a hardware timing device that triggers a system reset, or similar operation, after a designated amount of time has elapsed. A watchdog timer can be either a stand-alone hardware component or built into the processor itself. To avoid a reset, an application must periodically reset the watchdog timer before this interval elapses. This is also known as “kicking the dog”. External Watchdogs External watchdog timers are integrated circuits that physically assert the reset pin of the processor. The Processor must assert an output pin in some fashion to reset the timing mechanism of the watchdog. This type of watchdog is generally considered the most appropriate because of the complete independence of the watchdog from the processor. Some external watchdogs feature a “windowed” reset. ◦ Enforces timing constraints for a proper watchdog reset. ◦ Minimizes likelihood of errant software resetting the watchdog. External Watchdog Schematic Windowed Watchdog Operation Maxim MAX6323 Internal Watchdogs Many processors and microcontrollers have built-in watchdog circuitry available to the programmer. This typically consists of a memory-mapped counter that triggers a non-maskable interrupt (NMI), or reset, when the counter reaches a predefined value. Instead of issuing a reset via an I/O pin assertion, an internal counter of reset to an initial value. Watchdog configuration is controlled user software. Watchdog may even be used as a general purpose timer in some cases. Internal Watchdog Considerations Internal watchdogs are not as “safe” as watchdog circuits external to the processor. ◦ Watchdogs that issue a NMI instead of a reset may not properly reinitialize the system. ◦ Watchdog control registers may be inadvertently overwritten by runaway code, disabling the watchdog all together. ◦ Reset is limited to the processor itself (no outside peripherals). To circumvent these issues, most built-in watchdogs have extra safety-steps designed to prohibit errant code from interfering with the operation of the watchdog timer. On-chip solutions have a significant cost and space advantage over their external counterparts. MSP430 Watchdog Texas Instruments MSP430 family of microcontrollers has a built-in 16-bit watchdog timer featuring: ◦ Configurable clock source and prescaler ◦ Two interrupt options (Reset or NMI) ◦ Isolated watchdog counter Access to the watchdog counter requires a unique binary code, or password. ◦ The code must be written to the password register prior to resetting watchdog timer. ◦ An invalid password attempt causes a key violation interrupt. MSP430 Watchdog Timer Block Diagram Design Considerations The effectiveness of the watchdog is a function of how it is used within the application software. Simply issuing a watchdog reset in every iteration of the program loop may be insufficient. Take a more proactive approach. ◦ Periodically assess the state and health of the system. Only issue a reset if all processes are deemed normal. ◦ Employ a state-based approach when resetting the watchdog timer. ◦ Should a watchdog failure occur, provide an indication and/or capture debugging information. System Health Assessment As the size and complexity of software increases, so does the likelihood of introducing code that may be detrimental to the system. Software may not be the only cause of system invalidation. A spike in the power supply, for example, may corrupt data in memory, or even system registers (program counter, stack pointer, etc). Check for things like stack overflows and validate memory wherever possible. System Health Assessment If the state of the system is compromised, let the watchdog timer perform the reset. This is a better approach than an application “pseudo-reset”. Watchdog timers, themselves, can also adversely affect the system. ◦ Setting a watchdog interval too short will generate a premature reset. ◦ If a critical section of code takes 80 milliseconds to complete, do not set the watchdog interval for 60 milliseconds. State-based Watchdog To guarantee that the software executes as intended, incorporate a simple state machine. ◦ This involves adjusting a state variable at the beginning of a program iteration. ◦ Prior to resetting the watchdog timer at the end of the program iteration, verify that the state is “correct”. Prevents random code from wandering into the main loop and kicking the dog. Enforces a constraint on program sequence. State-based Watchdog Example void watchdog_state_advance(void) { g_usWatchdogState += 0x1111; } void watchdog_state_validate(void) { g_usWatchdogStatePrev += 0x1111; } if(g_usWatchdogState != g_usWatchdogStatePrev) { // State is invalid, allow watchdog to reset. SLEEP(); } else { // Reset the watchdog timer. Note: Repeated calls to the WDT_RESET(); validate function will cause } a watchdog reset. Debugging Information If software detects a fault condition, log the error information prior to allowing the watchdog to reset the system. ◦ Allows the cause of the failure to be addressed. ◦ A report of the error should be attempted when the system resets (part of initialization perhaps). In addition to reporting errors after reset, it is a good idea to indicate that the device has been reset. ◦ If the software was unable to catch the error, it will still attempt to notify of the reset event. ◦ Systems that appear “sluggish” may actually be experience frequent watchdog resets. Single-threaded Implementation Single-threaded implementations should reset the watchdog timer in the main software loop. To determine the proper watchdog timeout duration, the programmer must determine the amount of time it takes to execute the code, using worst case scenarios. ◦ Many systems do not require “tight” timing. ◦ In these cases, setting the timeout to a very large safe value may be acceptable, just to provide a protection against deadlocks. Prior to resetting the watchdog, verify that the state of the system is valid, and system health is normal. Single-threaded Example main(void) { hwinit(); for (;;) { watchdog_state_advance(); read_sensors(); control_motor(); display_status(); } } if(system_check() == S_OK) { // Kick the dog. watchdog_state_validate() } else { flash_led(); report_error(E_FAIL); } Multi-threaded Implementation The same concepts used in a single-threaded design are also applicable for multi-threaded implementations. Avoid creating a thread that simply resets the watchdog timer at regular intervals. ◦ Other threads could fail, and the watchdog thread would keep kicking the dog. Generate a set of flags or data from each thread that can be validated in a “monitoring” thread. ◦ The monitoring thread should reset the watchdog at regular intervals only if the data produced by the other threads is acceptable. Multi-threaded Monitoring Monitoring Task System Tasks Multi-threaded Frequency An important criteria that can be used to validate the health of the system is the execution frequency of the worker threads. This can be accomplished by incorporating a simple counter that is incremented on each iteration of a worker thread. These counters are then monitored and compared with threshold values from the monitoring thread. ◦ If the execution frequency of the monitoring task is significantly greater, the monitoring task can perform a thresholds ◦ This allows the software to validate timing constraints. Multi-threaded Example: System Threads thread_read_sensor(void) { for (;;) { read_sensors(); thread_sensor_cnt++; sleep(50); } } thread_control_motor(void) { for (;;) { control_motor(); thread_motor_cnt++; sleep(100); } } thread_display_status(void) { for (;;) { display_status (); thread_display_cnt++; sleep(125); } } Note: Each thread maintains a unique execution counter. Multi-threaded Example: Monitoring Thread main(void) { hwinit(); launch_threads(); for (;;) { watchdog_state_advance(); else { flash_led(); report_error(E_FAIL); } // Sleep monitoring task for 1 sec. sleep(1000); } if(system_check() == S_OK && thread_sensor_cnt > 18 && thread_sensor_cnt < 22 && thread_motor_cnt > 8 && thread_motor_cnt < 12 && thread_display_cnt > 6 && thread_display_cnt < 10) { // Kick the dog. watchdog_state_validate() // Reset counters. thread_sensor_cnt = 0; thread_motor_cnt = 0; thread_display_cnt = 0; } } Note: The relative frequency of each system thread is checked here. A small window is applied to each. Mars Pathfinder In July of 1997, a priority inversion occurred on the Mars Pathfinder mission, after the craft had landed on the Martian surface. A high priority communications task was forced to wait on a mutex held by a lower priority “science” task. The timing of the software was compromised, and a system reset issued by its watchdog timer brought the system back to normal operating conditions. On Earth, scientists were able to identify the problem and upload new code to fix the problem. Thus, the rest of the $265 million dollar mission could be completed successfully. Conclusion Watchdog timers can add a great deal of reliability to embedded systems if used properly. To do so requires a good overall approach. Resetting the watchdog timer must be part of the overall design. Verify the operation integrity of the system, and use this as a criteria for resetting the watchdog timer. In addition to validating that the software “does the right thing”, verify that it does so in the time expected. Assume the software will experience a hardware malfunction or software fault. Add enough debugging information to help debug situation. Questions ? References [1] Barr, M. (2001). Introduction to Watchdog Timers, http://www.netrino.com/Publications/Glossary/WatchdogTimer.php [2] Barr, M. (2002). Introduction to Priority Inversion, http://www.netrino.com/Publications/Glossary/PriorityInversion.php [3] Gansel, J. (2004, January). Great Watchdogs, http://darwin.bio.uci.edu/~sustain/bio65/Titlpage.htm [4] Murphy, N. Watchdog Timers, Embedded Systems Programming, http://www.embedded.com/2000/0011/0011feat4.htm [5] Maxim Integrated Products, Inc. (2005, December). Supervisory Circuits with Windowed (Min/Max) Watchdog and Manual Reset, http://datasheets.maxim-ic.com/en/ds/MAX6323-MAX6324.pdf [6] Texas Instruments, Inc. (2006). MSP430x1xx Family User’s Guide, http://focus.ti.com/lit/ug/slau049f/slau049f.pdf