Watchdog Timers

advertisement
Watchdog Timers
Jeffrey Schwentner
EEL6897, Fall 2007
Software Reliability

Embedded systems must be able to cope with both
hardware and software anomalies to be truly robust.

In many cases, embedded devices operate in total
isolation and are not accessible to an operator.

Manually resetting a device in this scenario when its
software “hangs” is not possible.

In extreme cases, this can result in damaged hardware
or loss of life and incur significant cost impact.
The Clementine

In 1994, a deep space probe, the Clementine, was
launched to make observations of the moon and a large
asteroid (1620 Geographos).

After months of operation, a software exception
caused a control thruster to fire for 11 minutes, which
depleted most of the remaining fuel and caused the
probe to rotate at 80 RPM.

Control was eventually regained, but it was too late to
successfully complete the mission.
Watchdog Timers

While it is not possible to cope with all hardware and
software anomalies, the developer can employ the use
of watchdog timers to help mitigate the risks.

A watchdog timer is a hardware timing device that
triggers a system reset, or similar operation, after a
designated amount of time has elapsed.

A watchdog timer can be either a stand-alone hardware
component or built into the processor itself.

To avoid a reset, an application must periodically reset
the watchdog timer before this interval elapses. This is
also known as “kicking the dog”.
External Watchdogs

External watchdog timers are integrated circuits that
physically assert the reset pin of the processor.

The Processor must assert an output pin in some
fashion to reset the timing mechanism of the watchdog.

This type of watchdog is generally considered the most
appropriate because of the complete independence of
the watchdog from the processor.

Some external watchdogs feature a “windowed” reset.
◦ Enforces timing constraints for a proper watchdog reset.
◦ Minimizes likelihood of errant software resetting the watchdog.
External Watchdog Schematic
Windowed Watchdog Operation
Maxim MAX6323
Internal Watchdogs

Many processors and microcontrollers have built-in
watchdog circuitry available to the programmer.

This typically consists of a memory-mapped counter
that triggers a non-maskable interrupt (NMI), or reset,
when the counter reaches a predefined value.

Instead of issuing a reset via an I/O pin assertion, an
internal counter of reset to an initial value.

Watchdog configuration is controlled user software.

Watchdog may even be used as a general purpose timer
in some cases.
Internal Watchdog Considerations

Internal watchdogs are not as “safe” as watchdog
circuits external to the processor.
◦ Watchdogs that issue a NMI instead of a reset may not properly
reinitialize the system.
◦ Watchdog control registers may be inadvertently overwritten by
runaway code, disabling the watchdog all together.
◦ Reset is limited to the processor itself (no outside peripherals).

To circumvent these issues, most built-in watchdogs
have extra safety-steps designed to prohibit errant code
from interfering with the operation of the watchdog
timer.

On-chip solutions have a significant cost and space
advantage over their external counterparts.
MSP430 Watchdog

Texas Instruments MSP430 family of microcontrollers
has a built-in 16-bit watchdog timer featuring:
◦ Configurable clock source and prescaler
◦ Two interrupt options (Reset or NMI)
◦ Isolated watchdog counter

Access to the watchdog counter requires a unique
binary code, or password.
◦ The code must be written to the password register
prior to resetting watchdog timer.
◦ An invalid password attempt causes a key violation
interrupt.
MSP430 Watchdog Timer Block Diagram
Design Considerations

The effectiveness of the watchdog is a function of how
it is used within the application software.

Simply issuing a watchdog reset in every iteration of the
program loop may be insufficient.

Take a more proactive approach.
◦ Periodically assess the state and health of the system. Only issue
a reset if all processes are deemed normal.
◦ Employ a state-based approach when resetting the watchdog
timer.
◦ Should a watchdog failure occur, provide an indication and/or
capture debugging information.
System Health Assessment

As the size and complexity of software increases, so
does the likelihood of introducing code that may be
detrimental to the system.

Software may not be the only cause of system
invalidation. A spike in the power supply, for example,
may corrupt data in memory, or even system registers
(program counter, stack pointer, etc).

Check for things like stack overflows and validate
memory wherever possible.
System Health Assessment

If the state of the system is compromised, let the
watchdog timer perform the reset. This is a better
approach than an application “pseudo-reset”.

Watchdog timers, themselves, can also adversely affect
the system.
◦ Setting a watchdog interval too short will generate a
premature reset.
◦ If a critical section of code takes 80 milliseconds to
complete, do not set the watchdog interval for 60
milliseconds.
State-based Watchdog
To guarantee that the software executes as intended,
incorporate a simple state machine.
◦ This involves adjusting a state variable at the
beginning of a program iteration.
◦ Prior to resetting the watchdog timer at the end of
the program iteration, verify that the state is
“correct”.
 Prevents random code from wandering into the main
loop and kicking the dog.
 Enforces a constraint on program sequence.

State-based Watchdog Example
void watchdog_state_advance(void)
{
g_usWatchdogState += 0x1111;
}
void watchdog_state_validate(void)
{
g_usWatchdogStatePrev += 0x1111;
}
if(g_usWatchdogState != g_usWatchdogStatePrev)
{
// State is invalid, allow watchdog to reset.
SLEEP();
}
else
{
// Reset the watchdog timer.
Note: Repeated calls to the
WDT_RESET();
validate function will cause
}
a watchdog reset.
Debugging Information

If software detects a fault condition, log the error
information prior to allowing the watchdog to reset the
system.
◦ Allows the cause of the failure to be addressed.
◦ A report of the error should be attempted when the system
resets (part of initialization perhaps).

In addition to reporting errors after reset, it is a good
idea to indicate that the device has been reset.
◦ If the software was unable to catch the error, it will still attempt
to notify of the reset event.
◦ Systems that appear “sluggish” may actually be experience
frequent watchdog resets.
Single-threaded Implementation

Single-threaded implementations should reset the
watchdog timer in the main software loop.

To determine the proper watchdog timeout duration,
the programmer must determine the amount of time it
takes to execute the code, using worst case scenarios.
◦ Many systems do not require “tight” timing.
◦ In these cases, setting the timeout to a very large safe
value may be acceptable, just to provide a protection
against deadlocks.

Prior to resetting the watchdog, verify that the state of
the system is valid, and system health is normal.
Single-threaded Example
main(void)
{
hwinit();
for (;;)
{
watchdog_state_advance();
read_sensors();
control_motor();
display_status();
}
}
if(system_check() == S_OK)
{
// Kick the dog.
watchdog_state_validate()
}
else
{
flash_led();
report_error(E_FAIL);
}
Multi-threaded Implementation

The same concepts used in a single-threaded design are
also applicable for multi-threaded implementations.

Avoid creating a thread that simply resets the watchdog
timer at regular intervals.
◦ Other threads could fail, and the watchdog thread would keep
kicking the dog.

Generate a set of flags or data from each thread that
can be validated in a “monitoring” thread.
◦ The monitoring thread should reset the watchdog at regular
intervals only if the data produced by the other threads is
acceptable.
Multi-threaded Monitoring
Monitoring Task
System Tasks
Multi-threaded Frequency



An important criteria that can be used to validate the
health of the system is the execution frequency of the
worker threads.
This can be accomplished by incorporating a simple
counter that is incremented on each iteration of a
worker thread.
These counters are then monitored and compared with
threshold values from the monitoring thread.
◦ If the execution frequency of the monitoring task is significantly
greater, the monitoring task can perform a thresholds
◦ This allows the software to validate timing constraints.
Multi-threaded Example: System Threads
thread_read_sensor(void)
{
for (;;)
{
read_sensors();
thread_sensor_cnt++;
sleep(50);
}
}
thread_control_motor(void)
{
for (;;)
{
control_motor();
thread_motor_cnt++;
sleep(100);
}
}
thread_display_status(void)
{
for (;;)
{
display_status ();
thread_display_cnt++;
sleep(125);
}
}
Note: Each thread
maintains a unique
execution counter.
Multi-threaded Example: Monitoring Thread
main(void)
{
hwinit();
launch_threads();
for (;;)
{
watchdog_state_advance();
else
{
flash_led();
report_error(E_FAIL);
}
// Sleep monitoring task for 1 sec.
sleep(1000);
}
if(system_check() == S_OK &&
thread_sensor_cnt > 18 &&
thread_sensor_cnt < 22 &&
thread_motor_cnt > 8 &&
thread_motor_cnt < 12 &&
thread_display_cnt > 6 &&
thread_display_cnt < 10)
{
// Kick the dog.
watchdog_state_validate()
// Reset counters.
thread_sensor_cnt = 0;
thread_motor_cnt = 0;
thread_display_cnt = 0;
}
}
Note: The relative
frequency of each system
thread is checked here. A
small window is applied to
each.
Mars Pathfinder

In July of 1997, a priority inversion occurred on the
Mars Pathfinder mission, after the craft had landed on
the Martian surface.

A high priority communications task was forced to wait
on a mutex held by a lower priority “science” task.

The timing of the software was compromised, and a
system reset issued by its watchdog timer brought the
system back to normal operating conditions.

On Earth, scientists were able to identify the problem
and upload new code to fix the problem.

Thus, the rest of the $265 million dollar mission could
be completed successfully.
Conclusion

Watchdog timers can add a great deal of reliability to
embedded systems if used properly.

To do so requires a good overall approach. Resetting
the watchdog timer must be part of the overall design.

Verify the operation integrity of the system, and use this
as a criteria for resetting the watchdog timer.

In addition to validating that the software “does the
right thing”, verify that it does so in the time expected.

Assume the software will experience a hardware
malfunction or software fault. Add enough debugging
information to help debug situation.
Questions ?
References
[1] Barr, M. (2001). Introduction to Watchdog Timers,
http://www.netrino.com/Publications/Glossary/WatchdogTimer.php
[2] Barr, M. (2002). Introduction to Priority Inversion,
http://www.netrino.com/Publications/Glossary/PriorityInversion.php
[3] Gansel, J. (2004, January). Great Watchdogs,
http://darwin.bio.uci.edu/~sustain/bio65/Titlpage.htm
[4] Murphy, N. Watchdog Timers, Embedded Systems Programming,
http://www.embedded.com/2000/0011/0011feat4.htm
[5] Maxim Integrated Products, Inc. (2005, December). Supervisory
Circuits with Windowed (Min/Max) Watchdog and Manual Reset,
http://datasheets.maxim-ic.com/en/ds/MAX6323-MAX6324.pdf
[6] Texas Instruments, Inc. (2006). MSP430x1xx Family User’s Guide,
http://focus.ti.com/lit/ug/slau049f/slau049f.pdf
Download