Back
Tuxgraphics.org ethernet host watchdog
The
software of routers and server does unfortunately occasionally
fail. In many cases a reboot can remedy the problem for a
while.
The host watchdog sends in intervals ping (icmp echo request)
to a host and waits for the reply. It can also just monitor the
pings that come from a host. This can even be used to add some
application level checks using scripts.
After 10 intervals of no received ping or ping reply the
watchdog resets the host. The watchdog goes then into a
"passive" state to avoid rebooting duing startup of the host
(e.g interrupting a filesystem check). In this passive state it
will not issue a second reset even if the monitored host
appears to be not responding to ping or not sending ping. Once
the host answers it goes back to the active state. In this
state it would reset the host if it suddenly fails again to
respond.
Configuring the watchdog's own IP address
The default settings for the watchdog are:
passwords: secret
IP address: 10.0.0.27
You can either change this in the code before you compile it or
you can change it at run-time.
To change it at run-time you must first
connect to the watchdog. You can use a straight ethernet cable and
connect the watchdog directly to your PC. Re-configure the
network interface of your PC to be a host in the same network
(e.g 10.0.0.1).
You can also connect the watchdog somewere in the local lan and
add a route on your PC. Under Linux that would be done with the
command:
route add -host 10.0.0.27 dev eth0
Once you are connected open http://10.0.0.27/ go to the config
menu and enter "secret" as "old pw". Change "new pw" and "own
IP" as needed.
The main configuraton page of the watchdog
The status line:
Status: OK or amount of missing pings [reset cnt: How
often a reset was initiated. This value goes back to zero on
powerdown of the the watchdog,
stopped|active|passive ]
The state stopped is shown if the watchdog is stopped via the
"actions" menu. Active means the watchdog is ready to reset the
host if needed. Passive means the host has not been reachable
yet since last reset (or after powerdown of the watchdog).
Monitored IP is the ip-address of the host to watch. Pings from
this host are counted as "host alive" and if the box "Send
ping" is ticked then pings are also send to this IP address. It
must be an address in the local lan and can not be behind a
gateway.
The main page updates itself (http refesh) after 60sec. An
immediate refresh can be done by clicking on "refresh page".
Note that requests for webpages take presindence over pings.
Clicking too often "refresh page" can actually cause sending of
ping to fail.
Actions page
On the actions you can trigger an immediate reset of the system
with the "reboot host now" button or stop the watchdog with the
"stop watchdog now" button. To stope the watchdog is useful
when performing maintenance on the monitored system. I the
stopped state the watchdog will not reboot the monitored host.
To start the watchdog again after it was stopped just go back
to the "actions page" and it will say "start watchdog now".
Config page
The config page is used to change the watchdog's own password
and IP address settings.
Factory default
The watchdog can be reset to compile time default settings in
the event that you forgott to what IP address or password you
configured. To do this power down the watchdog and connect the
pin PD6 from the microcontroller with GND while powering it up
again. The watchdog will then be reachable at http://10.0.0.27/
as described above. Note that you have to actually go to the
config page and submit the changes otherwise it will fall back
(after power down) to the settings the watchdog had previously
stored.
Inserting a temporary wire bridge to reset to factory defaults.
Thoughts on reliability and DOS attacks
The watchdog is only useful if a failure of the monitored host
really causes it to not respond to ping. You can add some more
sophisticated checks by only pinging the watchdog from the host
(the "Send ping" box not checked). This way you can write a
script which does some additional checks on the host:
#!/bin/sh
while true; do
# put your additional checks here (example check webserver is responding):
if w3m -dump_head http://localhost | grep Content-Type > /dev/null; then
ping -c 1 -q -w 2 10.0.0.27
fi
# end of additional checks
sleep 25
done
A big problem for servers on the internet are DOS attacks where
usually virus infected windows PCs are used to attack a server
by overloading it with requests. In such a case host might not
be responsive to the watchdog too. The chances for this to
happen are a bit reduced because the watchdog will only hit
after 10 reponse failures in a row. So make sure that your host
do not totally lock-up when they are attacked. A second target
could be the watchdog itself. The best protection is to not
allow any external traffic towards the watchdog. This can e.g
be done by only using private IP addresses between host and
watchdog or by using a firewall.
Back© tuxgraphics.org, K. Socher