ethernet host watchdog, version 3.X

Back

Tuxgraphics ethernet host watchdog, version 3.x

The software of routers and server does unfortunately occasionally fail. In many cases a reboot can remedy the problem for a while.

This host watchdog can improve the availability of your network and servers significantly and fix the problem automatically before your customers start to call.

How it works

The host watchdog

- sends in intervals ping (icmp echo request) to a host and waits for the reply

or

- expects to receive pings (icmp echo request) from the monitored host.

By sending the pings from the monitored host to the watchdog you have the possibility add some application level checks at the host using scripts. Application level checks are more complex checks that go beyond pure IP level availability. Application level checks are optional. The simplest configuration is to ping from the watchdog to a host.

The watchdog resets the host after 6 intervals of no received ping or no ping reply. The watchdog goes then into a "passive" state to avoid rebooting during startup of the host (e.g interrupting a file system check). In this passive state it will not issue a second reset even if the monitored host appears to be not responding to ping or not sending ping. Once the host answers it goes back to the active state. In this state it would reset the host if it suddenly fails again to respond.

The main configuration page of the watchdog

The status line:
Status: OK or amount of missing pings [reset cnt: How often a reset was initiated. This value goes back to zero on power down of the the watchdog, state: stopped|active|passive ]

The state stopped is shown if the watchdog is stopped via the "actions" menu. Active means the watchdog is ready to reset the host if needed. Passive means the host has not been reachable yet since last reset (or after power down of the watchdog).

Monitored IP is the ip-address of the host to watch. Pings from this host are counted as "host alive" and if the box "Send ping" is ticked then pings are also send to this IP address. It must be an address in the local LAN and can not be behind a gateway.

The ping interval is the time between pings sent out or the time until a ping must be received. If you ping the watchdog externally then the sending time should less than the ping interval configured at the watchdog. The value range for the ping interval is 2 to 250 sec.

Choosing a GW IP

The gateway IP should be set to 0.0.0.0 if the monitored host is on the same LAN as the watchdog (0.0.0.0 means don't use the GW). In this case the pings will be sent directly from the watchdog to the monitored host.

If you want to ping a host that is behind a gateway router (e.g a host in the internet) then you should use the gateway IP address of your router.

Actions page

On the actions page you can trigger an immediate reset of the system with the "reboot host now" button or stop the watchdog with the "stop watchdog now" button. It is recommended to stop the watchdog when performing maintenance on the monitored system. I the stopped state the watchdog will not reboot the monitored host.

To start the watchdog again after it was stopped just go back to the "actions page" and it will say "start watchdog now".

The actions page allows you to perform immediate manual actions.

Configuring the watchdog's own IP address

Version 2.X allowed to change the devices IP address remotely over the internet. This feature is removed in version 3.X for security reasons. You must now have physical access to the watch dog to be able to change the IP.

If you bought the board with pre-loaded software then you can change the IP by setting a jumper on the board.

If you compiled and loaded the software yourself then you can change the IP device's own IP in the source code and re-program the board.

Adding application level checks (checking if the host really works)

You can add some more sophisticated checks by only pinging the watchdog from the host (the "Send ping" box not checked). This way you can write a script which does some additional checks on the host and make sure that the application layer (e.g web-server) is really up and working:

#!/bin/sh
while true; do

# put your additional checks here (example check webserver is responding):
if w3m -dump_head http://localhost | grep Content-Type > /dev/null; then
    ping -c 1 -q -w 2 10.0.0.27
fi
# end of additional checks

sleep 8
done

Thoughts on reliability and DOS attacks

A problem for servers on the internet are DOS attacks where usually virus infected windows PCs are used to attack a server by overloading it with requests. In such a case host might not be responsive to the watchdog. The chances for this to happen are a bit reduced because the watchdog will only hit after 6 response failures in a row. If you have a host that might get temporarily overloaded then consider to use longer ping intervals (e.g 60sec). You can also enforce at the router facing the internet a bandwidth limit to make sure that your hosts do not totally lock-up when they are attacked. A second target could be the watchdog itself. The best protection is to not allow any external traffic towards the watchdog. This can e.g be done by only using private IP addresses between host and watchdog or by using a firewall.

External connections

A relay to control the reset button of the monitored host or to interrupt the power supply of the monitored host can be connected to pin PD7. The tuxgraphics ethernet board has already a transistor and fly-back diode on board to support a relay. All you need is an external 6V relay.

A LED can be connected on pin PB1. It will turn on as soon as the first missed ping is detected and it goes off when pings resume. This LED is optional.

Monitoring a network link

The host watchdog is designed to monitor an IP host (server) but you can also use it supervise transport equipment. WIFI routers are often used to provided a wireless network link to a remote site or to provided local IP network coverage. Due to firmware quality problems those routers may stop working. Rebooting the router will remedy the problem for a while. To monitor the WIFI network you can use two watchdogs and a WIFI bridge. The watchdogs are configured to ping each other across the WIFI connection.


     WIFI-Router  . . . . . . . . .  WIFI-Bridge
         |                               |
         |                               |
      watchdog-1                     watchdog-2
      plugged in at                  Will reset bridge
      the router.
      Will reset router.

After a failure of the WIFI network both watchdogs will trigger. It might unnecessarily reset the WIFI-Bridge but this setup will ensure that we recover also from a WIFI-Bridge failure.

Back