Back
Tuxgraphics ethernet host watchdog, version 3.x
The
software of routers and server does unfortunately occasionally
fail. In many cases a reboot can remedy the problem for a
while.
This host watchdog can improve the availability of your network and servers
significantly and fix the problem automatically before your customers
start to call.
How it works
The host watchdog
- sends in intervals ping (icmp echo request) to a host and waits
for the reply
or
- expects to receive pings (icmp echo request) from the monitored host.
By sending the pings from the monitored host to the watchdog
you have the possibility add some application level checks at the host
using scripts. Application level checks are more complex checks that go beyond pure
IP level availability. Application level checks are optional. The simplest configuration
is to ping from the watchdog to a host.
The watchdog resets the host after 6 intervals of no
received ping or no ping reply. The watchdog goes then into a
"passive" state to avoid rebooting during startup of the host
(e.g interrupting a file system check). In this passive state it
will not issue a second reset even if the monitored host
appears to be not responding to ping or not sending ping. Once
the host answers it goes back to the active state. In this
state it would reset the host if it suddenly fails again to
respond.
The main configuration page of the watchdog
The status line:
Status: OK or amount of missing pings [reset cnt: How
often a reset was initiated. This value goes back to zero on
power down of the the watchdog, state:
stopped|active|passive ]
The state stopped is shown if the watchdog is stopped via the
"actions" menu. Active means the watchdog is ready to reset the
host if needed. Passive means the host has not been reachable
yet since last reset (or after power down of the watchdog).
Monitored IP is the ip-address of the host to watch. Pings from
this host are counted as "host alive" and if the box "Send
ping" is ticked then pings are also send to this IP address. It
must be an address in the local LAN and can not be behind a
gateway.
The ping interval is the time between pings sent out or
the time until a ping must be received. If you ping the watchdog
externally then the sending time should less than the ping interval
configured at the watchdog. The value range for the ping interval
is 2 to 250 sec.
Choosing a GW IP
The gateway IP should be set to 0.0.0.0 if the monitored host is
on the same LAN as the watchdog (0.0.0.0 means don't use the GW). In this case the pings will be
sent directly from the watchdog to the monitored host.
If you want to ping a host that is behind a gateway router (e.g a host
in the internet) then you should use the gateway IP address of your
router.
Actions page
On the actions page you can trigger an immediate reset of the system
with the "reboot host now" button or stop the watchdog with the
"stop watchdog now" button. It is recommended to stop the watchdog
when performing maintenance on the monitored system. I the
stopped state the watchdog will not reboot the monitored host.
To start the watchdog again after it was stopped just go back
to the "actions page" and it will say "start watchdog now".
The actions page allows you to perform immediate manual actions.
Configuring the watchdog's own IP address
Version 2.X allowed to change the devices IP address remotely over
the internet. This feature is removed in version 3.X for security
reasons. You must now have physical access to the watch dog
to be able to change the IP.
If you bought the board with pre-loaded software then you can
change the IP by setting a jumper on the board.
If you compiled and loaded
the software yourself then you can change the IP device's own IP
in the source code and re-program the board.
Adding application level checks (checking if the host really works)
You can add some more
sophisticated checks by only pinging the watchdog from the host
(the "Send ping" box not checked). This way you can write a
script which does some additional checks on the host and make
sure that the application layer (e.g web-server) is really up and working:
#!/bin/sh
while true; do
# put your additional checks here (example check webserver is responding):
if w3m -dump_head http://localhost | grep Content-Type > /dev/null; then
ping -c 1 -q -w 2 10.0.0.27
fi
# end of additional checks
sleep 8
done
Thoughts on reliability and DOS attacks
A problem for servers on the internet are DOS attacks where
usually virus infected windows PCs are used to attack a server
by overloading it with requests. In such a case host might not
be responsive to the watchdog. The chances for this to
happen are a bit reduced because the watchdog will only hit
after 6 response failures in a row. If you have a host that
might get temporarily overloaded then consider to use longer ping intervals
(e.g 60sec). You can also enforce at the router facing the
internet a bandwidth limit to make sure that your hosts
do not totally lock-up when they are attacked. A second target
could be the watchdog itself. The best protection is to not
allow any external traffic towards the watchdog. This can e.g
be done by only using private IP addresses between host and
watchdog or by using a firewall.
External connections
A relay to control the reset button of the monitored host or
to interrupt the power supply of the monitored host can be connected
to pin PD7. The tuxgraphics ethernet board has already a transistor
and fly-back diode on board to support a relay. All you need is an external 6V relay.
A LED can be connected on pin PB1. It will turn on as soon as
the first missed ping is detected and it goes off when pings resume.
This LED is optional.
Monitoring a network link
The host watchdog is designed to monitor an IP host (server) but
you can also use it supervise transport equipment. WIFI routers
are often used to provided a wireless network link to a remote
site or to provided local IP network coverage. Due to firmware
quality problems those routers may stop working. Rebooting the
router will remedy the problem for a while. To monitor the WIFI network
you can use two watchdogs and a WIFI bridge. The watchdogs are configured
to ping each other across the WIFI connection.
WIFI-Router . . . . . . . . . WIFI-Bridge
| |
| |
watchdog-1 watchdog-2
plugged in at Will reset bridge
the router.
Will reset router.
After a failure of the WIFI network both watchdogs will trigger.
It might unnecessarily reset the WIFI-Bridge but this setup will
ensure that we recover also from a WIFI-Bridge failure.
Back© tuxgraphics.org, K. Socher