You use the SSH daemon on all of your servers for secure remote administration, so you want to set up Nagios to monitor SSH and alert you if it becomes unavailable. You also want to be able to add new servers for monitoring easily.
Start by setting it up for one server. You'll create a command definition, a host definition, and a service definition by editing commands.cfg, hosts.cfg, and services.cfg. Then, you'll be able to add new servers simply by creating new host definitions, and adding the server names to the service definition.
The default commands.cfg does not contain a command definition for SSH, so add this to commands.cfg:
# 'check_ssh' command definition define command{ command_name check_ssh command_line $USER1$/check_ssh -H $HOSTADDRESS$ }
Next, add a host definition to hosts.cfg, using your own hostname and IP address:
# SSH servers define host{ use generic-host host_name server1 alias backup server address 192.168.1.25 check_command check-host-alive max_check_attempts 10 check_period 24x7 notification_interval 120 notification_period 24x7 notification_options d,r contact_groups admins }
Add your new server to an existing group, or create a new group for it, as this example shows:
define hostgroup{ hostgroup_name misc_servers alias Servers members server1 }
Now, define the SSH service in services.cfg:
# Define a service to monitor SSH define service{ use generic-service host_name server1 service_description SSH is_volatile 0 check_period 24x7 max_check_attempts 4 normal_check_interval 5 retry_check_interval 1 contact_groups admins notification_options w,u,c,r notification_interval 960 notification_period 24x7 check_command check_ssh }
Run the syntax checker, then restart Nagios:
# /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
# /etc/init.d/nagios restart
Refresh the Nagios web interface, and you'll see the new entry's status listed as PENDING. In a few minutes, Nagios will run the new service check, and it will no longer be PENDING, but displaying status information. If you don't want to wait, go to Service Detail → SSH → Reschedule Next Service Check, and run it immediately.
If you are using ports other than port 22, use the -p
option to specify the correct
port.
You can use this recipe as a copy-and-paste template for most services.
Look in /usr/lib/nagios/libexec to view
your available plug-ins. Run [plugin-name]--help
to see the available
options.
Host and service definitions have several required fields; see "Template-Based Object Configuration" (http://localhost/nagios/docs/xodtemplate.html) in your local Nagios documentation for details.
The check_ssh command demonstrates the
most basic Nagios command definition. All Nagios command definitions
must have a command_name
and a
command_line
. The
command_namecan
be anything you want. The
command_linemust
be the name of a
plug-in, followed by options.
$USER1$
is a special macro
defined in resource.cfg; this is a shortcut for
the path to the plug-in. You may have up to 32 $USERx$
macros. Nagios automatically
expands the macro before it runs the command. $USERx$
macros can also store passwords
and usernames. This is a nice shortcut when you're managing groups
of complex configuration files.
-H
means "hostname or
address," and $HOSTADDRESS$
is a
built-in macro that takes the IP address from the host definition.
You can use hostnames with the $HOSTNAME$
macro, but adding DNS lookups
to your monitoring can slow it down, and it adds a point of
failure.
Run the plug-in from the command line to see help and options:
root@xena:/usr/local/nagios/libexec# ./check_ssh -h
Most plug-ins have the -h
or --help
option.
It is best to keep your command definitions generic with macros, and to use explicit values in the service definitions.
Every host needs its own host definitions. Host definitions tell Nagios where to find your servers, and define basic monitoring and alerting behaviors.
check_command
check-host-alive
is a special ping
command. It is used only when other services on the host do not
respond. Nagios knows that as long as the services are up, it
doesn't need to ping the host to see if it is
alive.
notification_options d,r
means send a notification when the host is down, or has recovered
from a down state and is now OK. Here are other options you can
use:
u
Send notifications on an unreachable state.
f
Send notifications when the host starts and stops flapping, which is changing state very rapidly.
n
Send no notifications.
These are similar to Host Definitions, with one large
difference: they are reusable. Once you create a service definition,
you can add more hosts to it, rather than creating a new service
definition every time you need to add a new machine. Just add more
servers on the host_name
line in
a comma-delimited list:
host_name stinkpad,uberpc,xena
Another option is to create a hostgroup for your servers, then
use the hostgroup_name
directive
instead:
hostgroup_name backup_servers
The notification_options
are a bit different:
w
Send notifications on a warning state.
u
Send notifications on an unknown state.
c
Send notifications on a critical state.
r
Send notifications on recoveries.
f
Send notifications when the service starts and stops flapping.
"Using Macros In Commands" (http://localhost/nagios/docs/macros.html) in your local Nagios documentation for a list of built-in macros
For definitions of the options in object definition files, which are all the files in lan_objects/, start at "Template-Based Object Configuration": http://localhost/nagios/docs/xodtemplate.html
For nagios.cfg and resources.cfg, see "Main Configuration File Options": http://localhost/nagios/docs/configmain.html
"Flapping": http://localhost/nagios/docs/flapping.html
Nagios.org: http://www.nagios.org/