13.9. Monitoring SSHD

Problem

You use the SSH daemon on all of your servers for secure remote administration, so you want to set up Nagios to monitor SSH and alert you if it becomes unavailable. You also want to be able to add new servers for monitoring easily.

Solution

Start by setting it up for one server. You'll create a command definition, a host definition, and a service definition by editing commands.cfg, hosts.cfg, and services.cfg. Then, you'll be able to add new servers simply by creating new host definitions, and adding the server names to the service definition.

The default commands.cfg does not contain a command definition for SSH, so add this to commands.cfg:

	# 'check_ssh' command definition
	define command{
	        command_name    check_ssh
	        command_line    $USER1$/check_ssh -H $HOSTADDRESS$
	        }

Next, add a host definition to hosts.cfg, using your own hostname and IP address:

	# SSH servers
	define host{
	        use                     generic-host
	        host_name               server1
	        alias                   backup server
	        address                 192.168.1.25
	        check_command           check-host-alive
	        max_check_attempts      10
	        check_period           24x7
	        notification_interval   120
	        notification_period     24x7
	        notification_options    d,r
	        contact_groups          admins
	        }

Add your new server to an existing group, or create a new group for it, as this example shows:

	define hostgroup{
	        hostgroup_name   misc_servers
	        alias            Servers
	        members          server1
	        }

Now, define the SSH service in services.cfg:

	# Define a service to monitor SSH
	define service{
	        use                       generic-service
	        host_name                 server1
	        service_description       SSH
	        is_volatile               0
	        check_period              24x7
	        max_check_attempts        4
	        normal_check_interval     5
	        retry_check_interval      1
	        contact_groups            admins
	         notification_options         w,u,c,r
	        notification_interval     960
	        notification_period       24x7
	        check_command             check_ssh
	        }

Run the syntax checker, then restart Nagios:

	# /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
	# /etc/init.d/nagios restart

Refresh the Nagios web interface, and you'll see the new entry's status listed as PENDING. In a few minutes, Nagios will run the new service check, and it will no longer be PENDING, but displaying status information. If you don't want to wait, go to Service Detail → SSH → Reschedule Next Service Check, and run it immediately.

Discussion

If you are using ports other than port 22, use the -p option to specify the correct port.

You can use this recipe as a copy-and-paste template for most services.

Look in /usr/lib/nagios/libexec to view your available plug-ins. Run [plugin-name]--help to see the available options.

Host and service definitions have several required fields; see "Template-Based Object Configuration" (http://localhost/nagios/docs/xodtemplate.html) in your local Nagios documentation for details.

Command definitions

The check_ssh command demonstrates the most basic Nagios command definition. All Nagios command definitions must have a command_name and a command_line. The command_namecan be anything you want. The command_linemust be the name of a plug-in, followed by options.

$USER1$ is a special macro defined in resource.cfg; this is a shortcut for the path to the plug-in. You may have up to 32 $USERx$ macros. Nagios automatically expands the macro before it runs the command. $USERx$ macros can also store passwords and usernames. This is a nice shortcut when you're managing groups of complex configuration files.

-H means "hostname or address," and $HOSTADDRESS$ is a built-in macro that takes the IP address from the host definition. You can use hostnames with the $HOSTNAME$ macro, but adding DNS lookups to your monitoring can slow it down, and it adds a point of failure.

Run the plug-in from the command line to see help and options:

	root@xena:/usr/local/nagios/libexec# ./check_ssh -h

Most plug-ins have the -h or --help option.

It is best to keep your command definitions generic with macros, and to use explicit values in the service definitions.

Host definitions

Every host needs its own host definitions. Host definitions tell Nagios where to find your servers, and define basic monitoring and alerting behaviors.

check_command check-host-alive is a special ping command. It is used only when other services on the host do not respond. Nagios knows that as long as the services are up, it doesn't need to ping the host to see if it is alive.

notification_options d,r means send a notification when the host is down, or has recovered from a down state and is now OK. Here are other options you can use:

u: Send notifications on an unreachable state.
f: Send notifications when the host starts and stops flapping, which is changing state very rapidly.
n: Send no notifications.

Service definitions

	host_name    stinkpad,uberpc,xena

Another option is to create a hostgroup for your servers, then use the hostgroup_name directive instead:

	hostgroup_name    backup_servers

The notification_options are a bit different:

w: Send notifications on a warning state.
u: Send notifications on an unknown state.
c: Send notifications on a critical state.
r: Send notifications on recoveries.
f: Send notifications when the service starts and stops flapping.