You've successfully installed Nagios, configured Apache, and set up your configuration files in an orderly manner as outlined in the previous recipe. Reading the local Nagios documentation at http://localhost/nagios is nice, but you really want to get going on setting up Nagios to keep an untiring eye on your network. What's the next step?
Nagios is best set up in small steps, so we'll start with monitoring five basic functions on the Nagios server: ping, disk usage, local users, total processes, and CPU load. This is a long recipe, but when you're finished, you'll have your basic Nagios framework constructed.
Copy the following five configuration files exactly as shown, except where it says to use your own information, and put them in the directories as outlined in the previous recipe:
/usr/local/nagios/etc/nagios.cfg
/usr/local/nagios/etc/lan_objects/timeperiods.cfg
/usr/local/nagios/etc/lan_objects/contacts.cfg
/usr/local/nagios/etc/lan_objects/hosts.cfg
/usr/local/nagios/etc/lan_objects/services.cfg
Obviously, retyping all this is the path to madness, so please visit http://www.oreilly.com/catalog/9780596102487 to download them.
First, create nagios.cfg:
################ # nagios.cfg # main Nagios configuration file ################ log_file=/usr/local/nagios/var/nagios.log cfg_dir=/usr/local/nagios/etc/lan_objects object_cache_file=/usr/local/nagios/var/objects.cache resource_file=/usr/local/nagios/etc/resource.cfg status_file=/usr/local/nagios/var/status.dat nagios_user=nagios nagios_group=nagios check_external_commands=1 command_check_interval=-1 command_file=/usr/local/nagios/var/rw/nagios.cmd comment_file=/usr/local/nagios/var/comments.dat downtime_file=/usr/local/nagios/var/downtime.dat lock_file=/usr/local/nagios/var/nagios.lock temp_file=/usr/local/nagios/var/nagios.tmp event_broker_options=-1 log_rotation_method=d log_archive_path=/usr/local/nagios/var/archives use_syslog=1 log_notifications=1 log_service_retries=1 log_host_retries=1 log_event_handlers=1 log_initial_states=0 log_external_commands=1 log_passive_checks=1 service_inter_check_delay_method=s max_service_check_spread=30 service_interleave_factor=s host_inter_check_delay_method=s max_host_check_spread=30 max_concurrent_checks=0 service_reaper_frequency=10 auto_reschedule_checks=0 auto_rescheduling_interval=30 auto_rescheduling_window=180 sleep_time=0.25 service_check_timeout=60 host_check_timeout=30 event_handler_timeout=30 notification_timeout=30 ocsp_timeout=5 perfdata_timeout=5 retain_state_information=1 state_retention_file=/usr/local/nagios/var/retention.dat retention_update_interval=60 use_retained_program_state=1 use_retained_scheduling_info=0 interval_length=60 use_aggressive_host_checking=0 execute_service_checks=1 accept_passive_service_checks=1 execute_host_checks=1 accept_passive_host_checks=1 enable_notifications=1 enable_event_handlers=1 process_performance_data=0 obsess_over_services=0 check_for_orphaned_services=0 check_service_freshness=1 service_freshness_check_interval=60 check_host_freshness=0 host_freshness_check_interval=60 aggregate_status_updates=1 status_update_interval=15 enable_flap_detection=0 low_service_flap_threshold=5.0 high_service_flap_threshold=20.0 low_host_flap_threshold=5.0 high_host_flap_threshold=20.0 date_format=us p1_file=/usr/local/nagios/bin/p1.pl illegal_object_name_chars=`~!$%^&*|'"<>?,()= illegal_macro_output_chars=`~$&|'"<> use_regexp_matching=0 use_true_regexp_matching=0 admin_email=nagios admin_pager=pagenagios daemon_dumps_core=0
Now, create timeperiods.cfg:
# Time periods # All times are valid for all # checks and notifications define timeperiod{ timeperiod_name 24x7 alias 24 Hours A Day, 7 Days A Week sunday 00:00-24:00 monday 00:00-24:00 tuesday 00:00-24:00 wednesday 00:00-24:00 thursday 00:00-24:00 friday 00:00-24:00 saturday 00:00-24:00 }
Next, create contacts.cfg. The contact_name
must be a Nagios user with a
Nagios login in htpasswd.users, and an email
account:
################ # Contacts- individuals and groups ################ define contact{ contact_name nagios alias Nagios Admin service_notification_period 24x7 host_notification_period 24x7 service_notification_options w,u,c,r host_notification_options d,r service_notification_commands notify-by-email host_notification_commands host-notify-by-email email nagios@alrac.net } # contact groups # Nagios only talks to contact groups, not individuals # members must be Nagios users, alias and contact_group # are whatever you want define contactgroup{ contactgroup_name admins alias Nagios Administrators members nagios }
Next, create hosts.cfg:
################ # Hosts file- individual hosts and host groups ################ # Generic host definition template - This is NOT a real host, just a template! define host{ name generic-host notifications_enabled 1 event_handler_enabled 1 flap_detection_enabled 1 failure_prediction_enabled 1 process_perf_data 1 retain_status_information 1 retain_nonstatus_information 1 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE! register 0 } # local host definition define host{ use generic-host host_name localhost alias Nagios Server address 127.0.0.1 check_command check-host-alive max_check_attempts 10 check_period 24x7 notification_interval 120 notification_period 24x7 notification_options d,r contact_groups admins } ############## # Host groups ############## # Every host must belong to a host group define hostgroup{ hostgroup_name test alias Test Servers members localhost }
Finally, create services.cfg:
################ # Services ################ # Generic service definition template - This is NOT a real service, just a template! define service{ name generic-service active_checks_enabled 1 passive_checks_enabled 1 parallelize_check 1 obsess_over_service 1 check_freshness 0 notifications_enabled 1 event_handler_enabled 1 flap_detection_enabled 1 failure_prediction_enabled 1 process_perf_data 1 retain_status_information 1 retain_nonstatus_information 1 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE! register 0 } # Define a service to "ping" the local machine define service{ use generic-service host_name localhost service_description PING is_volatile 0 check_period 24x7 max_check_attempts 4 normal_check_interval 5 retry_check_interval 1 contact_groups admins notification_options w,u,c,r notification_interval 960 notification_period 24x7 check_command check_ping!100.0,20%!500.0,60% } # Define a service to check the disk space of the root partition # on the local machine. Warning if < 20% free, critical if # < 10% free space on partition. define service{ use generic-service host_name localhost service_description Root Partition is_volatile 0 check_period 24x7 max_check_attempts 4 normal_check_interval 5 retry_check_interval 1 contact_groups admins notification_options w,u,c,r notification_interval 960 notification_period 24x7 check_command check_local_disk!20%!10%!/ } # Define a service to check the number of currently logged in # users on the local machine. Warning if > 20 users, critical # if > 50 users. define service{ use generic-service host_name localhost service_description Current Users is_volatile 0 check_period 24x7 max_check_attempts 4 normal_check_interval 5 retry_check_interval 1 contact_groups admins notification_options w,u,c,r notification_interval 960 notification_period 24x7 check_command check_local_users!20!50 } # Define a service to check the number of currently running procs # on the local machine. Warning if > 250 processes, critical if # > 400 users. define service{ use generic-service host_name localhost service_description Total Processes is_volatile 0 check_period 24x7 max_check_attempts 4 normal_check_interval 5 retry_check_interval 1 contact_groups admins notification_options w,u,c,r notification_interval 960 notification_period 24x7 check_command check_local_procs!250!400 } # Define a service to check the load on the local machine. define service{ use generic-service host_name localhost service_description Current Load is_volatile 0 check_period 24x7 max_check_attempts 4 normal_check_interva 5 retry_check_interval 1 contact_groups admins notification_options w,u,c,r notification_interval 960 notification_period 24x7 check_command check_local_load!5.0,4.0,3.0!10.0,6.0,4.0 }
OK, we're almost there! Make all the files in lan_objects/ owned and writable by the nagios user:
# chown nagios:nagios /usr/local/nagios/etc/lan_objects/*
# chmod 0644 /usr/local/nagios/etc/lan_objects/*
Adjust these file ownerships and modes as shown:
# chown nagios:nagios /usr/local/nagios/etc/nagios.cfg
# chmod 0644 /usr/local/nagios/etc/nagios.cfg
# chown nagios:nagios /usr/local/nagios/etc/resource.cfg
# chmod 0600 /usr/local/nagios/etc/resource.cfg
# chown nagios:nagios /usr/local/nagios/etc/cgi.cfg
# chmod 0644 /usr/local/nagios/etc/cgi.cfg
Now, you can run Nagios' syntax checker. You need to do this as root:
# /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
You should see a lot of output ending in these lines:
Total Warnings: 0 Total Errors: 0 Things look okay - No serious problems were detected during the pre-flight check
If there are any errors, it will tell you exactly what you need to fix. When you get a clean run, start up the Nagios daemon:
# /etc/init.d/nagios start
Now, log in to the Nagios web interface at http://localhost/nagios, and start clicking on various links in the left navigation bar. The Service Detail page should look like Figure 13-2.
This means you have successfully gotten Nagios up and running and monitoring localhost. Congratulations!
You may name Nagios configuration files whatever you want, as long they have the .cfg extension—this is required.
You won't be able to access all of the Nagios web interface pages yet; you'll get an "It appears as though you do not have permission to view the information you requested…" error on some of them because we haven't set the correct CGI permissions yet. See the next recipe to learn how to do this.
During its initial run, my Nagios system couldn't run the "Total Processes" check. The error message was check_procs: Unknown argument—(null). This means that either one of the options in the command definition (commands.cfg) was incorrect, or the service definition (services.cfg) was incorrect. I used the default files, so chances are you fine readers might encounter the same error. A quick comparison showed a mismatch between the two:
# commands.cfg # 'check_local_procs' command definition define command{ command_name check_local_procs command_line $USER1$/check_procs -w $ARG1$ -c $ARG2$ -s $ARG3$ } # services.cfg define service{ use generic-service host_name localhost service_description Total Processes <...> check_command check_local_procs!250!400! }
Compare the command_line
and
check_command
lines. The check_local_procs
command wants three
arguments, but the service definition check_local_procs!250!400!
only defined two.
Because all I want is to keep track of the total number of running
processes, the first two arguments are sufficient. Deleting -s $ARG3$
and restarting Nagios fixed
it.
When the total number of running processes reaches 250, Nagios sends a warning. 400 is critical.
The exclamation points simply separate the two alert values; they don't mean you need to get excited.
For definitions of the options in object definition files, which are all the files in lan_objects/, start at "Template-Based Object Configuration": http://localhost/nagios/docs/xodtemplate.html
For nagios.cfg and resources.cfg, see "Main Configuration File Options": http://localhost/nagios/docs/configmain.html
For cgi.cfg, see "CGI Configuration File Options" (http://localhost/nagios/docs/configcgi.html) and "Authentication And Authorization In The CGIs" (http://localhost/nagios/docs/cgiauth.html)
Nagios.org: http://www.nagios.org/