Monitoring and Management with Monit and Nagios

Recently, the SME Toolkit, a project sponsored by the IFC (International Finance Corporation), a member of the World Bank Group, initiated a major migration, moving all operations from one hosting provider to another. Atomic Object took responsibility for the technical aspects of the migration and new hosting environment. The migration gave us a perfect opportunity to re-evaluate our approach to monitoring and management.

Previously, we had utilized ‘god,’ a process monitoring framework written in Ruby. While ‘god’ was effective for the old hosting environment, we decided that something a little more substantial and scalable was necessary. Part of the reason for this was a move from only two nodes in the entire application architecture to well over a half-a-dozen nodes. We needed to orchestrate monitoring and management not only at the process level, but at the server and infrastructure level as well.

We decided to utilize two applications to achieve this: Monit and Nagios. Both are well respected monitoring and management frameworks utilized in industry. We selected Monit for process level monitoring, and Nagios for server and infrastructure monitoring. Both applications are highly configurable, and offer many approaches to monitoring and management, including variable levels of escalation and responses to alerts.

The overall SME Toolkit web application, like many systems, consists of several different subsidiary applications and services which together are necessary for the overall application to function. If one of the processes fails, consequences can range from slow performance to complete application failure (returning HTTP 500 internal server errors to site visitors). Unfortunately, for a variety of reasons, not all of these critical processes remain running smoothly all of the time. When they do stop unexpectedly, they must be restarted. And, if they cannot be restarted automatically for some reason, the necessary support staff must be notified of the issue. Additionally, if certain processes start to consume too many resources, they may need to be selectively restarted to prevent them from monopolizing the server’s resources.

Monit

Essentially, Monit ensures that critical services and resources are alive and not misbehaving. It does this in a few ways. The SME Toolkit project most commonly uses Monit to check for a process’s existence using its PID. Then, we ensure that the process has not exceeded pre-defined CPU and memory boundaries. If the process does not exist, Monit will start it. If a process has exceeded its pre-defined resource boundaries, Monit will restart it. In order to accomplish this, the Monit configuration specifies the location of a process’ PID file, and the commands necessary to start and stop each process. Timeouts can also be specified if certain processes take extra long to startup or shutdown. Alerts can be configured to send e-mails upon executing any monitoring action on a process. In addition to monitoring process existence and resource use, Monit is capable of many other monitoring functions, including file sizes, checksums, permissions, network connectivity, etc.

Monit provides an easily accessible web-based interface to access a list of all currently monitored resources and their statuses. From this page, one can quickly determine the state of a particular server. Those resources which are processes can be stopped, started, restarted, or unmonitored. This can be very useful when needing to quickly restart a service, or performing resource intensive operations for which Monit should not take a monitoring action. Naturally, the interface is password protected to prevent unauthorized persons from disturbing running processes. If necessary, one can interact with Monit at the command line. While the command line provides much more power, it isn’t as easy to review visually.

Sample configuration code to monitor Apache HTTPD with Monit:

check process apache with pidfile "/var/run/httpd.pid"
  start program = "/etc/init.d/httpd start"
  stop program = "/etc/init.d/httpd stop"

Nagios

Nagios ensures that servers themselves, and certain core indicators and resources, are in a normal state. Unlike Monit, Nagios does not actually provide any sort of process or server management mechanism – it simply monitors and alerts based upon pre-defined metrics. This is more appropriate for the overall health of a system or network, rather than for individual server processes. However, a nifty synergy can be set up with Nagios monitoring Monit for any failures. In such a case, Nagios can report that Monit has flagged a particular failure, and then Monit can provide further details. If Monit itself has failed, then Nagios also reports this.

Nagios works by defining certain ‘objects’ such as contacts, commands, hosts, and services. Email alerts can be sent to certain contacts or groups of contacts. Commands are operations to check for the health or availability of hosts and services. Hosts are generally servers, while services are processes, resources, etc. While Monit has many built-in monitoring features and functions, Nagios has customizability. There are existing templates which define many generic objects. However, one can easily add new objects to enhance Nagios’ functionality. For example, custom commands can be written to run an arbitrary scripts which returns a status code and message. This script could be used to check the status of virtually anything.

On the SME Toolkit project, for example, Nagios is configured to execute a Ruby script which checks the A and CNAME records that a particular FQDN resolves to. We expect the domain to resolve to a certain IP address or hostname, if it does not, the script returns an error code which is then flagged by Nagios. If the condition remains in a fail state for a certain amount of time, an e-mail notification is sent out to alert the support team.

Sample configuration code to check host availability (using ping) with Nagios:

# Define our host
define host{
	use         linux-server
	host_name   someserver.local
	alias       someserver.local
	address     192.168.1.1
	}

# Define ping service parameters warning at 100ms response time and 20% packet loss
# and failure at 500ms response time and 60% packet loss
define service{
	use                             generic-service
	host_name                       infmgt.sme.loc
	service_description             ping
	check_command                   check_ping!100.0,20%!500.0,60%
}