mon was developed under Linux, but it is known to work under Solaris 2.5
and 2.6. Since the clients and server are written
completely in Perl, portability shouldn't really be too much of an issue.
The following is a list of some of the features of mon:
Monitors
"Monitors" are programs that check for a particular condition,
and report success or failure to the server, along with
any output.
They are independent of mon, so to add a test for a
new service, you can just write your monitor in any language,
put it in the monitor directory, and it just works.
Asynchronous Events
Support for asynchronous events communicated to the
mon server. This is an open-ended protocol, like
the monitor and alert scripts, so that you can trigger on
anything. One obvious use is acting on SNMP traps. Traps
generated by remote entities can be programmed to behave
in the same manner as failures noticed by local polling
monitors, so it is possible to build a distributed monitoring
architecture. For example, remote monitoring domains (such as
sites separated by slow WAN lines) can collect their own
data locally and report significant events to a centralized
location, such as a NOC.
Alerts
"Alert" scripts send a message or otherwise act
on a failure that mon detects. These alerts, like
the monitors, are not part of mon, and are easy to add.
"Upalerts" are also supported, which are used to trigger
an alert when a server comes back up after being down for
a long amount of time.
Alert Management and Failure Handling
Failure of any monitor can trigger any (and multiple) alerts,
to different people at different times. You can effectively
construct "on call" schedules using this feature. For
example, you can send
a page to all system administrators if a resource goes down
before 8PM, but after 8PM, page only Joe, but send email to
everyone else.
Many alert throttling controls are implemented.
Parallelization
Parallelizes the checking of services on different
hosts or groups of hosts. For example, pinging your routers
can happen while it is also pinging your WWW servers. There's
no queue that can postpone the scheduled testing
of other services.
Repetitive Alert Supression
Repetitive alerts can be supressed. For example, only
send email once an hour if a service continues to fail.
As an option, small, transient failures of a service may be ignored.
Dependencies
Inter-service dependencies and even correlation. For example,
if the router between the monitoring host and your WWW
server is down, HTTP won't work, so only send an alert that
the router is down. This prevents the cascading of zillions
of alerts that happens when some critical resource is not
accessible. Dependencies can be understood as a hierarchical
form (a tree), and when a failure occurs, the tree is traversed
towards the node which has no unresolved dependencies. However,
complex dependencies can be described using a generic graph, since
the actual implementation does not require a hierarchichal layout.
Flexible Configuration
A very flexible (and extensible) configuration file.
Hosts can be grouped together, and each host or group
can have multiple services. Have a look
at an example configuration file.
Another m4-based example.
Client/Server Model
Has interactive command-line,
WWW-based, and SkyTel 2-Way
alphanumeric pager-based clients
that query the server for status and history. The protocol is simple,
and it is very easy to make clients of your own.
Multiple authentication methods are supported (including PAM),
along with per-user access control.
A Perl module API can be used to query the server, so writing
alternate interfaces are simple (such as one which takes
advantage of WAP, Wireless Access Protocol). At this point
there are several WWW interfaces actively being maintained by
different parties, each with its own report and goal.
To help with large configurations, "views" can be generated
to simplify reports for customers who do not need to know
the status of all services being monitored. For example,
a "network" view can be generated which includes the status of
all networking gear, just as a "servers" view can show all
info pertaining to servers. Views can be configured on a per-customer
basis if needed, and customers have control over their own views.
Run-time Alert Acknowledgement and Disabling
A service failure can be acknowledged so that alerts are
surpressed until the problem is fixed. This "ack" state
is retreivable from the client interface so that users
can see that support staff are working on the problem.
Also, Alerts for particular hosts, groups, or services can
be temporarily disabled an re-enabled by the client, without
stopping and restarting the server.
If you're upgrading a particular server, you can disable
the alert while you're doing the work, and re-enable it
when you're done.
History
Keeps a historical list (queried by the clients)
of both failures that were detected and alerts that were
triggered.
Portability
Nothing to compile for the server or clients, and written
in 100% Perl 5. This should help portability.