monitoring

On remote logging with syslog

We’ve been doing some interesting things at work, as always, with yet more people and Linux boxes. And of the side effects of mixing people, Linux boxes, and several locations is this need for some sort of centralized logging. Luckily we have either syslog-ng or rsyslog daemons installed on each machine, so the only two issues seemed to be reconfiguration of syslog services for remote logging and setup of some log reading/searching tool for everyone to enjoy.

As for log reading and searching, there seems to be no end of tools. We picked php-syslog-ng, which has web interface, MySQL back-end, access control, and more. There were a few minor issues during setup and configuration, but overall it seemed to be OK. I also patched the source code a bit in a few places, just to make it work nicer with our setup and our needs (both numerical and symbolic priorities, preference for include masks over excludes, and full functionality with disabled caching). In case you are interested, here is a patch against php-syslog-ng 2.9.8f tarball.

Once everything was up and running and we started looking through logs from all our hosts in the same place, there was one thing that surprised me a lot. Either I don’t understand the syslog facilities and priorites fully (and I don’t claim that I do), or there is just too many software authors who don’t care much. Most of our logs are coming in at priority critical. Even if there isn’t much critical about them. Emergency is also used way too much. And there is hardly anything at debug or info or notice levels. (RT, SpamAssassin, and many other applications seem to be using critical as their default log level). Luckily, that almost always is trivial to fix using either the configuration files or applications’ source code directly.

Monitoring dynamic processes with Nagios

We spent a couple of weeks now at work, setting up our Nagios installation. This is one of those things that one can never finish. Monitoring is like security – no matter how good it is, it can always be better.

One thing that I particularly enjoyed figuring out was how to monitor dynamic stuff. Hosts and services are the easy part – they are always there and should be there. If they aren’t – let me know. But how to monitor dynamic values, that change based on the time of day or day of week? How to configure the monitoring so that we don’t need to update the limits every other week?

For example, consider the number of the user registrations through a web form. If we are to measure this number over any sensible (monitoring-wise) period of time, such as one hour, we’d see that it fluctuates a lot during the day. Furthermore, this number fluctuates differently depending on the day of the week. We have three separate pick hours during the day, and we have a great decrease in numbers over the weekend. Plus the amount of user registrations is linked to all the advertising that company does, so this week might different from the last week and from the next week. How can we measure it so that a notiification is sent when the number is abnormal? What is abnormal?

The solution turned out to be much simpler than I originally thought it would. It is sufficient to get a few samples of the data in same hour last week and a week before. If current value is more than twice the maximum or less than twice the minimum from the sample data, then we should be notified. This, in fact, works pretty well. The only time when we get a lot of false positives is when the values in the sample data are small. With values under 10, it’s very easy to jump over or under the limit. When the sample values are higher, there is more space between the boundaries and the system works as expected.

We’ll get some more sample data now and we’ll be adjusting the formulat accordingly. But as I said, even as it is, it’s pretty good.

MRTG

MRTG (Multi Router Traffic Grapher) is monitoring utility, which runs on many platforms and is capable of collecting and graphing statistical information such as network traffic, CPU/Memory/Disk space usage, etc. MRTG can gather information using both SNMP protocol and external scripts. Below are few pieces of my MRTG config file together with scripts, which I felt like sharing.

Continue reading MRTG