We spent a couple of weeks now at work, setting up our Nagios installation. This is one of those things that one can never finish. Monitoring is like security – no matter how good it is, it can always be better.
One thing that I particularly enjoyed figuring out was how to monitor dynamic stuff. Hosts and services are the easy part – they are always there and should be there. If they aren’t – let me know. But how to monitor dynamic values, that change based on the time of day or day of week? How to configure the monitoring so that we don’t need to update the limits every other week?
For example, consider the number of the user registrations through a web form. If we are to measure this number over any sensible (monitoring-wise) period of time, such as one hour, we’d see that it fluctuates a lot during the day. Furthermore, this number fluctuates differently depending on the day of the week. We have three separate pick hours during the day, and we have a great decrease in numbers over the weekend. Plus the amount of user registrations is linked to all the advertising that company does, so this week might different from the last week and from the next week. How can we measure it so that a notiification is sent when the number is abnormal? What is abnormal?
The solution turned out to be much simpler than I originally thought it would. It is sufficient to get a few samples of the data in same hour last week and a week before. If current value is more than twice the maximum or less than twice the minimum from the sample data, then we should be notified. This, in fact, works pretty well. The only time when we get a lot of false positives is when the values in the sample data are small. With values under 10, it’s very easy to jump over or under the limit. When the sample values are higher, there is more space between the boundaries and the system works as expected.
We’ll get some more sample data now and we’ll be adjusting the formulat accordingly. But as I said, even as it is, it’s pretty good.
Nagios was ok , once upon a time. I used it extensively in one of the “known” providers in Cyprus and have built customized setups that are so complex to maintain it sounds almost silly. Anybody handling nagios knows that an additional administration layer is required in order to maintain a large NSP grade network and systems.
Unfortunately Nagios does not cut it anymore and thats a fact. New monitoring systems out there suprecede nagios by miles. My favourite is zenoss at the present. You can check an article on monitoring systems (which I feel very strongly about) at
http://www.spinthiras.net/2008.....onitoring/
I know this doesnt have much to do with dynamic processes but it has a lot to do with monitoring and my 2p towards it all.
Mario.
Mario,
whether Nagios “cuts it” or not is not a fact, but a perspective, which depends on the task at hand. I don’t work for ISP right now and Nagios is pretty suitable to what we need to monitor. I particularly like how easy it is to extend to monitor anything from networking equipment to how many pints of beer I have a week.
When I was working for an ISP, we used Nagios too. I don’t know which limitations you are talking about, because I remember it was doing a fine job. Once you organize everything properly into groups and use templates and defaults – it doesn’t need much effort at all.
Anyhow, thank you for the link. I’ll check the alternatives too.
One interesting approach to ‘evolving’ data like this is to use the Holt Winters forecasting model built into newer versions of RRDTool.
Here’s some notes on the implementation:
http://cricket.sourceforge.net/aberrant/rrd_hw.htm
You don’t need to be graphing things to find out if your data is outside of the bounds of the forecast, you can query them with ‘rrdtool fetch’. Anyway, another option rather than having to come up with reasonable metrics by hand for each type of data.
Joshua,
sounds very interesting. Thanks a bunch for sharing.
Hi,
You should look at Shinken, it’s a enhanced Nagios reimplementation in Python that allow you to have a quick and easy distributed and high availability monitoring environment, and of course with Nagios configuration and plugins compatibility :)
It's available (Open Source with a AGPL licence) at http://www.shinken-monitoring.org with even a demo virtual machine to test it in 5minutes :)
Jean gabès, Shinken developper