Here’s something I wanted to get into for a while now, but haven’t had the time yet – switching the monitoring / alerting system from server-oriented to business-oriented. The gist of the story is:
If it’s not actionable and business critical, then it shouldn’t ring.
The article has some statistics and summaries as well. The reasoning behind the switch is obvious, but it’s good to have it formulated:
After a few months, I can tell reducing our alerting rate should have been a top priority before things got out of hands, for a few reasons.
- Constant alerts prevented the team to focus on what was important. Being interrupted even for things that can wait for a few hours lowers our productivity when we work on things that can’t wait.
- Being awaken every night, several times a night exhausts a team and make people less productive at day, and more prone to do errors.
- Too many off hours interventions cost the company a lot of money that could be invested in hardening the infrastructure or hiring someone else instead.
“How to monitor your Linux servers with nmon” article provides some details on how to use the comprehensive server monitoring tool “nmon” (Nigel’s Monitor) to keep an eye on your server or two. If you have more than a handful of servers, you’d probably opt out for a full blown monitoring solution, like Zabbix, but even with that, nmon can be useful for quick troubleshooting, screenshots, and data collection.
I’ve heard of nmon before and even used it occasionally. What I didn’t know was that it can collect system metrics into a file, which can then later be analyzed and graphed with the nmonchart tool.
That’s pretty handy. The extra bonus is that these tools are available in most Linux distributions, so there is no need to download/compile/configure things.
Somehow I missed the announcement of the Nginx Amplify (beta) back in November of last year, so here it goes now.
Nginx Amplify is a new tool for the comprehensive monitoring of Nginx web servers. Here’s what it can do for you:
- Visually identify performance bottlenecks, overloaded servers, or potential DDoS attacks
- Improve and optimize NGINX performance with intelligent advice and recommendations
- Get alerts when something is wrong with the delivery of your application
- Plan capacity and performance for web applications
- Keep track of systems running NGINX
as the regular proactive monitoring of the Nginx issues. Have a look at the documentation for more details.
Downdetector is yet another one of those services that monitor major web services and provides and lets you see if any of them is experiencing any issues or outages.
You can search for specific providers or browse by company or issue type. There’s also a weekly top 10. What I like in particular are comments for each report, where you can get some feedback from other users experiencing the problem.
PagerDuty shares their Incident Response Documentation:
This documentation covers parts of the PagerDuty Incident Response process. It is a cut-down version of our internal documentation, used at PagerDuty for any major incidents, and to prepare new employees for on-call responsibilities. It provides information not only on preparing for an incident, but also what to do during and after. It is intended to be used by on-call practitioners and those involved in an operational incident response process (or those wishing to enact a formal incident response process).
I think this is a goldmine for anybody involved with incident response teams, operations, monitoring, technical support, network centers, and other similar setups. Not only it covers the specific steps and expectations during different situations, but it also defines the culture, which the company is trying to built.
I wish I had this 15 years ago when I was involved in setting up the Network Operations Center (NOC). I will definitely use it in the near future, when we’ll be setting up the support department at work.
One more for the CommitStrip.
Conky is a light-weight system monitor for X. It supports all kinds of metrics – anything from CPU, memory and network, to emails, music players, and more.
It reminds me of the old days, before Gnome and KDE took over the desktop environments – I think everybody had something similar running as part of the screen background.
The installation on Fedora is trivial – conky is packaged and available with a simple “yum install conky“. The configuration, on the other hand, is not so much. GitHub repository provides quite a few fancy user configurations, but there was a change in configuration file format in the version 1.10, and things aren’t as smooth as I would like.
It’ll take a bit of playing around, but I’m sure I’ll eventually lose enough sleep over this to just give up and have something semi-decent on my screen.
After our recent MySQL migrations, I started getting a weird issue – Zabbix server process was crashing periodically (several times a day).
8395:20161109:175408.023 [Z3005] query failed:  Lost connection to MySQL server during query [begin;]
8395:20161109:175408.024 [Z3001] connection to database 'zabbix_database_name_here' failed:  Can't connect to MySQL server on 'zabbix_database_host_here' (111)
8395:20161109:175408.024 Got signal [signal:11(SIGSEGV),reason:1,refaddr:(nil)]. Crashing ...
Digging around for a bit, it seems like a widely reported issue, related Zabbix server using the same database connection as one of its agents is monitoring (here is an example bug report).
Not having enough time to troubleshoot and fix it properly, I decided for the time being to use another monitoring tool – monit – to keep an eye on the Zabbix server process and restart it, if it’s down. After “yum install monit“, the following was dropped into /etc/monit.d/zabbix:
check process zabbix_server with pidfile /var/run/zabbix/zabbix_server.pid
start program = "/sbin/service zabbix-server start" with timeout 60 seconds
stop program = "/sbin/service zabbix-server stop"
Start the monit service, make sure it also starts at boot, and watch it in action via the /var/log/monit:
[UTC Nov 20 20:49:18] error : 'zabbix_server' process is not running
[UTC Nov 20 20:49:18] info : 'zabbix_server' trying to restart
[UTC Nov 20 20:49:18] info : 'zabbix_server' start: /sbin/service
[UTC Nov 20 20:50:19] info : 'zabbix_server' process is running with pid 28941
The chances of both systems failing at once are slim, so I think this will buy me some time.
Here’s something that happens once in a blue moon – you get a server that seems overloaded while doing nothing. There are several reasons for why that can happen, but today I’m only going to look at one of them. As it happened to me very recently.
Firstly, if you have any kind of important infrastructure, make sure you have the monitoring tools in place. Not just the notification kind, like Nagios, but also graphing ones like Zabbix and Munin. This will help you plenty in times like this.
When you have an issue to solve, you don’t want to be installing monitoring tools, and starting to gather your data. You want the data to be there already.
Now, for the real thing. What happened here? Well, obviously the CPU steal time seems way off. But what the hell is the CPU steal time? Here’s a handy article – Understanding the CPU steal time. And here is my favorite part of it:
There are two possible causes:
- You need a larger VM with more CPU resources (you are the problem).
- The physical server is over-sold and the virtual machines are aggressively competing for resources (you are not the problem).
The catch: you can’t tell which case your situation falls under by just watching the impacted instance’s CPU metrics.
In our case, it was a physical server issue, which we had no control over. But it was super helpful to be able to say what is going. We’ve prepared “plan B”, which was to move to another server, but finally the issue disappeared and we didn’t have to do that this time.
Oh, and if you don’t have those handy monitoring tools, you can use top:
P.S. : If you are on Amazon EC2, you might find this article useful as well.