After our recent MySQL migrations, I started getting a weird issue – Zabbix server process was crashing periodically (several times a day).
8395:20161109:175408.023 [Z3005] query failed:  Lost connection to MySQL server during query [begin;]
8395:20161109:175408.024 [Z3001] connection to database 'zabbix_database_name_here' failed:  Can't connect to MySQL server on 'zabbix_database_host_here' (111)
8395:20161109:175408.024 Got signal [signal:11(SIGSEGV),reason:1,refaddr:(nil)]. Crashing ...
Digging around for a bit, it seems like a widely reported issue, related Zabbix server using the same database connection as one of its agents is monitoring (here is an example bug report).
Not having enough time to troubleshoot and fix it properly, I decided for the time being to use another monitoring tool – monit – to keep an eye on the Zabbix server process and restart it, if it’s down. After “yum install monit“, the following was dropped into /etc/monit.d/zabbix:
check process zabbix_server with pidfile /var/run/zabbix/zabbix_server.pid
start program = "/sbin/service zabbix-server start" with timeout 60 seconds
stop program = "/sbin/service zabbix-server stop"
Start the monit service, make sure it also starts at boot, and watch it in action via the /var/log/monit:
[UTC Nov 20 20:49:18] error : 'zabbix_server' process is not running
[UTC Nov 20 20:49:18] info : 'zabbix_server' trying to restart
[UTC Nov 20 20:49:18] info : 'zabbix_server' start: /sbin/service
[UTC Nov 20 20:50:19] info : 'zabbix_server' process is running with pid 28941
The chances of both systems failing at once are slim, so I think this will buy me some time.
Here’s something that happens once in a blue moon – you get a server that seems overloaded while doing nothing. There are several reasons for why that can happen, but today I’m only going to look at one of them. As it happened to me very recently.
Firstly, if you have any kind of important infrastructure, make sure you have the monitoring tools in place. Not just the notification kind, like Nagios, but also graphing ones like Zabbix and Munin. This will help you plenty in times like this.
When you have an issue to solve, you don’t want to be installing monitoring tools, and starting to gather your data. You want the data to be there already.
Now, for the real thing. What happened here? Well, obviously the CPU steal time seems way off. But what the hell is the CPU steal time? Here’s a handy article – Understanding the CPU steal time. And here is my favorite part of it:
There are two possible causes:
- You need a larger VM with more CPU resources (you are the problem).
- The physical server is over-sold and the virtual machines are aggressively competing for resources (you are not the problem).
The catch: you can’t tell which case your situation falls under by just watching the impacted instance’s CPU metrics.
In our case, it was a physical server issue, which we had no control over. But it was super helpful to be able to say what is going. We’ve prepared “plan B”, which was to move to another server, but finally the issue disappeared and we didn’t have to do that this time.
Oh, and if you don’t have those handy monitoring tools, you can use top:
P.S. : If you are on Amazon EC2, you might find this article useful as well.
Graylog – store, search, and analyze log data from any source.
Free Alternative to Splunk Using Fluentd – now, this combination of Elasticsearch, Kibana, and Fluentd looks rather sexy.
page-monitor – capture webpage and diff the dom change with phantomjs
Monolog – Logging for PHP 5.3+
huginn – build agents that monitor and act on your behalf
What is Huginn?
Huginn is a system for building agents that perform automated tasks for you online. They can read the web, watch for events, and take actions on your behalf. Huginn’s Agents create and consume events, propagating them along a directed event flow graph. Think of it as Yahoo! Pipes plus IFTTT on your own server. You always know who has your data. You do.