Monitoring the monitoring : keeping Zabbix server service up

After our recent MySQL migrations, I started getting a weird issue – Zabbix server process was crashing periodically (several times a day).

8395:20161109:175408.023 [Z3005] query failed: [2013] Lost connection to MySQL server during query [begin;]
8395:20161109:175408.024 [Z3001] connection to database 'zabbix_database_name_here' failed: [2003] Can't connect to MySQL server on 'zabbix_database_host_here' (111)
8395:20161109:175408.024 Got signal [signal:11(SIGSEGV),reason:1,refaddr:(nil)]. Crashing ...

Digging around for a bit, it seems like a widely reported issue, related Zabbix server using the same database connection as one of its agents is monitoring (here is an example bug report).

Not having enough time to troubleshoot and fix it properly, I decided for the time being to use another monitoring tool – monit – to keep an eye on the Zabbix server process and restart it, if it’s down.  After “yum install monit“, the following was dropped into /etc/monit.d/zabbix:

check process zabbix_server with pidfile /var/run/zabbix/zabbix_server.pid
    start program = "/sbin/service zabbix-server start" with timeout 60 seconds
    stop program = "/sbin/service zabbix-server stop"

Start the monit service, make sure it also starts at boot, and watch it in action via the /var/log/monit:

[UTC Nov 20 20:49:18] error    : 'zabbix_server' process is not running
[UTC Nov 20 20:49:18] info     : 'zabbix_server' trying to restart
[UTC Nov 20 20:49:18] info     : 'zabbix_server' start: /sbin/service
[UTC Nov 20 20:50:19] info     : 'zabbix_server' process is running with pid 28941

The chances of both systems failing at once are slim, so I think this will buy me some time.

Fixing “InnoDB: Error: log file ./ib_logfile0 is of different size”

For the last few days I’ve been moving MySQL databases around at work.  Being a bit in a rush and overconfident (I have backups!),  I was simply detaching the /var/lib/mysql volume on one host (running Amazon AMI and MySQL) and attaching it to another host (running CentOS 7 and MariaDB).

It’s not surprising that I got this error: “InnoDB: Error: log file ./ib_logfile0 is of different size“.  Gladly, this ServerFault thread provided enough hints for me to solve the problem.  In a nutshell:

  1. Temporarily comment out the InnoDB log file size setting (e.g.: innodb_log_file_size = 64M) in /etc/my.cnf.
  2. Set innodb_fast_shutdown to 0 (read more).
  3. Restart the MySQL service once or twice.
  4. Uncomment the log file size setting.
  5. Set InnoDB fast shutdown back to default or remove it from your my.cnf altogether.
  6. Celebrate!

Knowing how little I learn from my own mistakes, I’m sure I’ll find this post useful in the future.

Shell parameter expansion : default values for shell script parameters

When writing shell scripts, it’s often useful to accept some command line parameters.  It’s even more useful to have some defaults for those parameters.  Until now I’ve been using if statements to check if the parameter was empty, and if it was, to set it to the default value.  Something like this:

#!/bin/bash

DB_HOST=$1
DB_NAME=$2
DB_USER=$3
DB_PASS=$4

if [ -z "$DB_HOST" ]
then
    DB_HOST="localhost"
fi

if [ -z "$DB_NAME" ]
then
    DB_NAME="wordpress"
fi

if [ -z "$DB_USER" ]
then
    DB_USER="root"
fi

echo "Connecting to the database:"
echo "Host: $DB_HOST"
echo "Name: $DB_NAME"
echo "User: $DB_USER"
echo "Pass: $DB_PASS"

It turns out there is a much more elegant way to do this with shell parameter expansion.  Here is how it looks rewritten:

#!/bin/bash

DB_HOST=${1-localhost}
DB_NAME=${2-wordpress}
DB_USER=${3-root}
DB_PASS=$4

echo "Connecting to the database:"
echo "Host: $DB_HOST"
echo "Name: $DB_NAME"
echo "User: $DB_USER"
echo "Pass: $DB_PASS"

This is so much better. Not only the script itself is shorter, but it’s also much more obvious what is going on.  Copy-paste errors are much less likely to happen here too.

I wish I learned about this sooner.

How to Read and Improve the C.R.A.P Index of your code

crapclasscompletetest

Levi Hackwith has an excellent post explaining “How to Read and Improve the C.R.A.P Index of your code“:

The C.R.A.P. (Change Risk Analysis and Predictions) index is designed to analyze and predict the amount of effort, pain, and time required to maintain an existing body of code.

It iterates over the old bits of wisdom – write simpler code and cover it with unit tests – but it does so in a very simple and measurable way.

He also reminds us that:

…software metrics, in general, are just tools. No single metric can tell the whole story; it’s just one more data point. Metrics are meant to be used by developers, not the other way around – the metric should work for you, you should not have to work for the metric. Metrics should never be an end unto themselves. Metrics are meant to help you think, not to do the thinking for you. ~Alberto Savoia