Downdetector – a weatherman for the digital world

Downdetector is yet another one of those services that monitor major web services and provides and lets you see if any of them is experiencing any issues or outages.

You can search for specific providers or browse by company or issue type.  There’s also a weekly top 10.  What I like in particular are comments for each report, where you can get some feedback from other users experiencing the problem.

 

PagerDuty Incident Response Documentation

PagerDuty shares their Incident Response Documentation:

This documentation covers parts of the PagerDuty Incident Response process. It is a cut-down version of our internal documentation, used at PagerDuty for any major incidents, and to prepare new employees for on-call responsibilities. It provides information not only on preparing for an incident, but also what to do during and after. It is intended to be used by on-call practitioners and those involved in an operational incident response process (or those wishing to enact a formal incident response process).

I think this is a goldmine for anybody involved with incident response teams, operations, monitoring, technical support, network centers, and other similar setups.  Not only it covers the specific steps and expectations during different situations, but it also defines the culture, which the company is trying to built.

I wish I had this 15 years ago when I was involved in setting up the Network Operations Center (NOC).  I will definitely use it in the near future, when we’ll be setting up the support department at work.

Conky – light-weight system monitor for X

Conky is a light-weight system monitor for X.  It supports all kinds of metrics – anything from CPU, memory and network, to emails, music players, and more.

It reminds me of the old days, before Gnome and KDE took over the desktop environments – I think everybody had something similar running as part of the screen background.

The installation on Fedora is trivial – conky is packaged and available with a simple “yum install conky“.  The configuration, on the other hand, is not so much.  GitHub repository provides quite a few fancy user configurations, but there was a change in configuration file format in the version 1.10, and things aren’t as smooth as I would like.

It’ll take a bit of playing around, but I’m sure I’ll eventually lose enough sleep over this to just give up and have something semi-decent on my screen.

Monitoring the monitoring : keeping Zabbix server service up

After our recent MySQL migrations, I started getting a weird issue – Zabbix server process was crashing periodically (several times a day).

8395:20161109:175408.023 [Z3005] query failed: [2013] Lost connection to MySQL server during query [begin;]
8395:20161109:175408.024 [Z3001] connection to database 'zabbix_database_name_here' failed: [2003] Can't connect to MySQL server on 'zabbix_database_host_here' (111)
8395:20161109:175408.024 Got signal [signal:11(SIGSEGV),reason:1,refaddr:(nil)]. Crashing ...

Digging around for a bit, it seems like a widely reported issue, related Zabbix server using the same database connection as one of its agents is monitoring (here is an example bug report).

Not having enough time to troubleshoot and fix it properly, I decided for the time being to use another monitoring tool – monit – to keep an eye on the Zabbix server process and restart it, if it’s down.  After “yum install monit“, the following was dropped into /etc/monit.d/zabbix:

check process zabbix_server with pidfile /var/run/zabbix/zabbix_server.pid
    start program = "/sbin/service zabbix-server start" with timeout 60 seconds
    stop program = "/sbin/service zabbix-server stop"

Start the monit service, make sure it also starts at boot, and watch it in action via the /var/log/monit:

[UTC Nov 20 20:49:18] error    : 'zabbix_server' process is not running
[UTC Nov 20 20:49:18] info     : 'zabbix_server' trying to restart
[UTC Nov 20 20:49:18] info     : 'zabbix_server' start: /sbin/service
[UTC Nov 20 20:50:19] info     : 'zabbix_server' process is running with pid 28941

The chances of both systems failing at once are slim, so I think this will buy me some time.