Zabbix in the Munich subway

Zabbix blog shares a very inspiring post “Zabbix in the subway. Munich Transport Corporation Case Study“, which shows how Zabbix is used for the monitoring of the trains and trams in Munich. Here are some implementation details to get you started:

Currently, there are 3796 devices monitored by the Zabbix server, which in turn houses the MySQL database and the web front-end. This server is operated virtually with 8 CPU’s and 32GB RAM supported by SAS storage system. 105818 items are queried by the Zabbix server from those devices, where 23820 triggers detect whether certain items deviate from their target state. This results in 298.48 NVPS with an average of approx. 7 people (actively) using the system concurrently. General overview:
* Each device inside a tram/subway is treated as a host and is monitored for availability.
* Each tram/subway is managed as a host group.
* Host groups are nested and organized by the tram/subway lines (using the host group nesting feature introduced in Zabbix 3.2).
* All devices depend on the connectivity of the MRCU (Mobile Radio Control Unit in subways) or LTE router (LTE G4 Connectivity in trams)
* Maps are automatically created for each tram/subway (using the Zabbix API).
* Maps use sub-maps to link to a specific tram/subway view.

There’s also a video from the Zabbix conference, presenting the case study.

Zabbix 4.2 is out!

Zabbix 4.2 has been released and it brings an impressive array of new features and improvements. Some of these are:

  • Built-in support of Prometheus data collection
  • Efficient high-frequency monitoring
  • Validation of collected data and error handling
  • Preprocessing data with JavaScript
  • Test preprocessing rules from UI
  • Test media type from Web UI
  • Support of TimescaleDB
  • Simplified tag management
  • More flexible auto-registration
  • Support for HTML emails
  • Animations and easy external services access on network maps
  • Extracting data from HTTP headers (like authentication tokens)
  • Non-destructive resizing and reordering of dashboard widgets
  • … and a lot more

If you were waiting for a good reason to upgrade – this is it!

Zabbix : No more flapping. Define triggers the smart way.

No more flapping. Define triggers the smart way.” is a very useful article from the Zabbix Weblog on how to setup sensible, flapping-aware triggers in Zabbix.

I’m sure every single person on this planet has a limit to how many up and down notifications he can receive …

Monitoring the monitoring : keeping Zabbix server service up

After our recent MySQL migrations, I started getting a weird issue – Zabbix server process was crashing periodically (several times a day).

8395:20161109:175408.023 [Z3005] query failed: [2013] Lost connection to MySQL server during query [begin;]
8395:20161109:175408.024 [Z3001] connection to database 'zabbix_database_name_here' failed: [2003] Can't connect to MySQL server on 'zabbix_database_host_here' (111)
8395:20161109:175408.024 Got signal [signal:11(SIGSEGV),reason:1,refaddr:(nil)]. Crashing ...

Digging around for a bit, it seems like a widely reported issue, related Zabbix server using the same database connection as one of its agents is monitoring (here is an example bug report).

Not having enough time to troubleshoot and fix it properly, I decided for the time being to use another monitoring tool – monit – to keep an eye on the Zabbix server process and restart it, if it’s down.  After “yum install monit“, the following was dropped into /etc/monit.d/zabbix:

check process zabbix_server with pidfile /var/run/zabbix/
    start program = "/sbin/service zabbix-server start" with timeout 60 seconds
    stop program = "/sbin/service zabbix-server stop"

Start the monit service, make sure it also starts at boot, and watch it in action via the /var/log/monit:

[UTC Nov 20 20:49:18] error    : 'zabbix_server' process is not running
[UTC Nov 20 20:49:18] info     : 'zabbix_server' trying to restart
[UTC Nov 20 20:49:18] info     : 'zabbix_server' start: /sbin/service
[UTC Nov 20 20:50:19] info     : 'zabbix_server' process is running with pid 28941

The chances of both systems failing at once are slim, so I think this will buy me some time.