Here’s something I wanted to get into for a while now, but haven’t had the time yet – switching the monitoring / alerting system from server-oriented to business-oriented. The gist of the story is:
If it’s not actionable and business critical, then it shouldn’t ring.
The article has some statistics and summaries as well. The reasoning behind the switch is obvious, but it’s good to have it formulated:
After a few months, I can tell reducing our alerting rate should have been a top priority before things got out of hands, for a few reasons.
- Constant alerts prevented the team to focus on what was important. Being interrupted even for things that can wait for a few hours lowers our productivity when we work on things that can’t wait.
- Being awaken every night, several times a night exhausts a team and make people less productive at day, and more prone to do errors.
- Too many off hours interventions cost the company a lot of money that could be invested in hardening the infrastructure or hiring someone else instead.