PagerDuty Incident Response Documentation

PagerDuty shares their Incident Response Documentation:

This documentation covers parts of the PagerDuty Incident Response process. It is a cut-down version of our internal documentation, used at PagerDuty for any major incidents, and to prepare new employees for on-call responsibilities. It provides information not only on preparing for an incident, but also what to do during and after. It is intended to be used by on-call practitioners and those involved in an operational incident response process (or those wishing to enact a formal incident response process).

I think this is a goldmine for anybody involved with incident response teams, operations, monitoring, technical support, network centers, and other similar setups.  Not only it covers the specific steps and expectations during different situations, but it also defines the culture, which the company is trying to built.

I wish I had this 15 years ago when I was involved in setting up the Network Operations Center (NOC).  I will definitely use it in the near future, when we’ll be setting up the support department at work.

Dissecting an SSL certificate

Julia Evans does it again.  If you ever wanted to understand SSL certificates, her post “Dissecting an SSL certificate” is for you.   This part made me smile:

Picking the right settings for your SSL certificates and SSL configuration on your webserver is confusing. As far as I understand it there are about 3 billion settings. Here is an example of an SSL Labs result for mail.google.com. There is all this stuff like OLD_TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256 on that page (for real, that is a real thing.). I’m happy there are tools like SSL Labs that help mortals make sense of all of it.

GitLab horror story : backup / restore failure

As I am reading this story – GitLab.com melts down after wrong directory deleted, backups fail and these details – every single hair I have, moves … I don’t (and didn’t) have any data on GitLab, so I haven’t lost anything.  But as somebody who worked as a system administrator (and backup administrator) for years, I can imagine the physical and psychological state of the team all too well.

Sure, things could have been done better.  But it’s easier said than done.  Modern technology is very complex.  And it changes fast.  And businesses want to move fast too.  And the proper resources (time, money, people) are not always allocated for mission critical tasks.  One thing is for sure, the responsibility lies on a whole bunch of people for a whole bunch of decisions.  But the hardest job is right now upon the tech people to bring back whatever they can.  There’s no sleep.  Probably no food.  No fun.  And a tremendous pressure all around.

I wish the guys and gals at GitLab a super good luck.  Hopefully they will find a snapshot to restore from and this whole thing will calm down and sort itself out.  Stay strong!

And I guess I’ll be doing test restores all night today, making sure that all my things are covered…

Update: you can now read the full post-mortem as well.