Sysadmin

GitLab horror story : backup / restore failure

As I am reading this story – GitLab.com melts down after wrong directory deleted, backups fail and these details – every single hair I have, moves … I don’t (and didn’t) have any data on GitLab, so I haven’t lost anything. But as somebody who worked as a system administrator (and backup administrator) for years, I can imagine the physical and psychological state of the team all too well.

Sure, things could have been done better. But it’s easier said than done. Modern technology is very complex. And it changes fast. And businesses want to move fast too. And the proper resources (time, money, people) are not always allocated for mission critical tasks. One thing is for sure, the responsibility lies on a whole bunch of people for a whole bunch of decisions. But the hardest job is right now upon the tech people to bring back whatever they can. There’s no sleep. Probably no food. No fun. And a tremendous pressure all around.

I wish the guys and gals at GitLab a super good luck. Hopefully they will find a snapshot to restore from and this whole thing will calm down and sort itself out. Stay strong!

And I guess I’ll be doing test restores all night today, making sure that all my things are covered…

Update: you can now read the full post-mortem as well.

Choosing the “best software”

Julia Evans has a nice blog post about choosing the “best software”. Here is my favorite part:

So, let’s talk about another way to think about making decisions than “what is the Best Thing in this situation”.

I run an event series called “lightning talks and pie”. At the most recent one, Ines Sombra gave a talk about capacity planning. In it, she said that there are 3 reasons you might want to change something about your system:

It’s too expensive

It’s too difficult to operate (humans spend a ton of time worrying about it)

It’s not doing the job it’s supposed to

I find these 3 criteria a lot easier to reason about than the “Choose The Best Thing” framework.

She provides some examples on how to apply this thinking, as well as how to deal with tradeoffs and limitations.

Immutable Infrastructure with AWS and Ansible

Immutable infrastructure is a very powerful concept that brings stability, efficiency, and fidelity to your applications through automation and the use of successful patterns from programming. The general idea is that you never make changes to running infrastructure. Instead, you ensure that all infrastructure is created through automation, and to make a change, you simply create a new version of the infrastructure, and destroy the old one.

“Immutable Infrastructure with AWS and Ansible” is a, so far, three part article series (part 1, part 2, part 3), that shows how to use Ansible to achieve an immutable infrastructure on the Amazon Web Services cloud solution.

It covers everything starting from the basic setup of the workstation to execute Ansible playbooks and all the way to AWS security (users, roles, security groups), deployment of resources, and auto-scaling.

10 things to avoid in Docker containers

10 things to avoid in Docker containers provides a handy reminder of what NOT to do when building Docker containers. Read the full article for details and explanations. For a brief summary, here are the 10 things:

Don’t store data in containers
Don’t ship your application in two pieces
Don’t create large images
Don’t use a single layer image
Don’t create images from running containers
Don’t use only the “latest” tag
Don’t run more than one process in a single container
Don’t store credentials in the image. Use environment variables
Don’t run processes as a root user
Don’t rely on IP addresses

Parsing text printouts within Ansible playbooks

I’m sure this will come handy soon, and I’ll be spending too much time trying to figure it out without this article: Parsing text printouts within Ansible playbooks.

It’s not every day that you see regular expression examples in the Ansible playbooks…