Tao of Backup

tao

Tao of Backup is yet another way to tell people to backup their files.  Not only it explains why it is important, but also how to do it properly.  My favorite chapter is on testing:

The novice asked the backup master: “Master, now that my backups have good coverage, are taken frequently, are archived, and are distributed to the four corners of the earth, I have supreme confidence in them. Have I achieved enlightenment? Surely now I comprehend the Tao Of Backup?” The master paused for one minute, then suddenly produced an axe and smashed the novice’s disk drive to pieces. Calmly he said: “To believe in one’s backups is one thing. To have to use them is another.”

The novice looked very worried.

Funny, but so true.

IPv6 20th birthday with 10% global penetration

Here’s some not so light coffee time reading on IPv6 – IPv6 non-alternatives: DJB’s article, 13 years later – an article that links, among other things to this Ars Technica article, which features some IPv6 statistics.  Summary?  Sure.  IPv6 RFC celebrates 20 year birthday this month with 10% global penetration.

ipv6

Exponential growth year-on-year is good.  But the absolute numbers aren’t so bright yet.  Especially considering some of the areas where it wasn’t so successful.

5 AWS mistakes you should avoid

5 AWS mistakes you should avoid” is a rather opinionated piece on what you should and shouldn’t do with your infrastructure, especially, when using AWS.  Here’s an example:

A typical web application consists of at least:

  • load balancer
  • scalable web backend
  • database

and looks like the following figure.

typical-web-application

This pattern is very common and if yours look different you should have (strong) reasons.

It’s all good advice in there, but it comes from a very narrow perspective.  The “mistakes” are:

  • managing infrastructure manually
  • not using Auto Scaling Groups
  • not analyzing metrics in CloudWatch
  • ignoring Trusted Advisor
  • underutilizing virtual machines

Amazon Makes It Almost Impossible To Calculate Their “Virtual CPU” Equivalent

So, it looks like I’m not the only one trying to figure out Amazon EC2 virtual CPU allocation.  Slashdot runs the story (and a heated debate, as usual) on the subject of Amazon’s non-definitive virtual CPUs:

ECU’s were not the simplest approach to describing a virtual CPU, but they at least had a definition attached to them. Operations managers and those responsible for calculating server pricing could use that measure for comparison shopping. But ECUs were dropped as a visible and useful definition without announcement two years ago in favor of a descriptor — virtual CPU — that means, mainly, whatever AWS wants it to mean within a given instance family.

A precise number of ECUs in an instance has become simply a “virtual CPU.”

Amazon EC2 t2.nano instances

If you thought t2.micro was a tiny machine, I have news for you – Amazon announced t2.nano instance type.  It features 512 MB of RAM, 1 vCPU, and up to two Elastic network interfaces.  Price for on-demand instance – $0.0065 per hour.

This instance type is perfect for small websites, developer and testing environments, and other tasks which don’t require a lot of resource.

CPU Steal Time. Now on Amazon EC2

Yesterday I wrote the blog post, trying to figure out what is the CPU steal time and why it occurs.  The problem with that post was that I didn’t go deep enough.

I was looking at this issue from the point of view of a generic virtual machine.  The case that I had to deal with wasn’t exactly like that.  I saw the CPU steal time on the Amazon EC2 instance.  Assuming that these were just my neighbors acting up or Amazon having a temporary hardware issue was a wrong conclusion.

That’s because I didn’t know enough about Amazon EC2.  Well, I’ve learned a bunch since then, so here’s what I found.

Continue reading “CPU Steal Time. Now on Amazon EC2”

NAS Performance: NFS vs Samba vs GlusterFS

I came across this question and also found the results of the benchmarks somewhat surprising.

  • GlusterFS replicated 2: 32-35 seconds, high CPU load
  • GlusterFS single: 14-16 seconds, high CPU load
  • GlusterFS + NFS client: 16-19 seconds, high CPU load
  • NFS kernel server + NFS client (sync): 32-36 seconds, very low CPU load
  • NFS kernel server + NFS client (async): 3-4 seconds, very low CPU load
  • Samba: 4-7 seconds, medium CPU load
  • Direct disk: < 1 second

The post is from 2012, so I’m curious if this is still accurate. Has anybody tried this? Can confirm or otherwise?

Also, an interesting note from the answer to the above:

From what I’ve seen after a couple of packet captures, the SMB protocol can be chatty, but the latest version of Samba implements SMB2 which can both issue multiple commands with one packet, and issue multiple commands while waiting for an ACK from the last command to come back. This has vastly improved its speed, at least in my experience, and I know I was shocked the first time I saw the speed difference too – Troubleshooting Network Speeds — The Age Old Inquiry

 

How Far Can You Go With HAProxy and a t2.micro

Here’s an interesting set of experiments trying to answer the question of how far can you go with HAProxy setup on the smallest of the Amazon EC2 instances – t2.micro (1 virtual CPU, 1 GB of RAM).  Here’s the summary.

460 requests/second

At 460 req/second response times are mostly a flat ~300 ms, except for two spikes. I attribute this to TCP congestion avoidance as the traffic approaches the limit and packets start to get dropped. After dropped packets are detected the clients reduce their transmission rate, but eventually the transmission rate stabilizes again just under the limit. Only 1739 requests timeout and 134918 succeed.

[…]

It seems that the limit of the t2.micro is around 500 req/second even for small responses.

CPU Steal Time

Here’s something that happens once in a blue moon – you get a server that seems overloaded while doing nothing.  There are several reasons for why that can happen, but today I’m only going to look at one of them.  As it happened to me very recently.

Firstly, if you have any kind of important infrastructure, make sure you have the monitoring tools in place.  Not just the notification kind, like Nagios, but also graphing ones like Zabbix and Munin.  This will help you plenty in times like this.

web1

When you have an issue to solve, you don’t want to be installing monitoring tools, and starting to gather your data.  You want the data to be there already.

Now, for the real thing.  What happened here?  Well, obviously the CPU steal time seems way off.  But what the hell is the CPU steal time?  Here’s a handy article – Understanding the CPU steal time.  And here is my favorite part of it:

There are two possible causes:

  1. You need a larger VM with more CPU resources (you are the problem).
  2. The physical server is over-sold and the virtual machines are aggressively competing for resources (you are not the problem).

The catch: you can’t tell which case your situation falls under by just watching the impacted instance’s CPU metrics.

In our case, it was a physical server issue, which we had no control over.  But it was super helpful to be able to say what is going.  We’ve prepared “plan B”, which was to move to another server, but finally the issue disappeared and we didn’t have to do that this time.

Oh, and if you don’t have those handy monitoring tools, you can use top:

top_steal

P.S. : If you are on Amazon EC2, you might find this article useful as well.