How Far Can You Go With HAProxy and a t2.micro

Here’s an interesting set of experiments trying to answer the question of how far can you go with HAProxy setup on the smallest of the Amazon EC2 instances – t2.micro (1 virtual CPU, 1 GB of RAM).  Here’s the summary.

460 requests/second

At 460 req/second response times are mostly a flat ~300 ms, except for two spikes. I attribute this to TCP congestion avoidance as the traffic approaches the limit and packets start to get dropped. After dropped packets are detected the clients reduce their transmission rate, but eventually the transmission rate stabilizes again just under the limit. Only 1739 requests timeout and 134918 succeed.

[…]

It seems that the limit of the t2.micro is around 500 req/second even for small responses.

CPU Steal Time

Here’s something that happens once in a blue moon – you get a server that seems overloaded while doing nothing.  There are several reasons for why that can happen, but today I’m only going to look at one of them.  As it happened to me very recently.

Firstly, if you have any kind of important infrastructure, make sure you have the monitoring tools in place.  Not just the notification kind, like Nagios, but also graphing ones like Zabbix and Munin.  This will help you plenty in times like this.

web1

When you have an issue to solve, you don’t want to be installing monitoring tools, and starting to gather your data.  You want the data to be there already.

Now, for the real thing.  What happened here?  Well, obviously the CPU steal time seems way off.  But what the hell is the CPU steal time?  Here’s a handy article – Understanding the CPU steal time.  And here is my favorite part of it:

There are two possible causes:

  1. You need a larger VM with more CPU resources (you are the problem).
  2. The physical server is over-sold and the virtual machines are aggressively competing for resources (you are not the problem).

The catch: you can’t tell which case your situation falls under by just watching the impacted instance’s CPU metrics.

In our case, it was a physical server issue, which we had no control over.  But it was super helpful to be able to say what is going.  We’ve prepared “plan B”, which was to move to another server, but finally the issue disappeared and we didn’t have to do that this time.

Oh, and if you don’t have those handy monitoring tools, you can use top:

top_steal

P.S. : If you are on Amazon EC2, you might find this article useful as well.

Linux Performance Analysis in 60,000 Milliseconds

Netflix shares the process they use to quickly assess performance issues on a Linux machine.  With these commands you can get a good idea of what’s going on within about 60 seconds:

uptime
dmesg | tail
vmstat 1
mpstat -P ALL 1
pidstat 1
iostat -xz 1
free -m
sar -n DEV 1
sar -n TCP,ETCP 1
top

Read the rest of the article for details and explanations.

Infrastructure update : GitHub, BitBucket, HipChat, TeamworkPM and Redmine

It’s been a while since I posted an update on our infrastructure tools, so here goes one.  (I know, ideally, it should be on our company’s blog, but we haven’t finished that part of the site yet).

Continue reading Infrastructure update : GitHub, BitBucket, HipChat, TeamworkPM and Redmine