HAProxy SNI

HAProxy SNI” is pure gold! If you want to have a load balancer for HTTPS traffic, without managing SSL certificates on the said load balancer, there is a way to do so.

The approach is utilizing the Server Name Indication (SNI) extension to the TLS protocol.  I knew about it and I was already using it on the web server side, but it didn’t occur to me that it’ll be utilized on the load balancer.  Here’s the configuration bit:

frontend https *:443
  description Incoming traffic to port 443
  mode tcp
  tcp-request inspect-delay 5s
  tcp-request content accept if { req_ssl_hello_type 1 }
  use_backend backend-ssl-foobar if { req_ssl_sni -i foobar.com }
  use_backend backend-ssl-example if { req_ssl_sni -i example.com }
  default_backend backend-ssl-default

The above will make HAProxy listen on port 443, and then send all traffic for foobar.com to one backend, all traffic for example.com to another backend, and the rest to the third, default backend.

350+ Data Structure Problems with Solutions

Here is a rather extensive collection of 350+ data structure problems with solutions.  The list varies from the usual searching and sorting of values in an array, to string manipulation, binary logic, matrices and graphs.  No matter how high was your grade for all those Computer Science courses back in college, or how long have you been programming, I guarantee you’ll find a challenge or two in this list.

From a very brief couple of hours look at the list, my favorite ones seem to be around the chessboard problems, such as this chess knight problem for finding the shortest path to destination using a queue.

WordPress Plugin : Image Processing Queue

As described in “Introducing WP Image Processing Queue – On‑the‑Fly Image Processing Done Right“, Image Processing Queue plugin tries to solve several issues with On-The-Fly Image Processing (OTFIP) in WordPress.  Some of the things that it improves are:

  • Response times for pages with non-yet generated thumbnails.
  • Server CPU spikes for pages which use a lot of images on sites with a lot of configured thumbnail sizes (49? really? WOW! I don’t think I’ve seen more than 10 in the wild, which is still a lot).
  • Server disk space issues caused by removed images and leftover thumbnails.

This is a very useful direction and I hope all the necessary bits will make it into the WordPress core.  But even for those who don’t use WordPress, the whole discussion and implementation are a handy reference.

Mcrouter: a memcached protocol router

Mcrouter is an Open Source tool developed by Facebook for scaling up the memcached deployments:

Mcrouter is a memcached protocol router for scaling memcached (http://memcached.org/) deployments. It’s a core component of cache infrastructure at Facebook and Instagram where mcrouter handles almost 5 billion requests per second at peak.

Here is a good overview of some of the scenarios where Mcrouter is useful.  There’s more than one.  Here are some of the features to get you started:

  • Memcached ASCII protocol
  • Connection pooling
  • Multiple hashing schemes
  • Prefix routing
  • Replicated pools
  • Production traffic shadowing
  • Online reconfiguration
  • Flexible routing
  • Destination health monitoring/automatic failover
  • Cold cache warm up
  • Broadcast operations
  • Reliable delete stream
  • Multi-cluster support
  • Rich stats and debug commands
  • Quality of service
  • Large values
  • Multi-level caches
  • IPv6 support
  • SSL support

Latency numbers by year

Last year I came across a nice chart of latency numbers every programmer should know.  Today, I saw this page, which shows you the same latency numbers, but also provides a timeline from 1990 to 2020.

For some operations, latency is constant, because it’s based on things of nature – speed of light, distance between continents, etc.  For other operations, latency can be decreased through better technology and algorithms.

The timeline clearly shows the mind-blowing advance we’ve experienced in technology over the last three decades.

CPU Steal Time. Now on Amazon EC2

Yesterday I wrote the blog post, trying to figure out what is the CPU steal time and why it occurs.  The problem with that post was that I didn’t go deep enough.

I was looking at this issue from the point of view of a generic virtual machine.  The case that I had to deal with wasn’t exactly like that.  I saw the CPU steal time on the Amazon EC2 instance.  Assuming that these were just my neighbors acting up or Amazon having a temporary hardware issue was a wrong conclusion.

That’s because I didn’t know enough about Amazon EC2.  Well, I’ve learned a bunch since then, so here’s what I found.

Continue reading “CPU Steal Time. Now on Amazon EC2”

NAS Performance: NFS vs Samba vs GlusterFS

I came across this question and also found the results of the benchmarks somewhat surprising.

  • GlusterFS replicated 2: 32-35 seconds, high CPU load
  • GlusterFS single: 14-16 seconds, high CPU load
  • GlusterFS + NFS client: 16-19 seconds, high CPU load
  • NFS kernel server + NFS client (sync): 32-36 seconds, very low CPU load
  • NFS kernel server + NFS client (async): 3-4 seconds, very low CPU load
  • Samba: 4-7 seconds, medium CPU load
  • Direct disk: < 1 second

The post is from 2012, so I’m curious if this is still accurate. Has anybody tried this? Can confirm or otherwise?

Also, an interesting note from the answer to the above:

From what I’ve seen after a couple of packet captures, the SMB protocol can be chatty, but the latest version of Samba implements SMB2 which can both issue multiple commands with one packet, and issue multiple commands while waiting for an ACK from the last command to come back. This has vastly improved its speed, at least in my experience, and I know I was shocked the first time I saw the speed difference too – Troubleshooting Network Speeds — The Age Old Inquiry

 

How Far Can You Go With HAProxy and a t2.micro

Here’s an interesting set of experiments trying to answer the question of how far can you go with HAProxy setup on the smallest of the Amazon EC2 instances – t2.micro (1 virtual CPU, 1 GB of RAM).  Here’s the summary.

460 requests/second

At 460 req/second response times are mostly a flat ~300 ms, except for two spikes. I attribute this to TCP congestion avoidance as the traffic approaches the limit and packets start to get dropped. After dropped packets are detected the clients reduce their transmission rate, but eventually the transmission rate stabilizes again just under the limit. Only 1739 requests timeout and 134918 succeed.

[…]

It seems that the limit of the t2.micro is around 500 req/second even for small responses.

CPU Steal Time

Here’s something that happens once in a blue moon – you get a server that seems overloaded while doing nothing.  There are several reasons for why that can happen, but today I’m only going to look at one of them.  As it happened to me very recently.

Firstly, if you have any kind of important infrastructure, make sure you have the monitoring tools in place.  Not just the notification kind, like Nagios, but also graphing ones like Zabbix and Munin.  This will help you plenty in times like this.

web1

When you have an issue to solve, you don’t want to be installing monitoring tools, and starting to gather your data.  You want the data to be there already.

Now, for the real thing.  What happened here?  Well, obviously the CPU steal time seems way off.  But what the hell is the CPU steal time?  Here’s a handy article – Understanding the CPU steal time.  And here is my favorite part of it:

There are two possible causes:

  1. You need a larger VM with more CPU resources (you are the problem).
  2. The physical server is over-sold and the virtual machines are aggressively competing for resources (you are not the problem).

The catch: you can’t tell which case your situation falls under by just watching the impacted instance’s CPU metrics.

In our case, it was a physical server issue, which we had no control over.  But it was super helpful to be able to say what is going.  We’ve prepared “plan B”, which was to move to another server, but finally the issue disappeared and we didn’t have to do that this time.

Oh, and if you don’t have those handy monitoring tools, you can use top:

top_steal

P.S. : If you are on Amazon EC2, you might find this article useful as well.