CPU Steal Time

Here’s something that happens once in a blue moon – you get a server that seems overloaded while doing nothing.  There are several reasons for why that can happen, but today I’m only going to look at one of them.  As it happened to me very recently.

Firstly, if you have any kind of important infrastructure, make sure you have the monitoring tools in place.  Not just the notification kind, like Nagios, but also graphing ones like Zabbix and Munin.  This will help you plenty in times like this.

web1

When you have an issue to solve, you don’t want to be installing monitoring tools, and starting to gather your data.  You want the data to be there already.

Now, for the real thing.  What happened here?  Well, obviously the CPU steal time seems way off.  But what the hell is the CPU steal time?  Here’s a handy article – Understanding the CPU steal time.  And here is my favorite part of it:

There are two possible causes:

  1. You need a larger VM with more CPU resources (you are the problem).
  2. The physical server is over-sold and the virtual machines are aggressively competing for resources (you are not the problem).

The catch: you can’t tell which case your situation falls under by just watching the impacted instance’s CPU metrics.

In our case, it was a physical server issue, which we had no control over.  But it was super helpful to be able to say what is going.  We’ve prepared “plan B”, which was to move to another server, but finally the issue disappeared and we didn’t have to do that this time.

Oh, and if you don’t have those handy monitoring tools, you can use top:

top_steal

P.S. : If you are on Amazon EC2, you might find this article useful as well.

Office 365 Lync Online SRV DNS records on Amazon Route 53

One of the very few things we still rely on from Microsoft at work is Office 365.  Not because it is so great, but because I simply didn’t have the time to move away yet (quiet Christmas season is coming soon).  Most people don’t get exposed much to it anyway, using Evolution or Thunderbird email clients, or forwarding everything to Gmail altogether.  But as an administrator of the service I get constantly annoyed by the “there are problems with your domain” notification.  After ignoring it for about a year, I decided to finally fix it.  All that was needed is a couple of records in the DNS zone.

Unfortunately, the instructions Microsoft provides don’t quite well apply to Amazon Route 53 user interface.  It took me a few tries and some searching to find the right ones.  Here they are, thanks to a comment on this page:

First record:

1. Log into Route 53.
2. in the hosted zones window, check the box next to YOURDOMAIN.com
3. click  [ > Go to Record Sets]
4. Click  [Create Record Set]
5. Enter  _sipfederationtls._tcp , in the Name field
6. Switch type to SRV – Service locator
7. In the Value window Enter: 100 1 5061 sipfed.online.lync.com
8. Click [Create]

Second record:

9. Click  [Create Record Set]
10. Enter: _sip._tls , in the name field
11. Switch type to SRV – Service locator
12. In the Value window Enter: 100 1 443 sipdir.online.lync.com
13. Click [Create]

Inside Amazon’s Cloud Computing Infrastructure

aws regions

Here’s a little insight into the Amazon’s cloud computing infrastructure:

Amazon operates at least 30 data centers in its global network, with another 10 to 15 on the drawing board.

How big is a data center?

A key decision in planning and deploying cloud capacity is how large a data center to build. Amazon’s huge scale offers advantages in both cost and operations. Hamilton said most Amazon data centers house between 50,000 and 80,000 servers, with a power capacity of between 25 and 30 megawatts.

So, how many servers does the Amazon AWS run?

So how many servers does Amazon Web Services run? The descriptions by Hamilton and Vogels suggest the number is at least 1.5 million. Figuring out the upper end of the range is more difficult, but could range as high as 5.6 million, according to calculations by Timothy Prickett Morgan at the Platform.

WordPress Benchmark of MySQL server on Amazon EC2

I have a friend who is a newcomer to the world of WordPress.  Until recently, he was mostly working with custom-built systems and a PostgreSQL database engine, so there are many topics to cover.

One of the topics that came up today was the performance of the database engine.  A quick Google search brought up the Benchmark plugin, which we used to compare results from several servers.  (NOTE: you’ll need php-bcmath installed on your server for this plugin to work.)

My friend’s test server showed a rather poor 48 requests / second result.  And that’s on an Intel Core2 Duo E4500 machine with 4 GB of RAM and 160 GB 7200 RPM SATA HDD, running Ubuntu 12.04 x86-64.

So, I tried it on my setup.  My setup is all on Amazon EC2, using the smallest possible t2.micro servers (that’s Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, with 1 GB of RAM and god knows what kind of hard disk, running Amazon AMI).

First, I ran the benchmark on the test server, which hosts about 20 sites with low traffic (I didn’t want to bring up a separate instance for just a single benchmark run).  MySQL runs on the same instance as the web server.  And here are the results:

Your System Industry Average
CPU Speed: 38,825 BogoWips 24,896 BogoWips
Network Transfer Speed: 97.81 Mbps 11.11 Mbps
Database Queries per Second: 425 Queries/Sec 1,279 Queries/Sec

Secondly, I ran the benchmark on one of the live servers, which also hosts about 20 sites with low traffic. Here though, Nginx web server runs on one instance and the MySQL database on another. Here are the results:

Your System Industry Average
CPU Speed: 37,712 BogoWips 24,901 BogoWips
Network Transfer Speed: 133.91 Mbps 11.15 Mbps
Database Queries per Second: 1,338 Queries/Sec 1,279 Queries/Sec

In both cases, MySQL is v5.5.42, running on the /usr/share/doc/mysql55-server-5.5.42/my-huge.cnf configuration file. (I find it ironically pleasing that the tiniest of Amazon EC2 servers fits perfectly for the huge configuration shipped with documentation.)

The benchmark plugin explains how the numbers are calculated. Here’s what it says about the database queries:

To benchmark your database I use your wp_options table which uses the longtext column type which is the same type used by wp_posts. I do 1000 inserts of 50 paragraphs of text, then 1000 selects, 1000 updates and 1000 deletes. I use the time taken to calculate queries per second based on 4000 queries. This is a good indication of how fast your overall DB performance is in a worst case scenario when nothing is cached.

So, it’s a good number to throw around, but it’s far from the realistic site performance, as your WordPress site will mostly get SELECTs, not INSERTs or UPDATEs or DELETEs. And then, you’ll obviously need to see how many SQL queries do you need per page. And then you’ll need to examine all the caching in play – from browser, web server, WordPress, MySQL, and the operating system. And then, and then, and then.

But for a quick measure, I think, this is a good benchmark. It’s obvious that my friend can get a lot more out of his server without digging too deep. It’s obvious that separating web and database server into two Amazon instances gives you quite a boost. And it’s obvious that I don’t know much about performance measuring.

June 30, 2015 23:59:60

AWS Official Blog covers the upcoming leap second shenanigans in “Look Before You Leap – The Coming Leap Second and AWS“:

The International Earth Rotation and Reference Systems (IERS) recently announced that an extra second will be injected into civil time at the end of June 30th, 2015. This means that the last minute of June 30th, 2015 will have 61 seconds. If a clock is synchronized to the standard civil time, it will show an extra second 23:59:60 on that day between 23:59:59 and 00:00:00. This extra second is called a leap second. There have been 25 such leap seconds since 1972. The last one took place on June 30th, 2012.

Not all applications and systems are properly coded to handle this “:60” notation.