CPU Steal Time

Here’s something that happens once in a blue moon – you get a server that seems overloaded while doing nothing.  There are several reasons for why that can happen, but today I’m only going to look at one of them.  As it happened to me very recently.

Firstly, if you have any kind of important infrastructure, make sure you have the monitoring tools in place.  Not just the notification kind, like Nagios, but also graphing ones like Zabbix and Munin.  This will help you plenty in times like this.

web1

When you have an issue to solve, you don’t want to be installing monitoring tools, and starting to gather your data.  You want the data to be there already.

Now, for the real thing.  What happened here?  Well, obviously the CPU steal time seems way off.  But what the hell is the CPU steal time?  Here’s a handy article – Understanding the CPU steal time.  And here is my favorite part of it:

There are two possible causes:

  1. You need a larger VM with more CPU resources (you are the problem).
  2. The physical server is over-sold and the virtual machines are aggressively competing for resources (you are not the problem).

The catch: you can’t tell which case your situation falls under by just watching the impacted instance’s CPU metrics.

In our case, it was a physical server issue, which we had no control over.  But it was super helpful to be able to say what is going.  We’ve prepared “plan B”, which was to move to another server, but finally the issue disappeared and we didn’t have to do that this time.

Oh, and if you don’t have those handy monitoring tools, you can use top:

top_steal

P.S. : If you are on Amazon EC2, you might find this article useful as well.

APC is dead, long live OPcache

Since this is probably common knowledge by now, this blog post is more a note to my future self.  APC is dead.  Don’t use it.  Use OPcache instead.  APCu is something else.

In the last few years I’ve had so much issues with APC, that I eventually stopped installing it on my servers by default.  Now that I need to squeeze every bit of performance for one of the projects, I looked back at it.  And tried it.  And once again it kicked me in the balls.  Then I remembered that I’ve seen APCu somewhere.  Maybe it’s a newer fork or something.

Gladly, after a quick Google search for the difference, I came across this discussion, which clarified a few things.

So out of those you named:

  • APC is opcode cache and data store
  • APCu is only data store
  • OPcache is only opcode cache

Since APC is older, at the moment you likely want OPcache as well as some data store, not necessarily APCu (although it is perfectly fine choice).

My interest was in opcode cache, since I already had a data store.  Installing and configuring OPcache needed just a few seconds, and didn’t cause any issues so far.

And if you want more information about it, here is a useful article, which, among other things, lists the helpful tools for monitoring and tweaking OPcache configuration.

3. How to check if OpCache is actually caching my files?

If you have already installed and configured OpCache, you may find it important to control which PHP files are actually being cached. The whole cache engine works in the background and is transparent to a visitor or a web developer. In order to check its status, you may use one of the two functions that provide such information: opcache_get_configuration() and opcache_get_status(). Fortunately, there is a couple of prepared scrips that fetch all the OpCache configuration and status data and display it in a friendly way. You don’t need to write any code by yourself, just pick up one of tools from these below:
Opcache Control Panel,
opcache-status by Rasmus Lerdorf,
OpCacheGUI by Pieter Hordijk,
opcache-gui by Andrew Collington.

May the Cache be with you.

Weird PHP error output bug

We came across this PHP bug at work today.  But before you go and read it, let me show you a use case.  See, if you can spot the problem.

We had a cron job script which looked something like this (shortened for clarity):

#!/bin/bash

# ... a bunch of stuff here ... 

date && echo "Updating products"
php updateProducts.php 1>/dev/null

if [ "$?" -ne "0" ]
then
  date && echo "Updating products failed"
  exit 1
fi

# ... more stuff here ...

Crystal clear, no? Output a time stamp and a log message, run the product update, redirecting all normal output to standard output, and then check if the script finished fine. If not, print the time stamp and log message and exit with non-zero status code.

We use similar code snippets all over the place, and they work fine.  This particular one was a new addition.  So the cron job ran and “Updating products failed” part happened.  Weird.   The PHP script in question has plenty of logging in it, but nothing was logged.  So we added more logs.  And then some more logs.  And even more logs.  Until it became obvious that something else is wrong, because even the first line of the script, which was now a logging action, wasn’t triggered.

After a rather lengthy troubleshooting session we noticed that the updateProducts.php file was in fact named udpateProducts.php.  A simple typo in the file name.  But shouldn’t that be printed out into the error output?

Let’s check:

$ php no_such_file.php
Could not open input file: no_such_file.php
$ php no_such_file.php 1>/dev/null
$

Huh? Where’s my error? It’s gone.   That’s because if you are as used to the command line as I am, you’d expect PHP to output to STDERR.  But PHP is much smarter than that.  It has a whole slew of configuration options in regards to error output.  In this case, in particular, you need to check the values of display_errors and error_log configuration variables.  The bug report describes a Debian machine, while I tested it on Fedora, CentOS, and Amazon AMI.

$ php -i | egrep '(display_errors|error_log)'
display_errors => Off => Off
error_log => no value => no value

Now it’s not much of a mystery.  But things like that can easily make you pull some hair out.  Hopefully, this gets some attention.