Forcing Amazon Linux AMI compatibility with CentOS in Ansible

One of the things that makes Ansible so awesome is a huge collection of shared roles over at Ansible Galaxy.  These bring you best practices, flexible configurations and in general save hours and hours of hardcore swearing and hair pulling.

Each role usually supports multiple versions of multiple Linux distributions.  However, you’ll find that the majority of the supported distributions are Ubuntu, Debian, Red Hat Enterprise Linux, CentOS, and Fedora.  The rest aren’t as popular.

Which brings me to the point with Amazon Linux AMI.  Amazon Linux AMI is mostly compatible with CentOS, but it uses a different version approach, which means that most of those Ansible roles will ignore or complain about not supporting Amazon AMI.

Here is an example I came across yesterday from the dj-wasabi.zabbix-server role.  The template for the Yum repository uses ansible_os_major_version variable, which is expected to be similar to Red Hat / CentOS version number – 5, 6, 7, etc.  Amazon Linux AMI’s major version is reported as “NA” – not available.   That’s probably because Amazon Linux AMI versions are date-based – with the latest one being 2016.03.

[zabbix]
name=Zabbix Official Repository - $basearch
baseurl=http://repo.zabbix.com/zabbix/{{ zabbix_version }}/rhel/{{ ansible_distribution_major_version }}/$basearch/
enabled=1
gpgcheck=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-ZABBIX

Officially, Amazon Linux AMI is not CentOS or Red Hat Enterprise Linux.  But if you don’t care about such little nuances, and you are brave enough to experiment and assume things, than you can make that role work, by simply setting the appropriate variables to the values that you want.

First, here is a standalone test.yml playbook to try things out:

- name: Test
  hosts: localhost
  pre_tasks:
  - set_fact: ansible_distribution_major_version=6
    when: ansible_distribution == "Amazon"
  tasks:
  - debug: msg={{ ansible_distribution_major_version }}

Let’s run it and look at the output:

$ ansible-playbook test.yml

PLAY [Test] *******************************************************************

GATHERING FACTS ***************************************************************
ok: [localhost]

TASK: [set_fact ansible_distribution_major_version=6] *************************
ok: [localhost]

TASK: [debug msg={{ ansible_distribution_major_version }}] ********************
ok: [localhost] => {
  "msg": "6"
}

PLAY RECAP ********************************************************************
localhost : ok=3 changed=0 unreachable=0 failed=0

So far so good.  Now we need to integrate this into our playbook in such a way that the variable is set before the third-party role is executed.  For that, we’ll use pre_tasks.  Here is an example:

---
- name: Zabbix Server
  hosts: zabbix.server
  sudo: yes
    pre_tasks:
    - set_fact: ansible_distribution_major_version=6
      when: ansible_distribution == "Amazon" and ansible_distribution_major_version == "NA"
  roles:
    - role: dj-wasabi.zabbix-server

A minor twist here is also checking if the major version is not set yet. You can skip that, or you can change it, for example, to examine the Amazon Linux AMI version and set corresponding CentOS version.

Let’s Encrypt on CentOS 7 and Amazon AMI

The last few weeks were super busy at work, so I accidentally let a few SSL certificates expire.  Renewing them is always annoying and time consuming, so I was pushing it until the last minute, and then some.

Instead of going the usual way for the renewal, I decided to try to the Let’s Encrypt deal.  (I’ve covered Let’s Encrypt before here and here.)  Basically, Let’s Encrypt is a new Certification Authority, created by Electronic Frontier Foundation (EFF), with the backing of Google, Cisco, Mozilla Foundation, and the like.  This new CA is issuing well recognized SSL certificates, for free.  Which is good.  But the best part is that they’ve setup the process to be as automated as possible.  All you need is to run a shell command to get the certificate and then another shell command in the crontab to renew the certificate automatically.  Certificates are only issued for 3 months, so you’d really want to have them automatically updated.

It took me longer than I expected to figure out how this whole thing works, but that’s because I’m not well versed in SSL, and because they have so many different options, suited for different web servers, and different sysadmin experience levels.

Eventually I made it work, and here is the complete process, so that I don’t have to figure it out again later.

We are running a mix of CentOS 7 and Amazon AMI servers, using both Nginx and Apache.   Here’s what I had to do.

First things first.  Install the Let’s Encrypt client software.  Supposedly there are several options, but I went for the official one.  Manual way:

# Install requirements
yum install git bc
cd /opt
git clone https://github.com/certbot/certbot letsencrypt

Alternatively, you can use geerlingguy’s lets-encrypt-role for Ansible.

Secondly, we need to get a new certificate.  As I said before, there are multiple options here.  I decided to use the certonly way, so that I have better control over where things go, and so that I would minimize the web server downtime.

There are a few things that you need to specify for the new SSL certificate.  These are:

  • The list of domains, which the certificate should cover.  I’ll use example.com and www.example.com here.
  • The path to the web folder of the site.  I’ll use /var/www/vhosts/example.com/
  • The email address, which Let’s Encrypt will use to contact you in case there is something urgent.  I’ll use ssl@example.com here.

Now, the command to get the SSL certificate is:

/opt/letsencrypt/certbot-auto certonly --webroot --email ssl@example.com --agree-tos -w /var/www/vhosts/example.com/ -d example.com -d www.example.com

When you run this for the first time, you’ll see that a bunch of additional RPM packages will be installed, for the virtual environment to be created and used.  On CentOS 7 this is sufficient.  On Amazon AMI, the command will run, install things, and will fail with something like this:

WARNING: Amazon Linux support is very experimental at present...
if you would like to work on improving it, please ensure you have backups
and then run this script again with the --debug flag!

This is useful, but insufficient.  Before you can run successfully, you’ll also need to do the following:

yum install python26-virtualenv

Once that is done, run the certbot command with the –debug parameter, like so:

/opt/letsencrypt/certbot-auto certonly --webroot --email ssl@example.com --agree-tos -w /var/www/vhosts/example.com/ -d example.com -d www.example.com --debug

This should produce a success message, with “Congratulations!” and all that.  The path to your certificate (somewhere in /etc/letsencrypt/live/example.com/) and its expiration date will be mentioned too.

If you didn’t get the success message, make sure that:

  • the domain, for which you are requesting a certificate, resolves back to the server, where you are running the certbot command.  Let’s Encrypt will try to access the site for verification purposes.
  • that public access is allowed to the /.well-known/ folder.  This is where Let’s Encrypt will store temporary verification files.  Note that the folder starts with dot, which in UNIX means hidden folder, which are often denied access to by many web server configurations.

Just drop a simple hello.txt to the /.well-known/ folder and see if you can access it with the browser.  If you can, then Let’s Encrypt shouldn’t have any issues getting you a certification.  If all else fails, RTFM.

Now that you have the certificate generated, you’ll need to add it to the web server’s virtual host configuration.  How exactly to do this varies from web server to web server, and even between the different versions of the same web server.

For Apache version >= 2.4.8 you’ll need to do the following:

SSLEngine on
SSLCertificateKeyFile /etc/letsencrypt/live/example.com/privkey.pem
SSLCertificateFile /etc/letsencrypt/live/example.com/fullchain.pem

For Apache version < 2.4.8 you’ll need to do the following:

SSLEngine on
SSLCertificateKeyFile /etc/letsencrypt/live/example.com/privkey.pem
SSLCertificateFile /etc/letsencrypt/live/example.com/cert.pem
SSLCertificateChainFile /etc/letsencrypt/live/example.com/chain.pem

For Nginx >= 1.3.7 you’ll need to do the following:

ssl_certificate /etc/letsencrypt/live/example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/example.com/privkey.pem;

You’ll obviously need the additional SSL configuration options for protocols, ciphers and the like, which I won’t go into here, but here are a few useful links:

Once your SSL certificate is issued and web server is configured to use it, all you need is to add an entry to the crontab to renew the certificates which are expiring in 30 days or less.  You’ll only need a single entry for all your certificates on this machine.  Edit your /etc/crontab file and add the following (adjust for your web server software, obviously):

# Renew Let's Encrypt certificates at 6pm every Sunday
0 18 * * 0 root (/opt/letsencrypt/certbot-auto renew && service httpd restart)

That’s about it.  Once all is up and running, verify and adjust your SSL configuration, using Qualys SSL Labs excellent tool.

Support lesson to learn from Amazon AWS

I’ve said a million times how happy I am with Amazon AWS.  Today I also want to share a positive lesson to learn from their technical support.  It’s the second time I’ve contacted them over the last year and a half, and it’s the second time I am amazed at how good well it works.

In my experience, technical support departments usually rely on one primary communication channel – be that a telephone, an email, a ticketing system, or a live chat.  The other channels are often just routed or converted into the main one, or, even, completely ignored.  But each one of those has it’s benefits and side effects.

Telephone provides the most immediate connectivity, and a much valued option of the human interaction.  But the communication is verbal, often without the paper trail.  It makes it difficult to carbon copy (CC) people on the conversation or review exactly what has been said.  It is also very free form, unstructured.

Live chat is also free form and unstructured, but it’s written, so transcripts are easily available.  It also helps with the carbon copy, but only on the receiving end – supervisors or field experts can often be included in the conversation, but adding somebody from the requesting side is rarely supported.

Email makes it easy to carbon copy people on both ends.  It provides the paper trail, but often lacks the immediate response factor.  And it’s still unstructured, making it difficult to figure out what was requested, what has been discussed and whether or not there was any resolution.  (Have you ever been a part of a lengthy multi-lingual conversation about, what turned out to be, multiple issues in the same thread?)

Ticketing/support systems help to structure the conversation and make it follow a certain workflow.  But they often lack humanity and, much like emails, the immediate response.

Now, what Amazon AWS support has done is a beautiful combination of a ticketing system and a phone.  You start off with the ticketing system – login, create a new support case, providing all the necessary information, and optionally CC other people from a single short form.  The moment you submit it, the web page asks for your phone number.  Once entered, a phone call is placed immediately by the system, connecting you to the support engineer.  The engineer confirms a few case details and lets you know that the case is in progress and expected resolution time (I was asking to raise the limit of the Elastic IP addresses on the Virtual Private Cloud, and I was told it will be done in the next 15 to 30 minute.  And it was done in 10!).  I have also received two emails – one confirming the opening of the case, with all the requested details, and another one notifying me that the work has been done, providing quick information on how to follow up, in case I needed to.

Overall experience was very smooth, fast, to the point, and very effective.  I never got lost.  I never had to figure anything out.  And my problem was attended to and resolved immediately.

I only wish more companies provided this level of support.  I’ll sure try too – but it’s a bar set high.

 

 

Top level domain nonsense and how it can break your stuff

Call me old school, but I really (I mean REALLY) don’t like the recent explosion of the top level domains.  I understand that most good names are taken in .com, .org, and .net zones, but do we really need all those .blue, .parts, and .yoga TLDs?

Why am I whining about all this all of a sudden?  I’ll tell you why.  Because a new top level domain – .aws – is about to be introduced, and it already broke something for me in a non-obvious manner.

aws

I manage a few Virtual Private Clouds on the Amazon AWS.  Many of these use and rely on some hostname naming convention (yeah, I’m familiar with the pets vs. cattle idea).  Imagine you have a few servers, which are separated into generic infrastructure and client segments, like so:

  • bastion.aws.example.com
  • firewall.aws.example.com
  • lb.aws.example.com
  • web.client1.example.com
  • db.client1.example.com
  • web.client2.example.com
  • db.client2.example.com
  • … and so on.

Working with such long FQDNs (fully qualified domain names) isn’t very convenient.  So add “search example.com” to your /etc/resolve.conf file and now you can use short hostnames like firewall.aws and web.client1.  And life is beautiful …

… until one day, when you see the following:

user@bastion.aws$> ssh firewall.aws
Permission denied (publickey).

And that’s when your heart misses a beat, the world freezes, and you go: “WTF?”.  All kinds of thoughts are rushing through your head.  Is it a typo?  Am I in the right place? Did the server get compromised?  How’s that for a little panic …

Trying a few things here and there, you manage to get into the server from somewhere else.  You are very careful.  You are looking around for any traces of the break-in, but you see nothing.  You dig through the logs both on the server and off it.  Still nothing.  You can dive into all those logwatch and cron messages in your Trash, that you were automatically deleting, cause things were working fine for so long.  There!  You find that cron was complaining that backup script couldn’t get into this machine.  Uh-oh.  This was happening for a few days now.  A black cloud of combined worry for the compromised machine and outdated backup kills the sunlight in your life.  Dammit!

Take a break to calm down.  Try to think clearly.  Don’t panic.  Stop assuming things, and start troubleshooting.

A few minutes later, you establish that the problem is not limited to that particular machine.  All your .aws hosts share this headache.  A few more minutes later, you learn that ‘ssh firewall.aws.example.com’ works fine, while ‘ssh firewall.aws’ still doesn’t.

That points toward the hostname resolution issue.   With that, it takes only a few more moments to see the following:

user@bastion.aws$> host firewall.aws
firewall.aws has address 127.0.53.53
firewall.aws mail is handled by 10 your-dns-needs-immediate-attention.aws.

Say what?  That’s not at all what I expected.  And what is that that I need to fix with my DNS?  Google search brings this beauty:

This is problably because the .dev and .local are now valid top level extensions.

Really? Who’s the genius behind that?  I thought people chose those specifically to make them internal.  So is there an .aws top level extension now too?  You bet there is!

Solution?  Well, as far as I am concerned, from this day onward, I don’t trust the brief hostnames anymore.  It’s FQDN or nothing.

CPU Steal Time. Now on Amazon EC2

Yesterday I wrote the blog post, trying to figure out what is the CPU steal time and why it occurs.  The problem with that post was that I didn’t go deep enough.

I was looking at this issue from the point of view of a generic virtual machine.  The case that I had to deal with wasn’t exactly like that.  I saw the CPU steal time on the Amazon EC2 instance.  Assuming that these were just my neighbors acting up or Amazon having a temporary hardware issue was a wrong conclusion.

That’s because I didn’t know enough about Amazon EC2.  Well, I’ve learned a bunch since then, so here’s what I found.

Continue reading “CPU Steal Time. Now on Amazon EC2”

Inside Amazon’s Cloud Computing Infrastructure

aws regions

Here’s a little insight into the Amazon’s cloud computing infrastructure:

Amazon operates at least 30 data centers in its global network, with another 10 to 15 on the drawing board.

How big is a data center?

A key decision in planning and deploying cloud capacity is how large a data center to build. Amazon’s huge scale offers advantages in both cost and operations. Hamilton said most Amazon data centers house between 50,000 and 80,000 servers, with a power capacity of between 25 and 30 megawatts.

So, how many servers does the Amazon AWS run?

So how many servers does Amazon Web Services run? The descriptions by Hamilton and Vogels suggest the number is at least 1.5 million. Figuring out the upper end of the range is more difficult, but could range as high as 5.6 million, according to calculations by Timothy Prickett Morgan at the Platform.

Cloud computing price war

Now this looks like a straight up war!  Less than a day apart, both Google and Amazon announced yet another price drop on their services.  TechCrunch sums up Google’s price drop as following:

Google Compute Engine is seeing a 32 percent reduction in prices across all regions, sizes and classes. App Engine prices are down 30 percent, and the company is also simplifying its price structure.

[…]

The price of cloud storage is dropping a whopping 68 percent to just $0.026/month per gigabyte and $0.2/month per gigabyte/DRA.

[…]

BigQuery, Google’s database for doing big data analysis, is getting the largest price drop at 85 percent. The team reduced per-gigabyte storage pricing from $0.08/GB to $0.026/GB, a 68 percent drop, and interactive queries now cost $5/TB instead of $35/TB. Batch queries now also cost $5/TB instead of the previous $20/TB.

Amazon Web Services Blog provides comparison tables between old and new prices, which are quite similar.  And they also notice the following:

If you’ve been reading this blog for an extended period of time you know that we reduce prices on our services from time to time, and today’s announcement serves as the 42nd price reduction since 2008.

 

Amazon package delivering drones

Holy crap! This is just crazy!  Amazon is working on a project to use drones for ultra fast package delivery (2.5 kg in less then half an hour):

He says we’re four or five years from drones being able to deliver small packages right to your house, largely because the company has to work with the FAA to make sure it’s legally allowed to run the Prime Air program — Amazon doesn’t have Zookal’s luxury of operating in Australia without the FAA’s regulatory oversight.

I always regarded Amazon as an innovative company – but this is in a category by itself.  Welcome to the future, once again.