Year: 2016

Support lesson to learn from Amazon AWS

I’ve said a million times how happy I am with Amazon AWS. Today I also want to share a positive lesson to learn from their technical support. It’s the second time I’ve contacted them over the last year and a half, and it’s the second time I am amazed at how good well it works.

In my experience, technical support departments usually rely on one primary communication channel – be that a telephone, an email, a ticketing system, or a live chat. The other channels are often just routed or converted into the main one, or, even, completely ignored. But each one of those has it’s benefits and side effects.

Telephone provides the most immediate connectivity, and a much valued option of the human interaction. But the communication is verbal, often without the paper trail. It makes it difficult to carbon copy (CC) people on the conversation or review exactly what has been said. It is also very free form, unstructured.

Live chat is also free form and unstructured, but it’s written, so transcripts are easily available. It also helps with the carbon copy, but only on the receiving end – supervisors or field experts can often be included in the conversation, but adding somebody from the requesting side is rarely supported.

Email makes it easy to carbon copy people on both ends. It provides the paper trail, but often lacks the immediate response factor. And it’s still unstructured, making it difficult to figure out what was requested, what has been discussed and whether or not there was any resolution. (Have you ever been a part of a lengthy multi-lingual conversation about, what turned out to be, multiple issues in the same thread?)

Ticketing/support systems help to structure the conversation and make it follow a certain workflow. But they often lack humanity and, much like emails, the immediate response.

Now, what Amazon AWS support has done is a beautiful combination of a ticketing system and a phone. You start off with the ticketing system – login, create a new support case, providing all the necessary information, and optionally CC other people from a single short form. The moment you submit it, the web page asks for your phone number. Once entered, a phone call is placed immediately by the system, connecting you to the support engineer. The engineer confirms a few case details and lets you know that the case is in progress and expected resolution time (I was asking to raise the limit of the Elastic IP addresses on the Virtual Private Cloud, and I was told it will be done in the next 15 to 30 minute. And it was done in 10!). I have also received two emails – one confirming the opening of the case, with all the requested details, and another one notifying me that the work has been done, providing quick information on how to follow up, in case I needed to.

Overall experience was very smooth, fast, to the point, and very effective. I never got lost. I never had to figure anything out. And my problem was attended to and resolved immediately.

I only wish more companies provided this level of support. I’ll sure try too – but it’s a bar set high.

Ansible safety net for DNS wildcard hosts

After using Ansible for only a week, I am deeply in love. I am doing more and more with less and less, and that’s exactly how I want my automation.

Today I had to solve an interesting problem. Ansible operates, based on the host and group inventory. As I mentioned before, I am now always relying on FQDNs (fully qualified domain names) for my host names. But what happens when DNS wildcards come into play with things like load balancers and reverse proxies Consider an example:

Nginx configured as reverse proxy on the machine proxy1.example.com with 10.0.0.10 IP address.
DNS wildcard is in place: *.example.com 3600 IN CNAME proxy1.example.com.
Ansible contains proxy1.example.com in host inventory and a playbook to setup the reverse proxy with Nginx.
Ansible contains a few other hosts in inventory and a playbook to setup Nginx as a web server.
Somebody adds a new host to inventory: another-web-server.example.com, without specifying any other host details, like ansible_ssh_host variable. And he also forgets to update the DNS zone with a new A or CNAME record.

Now, Ansible play is executed for the web servers configuration. All previously existing machines are fine. But the new machine’s another-web-server.example.com host name resolves to proxy1.example.com, which is where Ansible connects and runs the Nginx setup, overwriting the existing configuration, triggering a service restart, and screwing up your life. Just kidding, of course. :) It’ll be trivial to find out what happened. Fixing the Nginx isn’t too difficult either. Especially if you have backups in place. But it’s still better to avoid the whole mess altogether.

To help prevent these cases, I decided to create a new safety net role. Given a variable like:

---
# Aliased IPs is a list of hosts, which can be reached in 
# multiple ways due to DNS wildcards. Both IPv4 and IPv6 
# can be used. The hostname value is the primary hostname 
# for the IP - any other inventory hostname having any of 
# these IPs will cause a failure in the play.
aliased_ips:
  "10.0.0.10": "proxy1.example.com"
  "192.168.0.10": "proxy1.example.com"

And the following code in the role’s tasks/main.yml:

---
- debug: msg="Safety net - before IPv4"

- name: Check all IPv4 addresses against aliased IPs
  fail: msg="DNS is not configured for host '{{ inventory_hostname}}'. It resolves to '{{ aliased_ips[ item.0 ] }}'."
  when: "('{{ item[0] }}' == '{{ item[1] }}') and ('{{ inventory_hostname }}' != '{{ aliased_ips[ item.0 ] }}')"
  with_nested:
    - "{{ aliased_ips | default({}) }}"
    - "{{ ansible_all_ipv4_addresses }}"

- debug: msg="Safety net - after IPv4 and before IPv6"

- name: Check all IPv6 addresses against aliased IPs
  fail: msg="DNS is not configured for host '{{ inventory_hostname}}'. It resolves to '{{ aliased_ips[ item.0 ] }}'."
  when: "('{{ item[0] }}' == '{{ item[1] }}') and ('{{ inventory_hostname }}' != '{{ aliased_ips[ item.0 ] }}')"
  with_nested:
    - "{{ aliased_ips | default({}) }}"
    - "{{ ansible_all_ipv6_addresses }}"

- debug: msg="Safety net - after IPv6"

the safety net is in place. The first check will connect to the remote server, get the list of all configured IPv4 addresses, and then compare each one with each IP address in the aliased_ips variable. For every matching pair, it will check if the remote server’s host name from the inventory file matches the host name from the aliased_ips value for the matched IP address. If the host names match, it’ll continue. If not – a failure in the play occurs (Ansible speak for thrown exception). Other tasks will continue execution for other hosts, but nothing else will be done during this play run for this particular host.

The second check will do the same but with IPv6 addresses. You can mix and match both IPv4 and IPv6 in the same aliased_ips variable. And Ansible is smart enough to exclude the localhost IPs too, so things shouldn’t break too much.

I’ve tested the above and it seems to work well for me.

There is a tiny issue with elegance here though: host name to IP mappings are already configured in the DNS zone – duplicating this configuration in the aliased_ips variable seems annoying. Personally, I don’t have that many reverse proxies and load balancers to handle, and they don’t change too often either, so I don’t mind. Also, there is something about relying on DNS while trying to protect against DNS mis-configuration that rubs me the wrong way. But if you are the adventurous type, have a look at the Ansible’s dig lookup, which you can use to fetch the IP addresses from the DNS server of your choice.

As always, if you see any potential issues with the above or know of a better way to solve it, please let me know.

SugarCRM, RoundCube and Request Tracker integration on a single domain

In my years of working as a system administrator I’ve done some pretty complex setups and integration solutions, but I don’t think I’ve done anything as twisted as this one recently. The setup is part of the large and complex client project, built on their infrastructure, with quite a few requirements and a whole array of limitations. Several systems were integrated together, but in this particular post I’m focusing primarily on the SugarCRM, RoundCube and Request Tracker. Also, I am not going to cover the integration to full extent – just the email related parts.

Continue reading SugarCRM, RoundCube and Request Tracker integration on a single domain

WhatsApp introduces end-to-end encryption for everything

WhatsApp introduces end-to-end encryption for all communications – chats, pictures, videos, etc. I’m sure it’ll help them get more individuals and businesses on the network, as well as probably ban the app in a handful of countries.

WhatsApp has always prioritized making your data and communication as secure as possible. And today, we’re proud to announce that we’ve completed a technological development that makes WhatsApp a leader in protecting your private communication: full end-to-end encryption. From now on when you and your contacts use the latest version of the app, every call you make, and every message, photo, video, file, and voice message you send, is end-to-end encrypted by default, including group chats.

The idea is simple: when you send a message, the only person who can read it is the person or group chat that you send that message to. No one can see inside that message. Not cybercriminals. Not hackers. Not oppressive regimes. Not even us. End-to-end encryption helps make communication via WhatsApp private – sort of like a face-to-face conversation.

Absolute stupidity of include directive in /etc/sudoers, and Microsoft Azure

I’ve just spent three hours (!!!) trying to troubleshoot why sudo was misbehaving on a brand new CentOS 7 server. I was doing the setup of two identical servers in parallel (for two different clients). One server worked as expected, the other one didn’t.

The thing I was trying to do was trivial – allow users in the wheel group execution of sudo commands without password. I’ve done it a gadzillion times in the past, and probably at least a dozen times just this week alone. Here’s what’s needed:

Add user to the wheel group.
Edit /etc/sudoers file to uncommen tthe line (as in: remove the hash comment character from the beginning of the file): # %wheel ALL=(ALL) NOPASSWD: ALL
Enjoy!

Imagine my surprise when it only worked on one server and not on the other. I’ve dug deep and wide. Took a break. And dug again. Then, I’ve summoned the great troubleshooting powers of my brother. But even those didn’t help.

Lots of logging, diff-ing, strace-ing, swearing and hair pulling later, the problem was found and fixed. The issue was due to two separate reasons.

Reason 1: /etc/sudoers syntax uses the hash character (#) for two different purposes.

For comments, which there are plenty of in the file.
For the “#include” and “#includedir” directives, which include other files into the configuration.

The default /etc/sudoers file is full of lengthy comments. Just to give you and idea:

(root@host ~)# wc -l /etc/sudoers
118 /etc/sudoers
(root@host ~)# grep -v '^#' /etc/sudoers | grep -v '^$' | wc -l
12

Yup. 118 lines in total vs. 12 lines of configuration (comments and empty lines removed). Like with banner blindness, this causes comment blindness. Especially towards the end of the file. Especially if you’ve seen this file a billion times before.

And that’s where the problem starts. Right at the bottom of the file, there are these two lines:

##Read drop-in files from /etc/sudoers.d (the # here does not mean a comment)
#includedir /etc/sudoers.d

Interesting, right? Usually there is nothing in the /etc/sudoers.d/ folder on the brand new CentOS box. But even if there was something, by now you’d assume that the include of the folder is commented out. Much like that wheel group configuration I mentioned earlier. I found it by accident, while reading sudoers(5) manual page, trying to find out if there are any other locations or defaults for included configurations. About 600 lines into the manual, there is this:

To include /etc/sudoers.local from within /etc/sudoers we
would use the following line in /etc/sudoers:

#include /etc/sudoers.local

When sudo reaches this line it will suspend processing of
the current file (/etc/sudoers) and switch to
/etc/sudoers.local.

So that comment is not a comment at all, but an include of the folder. That’s the first part of the problem.

Reason #2: Windows Azure Linux Agent

As I mentioned above, the servers aren’t part of my infrastructure – they were provided by the clients. I was basically given an IP address, a username and a password for each server – which is usually all I need. In most cases I don’t really care where the server is hosted and what’s the hosting company in use. Turns out, I should.

The server with the problem was hosted on the Microsoft Azure cloud infrastructure. I assumed I was working off a brand new vanilla CentOS 7 box, but in fact I wasn’t. Microsoft adds packages to the default install. On of the packages that it adds is the Windows Azure Linux Agent, which “rpm -qi WALinuxAgent” describes as following:

The Windows Azure Linux Agent supports the provisioning and running of Linux VMs in the Microsoft Azure cloud. This package should be installed on Linux disk images that are built to run in the Microsoft Azure environment.

Harmless, right? Well, not so much. What I found in the /etc/sudoers.d/ folder was a little file, called waagent, which included the different sudo configuration for the user which I had a problem with.

During the troubleshooting process, I’ve created a new test user, added the account to the wheel group and found out that it was working fine. From there, I needed to find the differences between the two users.

I guess, the user that I was using initially was created by the client’s system administrator using Microsoft Azure web interface. A quick Google search brings this page from the Azure documentation:

By default, the root user is disabled on Linux virtual machines in Azure. Users can run commands with elevated privileges by using the sudo command. However, the experience may vary depending on how the system was provisioned.

SSH key and password OR password only – the virtual machine was provisioned with either a certificate (.CER file) or SSH key as well as a password, or just a user name and password. In this case sudo will prompt for the user’s password before executing the command.

SSH key only – the virtual machine was provisioned with a certificate (.cer, .pem, or .pubfile) or SSH key, but no password. In this case sudo will not prompt for the user’s password before executing the command.

I checked the user’s home folder and found no keys in there, so I think it was provisioned using the first option, with password only.

I think Microsoft should make it much more obvious that the system behavior might be different. Amazon AWS provides a good example to follow. When you login into Amazon AMI instance, you see a message of the day (motd) banner, which looks like this:

$ ssh server.example.com
Last login: Tue Apr  5 17:25:38 2016 from 127.0.0.1

__|  __|_  )
_|  (     /   Amazon Linux AMI
___|\___|___|

https://aws.amazon.com/amazon-linux-ami/2016.03-release-notes/

([email protected])$

It’s dead obvious that you are now on the Amazon EC2 machine and you should adjust your ~~expectations~~ assumptions accordingly.

Deleting the file immediately solved the problem. To avoid similar issues in the future, #includedir directive can be moved further up in the file, and surrounded by more visible comments. Like, maybe, an ASCII art skull, or something.

With that, I am off to heavy drinking and recovery… Stay sane!