Red Hat Satellite

Here’s something I didn’t know about – Red Hat Satellite.  From the FAQ page:

Red Hat® Satellite is a system management solution that makes Red Hat infrastructure easier to deploy, scale, and manage across physical, virtual, and cloud environments. Red Hat Satellite enables users to provision, configure, and update systems to help ensure that they are running efficiently andsecurely, and remain compliant with relevant standards. By automating most tasks related to maintaining systems, Red Hat Satellite helps organizations increase efficiency, reduce operational costs, and enable IT to better respond to strategic business needs.

Now Red Hat’s acquisition of Ansible makes even more sense.  I guess, their satellite is looking for the galaxy.

Ansible safety net for DNS wildcard hosts

After using Ansible for only a week, I am deeply in love.  I am doing more and more with less and less, and that’s exactly how I want my automation.

Today I had to solve an interesting problem.  Ansible operates, based on the host and group inventory.  As I mentioned before, I am now always relying on FQDNs (fully qualified domain names) for my host names.  But what happens when DNS wildcards come into play with things like load balancers and reverse proxies  Consider an example:

  1. Nginx configured as reverse proxy on the machine proxy1.example.com with 10.0.0.10 IP address.
  2. DNS wildcard is in place: *.example.com 3600 IN CNAME proxy1.example.com.
  3. Ansible contains proxy1.example.com in host inventory and a playbook to setup the reverse proxy with Nginx.
  4. Ansible contains a few other hosts in inventory and a playbook to setup Nginx as a web server.
  5. Somebody adds a new host to inventory: another-web-server.example.com, without specifying any other host details, like ansible_ssh_host variable.  And he also forgets to update the DNS zone with a new A or CNAME record.

Now, Ansible play is executed for the web servers configuration.  All previously existing machines are fine.  But the new machine’s another-web-server.example.com host name resolves to proxy1.example.com, which is where Ansible connects and runs the Nginx setup, overwriting the existing configuration, triggering a service restart, and screwing up your life.  Just kidding, of course. :)  It’ll be trivial to find out what happened.  Fixing the Nginx isn’t too difficult either.  Especially if you have backups in place.  But it’s still better to avoid the whole mess altogether.

To help prevent these cases, I decided to create a new safety net role.  Given a variable like:

---
# Aliased IPs is a list of hosts, which can be reached in 
# multiple ways due to DNS wildcards. Both IPv4 and IPv6 
# can be used. The hostname value is the primary hostname 
# for the IP - any other inventory hostname having any of 
# these IPs will cause a failure in the play.
aliased_ips:
  "10.0.0.10": "proxy1.example.com"
  "192.168.0.10": "proxy1.example.com"

And the following code in the role’s tasks/main.yml:

---
- debug: msg="Safety net - before IPv4"

- name: Check all IPv4 addresses against aliased IPs
  fail: msg="DNS is not configured for host '{{ inventory_hostname}}'. It resolves to '{{ aliased_ips[ item.0 ] }}'."
  when: "('{{ item[0] }}' == '{{ item[1] }}') and ('{{ inventory_hostname }}' != '{{ aliased_ips[ item.0 ] }}')"
  with_nested:
    - "{{ aliased_ips | default({}) }}"
    - "{{ ansible_all_ipv4_addresses }}"

- debug: msg="Safety net - after IPv4 and before IPv6"

- name: Check all IPv6 addresses against aliased IPs
  fail: msg="DNS is not configured for host '{{ inventory_hostname}}'. It resolves to '{{ aliased_ips[ item.0 ] }}'."
  when: "('{{ item[0] }}' == '{{ item[1] }}') and ('{{ inventory_hostname }}' != '{{ aliased_ips[ item.0 ] }}')"
  with_nested:
    - "{{ aliased_ips | default({}) }}"
    - "{{ ansible_all_ipv6_addresses }}"

- debug: msg="Safety net - after IPv6"

the safety net is in place.  The first check will connect to the remote server, get the list of all configured IPv4 addresses, and then compare each one with each IP address in the aliased_ips variable.  For every matching pair, it will check if the remote server’s host name from the inventory file matches the host name from the aliased_ips value for the matched IP address.  If the host names match, it’ll continue.  If not – a failure in the play occurs (Ansible speak for thrown exception).  Other tasks will continue execution for other hosts, but nothing else will be done during this play run for this particular host.

The second check will do the same but with IPv6 addresses.  You can mix and match both IPv4 and IPv6 in the same aliased_ips variable.  And Ansible is smart enough to exclude the localhost IPs too, so things shouldn’t break too much.

I’ve tested the above and it seems to work well for me.

There is a tiny issue with elegance here though: host name to IP mappings are already configured in the DNS zone – duplicating this configuration in the aliased_ips variable seems annoying.  Personally, I don’t have that many reverse proxies and load balancers to handle, and they don’t change too often either, so I don’t mind.  Also, there is something about relying on DNS while trying to protect against DNS mis-configuration that rubs me the wrong way.  But if you are the adventurous type, have a look at the Ansible’s dig lookup, which you can use to fetch the IP addresses from the DNS server of your choice.

As always, if you see any potential issues with the above or know of a better way to solve it, please let me know.

SugarCRM, RoundCube and Request Tracker integration on a single domain

In my years of working as a system administrator I’ve done some pretty complex setups and integration solutions, but I don’t think I’ve done anything as twisted as this one recently.  The setup is part of the large and complex client project, built on their infrastructure, with quite a few requirements and a whole array of limitations.  Several systems were integrated together, but in this particular post I’m focusing primarily on the SugarCRM, RoundCube and Request Tracker.  Also, I am not going to cover the integration to full extent – just the email related parts.

Continue reading SugarCRM, RoundCube and Request Tracker integration on a single domain

Absolute stupidity of include directive in /etc/sudoers, and Microsoft Azure

I’ve just spent three hours (!!!) trying to troubleshoot why sudo was misbehaving on a brand new CentOS 7 server.  I was doing the setup of two identical servers in parallel (for two different clients).   One server worked as expected, the other one didn’t.

The thing I was trying to do was trivial – allow users in the wheel group execution of sudo commands without password. I’ve done it a gadzillion times in the past, and probably at least a dozen times just this week alone.  Here’s what’s needed:

  1. Add user to the wheel group.
  2. Edit /etc/sudoers file to uncommen tthe line (as in: remove the hash comment character from the beginning of the file): # %wheel ALL=(ALL) NOPASSWD: ALL
  3. Enjoy!

Imagine my surprise when it only worked on one server and not on the other.  I’ve dug deep and wide.  Took a break. And dug again.  Then, I’ve summoned the great troubleshooting powers of my brother.  But even those didn’t help.

Lots of logging, diff-ing, strace-ing, swearing and hair pulling later, the problem was found and fixed.  The issue was due to two separate reasons.

Reason 1: /etc/sudoers syntax uses the hash character (#) for two different purposes.

  1. For comments, which there are plenty of in the file.
  2. For the “#include” and “#includedir” directives, which include other files into the configuration.

The default /etc/sudoers file is full of lengthy comments.  Just to give you and idea:

(root@host ~)# wc -l /etc/sudoers
118 /etc/sudoers
(root@host ~)# grep -v '^#' /etc/sudoers | grep -v '^$' | wc -l
12

Yup.  118 lines in total vs. 12 lines of configuration (comments and empty lines removed). Like with banner blindness, this causes comment blindness.  Especially towards the end of the file.  Especially if you’ve seen this file a billion times before.

And that’s where the problem starts.  Right at the bottom of the file, there are these two lines:

##Read drop-in files from /etc/sudoers.d (the # here does not mean a comment)
#includedir /etc/sudoers.d

Interesting, right? Usually there is nothing in the /etc/sudoers.d/ folder on the brand new CentOS box. But even if there was something, by now you’d assume that the include of the folder is commented out. Much like that wheel group configuration I mentioned earlier. I found it by accident, while reading sudoers(5) manual page, trying to find out if there are any other locations or defaults for included configurations. About 600 lines into the manual, there is this:

To include /etc/sudoers.local from within /etc/sudoers we
would use the following line in /etc/sudoers:

#include /etc/sudoers.local

When sudo reaches this line it will suspend processing of
the current file (/etc/sudoers) and switch to
/etc/sudoers.local.

So that comment is not a comment at all, but an include of the folder.  That’s the first part of the problem.

Reason #2: Windows Azure Linux Agent

As I mentioned above, the servers aren’t part of my infrastructure – they were provided by the clients.  I was basically given an IP address, a username and a password for each server – which is usually all I need.  In most cases I don’t really care where the server is hosted and what’s the hosting company in use.  Turns out, I should.

The server with the problem was hosted on the Microsoft Azure cloud infrastructure.  I assumed I was working off a brand new vanilla CentOS 7 box, but in fact I wasn’t.  Microsoft adds packages to the default install.  On of the packages that it adds is the Windows Azure Linux Agent, which “rpm -qi WALinuxAgent” describes as following:

The Windows Azure Linux Agent supports the provisioning and running of Linux VMs in the Microsoft Azure cloud. This package should be installed on Linux disk images that are built to run in the Microsoft Azure environment.

Harmless, right? Well, not so much.  What I found in the /etc/sudoers.d/ folder was a little file, called waagent, which included the different sudo configuration for the user which I had a problem with.

During the troubleshooting process, I’ve created a new test user, added the account to the wheel group and found out that it was working fine.  From there, I needed to find the differences between the two users.

I guess, the user that I was using initially was created by the client’s system administrator using Microsoft Azure web interface.  A quick Google search brings this page from the Azure documentation:

By default, the root user is disabled on Linux virtual machines in Azure. Users can run commands with elevated privileges by using the sudo command. However, the experience may vary depending on how the system was provisioned.

  1. SSH key and password OR password only – the virtual machine was provisioned with either a certificate (.CER file) or SSH key as well as a password, or just a user name and password. In this case sudo will prompt for the user’s password before executing the command.
  2. SSH key only – the virtual machine was provisioned with a certificate (.cer, .pem, or .pubfile) or SSH key, but no password. In this case sudo will not prompt for the user’s password before executing the command.

I checked the user’s home folder and found no keys in there, so I think it was provisioned using the first option, with password only.

I think Microsoft should make it much more obvious that the system behavior might be different.  Amazon AWS provides a good example to follow.  When you login into Amazon AMI instance, you see a message of the day (motd) banner, which looks like this:

$ ssh server.example.com
Last login: Tue Apr  5 17:25:38 2016 from 127.0.0.1

__|  __|_  )
_|  (     /   Amazon Linux AMI
___|\___|___|

https://aws.amazon.com/amazon-linux-ami/2016.03-release-notes/

(user@server.example.com)$

It’s dead obvious that you are now on the Amazon EC2 machine and you should adjust your expectations assumptions accordingly.

Deleting the file immediately solved the problem.  To avoid similar issues in the future, #includedir directive can be moved further up in the file, and surrounded by more visible comments.  Like, maybe, an ASCII art skull, or something.

ASCII skull

With that, I am off to heavy drinking and recovery… Stay sane!

 

Top level domain nonsense and how it can break your stuff

Call me old school, but I really (I mean REALLY) don’t like the recent explosion of the top level domains.  I understand that most good names are taken in .com, .org, and .net zones, but do we really need all those .blue, .parts, and .yoga TLDs?

Why am I whining about all this all of a sudden?  I’ll tell you why.  Because a new top level domain – .aws – is about to be introduced, and it already broke something for me in a non-obvious manner.

aws

I manage a few Virtual Private Clouds on the Amazon AWS.  Many of these use and rely on some hostname naming convention (yeah, I’m familiar with the pets vs. cattle idea).  Imagine you have a few servers, which are separated into generic infrastructure and client segments, like so:

  • bastion.aws.example.com
  • firewall.aws.example.com
  • lb.aws.example.com
  • web.client1.example.com
  • db.client1.example.com
  • web.client2.example.com
  • db.client2.example.com
  • … and so on.

Working with such long FQDNs (fully qualified domain names) isn’t very convenient.  So add “search example.com” to your /etc/resolve.conf file and now you can use short hostnames like firewall.aws and web.client1.  And life is beautiful …

… until one day, when you see the following:

user@bastion.aws$> ssh firewall.aws
Permission denied (publickey).

And that’s when your heart misses a beat, the world freezes, and you go: “WTF?”.  All kinds of thoughts are rushing through your head.  Is it a typo?  Am I in the right place? Did the server get compromised?  How’s that for a little panic …

Trying a few things here and there, you manage to get into the server from somewhere else.  You are very careful.  You are looking around for any traces of the break-in, but you see nothing.  You dig through the logs both on the server and off it.  Still nothing.  You can dive into all those logwatch and cron messages in your Trash, that you were automatically deleting, cause things were working fine for so long.  There!  You find that cron was complaining that backup script couldn’t get into this machine.  Uh-oh.  This was happening for a few days now.  A black cloud of combined worry for the compromised machine and outdated backup kills the sunlight in your life.  Dammit!

Take a break to calm down.  Try to think clearly.  Don’t panic.  Stop assuming things, and start troubleshooting.

A few minutes later, you establish that the problem is not limited to that particular machine.  All your .aws hosts share this headache.  A few more minutes later, you learn that ‘ssh firewall.aws.example.com’ works fine, while ‘ssh firewall.aws’ still doesn’t.

That points toward the hostname resolution issue.   With that, it takes only a few more moments to see the following:

user@bastion.aws$> host firewall.aws
firewall.aws has address 127.0.53.53
firewall.aws mail is handled by 10 your-dns-needs-immediate-attention.aws.

Say what?  That’s not at all what I expected.  And what is that that I need to fix with my DNS?  Google search brings this beauty:

This is problably because the .dev and .local are now valid top level extensions.

Really? Who’s the genius behind that?  I thought people chose those specifically to make them internal.  So is there an .aws top level extension now too?  You bet there is!

Solution?  Well, as far as I am concerned, from this day onward, I don’t trust the brief hostnames anymore.  It’s FQDN or nothing.