I’ve just spent three hours (!!!) trying to troubleshoot why sudo was misbehaving on a brand new CentOS 7 server. I was doing the setup of two identical servers in parallel (for two different clients). One server worked as expected, the other one didn’t.
The thing I was trying to do was trivial – allow users in the wheel group execution of sudo commands without password. I’ve done it a gadzillion times in the past, and probably at least a dozen times just this week alone. Here’s what’s needed:
- Add user to the wheel group.
- Edit /etc/sudoers file to uncommen tthe line (as in: remove the hash comment character from the beginning of the file): # %wheel ALL=(ALL) NOPASSWD: ALL
- Enjoy!
Imagine my surprise when it only worked on one server and not on the other. I’ve dug deep and wide. Took a break. And dug again. Then, I’ve summoned the great troubleshooting powers of my brother. But even those didn’t help.
Lots of logging, diff-ing, strace-ing, swearing and hair pulling later, the problem was found and fixed. The issue was due to two separate reasons.
Reason 1: /etc/sudoers syntax uses the hash character (#) for two different purposes.
- For comments, which there are plenty of in the file.
- For the “#include” and “#includedir” directives, which include other files into the configuration.
The default /etc/sudoers file is full of lengthy comments. Just to give you and idea:
(root@host ~)# wc -l /etc/sudoers
118 /etc/sudoers
(root@host ~)# grep -v '^#' /etc/sudoers | grep -v '^$' | wc -l
12
Yup. 118 lines in total vs. 12 lines of configuration (comments and empty lines removed). Like with banner blindness, this causes comment blindness. Especially towards the end of the file. Especially if you’ve seen this file a billion times before.
And that’s where the problem starts. Right at the bottom of the file, there are these two lines:
##Read drop-in files from /etc/sudoers.d (the # here does not mean a comment)
#includedir /etc/sudoers.d
Interesting, right? Usually there is nothing in the /etc/sudoers.d/ folder on the brand new CentOS box. But even if there was something, by now you’d assume that the include of the folder is commented out. Much like that wheel group configuration I mentioned earlier. I found it by accident, while reading sudoers(5) manual page, trying to find out if there are any other locations or defaults for included configurations. About 600 lines into the manual, there is this:
To include /etc/sudoers.local from within /etc/sudoers we
would use the following line in /etc/sudoers:
#include /etc/sudoers.local
When sudo reaches this line it will suspend processing of
the current file (/etc/sudoers) and switch to
/etc/sudoers.local.
So that comment is not a comment at all, but an include of the folder. That’s the first part of the problem.
Reason #2: Windows Azure Linux Agent
As I mentioned above, the servers aren’t part of my infrastructure – they were provided by the clients. I was basically given an IP address, a username and a password for each server – which is usually all I need. In most cases I don’t really care where the server is hosted and what’s the hosting company in use. Turns out, I should.
The server with the problem was hosted on the Microsoft Azure cloud infrastructure. I assumed I was working off a brand new vanilla CentOS 7 box, but in fact I wasn’t. Microsoft adds packages to the default install. On of the packages that it adds is the Windows Azure Linux Agent, which “rpm -qi WALinuxAgent” describes as following:
The Windows Azure Linux Agent supports the provisioning and running of Linux VMs in the Microsoft Azure cloud. This package should be installed on Linux disk images that are built to run in the Microsoft Azure environment.
Harmless, right? Well, not so much. What I found in the /etc/sudoers.d/ folder was a little file, called waagent, which included the different sudo configuration for the user which I had a problem with.
During the troubleshooting process, I’ve created a new test user, added the account to the wheel group and found out that it was working fine. From there, I needed to find the differences between the two users.
I guess, the user that I was using initially was created by the client’s system administrator using Microsoft Azure web interface. A quick Google search brings this page from the Azure documentation:
By default, the root
user is disabled on Linux virtual machines in Azure. Users can run commands with elevated privileges by using the sudo
command. However, the experience may vary depending on how the system was provisioned.
- SSH key and password OR password only – the virtual machine was provisioned with either a certificate (
.CER
file) or SSH key as well as a password, or just a user name and password. In this case sudo
will prompt for the user’s password before executing the command.
- SSH key only – the virtual machine was provisioned with a certificate (
.cer
, .pem
, or .pub
file) or SSH key, but no password. In this case sudo
will not prompt for the user’s password before executing the command.
I checked the user’s home folder and found no keys in there, so I think it was provisioned using the first option, with password only.
I think Microsoft should make it much more obvious that the system behavior might be different. Amazon AWS provides a good example to follow. When you login into Amazon AMI instance, you see a message of the day (motd) banner, which looks like this:
$ ssh server.example.com
Last login: Tue Apr 5 17:25:38 2016 from 127.0.0.1
__| __|_ )
_| ( / Amazon Linux AMI
___|\___|___|
https://aws.amazon.com/amazon-linux-ami/2016.03-release-notes/
(user@server.example.com)$
It’s dead obvious that you are now on the Amazon EC2 machine and you should adjust your expectations assumptions accordingly.
Deleting the file immediately solved the problem. To avoid similar issues in the future, #includedir directive can be moved further up in the file, and surrounded by more visible comments. Like, maybe, an ASCII art skull, or something.
With that, I am off to heavy drinking and recovery… Stay sane!