Rocket.Chat – the ultimate self-hosted open source chat platform

Chat is becoming more and more important for team communication and collaboration (what is ChatOps?).  Old school applications like Skype are being replaced with modern, web-based chat platforms, that provide group/room and one-on-one chats, file uploads, screen sharing, voice and video communications, API integration and more.  There are plenty of solutions to choose from too.

Traditionally, self-hosted solutions were difficult to setup and maintain, and were lacking in integration options.  So many teams choose to go for the third-party hosted approach.  This is not very exciting for companies that deal with sensitive data though.

As mentioned before, at work, we are using HipChat.  It’s nice, it’s free, and it integrates nicely.  Lately, there has been a lot of hype about Slack, which I tried, but didn’t particularly like.

rocket.chat

Today, however, I came across a very nice option, which seems to be a breeze to self-host and maintain – Rocket.Chat.  It’s modern – written in JavaScript, it has a long list of features, and there is a vibrant community around it.

You can try the live demo, or deploy it to your infrastructure via a gadzillion different methods, or read the beautiful documentation.  And there’s a rumor of HipChat and Slack import tool, so you won’t have to start from scratch…

Let me know what you think.

WordPress 4.5 “Coleman”

WordPress 4.5 “Coleman” – the newest WordPress version has been released (I’ve just upgraded).  Some of the changes included in this release are:

  • New and improved user interface for editing links in posts and pages.
  • More Markdown-like shortcuts for formatting text (now with code and horizontal lines).
  • Logo support in themes
  • Much improved image optimizations (initially expected in WordPress 4.4)
  • Better embed templates
  • Update for underlying libraries, such as jQuery, Backbone, and Underscore.

If you already manage a WordPress website, you’ll find the notification of the update in your admin area.  If not, then go and download it.

Open Source software is so reassuring …

There’s nothing like working on a problem for a few days and getting to the reassuring code snippet like this:

sub PSGIApp {
    my $self = shift;

    # XXX: this is fucked
    require HTML::Mason::CGIHandler;
    require HTML::Mason::PSGIHandler::Streamy;
    my $h = RT::Interface::Web::Handler::NewHandler('HTML::Mason::PSGIHandler::Streamy');

    $self->InitSessionDir;

    my $mason = sub {
        my $env = shift;

        # mod_fastcgi starts with an empty %ENV, but provides it on each
        # request.  Pick it up and cache it during the first request.
        $ENV{PATH} //= $env->{PATH};

        # HTML::Mason::Utils::cgi_request_args uses $ENV{QUERY_STRING} to
        # determine if to call url_param or not
        # (see comments in HTML::Mason::Utils::cgi_request_args)
        $ENV{QUERY_STRING} = $env->{QUERY_STRING};

The first comment is misleading. It throws you off. Almost make you close the file and go somewhere else. But that’s just a little frustration from the last few days. The solution to my problem is here too… And that’s when the warm, cozy feeling I have for the Open Source Software kicks in.

P.S.: both the problem and the solution will be posted separately.

 

Support lesson to learn from Amazon AWS

I’ve said a million times how happy I am with Amazon AWS.  Today I also want to share a positive lesson to learn from their technical support.  It’s the second time I’ve contacted them over the last year and a half, and it’s the second time I am amazed at how good well it works.

In my experience, technical support departments usually rely on one primary communication channel – be that a telephone, an email, a ticketing system, or a live chat.  The other channels are often just routed or converted into the main one, or, even, completely ignored.  But each one of those has it’s benefits and side effects.

Telephone provides the most immediate connectivity, and a much valued option of the human interaction.  But the communication is verbal, often without the paper trail.  It makes it difficult to carbon copy (CC) people on the conversation or review exactly what has been said.  It is also very free form, unstructured.

Live chat is also free form and unstructured, but it’s written, so transcripts are easily available.  It also helps with the carbon copy, but only on the receiving end – supervisors or field experts can often be included in the conversation, but adding somebody from the requesting side is rarely supported.

Email makes it easy to carbon copy people on both ends.  It provides the paper trail, but often lacks the immediate response factor.  And it’s still unstructured, making it difficult to figure out what was requested, what has been discussed and whether or not there was any resolution.  (Have you ever been a part of a lengthy multi-lingual conversation about, what turned out to be, multiple issues in the same thread?)

Ticketing/support systems help to structure the conversation and make it follow a certain workflow.  But they often lack humanity and, much like emails, the immediate response.

Now, what Amazon AWS support has done is a beautiful combination of a ticketing system and a phone.  You start off with the ticketing system – login, create a new support case, providing all the necessary information, and optionally CC other people from a single short form.  The moment you submit it, the web page asks for your phone number.  Once entered, a phone call is placed immediately by the system, connecting you to the support engineer.  The engineer confirms a few case details and lets you know that the case is in progress and expected resolution time (I was asking to raise the limit of the Elastic IP addresses on the Virtual Private Cloud, and I was told it will be done in the next 15 to 30 minute.  And it was done in 10!).  I have also received two emails – one confirming the opening of the case, with all the requested details, and another one notifying me that the work has been done, providing quick information on how to follow up, in case I needed to.

Overall experience was very smooth, fast, to the point, and very effective.  I never got lost.  I never had to figure anything out.  And my problem was attended to and resolved immediately.

I only wish more companies provided this level of support.  I’ll sure try too – but it’s a bar set high.

 

 

Ansible safety net for DNS wildcard hosts

After using Ansible for only a week, I am deeply in love.  I am doing more and more with less and less, and that’s exactly how I want my automation.

Today I had to solve an interesting problem.  Ansible operates, based on the host and group inventory.  As I mentioned before, I am now always relying on FQDNs (fully qualified domain names) for my host names.  But what happens when DNS wildcards come into play with things like load balancers and reverse proxies  Consider an example:

  1. Nginx configured as reverse proxy on the machine proxy1.example.com with 10.0.0.10 IP address.
  2. DNS wildcard is in place: *.example.com 3600 IN CNAME proxy1.example.com.
  3. Ansible contains proxy1.example.com in host inventory and a playbook to setup the reverse proxy with Nginx.
  4. Ansible contains a few other hosts in inventory and a playbook to setup Nginx as a web server.
  5. Somebody adds a new host to inventory: another-web-server.example.com, without specifying any other host details, like ansible_ssh_host variable.  And he also forgets to update the DNS zone with a new A or CNAME record.

Now, Ansible play is executed for the web servers configuration.  All previously existing machines are fine.  But the new machine’s another-web-server.example.com host name resolves to proxy1.example.com, which is where Ansible connects and runs the Nginx setup, overwriting the existing configuration, triggering a service restart, and screwing up your life.  Just kidding, of course. :)  It’ll be trivial to find out what happened.  Fixing the Nginx isn’t too difficult either.  Especially if you have backups in place.  But it’s still better to avoid the whole mess altogether.

To help prevent these cases, I decided to create a new safety net role.  Given a variable like:

---
# Aliased IPs is a list of hosts, which can be reached in 
# multiple ways due to DNS wildcards. Both IPv4 and IPv6 
# can be used. The hostname value is the primary hostname 
# for the IP - any other inventory hostname having any of 
# these IPs will cause a failure in the play.
aliased_ips:
  "10.0.0.10": "proxy1.example.com"
  "192.168.0.10": "proxy1.example.com"

And the following code in the role’s tasks/main.yml:

---
- debug: msg="Safety net - before IPv4"

- name: Check all IPv4 addresses against aliased IPs
  fail: msg="DNS is not configured for host '{{ inventory_hostname}}'. It resolves to '{{ aliased_ips[ item.0 ] }}'."
  when: "('{{ item[0] }}' == '{{ item[1] }}') and ('{{ inventory_hostname }}' != '{{ aliased_ips[ item.0 ] }}')"
  with_nested:
    - "{{ aliased_ips | default({}) }}"
    - "{{ ansible_all_ipv4_addresses }}"

- debug: msg="Safety net - after IPv4 and before IPv6"

- name: Check all IPv6 addresses against aliased IPs
  fail: msg="DNS is not configured for host '{{ inventory_hostname}}'. It resolves to '{{ aliased_ips[ item.0 ] }}'."
  when: "('{{ item[0] }}' == '{{ item[1] }}') and ('{{ inventory_hostname }}' != '{{ aliased_ips[ item.0 ] }}')"
  with_nested:
    - "{{ aliased_ips | default({}) }}"
    - "{{ ansible_all_ipv6_addresses }}"

- debug: msg="Safety net - after IPv6"

the safety net is in place.  The first check will connect to the remote server, get the list of all configured IPv4 addresses, and then compare each one with each IP address in the aliased_ips variable.  For every matching pair, it will check if the remote server’s host name from the inventory file matches the host name from the aliased_ips value for the matched IP address.  If the host names match, it’ll continue.  If not – a failure in the play occurs (Ansible speak for thrown exception).  Other tasks will continue execution for other hosts, but nothing else will be done during this play run for this particular host.

The second check will do the same but with IPv6 addresses.  You can mix and match both IPv4 and IPv6 in the same aliased_ips variable.  And Ansible is smart enough to exclude the localhost IPs too, so things shouldn’t break too much.

I’ve tested the above and it seems to work well for me.

There is a tiny issue with elegance here though: host name to IP mappings are already configured in the DNS zone – duplicating this configuration in the aliased_ips variable seems annoying.  Personally, I don’t have that many reverse proxies and load balancers to handle, and they don’t change too often either, so I don’t mind.  Also, there is something about relying on DNS while trying to protect against DNS mis-configuration that rubs me the wrong way.  But if you are the adventurous type, have a look at the Ansible’s dig lookup, which you can use to fetch the IP addresses from the DNS server of your choice.

As always, if you see any potential issues with the above or know of a better way to solve it, please let me know.

SugarCRM, RoundCube and Request Tracker integration on a single domain

In my years of working as a system administrator I’ve done some pretty complex setups and integration solutions, but I don’t think I’ve done anything as twisted as this one recently.  The setup is part of the large and complex client project, built on their infrastructure, with quite a few requirements and a whole array of limitations.  Several systems were integrated together, but in this particular post I’m focusing primarily on the SugarCRM, RoundCube and Request Tracker.  Also, I am not going to cover the integration to full extent – just the email related parts.

Continue reading “SugarCRM, RoundCube and Request Tracker integration on a single domain”

WhatsApp introduces end-to-end encryption for everything

WhatsApp introduces end-to-end encryption for all communications – chats, pictures, videos, etc.  I’m sure it’ll help them get more individuals and businesses on the network, as well as probably ban the app in a handful of countries.

WhatsApp has always prioritized making your data and communication as secure as possible. And today, we’re proud to announce that we’ve completed a technological development that makes WhatsApp a leader in protecting your private communication: full end-to-end encryption. From now on when you and your contacts use the latest version of the app, every call you make, and every message, photo, video, file, and voice message you send, is end-to-end encrypted by default, including group chats.

The idea is simple: when you send a message, the only person who can read it is the person or group chat that you send that message to. No one can see inside that message. Not cybercriminals. Not hackers. Not oppressive regimes. Not even us. End-to-end encryption helps make communication via WhatsApp private – sort of like a face-to-face conversation.

Absolute stupidity of include directive in /etc/sudoers, and Microsoft Azure

I’ve just spent three hours (!!!) trying to troubleshoot why sudo was misbehaving on a brand new CentOS 7 server.  I was doing the setup of two identical servers in parallel (for two different clients).   One server worked as expected, the other one didn’t.

The thing I was trying to do was trivial – allow users in the wheel group execution of sudo commands without password. I’ve done it a gadzillion times in the past, and probably at least a dozen times just this week alone.  Here’s what’s needed:

  1. Add user to the wheel group.
  2. Edit /etc/sudoers file to uncommen tthe line (as in: remove the hash comment character from the beginning of the file): # %wheel ALL=(ALL) NOPASSWD: ALL
  3. Enjoy!

Imagine my surprise when it only worked on one server and not on the other.  I’ve dug deep and wide.  Took a break. And dug again.  Then, I’ve summoned the great troubleshooting powers of my brother.  But even those didn’t help.

Lots of logging, diff-ing, strace-ing, swearing and hair pulling later, the problem was found and fixed.  The issue was due to two separate reasons.

Reason 1: /etc/sudoers syntax uses the hash character (#) for two different purposes.

  1. For comments, which there are plenty of in the file.
  2. For the “#include” and “#includedir” directives, which include other files into the configuration.

The default /etc/sudoers file is full of lengthy comments.  Just to give you and idea:

(root@host ~)# wc -l /etc/sudoers
118 /etc/sudoers
(root@host ~)# grep -v '^#' /etc/sudoers | grep -v '^$' | wc -l
12

Yup.  118 lines in total vs. 12 lines of configuration (comments and empty lines removed). Like with banner blindness, this causes comment blindness.  Especially towards the end of the file.  Especially if you’ve seen this file a billion times before.

And that’s where the problem starts.  Right at the bottom of the file, there are these two lines:

##Read drop-in files from /etc/sudoers.d (the # here does not mean a comment)
#includedir /etc/sudoers.d

Interesting, right? Usually there is nothing in the /etc/sudoers.d/ folder on the brand new CentOS box. But even if there was something, by now you’d assume that the include of the folder is commented out. Much like that wheel group configuration I mentioned earlier. I found it by accident, while reading sudoers(5) manual page, trying to find out if there are any other locations or defaults for included configurations. About 600 lines into the manual, there is this:

To include /etc/sudoers.local from within /etc/sudoers we 
would use the following line in /etc/sudoers:

    #include /etc/sudoers.local

When sudo reaches this line it will suspend processing of 
the current file (/etc/sudoers) and switch to 
/etc/sudoers.local. 

So that comment is not a comment at all, but an include of the folder.  That’s the first part of the problem.

Reason #2: Windows Azure Linux Agent

As I mentioned above, the servers aren’t part of my infrastructure – they were provided by the clients.  I was basically given an IP address, a username and a password for each server – which is usually all I need.  In most cases I don’t really care where the server is hosted and what’s the hosting company in use.  Turns out, I should.

The server with the problem was hosted on the Microsoft Azure cloud infrastructure.  I assumed I was working off a brand new vanilla CentOS 7 box, but in fact I wasn’t.  Microsoft adds packages to the default install.  On of the packages that it adds is the Windows Azure Linux Agent, which “rpm -qi WALinuxAgent” describes as following:

The Windows Azure Linux Agent supports the provisioning and running of Linux VMs in the Microsoft Azure cloud. This package should be installed on Linux disk images that are built to run in the Microsoft Azure environment.

Harmless, right? Well, not so much.  What I found in the /etc/sudoers.d/ folder was a little file, called waagent, which included the different sudo configuration for the user which I had a problem with.

During the troubleshooting process, I’ve created a new test user, added the account to the wheel group and found out that it was working fine.  From there, I needed to find the differences between the two users.

I guess, the user that I was using initially was created by the client’s system administrator using Microsoft Azure web interface.  A quick Google search brings this page from the Azure documentation:

By default, the root user is disabled on Linux virtual machines in Azure. Users can run commands with elevated privileges by using the sudo command. However, the experience may vary depending on how the system was provisioned.

  1. SSH key and password OR password only – the virtual machine was provisioned with either a certificate (.CER file) or SSH key as well as a password, or just a user name and password. In this case sudo will prompt for the user’s password before executing the command.
  2. SSH key only – the virtual machine was provisioned with a certificate (.cer, .pem, or .pubfile) or SSH key, but no password. In this case sudo will not prompt for the user’s password before executing the command.

I checked the user’s home folder and found no keys in there, so I think it was provisioned using the first option, with password only.

I think Microsoft should make it much more obvious that the system behavior might be different.  Amazon AWS provides a good example to follow.  When you login into Amazon AMI instance, you see a message of the day (motd) banner, which looks like this:

$ ssh server.example.com
Last login: Tue Apr  5 17:25:38 2016 from 127.0.0.1

       __|  __|_  )
       _|  (     /   Amazon Linux AMI
      ___|\___|___|

https://aws.amazon.com/amazon-linux-ami/2016.03-release-notes/

(user@server.example.com)$ 

It’s dead obvious that you are now on the Amazon EC2 machine and you should adjust your expectations assumptions accordingly.

Deleting the file immediately solved the problem.  To avoid similar issues in the future, #includedir directive can be moved further up in the file, and surrounded by more visible comments.  Like, maybe, an ASCII art skull, or something.

ASCII skull

With that, I am off to heavy drinking and recovery… Stay sane!

 

Share your public keys easily with GitHub

Here’s a handy thing that I didn’t know about – you can easily share your public keys by adding them to your GitHub account and then accessing the URL of the form https://github.com/YOUR_USERNAME.keys .  What you get is a plain text response with all your public keys, ready to be inserted into .ssh/authorized_keys file or anywhere else you want them.

Here’s an example of mine – https://github.com/mamchenkov.keys .  Don’t forget to configure two factor authentication for your GitHub account for an extra layer of security.  You probably don’t want any bugger who got your password inserting his own public keys into your account.

Top level domain nonsense and how it can break your stuff

Call me old school, but I really (I mean REALLY) don’t like the recent explosion of the top level domains.  I understand that most good names are taken in .com, .org, and .net zones, but do we really need all those .blue, .parts, and .yoga TLDs?

Why am I whining about all this all of a sudden?  I’ll tell you why.  Because a new top level domain – .aws – is about to be introduced, and it already broke something for me in a non-obvious manner.

aws

I manage a few Virtual Private Clouds on the Amazon AWS.  Many of these use and rely on some hostname naming convention (yeah, I’m familiar with the pets vs. cattle idea).  Imagine you have a few servers, which are separated into generic infrastructure and client segments, like so:

  • bastion.aws.example.com
  • firewall.aws.example.com
  • lb.aws.example.com
  • web.client1.example.com
  • db.client1.example.com
  • web.client2.example.com
  • db.client2.example.com
  • … and so on.

Working with such long FQDNs (fully qualified domain names) isn’t very convenient.  So add “search example.com” to your /etc/resolve.conf file and now you can use short hostnames like firewall.aws and web.client1.  And life is beautiful …

… until one day, when you see the following:

user@bastion.aws$> ssh firewall.aws
Permission denied (publickey).

And that’s when your heart misses a beat, the world freezes, and you go: “WTF?”.  All kinds of thoughts are rushing through your head.  Is it a typo?  Am I in the right place? Did the server get compromised?  How’s that for a little panic …

Trying a few things here and there, you manage to get into the server from somewhere else.  You are very careful.  You are looking around for any traces of the break-in, but you see nothing.  You dig through the logs both on the server and off it.  Still nothing.  You can dive into all those logwatch and cron messages in your Trash, that you were automatically deleting, cause things were working fine for so long.  There!  You find that cron was complaining that backup script couldn’t get into this machine.  Uh-oh.  This was happening for a few days now.  A black cloud of combined worry for the compromised machine and outdated backup kills the sunlight in your life.  Dammit!

Take a break to calm down.  Try to think clearly.  Don’t panic.  Stop assuming things, and start troubleshooting.

A few minutes later, you establish that the problem is not limited to that particular machine.  All your .aws hosts share this headache.  A few more minutes later, you learn that ‘ssh firewall.aws.example.com’ works fine, while ‘ssh firewall.aws’ still doesn’t.

That points toward the hostname resolution issue.   With that, it takes only a few more moments to see the following:

user@bastion.aws$> host firewall.aws
firewall.aws has address 127.0.53.53
firewall.aws mail is handled by 10 your-dns-needs-immediate-attention.aws.

Say what?  That’s not at all what I expected.  And what is that that I need to fix with my DNS?  Google search brings this beauty:

This is problably because the .dev and .local are now valid top level extensions.

Really? Who’s the genius behind that?  I thought people chose those specifically to make them internal.  So is there an .aws top level extension now too?  You bet there is!

Solution?  Well, as far as I am concerned, from this day onward, I don’t trust the brief hostnames anymore.  It’s FQDN or nothing.