10,000 most common English words

This GitHub repository contains a list of the 10,000 most common English words, sorted by frequency, as seen by the Google Machine Translation Team.

Here at Google Research we have been using word n-gram models for a variety of R&D projects, such as statistical machine translation, speech recognition, spelling correction, entity detection, information extraction, and others. While such models have usually been estimated from training corpora containing at most a few billion words, we have been harnessing the vast power of Google’s datacenters and distributed processing infrastructure to process larger and larger training corpora. We found that there’s no data like more data, and scaled up the size of our data by one order of magnitude, and then another, and then one more – resulting in a training corpus of one trillion words from public Web pages.

We believe that the entire research community can benefit from access to such massive amounts of data. It will advance the state of the art, it will focus research in the promising direction of large-scale, data-driven approaches, and it will allow all research groups, no matter how large or small their computing resources, to play together. That’s why we decided to share this enormous dataset with everyone. We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times.

There are a few variations of the list – with and without the swear words and such.  I took a quick look at it and was surprised to find that “cyprus” is at position 4,993 (pretty high), immediately after the word “emails“.  Weird!

(found via the link from this article)

Google Infrastructure Security Design Overview

If you ever wanted to know what Google does to maintain its high level of security, here’s your chance. Google Infrastructure Security Design Overview provides quite a bit of information on the subject.

This document gives an overview of how security is designed into Google’s technical infrastructure. This global scale infrastructure is designed to provide security through the entire information processing lifecycle at Google. This infrastructure provides secure deployment of services, secure storage of data with end user privacy safeguards, secure communications between services, secure and private communication with customers over the internet, and safe operation by administrators.

Amazon Linux AMI : Let’s Encrypt : ImportError: No module named interface

Let’s Encrypt has only experimental support for the Amazon Linux AMI, so it’s kind of expected to have issues once in a while.   Here’s one I came across today:

# /opt/letsencrypt/certbot-auto renew
Creating virtual environment...
Installing Python packages...
Installation succeeded.
Traceback (most recent call last):
File "/root/.local/share/letsencrypt/bin/letsencrypt", line 7, in <module>
from certbot.main import main
File "/root/.local/share/letsencrypt/local/lib/python2.7/dist-packages/certbot/main.py", line 12, in <module>
import zope.component
File "/root/.local/share/letsencrypt/local/lib/python2.7/dist-packages/zope/component/__init__.py", line 16, in <module>
from zope.interface import Interface
ImportError: No module named interface

My first though was to install the system updates. It looks like something is off in the Python-land. But even after the “yum update” was done, the issue was still there. A quick Google search later, thanks to the this GitHub issue and this comment, the solution is the following:

pip install pip --upgrade
pip install virtualenv --upgrade
virtualenv -p /usr/bin/python27 venv27

Running the renewal of the certificates works as expected after this.

P.S.: I wish we had fewer package and dependency managers in the world…

Amazon RDS and Amazon Virtual Private Cloud (VPC)

Yesterday I helped a friend to figure out why he couldn’t connect to his Amazon RDS database inside the Amazon VPC (Virtual Private Cloud).  It was the second time someone asked me to help with the Amazon Web Services (AWS), and it was the first time I was actually helpful.  Yey!

While I do use quite a few of the Amazon Web Services, I don’t have any experience with the Amazon RDS yet, as I’m managing my own MySQL instances.  It was interesting to get my toes wet in the troubleshooting.

Here are a few things I’ve learned in the process.

Lesson #1: Amazon supports two different ways of accessing the RDS service.  Make sure you know which one you are using and act accordingly.

gs-vpc-network

If you run an Amazon RDS instance in the VPC, you’ll have to setup your networking and security access properly.  This page – Connecting to a DB Instance Running the MySQL Database Engine – will only be useful once everything else is taken care of.  It’s not your first and only manual to visit.

Lesson #2 (sort of obvious): Make sure that both your Network ACL and Security Groups allow all the necessary traffic in and out.  Double-check the IP addresses in the rules.  Make sure you are not using a proxy server, when looking up your external IP address on WhatIsMyIP.com or similar.

Lesson #3: Do not use ICMP traffic (ping and such) as a troubleshooting tool.  It looks like Amazon RDS won’t be ping-able even if you allow it in your firewalls.  Try with “telnet your-rds-end-point-server your-rds-end-point-port” (example: “telnet 1.2.3.4 3306” or with a real database client, like the command-line MySQL one.

Lesson #4: Make sure your routing is setup properly.  Check that the subnet in which your RDS instance resides has the correct routing table attached to it, and that the routing table has the default gateway (0.0.0.0/0) route configured to either the Internet Gateway or to some sort of NAT.  Chances are your subnet is only dealing with private IP range and has no way of sending traffic outside.

Lesson #5: When confused, disoriented, and stuck, assume it’s not Amazon’s fault.  Keep calm and troubleshoot like any other remote connection issue.  Double-check your assumptions.

There’s probably lesson 6 somewhere there, about contacting support or something along those lines.  But in this particular case it didn’t get to that.  Amazon AWS support is excellent though.  I had to deal with those guys twice in the last two-something years, and they were awesome.

PHP : Microsoft Office 365 and Active Directory

Disclaimer: I am not the biggest fan of Microsoft.  On the contrary.  I keep running into situations, where Microsoft technologies are a constant source of pain.  If that annoys you, please stop reading this post now and go away.  I don’t care.  You’ve been warned.

A few recent projects that I’ve been working on in the office required integration with Microsoft Office 365.  Office 365 is a new kid on the block as far as I am concerned, so I had no experience of integrating with these services.

The first look at what needs to be done resulted in a heavy drinking session and a mild depression.  Here are a few links to get you started on that path, if you are interested:

We’ve discussed the options with the client and decided to go a different route – limit the integration to the single sign-on (SSO) only, and use their Active Directory server (I’m not sure about the exact setup on the client side, but I think they use Active Directory Federation Services to have a local server in the office synchronized with the Office 365 directory).

Exposing the Active Directory server to the entire Internet is not the smartest idea, so we had to wrap this all into a virtual private network (VPN).  You can read my blog post on how to setup the CentOS 7 server as an automated VPN client.

Once the Active Directory was established, PHP LDAP module was very useful for avoiding any low-level programming (sockets and such).  With a bit of Google searching and StackOverflow reading, we managed to figure out the magic combination of parameters for ldap_connect(), ldap_set_option(), and ldap_search().

It took longer than expected, but some of it was due to the non-standard configuration and permissions on the client side.  Anyways, it worked, which were the good news.

The client accepted the implementation and we could just close the chapter, have another drink, and forget about this nightmare.  But something was bothering me about it, so I was thinking the heavy thoughts at the back of my mind.

The things that bother me about this implementation are the following:

  • Although it works, it’s a rather raw implementation, with very limited flexibility (filters, multiple servers, etc).
  • The code is difficult to test, due to the specifics of the AD setup and the network access limitations.
  • There is a lack of elegance to the solution.  Working code is good, but I like things to be beautiful too.  As much as possible at least.

So, I was keeping an eye open and I think today I came across a couple of links that can help make things better:

  1. adLDAP PHP library, which provides LDAP authentication and integration with Active Directory.  I don’t know how I missed it so far, but I think now things will be much easier and cleaner.
  2. Search Filter Syntax documentation on MSDN.
  3. This Reddit thread.  Yes, a lot of the things I’ve learned today are linked from it.  But it’ll be much easier for me to find all this information in my own blog, next time I’ll have to deal with Microsoft again.
  4. Public-facing LDAP server thanks to Georgia Institute of Technology, for testing connection and simple queries.

Armed with this new knowledge, I’m sure the current working solution can be improved a lot – simplified with fewer lines of code, based on the much more robust and tested code base, and given a basic test script to make sure the code works somewhere else, outside of a particular client’s setup.

I wish I came across that all much earlier.