10,000 most common English words

This GitHub repository contains a list of the 10,000 most common English words, sorted by frequency, as seen by the Google Machine Translation Team.

Here at Google Research we have been using word n-gram models for a variety of R&D projects, such as statistical machine translation, speech recognition, spelling correction, entity detection, information extraction, and others. While such models have usually been estimated from training corpora containing at most a few billion words, we have been harnessing the vast power of Google’s datacenters and distributed processing infrastructure to process larger and larger training corpora. We found that there’s no data like more data, and scaled up the size of our data by one order of magnitude, and then another, and then one more – resulting in a training corpus of one trillion words from public Web pages.

We believe that the entire research community can benefit from access to such massive amounts of data. It will advance the state of the art, it will focus research in the promising direction of large-scale, data-driven approaches, and it will allow all research groups, no matter how large or small their computing resources, to play together. That’s why we decided to share this enormous dataset with everyone. We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times.

There are a few variations of the list – with and without the swear words and such.  I took a quick look at it and was surprised to find that “cyprus” is at position 4,993 (pretty high), immediately after the word “emails“.  Weird!

(found via the link from this article)

Every pub in the United Kingdom

This Reddit thread shares the map of all the pubs in the UK.  The Poke picked it up and wrapped it into some more links and quotes.  Apparently, not even all the pubs are covered:

“Nope. There’s at least 12 pubs missing from the north coast of Scotland. Thurso alone has more than 6, 2 in Bettyhill, Tongue and Melvich plus a few others all missing”, writes shaidy64

The source of the map is here referencing 24,727 UK pubs.  And I’ve only been to like, what, 3?  This situation urgently needs correction.

Database Engines Ranking


DB-Engines.com provides some insight into some of the most popular database engines (312 of them to be precise).  Nothing too surprising there – Oracle and MySQL leading the charts, but it’s nice to have the numbers and trends.


There are, of course, many different ways how the popularity can be calculated.  Their method is based on the popularity of each engine in a variety of online outlets, from Google Search to social networks.

  • Number of mentions of the system on websites, measured as number of results in search engines queries. At the moment, we use Google, Bing and Yandex for this measurement. In order to count only relevant results, we are searching for <system name> together with the term database, e.g. “Oracle” and “database”.
  • General interest in the system. For this measurement, we use the frequency of searches in Google Trends.
  • Frequency of technical discussions about the system. We use the number of related questions and the number of interested users on the well-known IT-related Q&A sites Stack Overflow and DBA Stack Exchange.
  • Number of job offers, in which the system is mentioned. We use the number of offers on the leading job search engines Indeed and Simply Hired.
  • Number of profiles in professional networks, in which the system is mentioned. We use the internationally most popular professional networks LinkedIn and Upwork.
  • Relevance in social networks. We count the number of Twitter tweets, in which the system is mentioned.

It seems objective and representative enough to me.

WordPress now powers 27.1% of all websites on the Internet


WordPress Tavern states:

WordPress now powers 27.1% of all websites on the internet, up from 25% last year. While it may seem that WordPress is neatly adding 2% of the internet every year, its percentage increase fluctuates from year to year and the climb is getting more arduous with more weight to haul.

Linking to these statistics from W3Techs.  Impressive!

Those who think that WordPress is just a blogging system are far from the truth…

Top 29 books on Amazon from Hacker News comments


I came across this nice visualization of “Top 29 books ranked by unique users linking to Amazon in Hacker News comments“.

Amazon product links were extracted and counted from 8.3M comments posted on Hacker News from Oct 2006 to Oct 2015.

Most of these are, not surprisingly, on programming and design.  A few are on startups and business.  Some are on how to have a good life.  Which is a bit weird.

400,000 GitHub repositories, 1 billion files, 14 terabytes of code: Spaces or Tabs?

Here is an interesting bit of research – do people prefer tabs or spaces when programming the most popular languages?

Tabs or spaces. We are going to parse a billion files among 14 programming languages to decide which one is on top.

The results are not very surprising and somewhat disappointing (for all of us, tab fans):

tabs vs. spaces

As far as PHP goes, I’m sure the choice of spaces has to do with the PSR-2 coding style guide, which states:

Code MUST use 4 spaces for indenting, not tabs.

On a more technical note, I think this is also related to the explosion of editors and IDEs in the recent years, which, as good as they are, aren’t as good as Vim.  Vim allows for a very flexible configuration, where your code can be formatted and re-formatted any way you like, making tabs or spaces a non-issue at all.

Regardless of the results of the study, what’s more interesting is the method and tools used.  I’ve had my eye on the Google Big Query for a while now, but I’m too busy these days to give it a try.  The article gives a few insights, into how awesome the tool is.  1.6 terabytes of data processed in 864.6 seconds:

That query took a relative long time since it involved joining a 190 million rows table with a 70 million rows one, and over 1.6 terabytes of contents. But don’t worry about having to run it, since I left the result publicly available at [fh-bigquery:github_extracts.contents_top_repos_top_langs].


Analyzing each line of 133 GBs of code in 16 seconds? That’s why I love BigQuery.

If you enjoyed this article, also have a look at “Analyzing GitHub issues and comments with BigQuery“, which works with a similar-sized data, trying to figure out how to write bug reports and pull request comments, so that they would be acted upon faster.

Page builders and multilingual WordPress websites

WPML.org, the web home of the WordPress Multilingual Plugin runs this blog post about the upcoming support for WordPress page builders.  Apart from the good news themselves, there are some insightful results of the survey that the team did, trying to understand who uses page builders and how.  I found the stats on which page builder solutions people use the most interesting:


At work we are primarily using Divi (when we are not building our own themes), but we’ve also done a few sites with Enfold.  I’ve also seen Avada in the wild.  But I can’t tell you which ones are better, because when it comes to using page builders, I’m mostly not involved.  These tools are so awesome these days that they can be easily used by a non-technical person.  Which is exactly what we do ;)

Analyzing 2+ Million Travis Builds

TravisCI – a continuous integration service – shares some of the insights from over 2,000,000 builds they’ve run, in an blog post called “What We Learned about Continuous Integration from Analyzing 2+ Million Travis Builds“.  For me, the most valuable bit is about the reasons for failing builds, which clearly indicates the need for and the importance of unit, integration, and UI tests:


Around 20% of all builds fail.  There is a variation based on the language – for some programming languages, testing is part of the process and culture – for others it’s an acquired tool.  Once you do implement testing, most of your builds will run.  You’ll cancel very few.  But about 20% will fail due to failed unit tests, configurations, or environment setups.  Catching these 20% before it hits production is super important.

GitHub private repository contributions on your profile

GitHub blog says that from now on your profile can include the private repository contributions on your profile.

github private repo contributions

When enabled, these can make quite a difference in the number of the green boxes, showing your GitHub activity.  Here’s an example from mine.  Before enabling those, showing only Open Source contributions:

GitHub mamchenkov before

And here’s one after, including private repository contributions:

GitHub mamchenkov after

Indeed, it is a more accurate representation of my GitHub activity.  Given that these days most of my private repository activity happens on BitBucket and not on GitHub, this is quite surprising.

Common files in PHP packages

Jordi Boggiano looks at some common files in PHP packages, using Packagist as a data source.  There are some interesting metrics in there.  For example:

  • 58% of packages include a src/ directory and 5% a lib/ one. That’s surprisingly low to me, that means a lot have the code simply in the root folder.
  • 4% have a bin/ directory, including some sort of CLI executables.
  • 55% have a LICENSE file, that’s.. pretty disastrous but hopefully a lot of those that don’t at least indicate in the README and composer.json
  • 49% have some file or directory indicating the presence of tests (phpunit.xml & co). I am not sure if this is good or bad news to be honest, that depends on your expectations.