10,000 most common English words

This GitHub repository contains a list of the 10,000 most common English words, sorted by frequency, as seen by the Google Machine Translation Team.

Here at Google Research we have been using word n-gram models for a variety of R&D projects, such as statistical machine translation, speech recognition, spelling correction, entity detection, information extraction, and others. While such models have usually been estimated from training corpora containing at most a few billion words, we have been harnessing the vast power of Google’s datacenters and distributed processing infrastructure to process larger and larger training corpora. We found that there’s no data like more data, and scaled up the size of our data by one order of magnitude, and then another, and then one more – resulting in a training corpus of one trillion words from public Web pages.

We believe that the entire research community can benefit from access to such massive amounts of data. It will advance the state of the art, it will focus research in the promising direction of large-scale, data-driven approaches, and it will allow all research groups, no matter how large or small their computing resources, to play together. That’s why we decided to share this enormous dataset with everyone. We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times.

There are a few variations of the list – with and without the swear words and such.  I took a quick look at it and was surprised to find that “cyprus” is at position 4,993 (pretty high), immediately after the word “emails“.  Weird!

(found via the link from this article)

Google Translate tip for Google Chrome

Here is something that I thought of today, played with, and found quite useful – integration of Google Translate with Google Chrome via the search engine configuration.  Of course, I know that there are addons for Google Chrome to integrate Google Translate.  Of course, I know that Google Chrome comes with certain integration out of the box.  But what I need is something else.  Once in a while, when I write an email or a blog post or something like that I’d forget a word in English that I know in Russian, or the other way around.  I usually open a new tab, go to Google Translate, and type the word in faster than I think of a better way to solve the problem.  It’s a completely automated process for me.  My fingers know how to do it.  Plus it’s all so fast because I do it from the keyboard with shortcuts, so even if I’d have some addon installed with a button in the toolbar, I’d need to reach for the mouse, which would slow me down.

So, here is what I did.  I went to Options->Basics->Default Search->Manage.  Of course, I didn’t want to change my default search engine from Google to anything.  Instead I wanted to add a new search engine.  See the above screenshot.  I named the search engine “Google Translate (English->Russian)” to avoid ambiguity when I add more search engines for translations between other languages.  I assigned the keyword “en,ru”, which is what I’ll have to type in the address bar for this search engine to kick in.  And I configured the search URL.  Nothing fancy.

Now, whenever I type “en,ru” in the browser address bar, Google Chrome switches from generic completion to a search engine, where I just have to type the word that I want translated and hit Enter.  Again, see the screenshot above for how the address bar looks.

In exactly the same way I can add more search engines to translate between different languages.  It’s even possible to use “auto” as the source language for Google Translate to figure out in which language the original word or phrase is.  And, of course, you don’t have to limit yourself to Google Translate search engines only.  I have search engines defined for PHP functions lookup, Wikipedia and IMDb searches, and more.  The trick is to find the search URL by performing the actual search on the site that you want to add, and then replace the search query with “%s”.  That’s all. Enjoy!

Google Docs, Google Translate, and the Web integration

Google Docs recently got a pretty exciting feature – integration with Google Translate.  But as exciting as it is, if you combine the new functionality with some bits of the previously available functionality, you can get truly mind-blowing results.

Consider an example.  You have a feedback form on your web site.  You fanatically collect responses and study them to make your web site better.  The problem however is that some of the questions that you ask in your feedback form are open-ended.  Meaning that people can write whatever they want in there.  And more often than you would like to,  people fill those fields in their native language.  Which might be very different from anything that you can understand.  This forces you to guess which language was used for each response, and then translate them one by one.  Needless to say, that takes a lot of time and effort.

One of the solutions to this problem can be achieved with Google Docs.  For some time now, Google Docs had Form functionality, where you could built pretty much any form you needed, and then easily integrate it with your web site.

If you don’t know how, go to Google Docs and select Form from the Create New menu.  Using a very user-friendly wizard build the form.  When you are done, open Form’s More Actions menu and select Embed.  This will give you a pop-up window with a little HTML code snippet.  Copy this code and paste it into your web site.

Google Docs Form

Whenever someone submits your new feedback form, the results will automatically go into a special spreadsheet in your Google Docs account.  You can see this spreadsheet by navigating to See responses menu in your Form editing screen and selecting Spreadsheet.

Google Docs Spreadsheet

All you need to do now is add two columns for each form field that you want to translate in this spreadsheet (I tried a single-column solution, but for some reason it didn’t work for me).  One will keep the auto-detected language of the form field submission, and another will keep the translation of the submitted field to the language that you understand.  Here is how you do it.

First, fill out and submit the feedback form yourself.  By doing so, you’ll make sure that the form is correct, all fields make sense, the HTML code is right, and that you are able to see the responses.  You’ll also have some sample data in your spreadsheet which will make your life easier.

Secondly, next to the column with the field value in the foreign language write a formula to guess the language.  If your field value is in cell B2, add =DetectLanguage(B2) to cell C2 and =GoogleTranslate(B2, C2, “en”) to cell D2.  Now, if you get some Russian text in B2, cell C2 wil lshow “ru” and cell D2 will show the English translation of the Russian text.

The only minor issue with resulting spreadsheet is that when  you get more submissions of your feedback form, language detection and translations are not done automatically.  But since we used the formulas in the cells, all we need to do to get those new submissions translated is drag and drop the formulas down to the new table rows.

While this is not exactly perfect, it is still a substantial improvement to the manual process used earlier.

P.S.: And so just that you know, it is entirely possible and in fact very easy to publish the spreadsheet back at your web site (for example, in password-protected area for your site administrators to see).  Every time the spreadsheet will be updated, the changes will be automatically reflected on your site as well.