Language Detection Library for PHP

patrickschur/language-detection – is a language detection library for PHP, which detects the language from a given text string.  Now, a bit more detailed:

This library can detect the language of a given text string. It can parse given training text in many different idioms into a sequence of N-grams and builds a database file in JSON format to be used in the detection phase. Then it can take a given text and detect its language using the database previously generated in the training phase. The library comes with text samples used for training and detecting text in 106 languages.

I tried it briefly with a few languages that I can master a phrase or two in, and it works better with some than with others.  Greek was good, Russian not so much.

Hopefully, the sample data used for training will improve over time, but it’s definitely a good start.

Via this blog post.


Morphos – morphological solution in PHP for English and Russian

If you ever had to deal with morphology in English, you probably found one or two libraries to help you out.  But if you had to do that for Russian, than I’m sure you are missing a few hairs, and the ones that you still have are grayer than they used to be.  I’ve got some good news for you though, now there is Morphos (GitHub repository).

Morphos is a morphological solution written completely in the PHP language. Supports Russian and English. Provides classes to decline First/Middle/Last names/nouns and generate cardinal numerals.

Just look at this beauty!

var_dump($dec->getForms($user_name, $dec->detectGender($user_name)));
/* Will produce something like
  array(6) {
    string(8) "Иван"
    string(10) "Ивана"
    string(10) "Ивану"
    string(10) "Ивана"
    string(12) "Иваном"
    string(15) "об Иване"

Just this alone can make user interfaces and emails so much better.  But there is more to it than that.

Global email in Gmail. Bad idea.

Gmail blog reports that Google is working on a more global email.  The first step is internationalized email addresses, like this:


As someone who worked in international environments for years, I strongly dislike this idea.  There is a whole array of issues related to this: readability of the email address (yes, read it!), display issues (do you have the font with all the necessary characters?), writing email address (searching through the addressbook, for example), or even copy-pasting an email address (have you tried copy-pasting something English strings from Hebrew or Arabic documents?  Now you’ll be copy-pasting international email addresses from English documents – so much fun!).  On top of that, all the usual things related to SPAM filters, trust issues (is this a company, free email hosting, or a personal domain?), etc.  Can you spell out this email address over a phone?  How about typing it on the mobile phone?  Do you even know in which language it is?

Using non-accented Latin characters is a pain for all those people who don’t speak English.  But it worked nonetheless for the last few decades.  Now we are heading towards the future, where that pain won’t be limited to those who don’t read English, but to everyone.  As you can’t really learn all the languages of the world, or control which language email addresses are making it into your inbox.  Remember, that just because the email address is in a given language, it doesn’t mean that the content of the email is in the same language.

On top of that, we’ve tried that already with the international URLs.  See how well that worked out.  Yeah, some people sure use them.  But try copy-pasting this URL around and I guarantee you’ll end up with a whole bunch of long and cumbersome escaped strings.  The same or similar fate will hit the emails…



The word Mamihlapinatapai (sometimes spelled mamihlapinatapei) is derived from the Yaghan language of Tierra del Fuego, listed in The Guinness Book of World Records as the “most succinct word”, and is considered one of the hardest words to translate. It refers to “a look shared by two people, each wishing that the other will offer something that they both desire but are unwilling to suggest or offer themselves.” It is also cited in books and articles on game theory associated with the volunteer’s dilemma.

Managing gettext translations on the command line

I am working on a rather multilingual project in the office currently.  And, as always, we tried a few alternatives before ending up with gettext again.  For those of you who don’t know, gettext is the de facto standard for managing language translations in software, especially when it comes to messages and user interface elements.  It’s a nice, powerful system but it’s a bit awkward when things come to web development.

Anyways, we started using it in a bit of a rush, without doing all the necessary planning, and quite soon ended up in a bit of a mess.  Different people used different editors to update translations.  And each person’s environment was setup in a different way.   All that made its way into the PO files that hold translations.  More so, we didn’t really define the procedure for the updates of translations.  That became a bigger problem when we realized that Arabic has only 50 translated strings, while English has 220, and Chinese 350.  All languages were supposed to have exactly the same amount of strings, even if the actual translations were missing.

So today I had to rethink and redefine how we do it.  First of all, I had to figure out and try the process outside of the project.  It took me a good couple of hours to brush up my gettext knowledge and find some useful documentation online.  Here is a very helpful article that got me started.

After reading the article, a few manuals and playing with the actual commands, I decided on the following:

  1. The source of all translations will be a single POT file.  This file will be completely dropped and regenerated every time any strings are updated in the source code.
  2. Each language will have a PO file of its own. However, the strings for the language won’t be extracted from the source code, but from the common POT file.
  3. All editors will use current project folder as the primary path.  In other words, “.” instead of full path to “/var/www/foobar”.  This will make all file references in PO/POT files point to a relative location to the project folder, ignoring the specifics of each contributor’s setup.
  4. Updating language template files (PO) and building of MO files will be a part of the project build/deploy script, to make sure everything stays as up to date as possible.

Now for the actual code.   Here is the shell script that does the job. (Here is a link to the Gist, just in case I’ll update it in the future.)


LANGS="en_US ru_RU"

# Create template
echo "Creating POT"
rm -f $POT
xgettext \
 --copyright-holder="2012 My Company Ltd" \
 --package-name="Project Name" \
 --package-version="1.0" \
 --msgid-bugs-address="" \
 --language=PHP \
 --sort-output \
 --keyword=__ \
 --keyword=_e \
 --from-code=UTF-8 \
 --output=$POT \
 --default-domain=$DOMAIN \

# Create languages
for LANG in $LANGS
 if [ ! -e "$LANG.po" ]
 echo "Creating language file for $LANG"
 msginit --no-translator --locale=$LANG.UTF-8 --output-file=$LANG.po --input=$POT

echo "Updating language file for $LANG from $POT"
 msgmerge --sort-output --update --backup=off $LANG.po $POT

echo "Converting $LANG.po to $"
 msgfmt --check --verbose --output-file=$ $LANG.po


Now, all you need to do is run the script once to get the default POT file and a PO file for every language.  You can edit PO files with translations for as much as you want.  Then simply run the script again and it will update generated MO files.  No parameters, no manuals, no nothing.  If you need to add another language, just put the appropriate locale in the $LANGS variable and run the script again.  You are good to go.


myGengo – human translation service that scales

Via this GigaOm blog post I came across an interesting service – myGengo.  I’ve had plenty of projects that dealt with multi-lingual issues, and professional, punctual translations were always a pain in the process.  So it is nice to see a company that uses, in my opinion, a very correct approach to the problem.

Right now, the translation market has two main segments: a high-end market dominated by full-time in-house translators, and a low-end market dominated by Google Translate. myGengo’s service aims to occupy the space in between the two markets by offering “human translation services at scale.”

Essentially, myGengo is like an oDesk built specifically for translation services. myGengo has assembled a group of more than 3,000 translators worldwide who work on a freelance basis through myGengo’s own dedicated software program. myGengo serves clients directly, and also has an API to let other startups include myGengo’s translation services in their apps. myGengo says it is targeted at people and businesses who occasionally need high-quality, fast translation services, but aren’t in the market to hire an in-house translator for the job.

0.5 USD cents per word, 1 to 16 hours per page (depending on the complexity of the document), human translation with pre-tested personnel, API integration – it sounds almost like a dream.  Of course, for now they only support a dozen or so languages, but given that they just received a $5.25 million Series A funding, I expect the service to expand quite a bit in the nearest future.

Smile with “Рождеством Христовым”

A couple of days ago DailyPost suggest the following topic for a blog post: Share something that makes you smile.  I wanted to share something, but so many things make me smile that it’s hard to choose.  Today though I came across something that made me smile, and even laugh.  It’ll take me a bit to explain, so please bear with me.  And if you choose not to, here is an entertaining and short explanation of “bear with me” versus “bare with me”.

Anyway, here we go with the facts:

  1. It’s just after the midnight on January 7th.
  2. Russia, as well as some other countries, celebrate Christmas on January 7th, and not on December 25th.  Wikipedia explains why.
  3. “Merry Christmas” in Russian is “С Рождеством Христовым”.

So, what we have right now is a lot of Russian-speaking people sending “С Рождеством Христовым” via any means possible to a lot of other people.  One of those means is Twitter.  One of many Twitter features is Trending Topics (aka TT).  This is an automatically generated list of most common phrases used across Twitter in some recent period of time (like an hour or two).  And as so many other automated features, this one has its side effects.

Firstly, it seems that it doesn’t much care for the language or alphabet.  It grabs any frequently used phrase in any language or any alphabet, puts it in the list of trending topics, and shows it to any user, no matter what his location or preferred language is.

Secondly, it seems that it tries to minimize the phrase by removing very short words.  Like those consisting of only one or two characters.

So what we have by now is “Рождеством Христовым”, and not “С Рождеством Христовым”.  And that phrase is a number one trending topic, shown to all Twitter users everywhere.  Here is a screenshot.

Here starts the fun.  Most people who see this, have no idea what is it all about.  Many of those, who are trying to find out get confused by incorrect spelling and by the fact that Christmas is over already for most of the world.  That I find funny.

But that’s not all.  Since the phrase went up to trending topics,  it got a lot of special attention.  Humor.  Some people started spreading rumors.  For example, that “Рождеством Христовым” is the name of the new Russian nuclear bomb.  Some others started using the phrase in famous quotes.  For example, “I love the smell OF Рождеством Христовым in the morning!” (original quote talks about napalm and is from the movie “Apocalypse Now”).  That I find hilarious.  You can have a look yourself at everything that has been tweeted with this phrase.

And even that is not all.  Twitter has been known for having hard times during activity spikes.  Today is just one of such spikes.  So Twitter is unstable, falling over the edge.  And when it does so, it shows the famous Fail Whale.

This is cute and worth a smile, but there is still more to the story.  The meaning of Fail Whale varies between people.  Mashable once published an interview with the designer of the image.  While I know the background of this image, I can’t help a different association.  The one that Denis Lebel mentioned in the comment to that interview – the story of the Sperm Whale from Hitchhicker’s Guide to the Universe.

It is important to note that suddenly, and against all probability, a Sperm Whale had been called into existence, several miles above the surface of an alien planet and since this is not a naturally tenable position for a whale, this innocent creature had very little time to come to terms with its identity.

Innocent “Merry Christmas” wishes, weird Cyrillic letters shown to the whole world, rumors of nuclear war from Russian, word play with famous quotes, Twitter outages, and flying whales – I find the mix hilarious.  I hope you do too.

P.S.: To all those of you celebrating – Merry Christmas and С Рождеством Христовым.

Simpler Google Talk translations?

Google has recently added Gtalk bots that can do translations to various languages, mostly available with Google Translate.  While I’m all for helping people understand each other better (even though there are certain complains regarding the quality of translation), I think this functionality could have implemented simpler.

Disclaimer: I haven’t tried it out myself, I’ve only read about it and saw the screenshots.

The problem that I see with the implementation is it being one way.  The bots are named fr2en and fr2en.  Which means that in order to keep up with conversation in the language foreign to you, you’ll need to have two bots nearby, not one.  Why?  Because if you will ask a person in his language a question, he will likely reply in the same language.  So you will need to translate both to and from the language.  I think this should have been done with one bot, not two.

Vacation vs. vocation

My co-worker and I were composing an email today. He was writing and I was watching over. When I pointed out to him that he wanted to write “vacation” instead of “vocation”, he argued that if the word was wrong, the spellchecker would have underlined it in red. Since I was 99.9% sure that I was right, I aked him to double check.

It turned out that both “vacation” and “vocation” are legitimate words. But what surprised me was that their meanings were almost opposite.

“Vacation” has to do with resting and spending the time nicely. “Vocation” has to do with hard work. If you don’t believe me, check the definitions in the dictionary. Here are the words in : vacation and vocation.

P.S.: And I was right.

Dictionary plugins for Mozilla Firefox

Being a non-native English speaker, I have this mildly often need to lookup the translation of some word in the dictionary. Instead of installing translation software on my computer or visiting one of the online translators every time such a need arises, I chose to use an extension to Mozilla Firefox.

Until now I was using the DictionarySearch extension, which can be configured to lookup in several different dictionaries, encyclopedias, etc. I was mostly happy with the extension, but felt that it could be improved simplified. I didn’t need all those configurations, choices and such. All I wanted was to lookup the translation of either English or Russian word in the Yandex Lingvo.

Today I came across an extension which does exactly that. Lingvo Online! for Mozilla Firefox is a very small and simple extension which does exactly what I want. It adds a context menu which allows quick lookups of selected words.