Language Detection Library for PHP

patrickschur/language-detection – is a language detection library for PHP, which detects the language from a given text string.  Now, a bit more detailed:

This library can detect the language of a given text string. It can parse given training text in many different idioms into a sequence of N-grams and builds a database file in JSON format to be used in the detection phase. Then it can take a given text and detect its language using the database previously generated in the training phase. The library comes with text samples used for training and detecting text in 106 languages.

I tried it briefly with a few languages that I can master a phrase or two in, and it works better with some than with others.  Greek was good, Russian not so much.

Hopefully, the sample data used for training will improve over time, but it’s definitely a good start.

Via this blog post.

 

Morphos – morphological solution in PHP for English and Russian

If you ever had to deal with morphology in English, you probably found one or two libraries to help you out.  But if you had to do that for Russian, than I’m sure you are missing a few hairs, and the ones that you still have are grayer than they used to be.  I’ve got some good news for you though, now there is Morphos (GitHub repository).

Morphos is a morphological solution written completely in the PHP language. Supports Russian and English. Provides classes to decline First/Middle/Last names/nouns and generate cardinal numerals.

Just look at this beauty!

var_dump($dec->getForms($user_name, $dec->detectGender($user_name)));
/* Will produce something like
  array(6) {
    ["nominativus"]=>
    string(8) "Иван"
    ["genetivus"]=>
    string(10) "Ивана"
    ["dativus"]=>
    string(10) "Ивану"
    ["accusative"]=>
    string(10) "Ивана"
    ["ablativus"]=>
    string(12) "Иваном"
    ["praepositionalis"]=>
    string(15) "об Иване"
  }
*/

Just this alone can make user interfaces and emails so much better.  But there is more to it than that.

Global email in Gmail. Bad idea.

Gmail blog reports that Google is working on a more global email.  The first step is internationalized email addresses, like this:

internationalized_email_address

As someone who worked in international environments for years, I strongly dislike this idea.  There is a whole array of issues related to this: readability of the email address (yes, read it!), display issues (do you have the font with all the necessary characters?), writing email address (searching through the addressbook, for example), or even copy-pasting an email address (have you tried copy-pasting something English strings from Hebrew or Arabic documents?  Now you’ll be copy-pasting international email addresses from English documents – so much fun!).  On top of that, all the usual things related to SPAM filters, trust issues (is this a company, free email hosting, or a personal domain?), etc.  Can you spell out this email address over a phone?  How about typing it on the mobile phone?  Do you even know in which language it is?

Using non-accented Latin characters is a pain for all those people who don’t speak English.  But it worked nonetheless for the last few decades.  Now we are heading towards the future, where that pain won’t be limited to those who don’t read English, but to everyone.  As you can’t really learn all the languages of the world, or control which language email addresses are making it into your inbox.  Remember, that just because the email address is in a given language, it doesn’t mean that the content of the email is in the same language.

On top of that, we’ve tried that already with the international URLs.  See how well that worked out.  Yeah, some people sure use them.  But try copy-pasting this URL around and I guarantee you’ll end up with a whole bunch of long and cumbersome escaped strings.  The same or similar fate will hit the emails…

Mamihlapinatapai

Mamihlapinatapai

The word Mamihlapinatapai (sometimes spelled mamihlapinatapei) is derived from the Yaghan language of Tierra del Fuego, listed in The Guinness Book of World Records as the “most succinct word”, and is considered one of the hardest words to translate. It refers to “a look shared by two people, each wishing that the other will offer something that they both desire but are unwilling to suggest or offer themselves.” It is also cited in books and articles on game theory associated with the volunteer’s dilemma.

Managing gettext translations on the command line

I am working on a rather multilingual project in the office currently.  And, as always, we tried a few alternatives before ending up with gettext again.  For those of you who don’t know, gettext is the de facto standard for managing language translations in software, especially when it comes to messages and user interface elements.  It’s a nice, powerful system but it’s a bit awkward when things come to web development.

Anyways, we started using it in a bit of a rush, without doing all the necessary planning, and quite soon ended up in a bit of a mess.  Different people used different editors to update translations.  And each person’s environment was setup in a different way.   All that made its way into the PO files that hold translations.  More so, we didn’t really define the procedure for the updates of translations.  That became a bigger problem when we realized that Arabic has only 50 translated strings, while English has 220, and Chinese 350.  All languages were supposed to have exactly the same amount of strings, even if the actual translations were missing.

So today I had to rethink and redefine how we do it.  First of all, I had to figure out and try the process outside of the project.  It took me a good couple of hours to brush up my gettext knowledge and find some useful documentation online.  Here is a very helpful article that got me started.

After reading the article, a few manuals and playing with the actual commands, I decided on the following:

  1. The source of all translations will be a single POT file.  This file will be completely dropped and regenerated every time any strings are updated in the source code.
  2. Each language will have a PO file of its own. However, the strings for the language won’t be extracted from the source code, but from the common POT file.
  3. All editors will use current project folder as the primary path.  In other words, “.” instead of full path to “/var/www/foobar”.  This will make all file references in PO/POT files point to a relative location to the project folder, ignoring the specifics of each contributor’s setup.
  4. Updating language template files (PO) and building of MO files will be a part of the project build/deploy script, to make sure everything stays as up to date as possible.

Now for the actual code.   Here is the shell script that does the job. (Here is a link to the Gist, just in case I’ll update it in the future.)

#!/bin/bash

DOMAIN="project_tag"
POT="$DOMAIN.pot"
LANGS="en_US ru_RU"
SOURCES="*.php"

# Create template
echo "Creating POT"
rm -f $POT
xgettext \
 --copyright-holder="2012 My Company Ltd" \
 --package-name="Project Name" \
 --package-version="1.0" \
 --msgid-bugs-address="translations@company.com" \
 --language=PHP \
 --sort-output \
 --keyword=__ \
 --keyword=_e \
 --from-code=UTF-8 \
 --output=$POT \
 --default-domain=$DOMAIN \
 $SOURCES

# Create languages
for LANG in $LANGS
do
 if [ ! -e "$LANG.po" ]
 then
 echo "Creating language file for $LANG"
 msginit --no-translator --locale=$LANG.UTF-8 --output-file=$LANG.po --input=$POT
 fi

echo "Updating language file for $LANG from $POT"
 msgmerge --sort-output --update --backup=off $LANG.po $POT

echo "Converting $LANG.po to $LANG.mo"
 msgfmt --check --verbose --output-file=$LANG.mo $LANG.po
done

 

Now, all you need to do is run the script once to get the default POT file and a PO file for every language.  You can edit PO files with translations for as much as you want.  Then simply run the script again and it will update generated MO files.  No parameters, no manuals, no nothing.  If you need to add another language, just put the appropriate locale in the $LANGS variable and run the script again.  You are good to go.

Enjoy!