localization

WordPress : Preferred Languages Research

Pascal Birchler of the WordPress blogs some interesting research he did in the area of handling preferred language and how different systems – ranging from browsers, wikis, and social networks to all kinds of content management systems – approach and solve the problem.

drupal-language-hierarchy-module

Drupal

Drupal 8 has a rather powerful user interface text language detection mechanism. There is a per session, per user and per browser option in the detection settings. However, users can only choose one language, so they cannot say (in core at least) that they want German primarily and Spanish if German is not available. But the language selected by the user is part of the larger fallback system, so it may fall back further down to other options.

The Language fallback module allows defining one fallback for a language, while the Language Hierarchy module provides a GUI to change the language fallback system. It allows setting up language hierarchies where translations of a site’s content, settings and interface can fall back to parent language translations, without ever falling back to English. This module might be the most interesting one for our research.

Apart from the research itself, I think this is an interesting example of how complex some seemingly simple features are.

Managing gettext translations on the command line

I am working on a rather multilingual project in the office currently. And, as always, we tried a few alternatives before ending up with gettext again. For those of you who don’t know, gettext is the de facto standard for managing language translations in software, especially when it comes to messages and user interface elements. It’s a nice, powerful system but it’s a bit awkward when things come to web development.

Anyways, we started using it in a bit of a rush, without doing all the necessary planning, and quite soon ended up in a bit of a mess. Different people used different editors to update translations. And each person’s environment was setup in a different way. All that made its way into the PO files that hold translations. More so, we didn’t really define the procedure for the updates of translations. That became a bigger problem when we realized that Arabic has only 50 translated strings, while English has 220, and Chinese 350. All languages were supposed to have exactly the same amount of strings, even if the actual translations were missing.

So today I had to rethink and redefine how we do it. First of all, I had to figure out and try the process outside of the project. It took me a good couple of hours to brush up my gettext knowledge and find some useful documentation online. Here is a very helpful article that got me started.

After reading the article, a few manuals and playing with the actual commands, I decided on the following:

The source of all translations will be a single POT file. This file will be completely dropped and regenerated every time any strings are updated in the source code.
Each language will have a PO file of its own. However, the strings for the language won’t be extracted from the source code, but from the common POT file.
All editors will use current project folder as the primary path. In other words, “.” instead of full path to “/var/www/foobar”. This will make all file references in PO/POT files point to a relative location to the project folder, ignoring the specifics of each contributor’s setup.
Updating language template files (PO) and building of MO files will be a part of the project build/deploy script, to make sure everything stays as up to date as possible.

Now for the actual code. Here is the shell script that does the job. (Here is a link to the Gist, just in case I’ll update it in the future.)

#!/bin/bash

DOMAIN="project_tag"
POT="$DOMAIN.pot"
LANGS="en_US ru_RU"
SOURCES="*.php"

# Create template
echo "Creating POT"
rm -f $POT
xgettext \
 --copyright-holder="2012 My Company Ltd" \
 --package-name="Project Name" \
 --package-version="1.0" \
 --msgid-bugs-address="[email protected]" \
 --language=PHP \
 --sort-output \
 --keyword=__ \
 --keyword=_e \
 --from-code=UTF-8 \
 --output=$POT \
 --default-domain=$DOMAIN \
 $SOURCES

# Create languages
for LANG in $LANGS
do
 if [ ! -e "$LANG.po" ]
 then
 echo "Creating language file for $LANG"
 msginit --no-translator --locale=$LANG.UTF-8 --output-file=$LANG.po --input=$POT
 fi

echo "Updating language file for $LANG from $POT"
 msgmerge --sort-output --update --backup=off $LANG.po $POT

echo "Converting $LANG.po to $LANG.mo"
 msgfmt --check --verbose --output-file=$LANG.mo $LANG.po
done

Now, all you need to do is run the script once to get the default POT file and a PO file for every language. You can edit PO files with translations for as much as you want. Then simply run the script again and it will update generated MO files. No parameters, no manuals, no nothing. If you need to add another language, just put the appropriate locale in the $LANGS variable and run the script again. You are good to go.

Enjoy!

Hope to see more language controls in Google Reader

If you read this blog even for a short while, you probably know that I depend on many Google tools, such Gmail and Google Reader. As a power user, I believe I know pretty much everything these services have to offer. I also know a few things that these services don’t have on offer yet, but which I’d gladly welcomed.

I already mentioned a sharing of interesting items in Google Reader with your contacts. That’s a really nice feature. And you can even control which users you see shared items from. However, one important thing is missing in that functionality – language control.

You see, I don’t have that many friends who are using Google Reader and share items, but even those few that I have speak a total of 7 languages (Russian, English, Greek, French, Ukrainian, Dutch, and German). Not only they speak this languages, but they also share a lot of items in those languages. That is sort of useless, since I only know two languages – Russian and English. These two are enough to provide the common ground for communications with all of my friends.

So, what I would really like to see in Google Reader, is a new setting which would let me filter my friends’ shared items to only those languages that I can understand. I know this can be a bit tricky to implement (how does the system know in which language the shared item is? or, even, what should it do if shared item is in more than one language?), but it would be really helpful functionality. And a huge time saver too, since then I wouldn’t have to go through all those items that I have no understanding off and marking them as read.

Should such a feature appear, I’d like to see it taken to extreme. I should be able to automatically tag or do searches on content in specific language. This will give me a useful tool of comparing hype about the same topic in different language communities.