Slashdot has an interesting story of why there are only two variations of the word tea in the majority of languages:
With a few minor exceptions, there are really only two ways to say “tea” in the world. One is like the English term — te in Spanish and tee in Afrikaans are two examples. The other is some variation of cha, like chay in Hindi. Both versions come from China. How they spread around the world offers a clear picture of how globalization worked before “globalization” was a term anybody used. The words that sound like “cha” spread across land, along the Silk Road. The “tea”-like phrasings spread over water, by Dutch traders bringing the novel leaves back to Europe.
The term cha is “Sinitic,” meaning it is common to many varieties of Chinese. It began in China and made its way through central Asia, eventually becoming “chay” in Persian. That is no doubt due to the trade routes of the Silk Road, along which, according to a recent discovery, tea was traded over 2,000 years ago. This form spread beyond Persia, becoming chay in Urdu, shay in Arabic, and chay in Russian, among others. It even it made its way to sub-Saharan Africa, where it became chai in Swahili. The Japanese and Korean terms for tea are also based on the Chinese cha, though those languages likely adopted the word even before its westward spread into Persian. But that doesn’t account for “tea.” The te form used in coastal-Chinese languages spread to Europe via the Dutch, who became the primary traders of tea between Europe and Asia in the 17th century, as explained in the World Atlas of Language Structures. The main Dutch ports in east Asia were in Fujian and Taiwan, both places where people used the te pronunciation. The Dutch East India Company’s expansive tea importation into Europe gave us the French the, the German Tee, and the English tea.
This reminds me of this old post about how most languages, apart from English, use “ananas” as a word for pineapple.
I am not learning Japanese (just yet), but I still find the diagram above aesthetically pleasing. It’s from this article, which discusses the structure of the Japanese sentences versus the English ones.
This library can detect the language of a given text string. It can parse given training text in many different idioms into a sequence of N-grams and builds a database file in JSON format to be used in the detection phase. Then it can take a given text and detect its language using the database previously generated in the training phase. The library comes with text samples used for training and detecting text in 106 languages.
I tried it briefly with a few languages that I can master a phrase or two in, and it works better with some than with others. Greek was good, Russian not so much.
Hopefully, the sample data used for training will improve over time, but it’s definitely a good start.
If you ever had to deal with morphology in English, you probably found one or two libraries to help you out. But if you had to do that for Russian, than I’m sure you are missing a few hairs, and the ones that you still have are grayer than they used to be. I’ve got some good news for you though, now there is Morphos (GitHub repository).
Morphos is a morphological solution written completely in the PHP language. Supports Russian and English. Provides classes to decline First/Middle/Last names/nouns and generate cardinal numerals.
Just look at this beauty!
/* Will produce something like
string(15) "об Иване"
Just this alone can make user interfaces and emails so much better. But there is more to it than that.
This GitHub repository contains a list of the 10,000 most common English words, sorted by frequency, as seen by the Google Machine Translation Team.
Here at Google Research we have been using word n-gram models for a variety of R&D projects, such as statistical machine translation, speech recognition, spelling correction, entity detection, information extraction, and others. While such models have usually been estimated from training corpora containing at most a few billion words, we have been harnessing the vast power of Google’s datacenters and distributed processing infrastructure to process larger and larger training corpora. We found that there’s no data like more data, and scaled up the size of our data by one order of magnitude, and then another, and then one more – resulting in a training corpus of one trillion words from public Web pages.
We believe that the entire research community can benefit from access to such massive amounts of data. It will advance the state of the art, it will focus research in the promising direction of large-scale, data-driven approaches, and it will allow all research groups, no matter how large or small their computing resources, to play together. That’s why we decided to share this enormous dataset with everyone. We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times.
There are a few variations of the list – with and without the swear words and such. I took a quick look at it and was surprised to find that “cyprus” is at position 4,993 (pretty high), immediately after the word “emails“. Weird!
Pascal Birchler of the WordPress blogs some interesting research he did in the area of handling preferred language and how different systems – ranging from browsers, wikis, and social networks to all kinds of content management systems – approach and solve the problem.
Drupal 8 has a rather powerful user interface text language detection mechanism. There is a per session, per user and per browser option in the detection settings. However, users can only choose one language, so they cannot say (in core at least) that they want German primarily and Spanish if German is not available. But the language selected by the user is part of the larger fallback system, so it may fall back further down to other options.
The Language fallback module allows defining one fallback for a language, while the Language Hierarchy module provides a GUI to change the language fallback system. It allows setting up language hierarchies where translations of a site’s content, settings and interface can fall back to parent language translations, without ever falling back to English. This module might be the most interesting one for our research.
Apart from the research itself, I think this is an interesting example of how complex some seemingly simple features are.