Morphos – morphological solution in PHP for English and Russian

If you ever had to deal with morphology in English, you probably found one or two libraries to help you out.  But if you had to do that for Russian, than I’m sure you are missing a few hairs, and the ones that you still have are grayer than they used to be.  I’ve got some good news for you though, now there is Morphos (GitHub repository).

Morphos is a morphological solution written completely in the PHP language. Supports Russian and English. Provides classes to decline First/Middle/Last names/nouns and generate cardinal numerals.

Just look at this beauty!

var_dump($dec->getForms($user_name, $dec->detectGender($user_name)));
/* Will produce something like
  array(6) {
    ["nominativus"]=>
    string(8) "Иван"
    ["genetivus"]=>
    string(10) "Ивана"
    ["dativus"]=>
    string(10) "Ивану"
    ["accusative"]=>
    string(10) "Ивана"
    ["ablativus"]=>
    string(12) "Иваном"
    ["praepositionalis"]=>
    string(15) "об Иване"
  }
*/

Just this alone can make user interfaces and emails so much better.  But there is more to it than that.

10,000 most common English words

This GitHub repository contains a list of the 10,000 most common English words, sorted by frequency, as seen by the Google Machine Translation Team.

Here at Google Research we have been using word n-gram models for a variety of R&D projects, such as statistical machine translation, speech recognition, spelling correction, entity detection, information extraction, and others. While such models have usually been estimated from training corpora containing at most a few billion words, we have been harnessing the vast power of Google’s datacenters and distributed processing infrastructure to process larger and larger training corpora. We found that there’s no data like more data, and scaled up the size of our data by one order of magnitude, and then another, and then one more – resulting in a training corpus of one trillion words from public Web pages.

We believe that the entire research community can benefit from access to such massive amounts of data. It will advance the state of the art, it will focus research in the promising direction of large-scale, data-driven approaches, and it will allow all research groups, no matter how large or small their computing resources, to play together. That’s why we decided to share this enormous dataset with everyone. We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times.

There are a few variations of the list – with and without the swear words and such.  I took a quick look at it and was surprised to find that “cyprus” is at position 4,993 (pretty high), immediately after the word “emails“.  Weird!

(found via the link from this article)

WordPress : Preferred Languages Research

Pascal Birchler of the WordPress blogs some interesting research he did in the area of handling preferred language and how different systems – ranging from browsers, wikis, and social networks to all kinds of content management systems – approach and solve the problem.

drupal-language-hierarchy-module

Drupal

Drupal 8 has a rather powerful user interface text language detection mechanism. There is a per session, per user and per browser option in the detection settings. However, users can only choose one language, so they cannot say (in core at least) that they want German primarily and Spanish if German is not available. But the language selected by the user is part of the larger fallback system, so it may fall back further down to other options.

The Language fallback module allows defining one fallback for a language, while the Language Hierarchy module provides a GUI to change the language fallback system. It allows setting up language hierarchies where translations of a site’s content, settings and interface can fall back to parent language translations, without ever falling back to English. This module might be the most interesting one for our research.

Apart from the research itself, I think this is an interesting example of how complex some seemingly simple features are.

Programming and Greek

One thought that cracks me up every now and then is about Greek programmers.  In Greek language, instead of a question mark a semicolon is used.

Greek

In many programming languages, a semicolon is used to represent the end of statement.  So, this:

$a = $b + $c;
print $a;

to Greek programmers must be looking like this:

$a = $b + $c?
print $a?

I don’t know about you, but to me this would be a constant confidence issue.  It’s almost like I’m not sure what I’m going and asking the computer to confirm.

I’m sure though they have their ways of working around this …

By the way, while reading through the Wikipedia article linked above, I thought that the possible origins of the question mark were quite interesting:

questio

 

That would also explain why not all the languages are using the question mark character.

The Definitive Guide to Natural Language Processing

The Definitive Guide to Natural Language Processing” is an easy to follow article on what a challanging task it is for machines to understand human language.  There’s also this cool video of two bots talking to each other.