10,000 most common English words

This GitHub repository contains a list of the 10,000 most common English words, sorted by frequency, as seen by the Google Machine Translation Team.

Here at Google Research we have been using word n-gram models for a variety of R&D projects, such as statistical machine translation, speech recognition, spelling correction, entity detection, information extraction, and others. While such models have usually been estimated from training corpora containing at most a few billion words, we have been harnessing the vast power of Google’s datacenters and distributed processing infrastructure to process larger and larger training corpora. We found that there’s no data like more data, and scaled up the size of our data by one order of magnitude, and then another, and then one more – resulting in a training corpus of one trillion words from public Web pages.

We believe that the entire research community can benefit from access to such massive amounts of data. It will advance the state of the art, it will focus research in the promising direction of large-scale, data-driven approaches, and it will allow all research groups, no matter how large or small their computing resources, to play together. That’s why we decided to share this enormous dataset with everyone. We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times.

There are a few variations of the list – with and without the swear words and such.  I took a quick look at it and was surprised to find that “cyprus” is at position 4,993 (pretty high), immediately after the word “emails“.  Weird!

(found via the link from this article)

Election. Yeah, right!

This Google blog post titled “A voice for everyone in 2016” made me chuckle:

Every election matters and every vote counts. The American democracy relies on everyone’s participation in the political process. This November, Americans all across the country will line up at the polls to cast their ballots for the President of the United States.

It sounds like a true effort to make things better and enhance democracy and what not.  But in practice, is it really an election? One by one, the candidates are falling of the ballot.  Day by day it becomes more obvious that Hillary Clinton will be the next president of the USA.

The more tools and technologies we have to enhance our lives, the worst the content on which we can apply those tools becomes.  The better the home cinemas became, the worse the movies got.  The better audio systems we have, the worse the music gets.  And politics just follow the same trend, unfortunately.

National Cancer Institute on Cannabis and Cannabinoids

National Cancer Institute has an interesting update on cannabis … Basically, marijuana is not yet universally approved as a medical treatment for cancer (only in a few states for now), but quite a few large studies suggest that not only it’s not harmful, but quite helpful for both cancer treatment and post-treatment relief.

usa

I think this is a good step in the direction of “the world is not black and white”.  We’ve been tagging everything as just good or bad for way too long.  It’s time to start looking at benefits and side effects in a bit more detail.

TrumpDonald.org

 

trumpdonald.org

Since the dawn of times, porn websites used to be at the tip of the spear for all the web development technology.  They were the first ones to use cookies, forms, online payments, streaming video, browser fingerprinting, responsive design and more.  It’s good to see that political activism is trying to catch up. Enjoy TrumpDonald.org!

America is full of high-earning poor people

poor rich

America is full of high-earning poor people” is an interesting article, with lots of charts and statistics, on how poor even high earning households are in America.  The problem is, of course, not unique to the United States.

The fact that the average upper-middle-class household has just $12,200 in non-pension financial wealth is disturbing. Even worse, within that group, about 25% of the higher earning population had only $3,200 in 2013. It’s no wonder one quarter of all American households couldn’t come up with $2,000 if they faced an emergency—it’s not just low earners.

Celebrating Columbus Day …

Just in time for the celebration of the Columbus Day in the USA, kottke.org links to a few sources (one, two, three) that suggest that the guy was not worthy:

Population figures from 500 years ago are necessarily imprecise, but Bergreen estimates that there were about 300,000 inhabitants of Hispaniola in 1492. Between 1494 and 1496, 100,000 died, half due to mass suicide. In 1508, the population was down to 60,000. By 1548, it was estimated to be only 500.

Understandably, some natives fled to the mountains to avoid the Spanish troops, only to have dogs set upon them by Columbus’s men. (Bergreen, 205)

10 Conspiracy Theories That Turned Out To Be True

10 Conspiracy Theories That Turned Out To Be True – some I’ve heard about before, some are new to me.  I’ll keep the list here for further reading and research.

  1. The Gulf of Tonkin Incident
  2. Tuskegee Syphilis Experiment
  3. Project MKUltra
  4. Operation Northwoods
  5. CIA Drug Trafficking
  6. Operation Mockingbird
  7. COINTELPRO
  8. Operation Snow White
  9. Secret Global Economic Policies
  10. The US Government Illegally Spies On Its Own Citizens

Citizenfour

citizenfour

It’s been a long while (almost two years in fact), since I posted a movie review.  It’s not that I haven’t seen any good movies in this period, but more of the fact that I tend to sound repetitive when I write these.  Watch that, this one is awesome, etc.

Last night I’ve watched “Citizenfour“, and I have to say I’m shaken by that documentary.   And I’m not a privacy or security freak, and I was somewhat familiar with Edward Snowden’s story.  This film, while portraying his personality, is not so much about him, as it is about the state of affairs.

As a non-US citizen, I have very little interest in what the US government is doing.  I don’t particularly care if someone is recording my Internet traffic, Google searches, or the phone calls I make.  I’m not worried about ending up “on the list”, or anything like that.

But not everyone is like that.  I do understand how government surveillance can be used, how data can be analyzed, and how pressure can be applied.  And I do share the point of view that the balance of power between the government and the people is way off (and not only in the US), and that we are beyond the point of any meaningful individual resistance.  It’s just that I don’t do anything about it, and Edward Snowden did.

For me personally, quite a few things were new in this film.  It was interesting to learn about the variety of NSA and CIA programs, the depth of their rich, and the technology that is in place already.  Some of it does sound like science fiction future, but is in fact very possible.   The stuff about security access in the NSA, drone video feeds, data gathering, analysis and search, with real time notifications, etc – all that was insightful.

The other side to the movie that I found interesting was the whole process that was used to expose these documents.  There is in fact no framework as to how such things can be done, what should and shouldn’t be published, how things can be verified, etc.  The move to remove his own bias and pass on the responsibility onto the journalists was interesting.

Overall, I think that the more people see this movie, the better.  The issues raised are very important and we should know about them.  It doesn’t only affect criminals or terrorists or Americans.  It affects everyone.  In particular everyone who has a phone, or a computer with an Internet connection, or a credit card.  After all, there are 1,200,000 people on the US watch lists, and from what I understand, this list is growing fast.