10,000 most common English words

This GitHub repository contains a list of the 10,000 most common English words, sorted by frequency, as seen by the Google Machine Translation Team.

Here at Google Research we have been using word n-gram models for a variety of R&D projects, such as statistical machine translation, speech recognition, spelling correction, entity detection, information extraction, and others. While such models have usually been estimated from training corpora containing at most a few billion words, we have been harnessing the vast power of Google’s datacenters and distributed processing infrastructure to process larger and larger training corpora. We found that there’s no data like more data, and scaled up the size of our data by one order of magnitude, and then another, and then one more – resulting in a training corpus of one trillion words from public Web pages.

We believe that the entire research community can benefit from access to such massive amounts of data. It will advance the state of the art, it will focus research in the promising direction of large-scale, data-driven approaches, and it will allow all research groups, no matter how large or small their computing resources, to play together. That’s why we decided to share this enormous dataset with everyone. We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times.

There are a few variations of the list – with and without the swear words and such.  I took a quick look at it and was surprised to find that “cyprus” is at position 4,993 (pretty high), immediately after the word “emails“.  Weird!

(found via the link from this article)

Engineers Salary Data

Amitj Aggarwal, former Staff Engineer at Google (2008-2012), has collected a whole bunch of data in regards to engineers salaries, in USA and worldwide.  His points seem to be overly optimistic at times, but I don’t have any links handy to contradict his research.

Here are a few points to get you started:

  • Zoho, Salesforce pay 40% more than Oracle, Cisco, GE!!!
  • Top 7% or so engineers at Netflix, Amazon, Google, Facebook are paid more than $1.4M per year. Next 10% make $700K on average.
  • Facebook has lost relevance to Slack, LinkedIn, Snapchat, Pinterest and Quora. If you are working at Facebook ask for a 50% raise else move to a startup.
  • Oracle is loosing to cloud startups. If you are working at Oracle ask for a 60% raise else move to a startup.
  • ENGINEERS DO NOT WASTE MONEY ON AN MBA. You will make 2X more on average as an engineer.
  • Tableau, Splunk, Slack, Airbnb, Quora, Twitter, Facebook, Google pay more than $320K salary to their top hires. Definitely interview at these fine places. Uber top engineer salaries are $190-340K per year.
  • Starting salaries for fresh software engineering graduates is now $130K-160K. Ask shamelessly. For the best ones its ~$180K.
  • Apple pays 60% more than Samsung.

Google Infrastructure Security Design Overview

If you ever wanted to know what Google does to maintain its high level of security, here’s your chance. Google Infrastructure Security Design Overview provides quite a bit of information on the subject.

This document gives an overview of how security is designed into Google’s technical infrastructure. This global scale infrastructure is designed to provide security through the entire information processing lifecycle at Google. This infrastructure provides secure deployment of services, secure storage of data with end user privacy safeguards, secure communications between services, secure and private communication with customers over the internet, and safe operation by administrators.

Headless Browsers

Headless Browsers is a list of (almost) all headless web browsers in existence.  These are browsers without graphical user interface, controlled programmatically, and useful for testing, automation, and other similar tasks.

I’ve used one or two.  I’v heard about three of four.  I had no idea there was such a variety though.