statistics

World’s Biggest Data Breaches

Here’s an interactive collection of the world’s biggest data breaches. It goes back to 2004, where about 92,000,000 email addresses and screen names were stolen by an AOL employee, and covers most of the major events up until and including 2016. There are a few ways to filter the data and change the representation.

Overall, should give you a pretty good idea of how safe and secure your online data is. Oh, and how private it is too.

GIT quick statistics

Any git repository contains a tonne of information about commits, contributors, and files. Extracting this information is not always trivial, mostly because of a gadzillion options to a gadzillion git commands – I don’t think there is a single person alive who knows them all. Probably not even Linus Torvalds himself.

git-quick-stats is a tool that simplifies access to some of that information and makes reports and statistics quick and easy to extract. It also works across UNIX-like operating systems, Mac OS X, and Windows.

GitHub to MySQL

GitHub to MySQL is a handy little app in PHP that pulls labels, milestones and issues from GitHub into your local MySQL database. This is useful for analysis and backup purposes.

There are a few example queries provided that show issues vs. pull requests, average number of days to merge a pull request over the past weeks, average number of pull requests open every day, and total number of issues.

I think this tool can be easily extended to pull other information from GitHub, such as release notes, projects, web hooks. Also, if you are using multiple version control services, such as BitBucket and GitLab, extending this tool can help with merging data from multiple sources and cross-referencing it with the company internal tools (bug trackers, support ticketing systems, CRM, etc).

This is not something I’ll be doing now, but I’m sure the future is not too far away.

10,000 most common English words

This GitHub repository contains a list of the 10,000 most common English words, sorted by frequency, as seen by the Google Machine Translation Team.

Here at Google Research we have been using word n-gram models for a variety of R&D projects, such as statistical machine translation, speech recognition, spelling correction, entity detection, information extraction, and others. While such models have usually been estimated from training corpora containing at most a few billion words, we have been harnessing the vast power of Google’s datacenters and distributed processing infrastructure to process larger and larger training corpora. We found that there’s no data like more data, and scaled up the size of our data by one order of magnitude, and then another, and then one more – resulting in a training corpus of one trillion words from public Web pages.

We believe that the entire research community can benefit from access to such massive amounts of data. It will advance the state of the art, it will focus research in the promising direction of large-scale, data-driven approaches, and it will allow all research groups, no matter how large or small their computing resources, to play together. That’s why we decided to share this enormous dataset with everyone. We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times.

There are a few variations of the list – with and without the swear words and such. I took a quick look at it and was surprised to find that “cyprus” is at position 4,993 (pretty high), immediately after the word “emails“. Weird!

(found via the link from this article)