World’s Biggest Data Breaches

Here’s an interactive collection of the world’s biggest data breaches.  It goes back to 2004, where about 92,000,000 email addresses and screen names were stolen by an AOL employee, and covers most of the major events up until and including 2016.  There are a few ways to filter the data and change the representation.

Overall, should give you a pretty good idea of how safe and secure your online data is. Oh, and how private it is too.

Getting started with workflows in PHP

For a large project at work, we need to integrate or develop a workflow engine.  I worked a little bit with workflow engines in the past, but the subject is way to big and complex for me to claim any expertise in it.

So, I am looking at what’s available these days and what are our options.  This post is a collection of initial links and thoughts, and it’s goal is mostly to document my research process and findings, and not to provide any answers or solutions yet.

Continue reading Getting started with workflows in PHP

Social Media Research Toolkit

Social Media Research Toolkita list of 50+ social media research tools curated by researchers at the Social Media Lab at Ted Rogers School of Management, Ryerson University. The kit features tools that have been used in peer-reviewed academic studies. Many tools are free to use and require little or no programming. Some are simple data collectors such as tweepy, a Python library for collecting Tweets, and others are a bit more robust, such as Netlytic, a multi-platform (Twitter, Facebook, and Instagram) data collector and analyzer, developed by our lab. All of the tools are confirmed available and operational.

Via Four short links: 14 Feb 2017.

10,000 most common English words

This GitHub repository contains a list of the 10,000 most common English words, sorted by frequency, as seen by the Google Machine Translation Team.

Here at Google Research we have been using word n-gram models for a variety of R&D projects, such as statistical machine translation, speech recognition, spelling correction, entity detection, information extraction, and others. While such models have usually been estimated from training corpora containing at most a few billion words, we have been harnessing the vast power of Google’s datacenters and distributed processing infrastructure to process larger and larger training corpora. We found that there’s no data like more data, and scaled up the size of our data by one order of magnitude, and then another, and then one more – resulting in a training corpus of one trillion words from public Web pages.

We believe that the entire research community can benefit from access to such massive amounts of data. It will advance the state of the art, it will focus research in the promising direction of large-scale, data-driven approaches, and it will allow all research groups, no matter how large or small their computing resources, to play together. That’s why we decided to share this enormous dataset with everyone. We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times.

There are a few variations of the list – with and without the swear words and such.  I took a quick look at it and was surprised to find that “cyprus” is at position 4,993 (pretty high), immediately after the word “emails“.  Weird!

(found via the link from this article)

Engineers Salary Data

Amitj Aggarwal, former Staff Engineer at Google (2008-2012), has collected a whole bunch of data in regards to engineers salaries, in USA and worldwide.  His points seem to be overly optimistic at times, but I don’t have any links handy to contradict his research.

Here are a few points to get you started:

  • Zoho, Salesforce pay 40% more than Oracle, Cisco, GE!!!
  • Top 7% or so engineers at Netflix, Amazon, Google, Facebook are paid more than $1.4M per year. Next 10% make $700K on average.
  • Facebook has lost relevance to Slack, LinkedIn, Snapchat, Pinterest and Quora. If you are working at Facebook ask for a 50% raise else move to a startup.
  • Oracle is loosing to cloud startups. If you are working at Oracle ask for a 60% raise else move to a startup.
  • ENGINEERS DO NOT WASTE MONEY ON AN MBA. You will make 2X more on average as an engineer.
  • Tableau, Splunk, Slack, Airbnb, Quora, Twitter, Facebook, Google pay more than $320K salary to their top hires. Definitely interview at these fine places. Uber top engineer salaries are $190-340K per year.
  • Starting salaries for fresh software engineering graduates is now $130K-160K. Ask shamelessly. For the best ones its ~$180K.
  • Apple pays 60% more than Samsung.