400,000 GitHub repositories, 1 billion files, 14 terabytes of code: Spaces or Tabs?

Here is an interesting bit of research – do people prefer tabs or spaces when programming the most popular languages?

Tabs or spaces. We are going to parse a billion files among 14 programming languages to decide which one is on top.

The results are not very surprising and somewhat disappointing (for all of us, tab fans):

tabs vs. spaces

As far as PHP goes, I’m sure the choice of spaces has to do with the PSR-2 coding style guide, which states:

Code MUST use 4 spaces for indenting, not tabs.

On a more technical note, I think this is also related to the explosion of editors and IDEs in the recent years, which, as good as they are, aren’t as good as Vim.  Vim allows for a very flexible configuration, where your code can be formatted and re-formatted any way you like, making tabs or spaces a non-issue at all.

Regardless of the results of the study, what’s more interesting is the method and tools used.  I’ve had my eye on the Google Big Query for a while now, but I’m too busy these days to give it a try.  The article gives a few insights, into how awesome the tool is.  1.6 terabytes of data processed in 864.6 seconds:

That query took a relative long time since it involved joining a 190 million rows table with a 70 million rows one, and over 1.6 terabytes of contents. But don’t worry about having to run it, since I left the result publicly available at [fh-bigquery:github_extracts.contents_top_repos_top_langs].

and:

Analyzing each line of 133 GBs of code in 16 seconds? That’s why I love BigQuery.

If you enjoyed this article, also have a look at “Analyzing GitHub issues and comments with BigQuery“, which works with a similar-sized data, trying to figure out how to write bug reports and pull request comments, so that they would be acted upon faster.

Free Data Science Books

I came across a collection of free data science books:

Pulled from the web, here is a great collection of eBooks (most of which have a physical version that you can purchase on Amazon) written on the topics of Data Science, Business Analytics, Data Mining, Big Data, Machine Learning, Algorithms, Data Science Tools, and Programming Languages for Data Science.

Most notably, there are introductory books, handbooks, Hadoop guide, SQL books, social media data mining stuff, and d3 tips and tricks.  There’s also plenty on artificial intelligence and machine learning, but that’s too far out for me.

πfs – the data-free filesystem!

πfs – the data-free filesystem!

πfs is a revolutionary new file system that, instead of wasting space storing your data on your hard drive, stores your data in π! You’ll never run out of space again – π holds every file that could possibly exist! They said 100% compression was impossible? You’re looking at it!

GitHub’s Data Challenge II winners announced

GitHub, being a massive data store, is constantly looking for new and improved ways of extracting knowledge from its data.

In April we announced the second annual GitHub data challenge.

[…]

After receiving some amazing entries in the previous challenge, we were excited to see what people would discover with another year of data. The results blew us away: we saw many more entrants and novel applications of our data. GitHubbers ranked their favorite entries, and after tallying the votes, we’re happy to announce the top 3 entries for the 2013 GitHub data challenge.

The second place (Popular Convention by Outsider) and third place (open source contributions by location by David Fischer) winners are very nice.  But the first place winner is truly amazing (The Open Source Report Card, by Dan Foreman-Mackey).  It’s an excellent combination of data crunching with beautiful presentation.  Of course, you’ll need some publicly visible repositories and contributions to see anything interesting, but once you do, it’s quite impressive.  Have a look at mine, for example.

In Head-Hunting, Big Data May Not Be Such a Big Deal

In Head-Hunting, Big Data May Not Be Such a Big Deal

Very interesting interview with Laszlo Block, senior vice president of people operations at Google.  Here are some of my favorite bits.

Years ago, we did a study to determine whether anyone at Google is particularly good at hiring. We looked at tens of thousands of interviews, and everyone who had done the interviews and what they scored the candidate, and how that person ultimately performed in their job. We found zero relationship. It’s a complete random mess, except for one guy who was highly predictive because he only interviewed people for a very specialized area, where he happened to be the world’s leading expert.

So, it’s not just the recruiting agency you work with.  It’s pretty much everyone.

We’re also observing people working together in different groups and have found that the average team size of any group at Google is about six people.

I find teams of five-six people to be the most efficient as well.

On the hiring side, we found that brainteasers are a complete waste of time. How many golf balls can you fit into an airplane? How many gas stations in Manhattan? A complete waste of time. They don’t predict anything. They serve primarily to make the interviewer feel smart.

Oops.  I’ve started to use brainteasers in the interviews years ago.  I think I actually learned about them while being interviewed by Google.   Contrary to Google findings, I think they are useful.  That might be because I’m in slightly different line of work usually.

Behavioral interviewing also works — where you’re not giving someone a hypothetical, but you’re starting with a question like, “Give me an example of a time when you solved an analytically difficult problem.” The interesting thing about the behavioral interview is that when you ask somebody to speak to their own experience, and you drill into that, you get two kinds of information. One is you get to see how they actually interacted in a real-world situation, and the valuable “meta” information you get about the candidate is a sense of what they consider to be difficult.

No t always applicable, but yes, when it is, a very useful way to find out more about the candidate.

We found that, for leaders, it’s important that people know you are consistent and fair in how you think about making decisions and that there’s an element of predictability. If a leader is consistent, people on their teams experience tremendous freedom, because then they know that within certain parameters, they can do whatever they want.

That is good to know.  Especially when I suck so badly in consistency department.

One of the things we’ve seen from all our data crunching is that G.P.A.’s are worthless as a criteria for hiring, and test scores are worthless — no correlation at all except for brand-new college grads, where there’s a slight correlation. Google famously used to ask everyone for a transcript and G.P.A.’s and test scores, but we don’t anymore, unless you’re just a few years out of school. We found that they don’t predict anything.

What’s interesting is the proportion of people without any college education at Google has increased over time as well. So we have teams where you have 14 percent of the team made up of people who’ve never gone to college.

I can easily agree with the absence of correlation between grades in college and candidate’s talent.  But I prefer to see at least some education.  I don’t insist on it however, as I’ve worked with a few people who had no formal education in the field but were exceptionally good – learned from the experience.