How Google Uses and Contributes to Open Source

Here is a good Open Source story – “How Google Uses and Contributes to Open Source“, which goes into some detail and history of how Google is working with Open Source community.

I’ve seen this before:

“There are companies and people who just take the software and say, “I didn’t have to pay for it. I can do anything I want. The license file is a big blob of text. I’m not going to read that,” Merlin said.

And I’ve this (quite a few times actually):

Back in its early days, around 1998, Google was a small company. It was using open source just like any other small company. While Google was abiding by licences, they were not giving back much due to several reasons. “Some of it was just run fast and make sure that we have money next month to pay everyone’s salary,” said Merlin.

Having been involved in open sourcing companies’ projects new and old, this is what I firmly believe now is the best strategy:

Go open source from the beginning

Google changed that by writing a lot of things from the ground up as open source or to be open source ready. That was a good lesson that they learned, and that’s a problem many companies face when they want to open source their stuff but can’t because the code was not designed to be open source from the beginning.

This, I think, is an interesting approach too (if  you are too small of a company to have research papers and algorithms, consider blog posts, tips and tweaks, case studies, and the like):

Even if Google can’t open source certain code, they found a way to bring that work to the public. “We wrote papers talking about the magic algorithm that we used. We can’t give you the code for the reason I just explained, but we’re giving you the way they work so you can rewrite them,” said Merlin. Google has published hundreds of such papers and people are using it to create projects based on those ideas.

This bit on Android is mind blowing:

Now virtually all of Google’s open source code is on GitHub, except for Android. “The Android distribution is so big and it gets released in big chunks. So, when it gets released, everyone wants to sync that,” Merlin said. “It’s so huge that if we put it on GitHub, it would completely kill GitHub. We use our own mirrors for that, to help out.”

A word of caution for the companies using Open Source software:

Companies have to be extremely careful when using open source. Different projects use different licenses, and you need to be in compliance with them.

[…]

Things become complicated when you have projects that you ship. In the case of open source, you need to list the projects that you use and their licenses. In the case of BSD and MIT, you need to list the name and the copyright of the person you got that project from.

You’ll probably need a set of tools to deal with issues like this.  For PHP-based projects, composer is indispensable.  You can run “composer licenses” command and instantly get information about the project’s license, as well as licenses for each and every dependency in use (thanks to this blog post).

There is a good section on Contributor License Agreements (CLAs).  I am slightly familiar with the subject (I signed a few myself), but my experience is limited, especially from the company perspective.  I found this part useful, for that distant time when I’ll need to set it up:

Google uses the Apache foundation ICLA, without modifying it or putting anything special in it. CLAs ensure that companies like Google “can re-license your code under a different open source (license) if we need to. Sometimes we need to merge with other projects and that’s what the CLA allows us to do,” said Merlin.

These are just bits and pieces which I found interesting.  I wish more companies shared their practices and experiences – in particular those larger businesses, with years of history and a wide variety of challenges.

git: history of a source code line

git is one of those tools that no matter how much you know about it, there is an infinite supply of new things to learn.  Here’s a handy bit I’ve discovered recently, thanks to this StackOverflow comment:

Since Git 1.8.4, git log has -L to view the evolution of a range of lines.

[…]

And you want to know the history of what is now line 155.

Then, use git log. Here, -L 155,155:git-web–browse.sh means “trace the evolution of lines 155 to 155 in the file named git-web–browse.sh“.

Absolutely brilliant!  I used to suffer through this via an iteration of git blame and git show to the point of custom bash scripts.

400,000 GitHub repositories, 1 billion files, 14 terabytes of code: Spaces or Tabs?

Here is an interesting bit of research – do people prefer tabs or spaces when programming the most popular languages?

Tabs or spaces. We are going to parse a billion files among 14 programming languages to decide which one is on top.

The results are not very surprising and somewhat disappointing (for all of us, tab fans):

tabs vs. spaces

As far as PHP goes, I’m sure the choice of spaces has to do with the PSR-2 coding style guide, which states:

Code MUST use 4 spaces for indenting, not tabs.

On a more technical note, I think this is also related to the explosion of editors and IDEs in the recent years, which, as good as they are, aren’t as good as Vim.  Vim allows for a very flexible configuration, where your code can be formatted and re-formatted any way you like, making tabs or spaces a non-issue at all.

Regardless of the results of the study, what’s more interesting is the method and tools used.  I’ve had my eye on the Google Big Query for a while now, but I’m too busy these days to give it a try.  The article gives a few insights, into how awesome the tool is.  1.6 terabytes of data processed in 864.6 seconds:

That query took a relative long time since it involved joining a 190 million rows table with a 70 million rows one, and over 1.6 terabytes of contents. But don’t worry about having to run it, since I left the result publicly available at [fh-bigquery:github_extracts.contents_top_repos_top_langs].

and:

Analyzing each line of 133 GBs of code in 16 seconds? That’s why I love BigQuery.

If you enjoyed this article, also have a look at “Analyzing GitHub issues and comments with BigQuery“, which works with a similar-sized data, trying to figure out how to write bug reports and pull request comments, so that they would be acted upon faster.

Kids react to Metallica

This video made me feel very, very old.  Ancient almost.  I grew up with Metallica and it is shocking to realize that kids these days don’t even know the band.  Apart from that, the reactions are funny.  New generation …

By the way, there are more videos in that YouTube channel (see the Windows 95 for example).   Radio Nostalgie … :)

Quora: if programming languages were countries …

If programming languages were countries, which country would each language represent?” over Quora is hilarious!  Here are a few bits to get you started:

CRussia. Everything has to be done in a backwards way, but everything is possible, and there’s a lot of legacy.

C++USA. Powerful, but more and more complicated, unreadable, error-prone. Tends to dominate and influence everything.

Haskell Monaco. Not many people, but very rich, so they don’t have to consider lower classes’ problems.

Java Sweden. Comfortable, but has its own king and currency.

JavaScript China. Developing really fast and can do lots of surprising stuff. A lot of users.

PHPBangladesh. Poor, but numerous, and it’s found all over the web.

PascalGermany. Strict rules, good performance. And there are many people who just don’t like the language.

BashSwitzerland. Not very big in itself, but pulls the strings of the others.

Awk: North Korea. Stubbornly resists change, and its users appear to be unnaturally fond of it for reasons we can only speculate on.