400,000 GitHub repositories, 1 billion files, 14 terabytes of code: Spaces or Tabs?

Here is an interesting bit of research – do people prefer tabs or spaces when programming the most popular languages?

Tabs or spaces. We are going to parse a billion files among 14 programming languages to decide which one is on top.

The results are not very surprising and somewhat disappointing (for all of us, tab fans):

tabs vs. spaces

As far as PHP goes, I’m sure the choice of spaces has to do with the PSR-2 coding style guide, which states:

Code MUST use 4 spaces for indenting, not tabs.

On a more technical note, I think this is also related to the explosion of editors and IDEs in the recent years, which, as good as they are, aren’t as good as Vim.  Vim allows for a very flexible configuration, where your code can be formatted and re-formatted any way you like, making tabs or spaces a non-issue at all.

Regardless of the results of the study, what’s more interesting is the method and tools used.  I’ve had my eye on the Google Big Query for a while now, but I’m too busy these days to give it a try.  The article gives a few insights, into how awesome the tool is.  1.6 terabytes of data processed in 864.6 seconds:

That query took a relative long time since it involved joining a 190 million rows table with a 70 million rows one, and over 1.6 terabytes of contents. But don’t worry about having to run it, since I left the result publicly available at [fh-bigquery:github_extracts.contents_top_repos_top_langs].


Analyzing each line of 133 GBs of code in 16 seconds? That’s why I love BigQuery.

If you enjoyed this article, also have a look at “Analyzing GitHub issues and comments with BigQuery“, which works with a similar-sized data, trying to figure out how to write bug reports and pull request comments, so that they would be acted upon faster.

PHP tags – once and for all. Yet again.

For those of us who have been using PHP since the early version 3 days and such, here is a modern day refresher for PHP tags:

If a file is pure PHP code, it is preferable to omit the PHP closing tag at the end of the file. This prevents accidental whitespace or new lines being added after the PHP closing tag, which may cause unwanted effects because PHP will start output buffering when there is no intention from the programmer to send any output at that point in the script.

And from the comment on the same page:

Since PHP 5.4 the inline echo <?= ?> short tags are always enabled regardless of the short_open_tag (php.ini) setting.

For me personally, closing PHP tags are a part of muscle memory.  I’ll have to unlearn that, I guess.  And in regards to the inline echo tags, I was under the impression that they are being phased out together with the other short tags (<? … ?>).  Apparently, I was wrong.  They are here to stay.  Which explains why there are in PSR-1, PSR-2, and in CakePHP 3 (which requires PHP 5.4.16 and fully adopts PSR-2) in particular.

GitHub error pages

I’ve praised GitHub many a time in posts on this blog and in numerous conversations over a pint.  Today, I found yet another reason to do so – GitHub error pages.  We’ve all seen a parallax 404 by now, right?

github 404

Today was the first time I looked into the source code of the page.  It greets one with the following words right under the HTML 5 Doctype definition:

Hello future GitHubber! I bet you’re here to remove those nasty inline styles,
DRY up these templates and make ’em nice and re-usable, right?

Please, don’t. https://github.com/styleguide/templates/2.0

The link provides even more goodness.  The list of other (all?) GitHub error templates is provided with explanation of which one fires when, as well as an insightful list of rules that GitHub uses for building error pages.  Have a look:

If you’re visiting from the internet, feel free to learn from our style. This is a guide we use for our own apps internally at GitHub. We encourage you to set up one that works for your own team.

Error pages should be built such that they require zero scripting, zero javascript, and zero dependency on anything whatsoever. That means static HTML with inline CSS and base64-encoded images.

The following are banned from every error page:

  • All <script> tags with an src attribute.
  • All JavaScript that loads external data.
  • All <link> tags.
  • All <img> tags with an src pointing to a URL.

It’s things like that that keep me coming back and looking for more web development elegance all around GitHub.

Enforcing coding styles in PHP

I came across a plugin for CakePHP which helps to check if the certain code follows CakePHP coding style.  While I haven’t tried it, I think the better way is to utilize CodeSniffer.  As per PHP_CodeSniffer PEAR page:

PHP_CodeSniffer tokenises PHP, JavaScript and CSS files and detects violations of a defined set of coding standards.

Which basically means that PHP_CodeSniffer is a generic tool for validating your code.  You can use for CakePHP, WordPress, or any other PHP project that you are working on.  The best part is that you can create your own set of rules regarding coding style and then make sure that your team follows it. If you don’t care that much for your own rules, then you can use one of the many existing rulesets.  Some of these come together with CodeSniffer package, others are available on the Web.

Setting up CodeSniffer for my team at work has been a long lasting TODO item, however it looks like I will be able to start working on this next week.  Once it created, tested, and everyone is happy with it, we’ll have it in the pre-commit hook in our Subversion repository.  This way, we will prevent commits of any code that does not follow our rules.  Of course, I plan to only run CodeSniffer against the code that we wrote in-house.  There is no need to re-format all the third-party code just for the sake of it.  Plus, we are rarely doing any modifications of the third-party code at all.