research

Largest digital survey of the sky mapped billions of stars

An international team of astronomers have released two petabytes of data from the Pan-STARRS project that’s also known as the “world’s largest digital sky survey.” Two petabytes of data, according to the team, is equivalent to any of the following: a billion selfies, one hundred Wikipedias or 40 million four-drawer filing cabinets filled with single-spaced text. The scientists spent four years observing three-fourths of the night sky through their 1.8 meter telescope at Haleakala Observatories on Maui, Hawaii, scanning three billion objects in the Milky Way 12 times in five different filters. Those objects included stars, galaxies, asteroids and other celestial bodies.

Wow … this is mind blowing at the very least …

See the image above? That’s the result of half a million 45-second exposures taken over four years. They’re releasing even more detailed images and data in 2017 — for now, you can check out what the team released to the public on the official Pan-STARRS website.

Every pub in the United Kingdom

This Reddit thread shares the map of all the pubs in the UK. The Poke picked it up and wrapped it into some more links and quotes. Apparently, not even all the pubs are covered:

“Nope. There’s at least 12 pubs missing from the north coast of Scotland. Thurso alone has more than 6, 2 in Bettyhill, Tongue and Melvich plus a few others all missing”, writes shaidy64

The source of the map is here referencing 24,727 UK pubs. And I’ve only been to like, what, 3? This situation urgently needs correction.

WordPress : Preferred Languages Research

Pascal Birchler of the WordPress blogs some interesting research he did in the area of handling preferred language and how different systems – ranging from browsers, wikis, and social networks to all kinds of content management systems – approach and solve the problem.

drupal-language-hierarchy-module

Drupal

Drupal 8 has a rather powerful user interface text language detection mechanism. There is a per session, per user and per browser option in the detection settings. However, users can only choose one language, so they cannot say (in core at least) that they want German primarily and Spanish if German is not available. But the language selected by the user is part of the larger fallback system, so it may fall back further down to other options.

The Language fallback module allows defining one fallback for a language, while the Language Hierarchy module provides a GUI to change the language fallback system. It allows setting up language hierarchies where translations of a site’s content, settings and interface can fall back to parent language translations, without ever falling back to English. This module might be the most interesting one for our research.

Apart from the research itself, I think this is an interesting example of how complex some seemingly simple features are.

Yet another bit on security

Here are a couple of interesting articles from the last few days on Slashdot.

First, comes in a very non-surprising survey saying that “40 percent of organizations store admin passwords in Word documents“. Judging from my personal experiences in different companies, I’d say this number is much higher if you extend the Word documents to Excel spreadsheets and plain text files. I think pretty much every single company I’ve worked at used such common files for admin password storage (at least at some point).

“Why or why?!!!”, the security concerned among you might scream. Well, I think there are two reasons for this. The first one is that password management is complicated. There are tools that help with this, but even those are rarely easy to use. Storing the passwords in a secure, encrypted storage is one thing. But, how do you share them with just the right people? How do you trust the tool? What happens if the file gets corrupted, the software updates, the license expires, or the master password is lost? The risk of losing admin access to all your equipment and accounts is scary. On top of that, there is the issue of changing passwords (especially when people leave the company) – not a simple job if you have a variety of accounts (hardware, software, services, etc) and a lot of people who have a varying degree of access. Or automation scripts that need access to perform large scale operations. Personally, I don’t think this problem has been solved yet.

The second reason is in this other Slashdot post – “Sad Reality: It’s Cheaper To Get Hacked Than Build Strong IT Defenses“. This is very true as well. A simple firewall and a strong password policy is often more than enough for many organizations. The risks of compromise are low. In those cases where it does happen, you’d often get some script kiddie consequence like a Bitcoin mining app or affiliate links spread across your website. Both are quite easy to detect and fix. Is it worth investing hundreds of thousands in equipment and personnel to prevent this? For many companies it is not.

The fact of the matter is that a lot of people don’t really care about security or privacy on the personal level, and that then translates into the organizational mentality as well.

Just think about people leaving in all those high crime areas. Some of them think the risk is worth it – maybe then can make more money there or have a more exciting life. Some of them simply can’t afford to move anywhere. That’s very similar to the digital security, I think. Some don’t care and prefer to run the risk, saving the money on protection. Some simply can’t afford to have a decent level of security.

400,000 GitHub repositories, 1 billion files, 14 terabytes of code: Spaces or Tabs?

Here is an interesting bit of research – do people prefer tabs or spaces when programming the most popular languages?

Tabs or spaces. We are going to parse a billion files among 14 programming languages to decide which one is on top.

The results are not very surprising and somewhat disappointing (for all of us, tab fans):

As far as PHP goes, I’m sure the choice of spaces has to do with the PSR-2 coding style guide, which states:

Code MUST use 4 spaces for indenting, not tabs.

On a more technical note, I think this is also related to the explosion of editors and IDEs in the recent years, which, as good as they are, aren’t as good as Vim. Vim allows for a very flexible configuration, where your code can be formatted and re-formatted any way you like, making tabs or spaces a non-issue at all.

Regardless of the results of the study, what’s more interesting is the method and tools used. I’ve had my eye on the Google Big Query for a while now, but I’m too busy these days to give it a try. The article gives a few insights, into how awesome the tool is. 1.6 terabytes of data processed in 864.6 seconds:

That query took a relative long time since it involved joining a 190 million rows table with a 70 million rows one, and over 1.6 terabytes of contents. But don’t worry about having to run it, since I left the result publicly available at [fh-bigquery:github_extracts.contents_top_repos_top_langs].

and:

Analyzing each line of 133 GBs of code in 16 seconds? That’s why I love BigQuery.

If you enjoyed this article, also have a look at “Analyzing GitHub issues and comments with BigQuery“, which works with a similar-sized data, trying to figure out how to write bug reports and pull request comments, so that they would be acted upon faster.