Metabase – Open Source business intelligence and analytics

Metabase is an Open Source business intelligence and analytics tool.  It supports a variety of databases and services as sources for data, and provides a number of data querying and processing tools.  Have a look at the GitHub repository as well.

And if you want a few alternatives or complimenting tools, I found this list quite useful.

Registry of Open Data on AWS

AWS News Blog covers the Registry of Open Data on AWS:

Almost a decade ago, my colleague Deepak Singh introduced the AWS Public Datasets in his post Paging Researchers, Analysts, and Developers. I’m happy to report that Deepak is still an important part of the AWS team and that the Public Datasets program is still going strong!

Today we are announcing a new take on open and public data, the Registry of Open Data on AWS, or RODA. This registry includes existing Public Datasets and allows anyone to add their own datasets so that they can be accessed and analyzed on AWS.

Currently, there are 53 data sets in the registry.  Each provides a tonne of data.  Subjects vary from satellite imagery and weather monitoring to political and financial information.

Hopefully, this will grow and expand with time.

A Comparative Analysis of Top 6 BI and Data Visualization Tools in 2018

A Comparative Analysis of Top 6 BI and Data Visualization Tools in 2018 looks at 6 most popular business intelligence and data visualization tools and compares pros, cons, features, pricing and more.  The systems included in this review are:

 

Grakn and Graql – a database for AI

From the grakn.ai website:

Grakn is a distributed hyper-relational database for knowledge-oriented systems. Grakn enables machines to manage complex data that serves as a knowledge base for cognitive/AI systems.

Graql is Grakn’s reasoning (through OLTP) and analytics (through OLAP) query language. Graql is a much higher level abstraction over traditional query language – SQL, NoSQL, or Graphs.

Amazon Snowmobile – a truck with up to 100 Petabytes of storage

Back in my college days, I had a professor who frequently used Andrew Tanenbaum‘s quote in the networking class:

Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.

I guess he wasn’t the only one, as during this year’s Amazon re:Invent 2016 conference, the company announced, among other things, a AWS Snowmobile:

Moving large amounts of on-premises data to the cloud as part of a migration effort is still more challenging than it should be! Even with high-end connections, moving petabytes or exabytes of film vaults, financial records, satellite imagery, or scientific data across the Internet can take years or decades. On the business side, adding new networking or better connectivity to data centers that are scheduled to be decommissioned after a migration is expensive and hard to justify.

[…]

In order to meet the needs of these customers, we are launching Snowmobile today. This secure data truck stores up to 100 PB of data and can help you to move exabytes to AWS in a matter of weeks (you can get more than one if necessary). Designed to meet the needs of our customers in the financial services, media & entertainment, scientific, and other industries, Snowmobile attaches to your network and appears as a local, NFS-mounted volume. You can use your existing backup and archiving tools to fill it up with data destined for Amazon Simple Storage Service (S3) or Amazon Glacier.

Thanks to this VentureBeat page, we even have a picture of the monster:

aws-snowmobile

100 Petabytes on wheels!

I know, I know, it looks like a regular truck with a shipping container on it.  But I’m pretty sure it’s VERY different from the inside.  With all that storage, networking, power, and cooling needed, it would be awesome to take a pick into this thing.

 

 

400,000 GitHub repositories, 1 billion files, 14 terabytes of code: Spaces or Tabs?

Here is an interesting bit of research – do people prefer tabs or spaces when programming the most popular languages?

Tabs or spaces. We are going to parse a billion files among 14 programming languages to decide which one is on top.

The results are not very surprising and somewhat disappointing (for all of us, tab fans):

tabs vs. spaces

As far as PHP goes, I’m sure the choice of spaces has to do with the PSR-2 coding style guide, which states:

Code MUST use 4 spaces for indenting, not tabs.

On a more technical note, I think this is also related to the explosion of editors and IDEs in the recent years, which, as good as they are, aren’t as good as Vim.  Vim allows for a very flexible configuration, where your code can be formatted and re-formatted any way you like, making tabs or spaces a non-issue at all.

Regardless of the results of the study, what’s more interesting is the method and tools used.  I’ve had my eye on the Google Big Query for a while now, but I’m too busy these days to give it a try.  The article gives a few insights, into how awesome the tool is.  1.6 terabytes of data processed in 864.6 seconds:

That query took a relative long time since it involved joining a 190 million rows table with a 70 million rows one, and over 1.6 terabytes of contents. But don’t worry about having to run it, since I left the result publicly available at [fh-bigquery:github_extracts.contents_top_repos_top_langs].

and:

Analyzing each line of 133 GBs of code in 16 seconds? That’s why I love BigQuery.

If you enjoyed this article, also have a look at “Analyzing GitHub issues and comments with BigQuery“, which works with a similar-sized data, trying to figure out how to write bug reports and pull request comments, so that they would be acted upon faster.

Free Data Science Books

I came across a collection of free data science books:

Pulled from the web, here is a great collection of eBooks (most of which have a physical version that you can purchase on Amazon) written on the topics of Data Science, Business Analytics, Data Mining, Big Data, Machine Learning, Algorithms, Data Science Tools, and Programming Languages for Data Science.

Most notably, there are introductory books, handbooks, Hadoop guide, SQL books, social media data mining stuff, and d3 tips and tricks.  There’s also plenty on artificial intelligence and machine learning, but that’s too far out for me.

πfs – the data-free filesystem!

πfs – the data-free filesystem!

πfs is a revolutionary new file system that, instead of wasting space storing your data on your hard drive, stores your data in π! You’ll never run out of space again – π holds every file that could possibly exist! They said 100% compression was impossible? You’re looking at it!