Data Gravity

On the drive back home today I was listening to DevOps Cafe podcast, episode 59.  I’ve recently subscribed to this show and I think this was the first episode of it I ever heard.  It’s one of many tech talk podcasts, where two or more people chat for a varied period of time on a selection of topics, mostly related to technology.

In this particular episode, program hosts John and Damon were interviewing the CTO of BashoDave McCrory.  I wasn’t familiar with either Basho or Dave prior to the episode.  Gladly, a somewhat lengthy introduction by Dave gave me a good idea who he is.  What followed though was way more interesting – a discussion about data.

To be completely honest with you, I haven’t even finished the episode yet (got home right in the middle of it), but I feel like it’s one of those worth blogging about.  For one, I’ve learned a new term – “data lake”.  Apparently, that’s a new and fancy way of branding “data warehousing”.  Here is a bit from TechTarget, for example:

A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed.

While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question.

The term data lake is often associated with Hadoop-oriented object storage.

But that was just the beginning.  What followed was a fascinating discussion on Data Gravity.  Obviously, this whole thing is too fresh in my mind and I can’t formulate it well yet, so I suggest you listen to the episode and read the intro on the Data Gravity site.  For the sake of brevity:

[…] it’s also a misleading term. Behind it all is the notion that data which is near other data is more useful, and the tendency of data to cling together comes from the usefulness of the resulting knowledge. […]

A lot of it seems obvious, but here it’s all put into a nice thought framework, with references to other, more established fields, like math and physics.  Easily one of the most interesting technology related discussions I’ve heard in a while!

Extract, Transform, Load

I’ve been doing all kinds of data migrations and system integration for years now.  But only yesterday I’ve learned that there is a very specific term linked to the process.

In computing, extract, transform, and load (ETL) refers to a process in database usage and especially in data warehousing that:

  • Extracts data from outside sources
  • Transforms it to fit operational needs, which can include quality levels
  • Loads it into the end target (database, more specifically, operational data store, data mart, or data warehouse)

ETL systems commonly integrate data from multiple applications, typically developed and supported by different vendors or hosted on separate computer hardware. The disparate systems containing the original data are frequently managed and operated by different employees. For example a cost accounting system may combine data from payroll, sales and purchasing.

A year without Google Reader

Mashable reminds us that it’s been a year since Google Reader has been decommissioned.  They are also doing a survey to find out if people use more of RSS feeds now or less, what they’ve substituted it with and which tools people are using now to follow their favorite feeds.

I’ve completed the survey, but without any visible results just yet, I thought I’d talk about my situation here.  In the last year my use of RSS has decreased significantly.   Even though the actual number of the feeds I am subscribed to has increased, I read them less.  I share less.  I bookmark and blog about less.  And it’ nothing but the tool’s fault.  Even though Feedly is an excellent tool – fast, flexible, with mobile support, and aesthetically pleasing, it simply is not Google Reader, which I was practically embed into.  I’ve looked around for Google Reader alternatives, I tried a few.  Feedly is the best of the bunch for my taste, but it’s different.

So, with that in mind, what happened to all that free time that I used to spend in Google Reader?  Sadly, I have to admit that I’m much more on Facebook now.  Quality-wise, that’s a huge drop.  Instead of following my favorite writers, keeping in touch with all kinds of technology advances, and learning new things, I am now participating in flaming comment wars about nothing, and watching videos of cute kittens and bouncing boobs.  Cheap entertainment swallowed me and spat me out.  It’s exactly like never switching a television set was in the last century.  And it’s a pity.

And the saddest part is that I knew it would happen.  And if I knew, Google definitely knew that too.  And they killed Google Reader anyway.  And it’ll be a long time until I let it go…

Cayley – an open-source graph

Cayley – an open-source graph inspired by the graph database behind Freebase and Google’s Knowledge Graph. Its goal is to be a part of the developer’s toolbox where Linked Data and graph-shaped data (semantic webs, social networks, etc) in general are concerned.

cayley

Aggregating feeds isn’t all that simple

As I mentioned a few times, one of my first start-up ideas was an RSS aggregator.  It was back in 2005 or so, before Google Reader was even alive.  Bloglines was the coolest tool, if I remember correctly, and it sucked badly.  I got together with a few friends of mine and we started coding.  It was an interesting challenge both technically and aesthetically.  But we got it to the point where it actually worked and wasn’t all too bad.  It was a weird mixture of Python, Perl, and PHP though.

Eventually, it became too much work.  We couldn’t figure out how to monetize the thing.  And Google Reader was announced.  That sort of killed the project.

A few month back, when the announcement of Google Reader’s end of life came out, I looked at the alternatives and wasn’t pleased.  I thought with all the technical advances in the last few years, and with my own improved knowledge, I could attempt the task again.  Yes, I know, I am hopeless optimist in a lot of matters.

At least this time it took just a few days to convince me not to pursue the goal.  Alternatives are plentiful.  Each and every one of them is light years ahead.  I still don’t enjoy front-end development.  And I still have no clue as to how to monetize it.  So, the Subs Reader got frozen.  At least I got it all in frameworks, and left it in the Open Source state.  If I ever will have another try, I can pick up from here.

One of the biggest mistakes I’ve done the last time, was not documenting the project’s process at all.  I vaguely remember that I didn’t sleep for a few nights, trying to figure out all kinds of problems.  But what were they, I don’t remember.

Today, I came across a blog post which lists similar problems that I had to solve, but in greater number and variety.  Even if you aren’t thinking about writing your own RSS reader any time soon (or ever), you should still read through the Brian’s stupid feed tricks.  First of all, they clearly illustrate how much complexity is hiding in the details.  Secondly, they show non-standard is the web in general and RSS in particular.  If you do any kind of web crawling, you’d probably see half of the same issues in your application.  Thirdly, even if you aren’t crawling the web at all, but just code a web application or an API to one, you’ll many places where you can go wrong without noticing it.  All in all, it’s a great list of problems that everybody involved in web development can learn from.

Feeding on friends with FriendFeed.com

One of the things that people on the web do is follow each other.  Reading blog posts, watching favorite video clips, stare at shared photos, reply to comments, get status updates, and so on and so forth.

In the previous years, the number of people who were online was much smaller.  And they weren’t publishing as much as they do now.  Everyone and their dog has a blog.  Pictures and videos are flying around.  Playlists and favorite songs are shared.  Micro-blogging is blossoming.  How can anyone follow all that?  Well, RSS, of course, is one of the common answers.

But, RSS has its share of problems.  It is still too technical to be used by many people.  Good tools are a few.  And grouping things around people isn’t much fun yet.  Also, feed discovery is still an issue (from a person’s point of view, not the aggregator point of view).

FriendFeed.com web service recently went public and solved a few problems.  It starts off with feed discovery.  When you register and login, you can easily specify all the places that you publish at – blog, Flickr photostream, del.icio.us bookmarks, LinkedIn profile, Twitter, and so on and so forth.  This way, when somebody is interested in following you up, he or she will just need to subscribe to you once and get all the stuff from everywhere where you publish.  This is cool.

FriendFeed screenshot

Another problem that FriendFeed solves is the problem of virtual people.  In social networks, it is often that you can’t follow a person who hasn’t registered yet.  You can invite them in, wait for them to join, and then be notified when they joined.  But it is often impossible to follow people who decided not to join the network.  In FriendFeed, you can create “imaginary friends”.  This way, you can group people and sources in any way you like best.   This is priceless.

For example, you can create an imaginary friend for a person who hasn’t registered, and you can assign a blog and a Flickr photostream to him.  Or, you can create an imaginary friend for a real person, who even registered, but who publishes so much that you can’t take it.  Instead of following of their stuff, you just pick things that you are interested in (say Twitter messages and blog, but not Flickr and YouTube) and link those to your imaginary friend.

With this functionality, following topics or events becomes extremely easy.  If you are interested in kebab cooking ,or in Cyprus switching to Euro, or  anything else for that matter, you can create an imaginary friend for the topic and assign it blogs, Google Reader shared items, Picasa photos, or whatever else is supported.  There is a lot of potential in here.

Another thing that FriendFeed does right is presentation of data.  There are links to original sources whenever possible, and there are thumbnails for whatever possible.  Also, people have avatars, which makes it very easy to distinguish who is who and who published what.

And if all that wasn’t enough, you can subscribe to updates via email.  Which means that you can really improve your productivity while still following a whole lot of sources.  No need to run around the web looking for updates.  No need to interrupt your work flow to see if there is a reply to your comment.  You just get used to getting back at all the updates once a day in a brief, but nicely looking digest form, and that’s it!

FriendFeed is a really nice services which a lot of people were waiting for and which they will appreciate now that it is finally here.  Oh, and just in case, here is the link to my FriendFeed profile.

Google Reader and Google Talk integrated. Sort of.

Google Reader has been recently integrated with Google Talk.  Somewhat.  If you use Google Reader and Google Talk, and you have some buddies in your Google Talk contact list, who also use Google Reader, then from now on you will be able to see each other’s shared items.  Through the “Settings“, you can control who you want and don’t want to see in the “Friends’ shared items“.

This is a really nice piece of functionality.  First of all, it saves you all the effort of finding and subscribing to “Shared items” RSS feeds of all your friends one by one.   Secondly, it helps to highlight interesting stuff from your buddies, even those that you might accidentally omitted from your subscriptions.

So, what am I missing there?  Two things.

First, the option to rename buddies.  I am blessed with contacts who choose all sorts of nicknames and avatars.  I prefer real names.  And I attach real face pictures to all my contacts whenever I can.  And I’ve done it in my Gmail contacts.  That information should be used for the Google Reader friends list.

Secondly, I need an option to enter a discussion with my friends regarding an item in my Google Reader.  That can be something I have shared, or that can be something my friends shared.  I want a “discuss in chat” and “discuss in email” buttons.  “Discuss in email” should be, in this case, different from “Email this item”.  We both (me, and the friend with who I’m entering a discussion) have read the item.  We just need a reference, like a subject, and URL to the item (original article?), just in case we need to run through it again or quote something.

While the second point is harder to implement (requires user studies, interface cluttering, etc), I’m really surprised that the first one wasn’t done.