Free Data Science Books

I came across a collection of free data science books:

Pulled from the web, here is a great collection of eBooks (most of which have a physical version that you can purchase on Amazon) written on the topics of Data Science, Business Analytics, Data Mining, Big Data, Machine Learning, Algorithms, Data Science Tools, and Programming Languages for Data Science.

Most notably, there are introductory books, handbooks, Hadoop guide, SQL books, social media data mining stuff, and d3 tips and tricks.  There’s also plenty on artificial intelligence and machine learning, but that’s too far out for me.

Cat’s name as a password? Obsolete!

Remember the good old joke about choosing a secure password and not using your cat’s name as one, even though it is MySup3rKittyCat13*?  Well, that is now obsolete – a better way is to name your kids with SQL injection names.  Xkcd has an example:

exploits_of_a_mom

 

And if you think that this is all just a joke, I assure you it’s not.  Here is a serious question at Stack Overflow that starts with:

We have an employee whose last name is Null. He kills our employee lookup application when his last name is used as the search term (which happens to be quite often now).

True story, apparently.

BLOB is bad for your (mental) health

If you ever mention that your web application uses database to store files, you risk being flamed into oblivion.  Indeed, in most cases, it is a bad idea, since file system is more effecient when it comes to files.  However, there are cases when it makes sense to have files saved in the database.

Maybe I am doing something wrong, but in the last six month, I had to develop at least three systems that used MySQL for file storage (uploaded files that have to be synchronized across several hosts, etc).  Yesterday, for the third time I stumbled across the same problem, that almost drove me insane.

MySQL has four data types for storing binary data – TINYBLOB, BLOB, MEDIUMBLOB, and LONGBLOB.  Somehow I always forget about these and use BLOB.  BLOB works just fine, but it has a limit on size, which is rather low – 64 KBytes.  The mean thing here is that it will work just fine with most of the test data – text files, short PDFs, and small pictures.  Once the application is tested and put into production, the corrupted files will start coming in.  Re-writing all parts that deal with uploading, moving, cleaning, escaping, and encrypting binary data takes time.  Going through file reading and writing routines is boring too, and it won’t help either.

By the time, the issue is discovered and all fields are changed to LONGBLOB, it is often very late, and you’ve lost your weekend, as well as a lot of large files. This post is an attempt to save my (and your) sanity.

Reminder: use LONGBLOB instead of BLOB for file storage, unless you are absolutely sure about the maximum size of incoming data.

Portability and flexibility win over performance

I noticed this ticket in WordPress TracChange enum to varchar and went in to see if there is any heated discussion.  The issue is around field types used in SQL scheme for WordPress tables.  Certain fields, such post status employed ENUM type with a set of allowed values.  The proposed change in the ticket is to convert them to VARCHAR type.

Why the change?  Well, VARCHAR is just a text field.  Anyone can put pretty much any string into it.  It has more flexibility for plugin developers and future changes – no need to tweak the SQL scheme.  ENUM on the other hand works a little bit faster.

Side note: I also thought that ENUM provides some extra data validation, assuming the ENUM field is set to NOT NULL, but it turns out this is not the case.  If you insert a record with a value which is not in the list, the NULL is used. 

The change has been approved, the patch was attached, and the world will see it in the next WordPress release.  Once again, it has been proven that human time is much more valuable than machine time.  Making it easier for plugin developers to extend and change the system has more value than that of a few extra CPU cycles to lookup in strings instead of numbers.

The future of SQL

Slashdot lets us know that Google contributes code to MySQL.  Among the comments to that post, there is this one, which is while being rather funny holds some truth to it:

They need to add a GOOGLE function to allow queries to be searched nicer.

SELECT * FROM articles WHERE GOOGLE(‘boobies’);

something similar might be available but it is a PITA to list the fields to search and specify the operators etc

I think here lies the future of SQL…