Querying CSV with SQL

Excel is not the only tool available when it comes working with CSV files.  I have previously mentioned TextQL (here).  Yesterday, I tried another tool, called “q”.  But since searching for “q” is not very effective, it’s also known “q text as data“.

For those using Fedora, you can install it by simply running “dnf install q-text-as-data“.   Here’s an example of how it works:

$ q-text-as-data -H -d ',' "SELECT COUNT(DISTINCT(Project)) FROM deploy.csv"

In the above example, I’m querying the deploy.csv file, which is in the current folder. q supports both command and tab separated values, so I’m helping it out with the “-d ‘,’” parameter, saying that this particular one is a CSV. “-H” tells q that the first row in this file is used for headers. CSV files with headers are more convenient, as you can use headers as column names, instead of numerical indexes.

Validating CSV schema

CSV, or comma-separated values, is a very common format for managing all kinds of configurations, as well data manipulation.  As the linked Wikipedia page mentions, there are a few RFCs that try to standardize the format.  However, I thought, there is still a lack of schema-type standard that would allow one to define a format for particular file.

Today I came across an effort that attempts to do just that – CSV Schema Language v1.1 – an unofficial draft of the language for defining and validating CSV data.  This is work in progress by the Digital Preservation team at The National Archives.

Apart from the unofficial draft of the language, there is also an Open Source CSV Validator v1.1 application, written in Scala.