Validating CSV schema

CSV, or comma-separated values, is a very common format for managing all kinds of configurations, as well data manipulation.  As the linked Wikipedia page mentions, there are a few RFCs that try to standardize the format.  However, I thought, there is still a lack of schema-type standard that would allow one to define a format for particular file.

Today I came across an effort that attempts to do just that – CSV Schema Language v1.1 – an unofficial draft of the language for defining and validating CSV data.  This is work in progress by the Digital Preservation team at The National Archives.

Apart from the unofficial draft of the language, there is also an Open Source CSV Validator v1.1 application, written in Scala.

Docker Image Vulnerability Research

Federacy has an interesting research in Docker image vulnerabilities.  The bottom line is:

24% of latest Docker images have significant vulnerabilities

This can and should be improved, especially given the whole hierarchical structure of Docker images.  It’s not like improving security of all those random GitHub repositories.

Why Configuration Management and Provisioning are Different

In “Why Configuration Management and Provisioning are Different” Carlos Nuñez advocates for the use of specialized infrastructure provisioning tools, like Terraform, Heat, and CloudFormation, instead of relying on the configuration management tools, like Ansible or Puppet.

I agree with his argument for the rollbacks, but not so much for the maintaining state and complexity.  However I’m not yet comfortable to word my disagreement – my head is all over the place with clouds, and I’m still weak on the terminology.

The article is nice regardless, and made me look at the provisioning tools once again.

Living Without Atomic Clocks

Living Without Atomic Clocks” is an interesting article that covers some design bits of distributed systems and CockroachDB (what a name!), especially those related to time precision.  This part in particular is the one I’m sure I’ll came back to at some point:

How does TrueTime provide linearizability?

OK, back to Spanner and TrueTime. It’s important to keep in mind that TrueTime does not guarantee perfectly synchronized clocks. Rather, TrueTime gives an upper bound for clock offsets between nodes in a cluster. Synchronization hardware helps minimize the upper bound. In Spanner’s case, Google mentions an upper bound of 7ms. That’s pretty tight; by contrast, using NTP for clock synchronization is likely to give somewhere between 100ms and 250ms.

So how does Spanner use TrueTime to provide linearizability given that there are still inaccuracies between clocks? It’s actually surprisingly simple. It waits. Before a node is allowed to report that a transaction has committed, it must wait 7ms. Because all clocks in the system are within 7ms of each other, waiting 7ms means that no subsequent transaction may commit at an earlier timestamp, even if the earlier transaction was committed on a node with a clock which was fast by the maximum 7ms. Pretty clever.

World’s Biggest Data Breaches

Here’s an interactive collection of the world’s biggest data breaches.  It goes back to 2004, where about 92,000,000 email addresses and screen names were stolen by an AOL employee, and covers most of the major events up until and including 2016.  There are a few ways to filter the data and change the representation.

Overall, should give you a pretty good idea of how safe and secure your online data is. Oh, and how private it is too.