regular expressions

Parsing: a timeline

“Parsing: a timeline” is a historical timeline of parsing, as done by computers and computer programming languages. It starts well before computers were actually invented, from the time where people started thinking about what is a language, what it consists of and how it works.

Even though this article is mostly aimed at technical people, I’m sure pretty much anyone will find interesting bits in there, as some of the names and works mentioned are well known outside of technical industries. For techies, you’ll find all your favorite names in there – Markov, Turing, Boehm, Chomsky, Knuth, Dijkstra, Wall, and more.

Regex101 – online regex editor and debugger

Regex101 is an online regular expression editor and debugger. You can test your regular expressions against sample data, see if the expression worked, watch it matched, and so on. Having an explanation for each part of the regular expression dynamically generated, and a quick reference nearby is super handy too.

Update (November 7, 2018): Here’s another Regex Tester.

Parsing text printouts within Ansible playbooks

I’m sure this will come handy soon, and I’ll be spending too much time trying to figure it out without this article: Parsing text printouts within Ansible playbooks.

It’s not every day that you see regular expression examples in the Ansible playbooks…

The RegEx that killed StackOverflow

Here’s an outage postmortem from the recent StackOverflow downtime. It just shows you how easy it is to break things, even they were built by some of the smartest people around. Programming is touch and there is no way around it.

Technical Details

The regular expression was: ^[\s\u200c]+|[\s\u200c]+$ Which is intended to trim unicode space from start and end of a line. A simplified version of the Regex that exposes the same issue would be \s+$ which to a human looks easy (“all the spaces at the end of the string”), but which means quite some work for a simple backtracking Regex engine. The malformed post contained roughly 20,000 consecutive characters of whitespace on a comment line that started with — play happy sound for player to enjoy. For us, the sound was not happy.

If the string to be matched against contains 20,000 space characters in a row, but not at the end, then the Regex engine will start at the first space, check that it belongs to the \s character class, move to the second space, make the same check, etc. After the 20,000th space, there is a different character, but the Regex engine expected a space or the end of the string. Realizing it cannot match like this it backtracks, and tries matching \s+$ starting from the second space, checking 19,999 characters. The match fails again, and it backtracks to start at the third space, etc.

So the Regex engine has to perform a “character belongs to a certain character class” check (plus some additional things) 20,000+19,999+19,998+…+3+2+1 = 199,990,000 times, and that takes a while. This is not classic catastrophic backtracking (talk on backtracking) (performance is O(n²), not exponential, in length), but it was enough. This regular expression has been replaced with a substring function.

Mail::RFC822::Address: regexp-based address validation

This is pure gold! Check out the regular expression for an RFC822 email address validation. I’m not going to paste it here, being concerned that it will open the gates of hell or something, but here is a sneak preview of about the first third or so.