Unicode

Zero-Width Characters

This article shows a couple of interesting zero-width characters techniques for the invisible fingeprinting of text.

In early 2016 I realized that it was possible to use zero-width characters, like zero-width non-joiner or other zero-width characters like the zero-width space to fingerprint text. Even with just a single type of zero-width character the presence or non-presence of the non-visible character is enough bits to fingerprint even the shortest text.

[…]

I also realized that it is possible to use homoglyph substitution (e.g., replacing the letter “a” with its Cyrillic counterpart, “а”), but I dismissed this as too easy to detect due to the differences in character rendering across fonts and systems. However, differences in dashes (en, em, and hyphens), quotes (straight vs curly), word spelling (color vs colour), and the number of spaces after sentence endings could probably go undetected due to their frequent use in real text.

[…]

The reason I’m writing about this now is that it appears both homoglyph substitution and zero-width fingerprintinghave been discovered by others, so journalists should be informed of the existence of these techniques.

MySQL, PHP and “Integrity constraint violation: 1062 Duplicate entry”

Anna Filina blogs about an interesting problem she encountered with when working on a PHP and MySQL project:

MySQL was complaining about “Integrity constraint violation: 1062 Duplicate entry”. I had all the necessary safeguards in my code to prevent duplicates in tha column.

I gave up on logic and simply dumped the contents of the problematic column for every record. I found that there was a record with and without an accent on one of the characters. PHP saw each as a unique value, but MySQL did not make a distinction, which is why it complained about a duplicate value. It’s a good thing too, because based on my goal, these should have been treated as duplicates.

She also mentions two possible solutions to the problem:

My solution was to substitute accented characters before filtering duplicates in the code. This way, similar records were rejected before they were sent to the database.

and

As pointed out in the comments, a more robust and versatile solution would be to check the collation on the column.

I’m sure this will come in handy one day.

When monospace fonts aren’t: The Unicode character width nightmare

I don’t deal with Unicode and other character encoding on the daily basis, but when I do, I need every piece of information that has been written on the subject. Hence the link to this interesting issue :

As long as you stick to precomposed Unicode characters, and Western scripts, things are relatively straightforward. Whether it’s A or Å, S or Š – so long as there are no combining marks, you can count a single Unicode code point as one character width. So the following works:
	aeioucsz
	áéíóúčšž
Nice and neat, right?

Unfortunately, problems appear with Asian characters. When displayed in monospace, many Asian characters occupy two character widths.

ftfy – fixes text for you

ftfy makes Unicode text less broken and more consistent. It works in Python 2.7, Python 3.2, or later.

The most interesting kind of brokenness that this resolves is when someone has encoded Unicode with one standard and decoded it with a different one. This often shows up as characters that turn into nonsense sequences

PHP regular expression to match English/Latin characters only

Today at work I came across a task which turned out to be much easier and simpler than I originally thought it would. We have have a site with some user registration forms. The site is translated into a number of languages, but due to the regulatory procedures, we have to force users to input their registration details in English only. Using Latin characters, numbers, and punctuation.

I’ve refreshed my knowledge of Unicode and PCRE. And then I came up with the following method which seems to do the job just fine.

/**
 * Check that given string only uses Latin characters, digits, and punctuation
 *
 * @param string $string String to validate
 * @return boolean True if Latin only, false otherwise
 */
public function validateLatin($string) {
    $result = false;

    if (preg_match("/^[\w\d\s.,-]*$/", $string)) {
        $result = true;
    }

    return $result;
}

In other words, just a standard regular expression with no Unicode trickery. The ‘/u’ modifier would cause this to totally malfunction and match everything. Good to know.