MySQL, PHP and “Integrity constraint violation: 1062 Duplicate entry”

Anna Filina blogs about an interesting problem she encountered with when working on a PHP and MySQL project:

MySQL was complaining about “Integrity constraint violation: 1062 Duplicate entry”. I had all the necessary safeguards in my code to prevent duplicates in tha column.

I gave up on logic and simply dumped the contents of the problematic column for every record. I found that there was a record with and without an accent on one of the characters. PHP saw each as a unique value, but MySQL did not make a distinction, which is why it complained about a duplicate value. It’s a good thing too, because based on my goal, these should have been treated as duplicates.

She also mentions two possible solutions to the problem:

My solution was to substitute accented characters before filtering duplicates in the code. This way, similar records were rejected before they were sent to the database.


As pointed out in the comments, a more robust and versatile solution would be to check the collation on the column.

I’m sure this will come in handy one day.

When monospace fonts aren’t: The Unicode character width nightmare

I don’t deal with Unicode and other character encoding on the daily basis, but when I do, I need every piece of information that has been written on the subject.  Hence the link to this interesting issue :

As long as you stick to precomposed Unicode characters, and Western scripts, things are relatively straightforward. Whether it’s A or Å, S or Š – so long as there are no combining marks, you can count a single Unicode code point as one character width. So the following works:


Nice and neat, right?

Unfortunately, problems appear with Asian characters. When displayed in monospace, many Asian characters occupy two character widths.

ftfy – fixes text for you

ftfy – fixes text for you

ftfy makes Unicode text less broken and more consistent. It works in Python 2.7, Python 3.2, or later.

The most interesting kind of brokenness that this resolves is when someone has encoded Unicode with one standard and decoded it with a different one. This often shows up as characters that turn into nonsense sequences

PHP regular expression to match English/Latin characters only

Today at work I came across a task which turned out to be much easier and simpler than I originally thought it would.  We have have a site with some user registration forms.  The site is translated into a number of languages, but due to the regulatory procedures, we have to force users to input their registration details in English only.  Using Latin characters, numbers, and punctuation.

I’ve refreshed my knowledge of Unicode and PCRE.  And then I came up with the following method which seems to do the job just fine.

 * Check that given string only uses Latin characters, digits, and punctuation
 * @param string $string String to validate
 * @return boolean True if Latin only, false otherwise
public function validateLatin($string) {
    $result = false;

    if (preg_match("/^[\w\d\s.,-]*$/", $string)) {
        $result = true;

    return $result;

In other words, just a standard regular expression with no Unicode trickery.  The ‘/u’ modifier would cause this to totally malfunction and match everything.  Good to know.

Tags in system applications

I was thinking about how cool tags are. They truly help finding bookmarked or themed information faster. Keeping up with important issues is much easier too.

But are there any good uses for tags in system applications? Sure, there are. One particular area that springs to mind is font management.

After I have installed about 6,000 fonts on my computer I realized that it is extremely difficult for me to efficiently use them. There are no categories or bookmarks of any kind. There are not subfolders. There are no comments or descriptions. I would be willing to sort out and tag all these fonts once to be able to find the most appropriate font later.

KDE people? Anyone?

Fonts saga continued

I mentioned recently that I’ve installed a whole lot of fonts on my office workstation. I was never actually concerned about fonts and was very satisfied with the default few that I had on box. But I surprised myself. The view of the world looked so different and it appeared so nice that I decided to do the same procedure at home. I am way too addicted to the good looks of the Internet to view it in Helvetica 24×7.

If you are like I was, never caring about installing fonts, then I suggest you try it. You’ll be amazed as to how different the real thing is.

P.S.: One of the side effects was also my blogging fever. After I installed all these fonts I started to browse the web more, and WordPress’ administration interface looked so good, that I couldn’t stay away.

Fixing SpamAssassin high load

Since few days before the upgrade of the home server to Fedora Linux Core 3 I was seeing high load spikes from SpamAssassin. After the upgrade, my load average stayed at 10-15 almost the whole day. I was trying to fix it, but to no avail. At first, I have switched off the Bayes filtering, thinking that that is the part to be blamed. It didn’t help. Than I forced SpamAssassin to run in non-Unicode environment, thinking that that might improve things. Nope.

After playing some more with configuration and Googling, I decided to visit #spamassassin IRC channel. In a matter of seconds I was pointed to the this bug report. I will be trying the posted patches shortly. For now though, a simple change of number of maximum children helped a lot. I have reduced it to 2 from the default value of 5. (Edit /etc/sysconfig/spamassassin)