Today at work I came across a task which turned out to be much easier and simpler than I originally thought it would. We have have a site with some user registration forms. The site is translated into a number of languages, but due to the regulatory procedures, we have to force users to input their registration details in English only. Using Latin characters, numbers, and punctuation.
I’ve refreshed my knowledge of Unicode and PCRE. And then I came up with the following method which seems to do the job just fine.
/** * Check that given string only uses Latin characters, digits, and punctuation * * @param string $string String to validate * @return boolean True if Latin only, false otherwise */ public function validateLatin($string) { $result = false; if (preg_match("/^[\w\d\s.,-]*$/", $string)) { $result = true; } return $result; }
In other words, just a standard regular expression with no Unicode trickery. The ‘/u’ modifier would cause this to totally malfunction and match everything. Good to know.
Why not just return preg_match(“/^[\w\d\s.,-]*$/”, $string); ?
One could do that too of course. Nothing wrong with that. It’s only a matter of the coding style. I personally prefer to have an if block. In case that I need to debug or extend it, I can just add extra statements inside the block.
Nice one!
– Do not dots and hyphens/’minus signs’ have to be escaped inside these square brackets?
Nope, they don’t. :)
ok, …always learning… Thx :-)
I do not get why the \u modifier should break the whole thing.
When I encode my php-file correctly in utf8 these two function calls work as expected
single-quoted: validateLatin(‘as\xc3\xb6’); => false
double-quoted: validateLatin(“as\xc3\xb6”);
Which basically means, that if the form, which sends the login credentials correctly submits utf8 you can also use german umlauts etc.
If it does not, you will end up with a string that has backslashes in it, so you should be fine with the \u modifier.
Did I miss something?
I had a unit test which was not encoded properly, I guess. The /u modifier tells preg_match to treat the string as Unicode. \w doesn’t match a Unicode character. That’s why it’s failing for me.
Love you man….
Very good, nice one………..
what about this:
preg_match(“/[^\x00-\x7F]/”,$name)
This will allow for too much. You could minimize it to the range of space (\x20) to tilda (\x7e), but you’ll still get a whole bunch of brackets and slashes into there. YMMV I guess.
It’s good to filter Chinese. I create a copy at liveregex tester. You can take a look at here:
https://www.liveregex.com/A8Zbg
Thanks for sharing this, very useful.
I am a noob to regex. I have been trying to make your func work for multiple paragraphs. Do you have any suggestion to expand this regex to allow for more than just 1 string ? thank you.