Tuesday, 14 October, 2014

Matching Latin characters with regular expressions in Java

Recently I needed to check user input for non Latin characters. My first impression was that this cannot be done with regular expressions. It turned out that I was completely wrong.

Standard regular expression character classes like [a-zA-Z] or \w (word character) are not suitable for matching Latin characters, they are far too restrictive. For example, these classes exclude characters like öòóô or äáàâ which are used in many western and central European languages.

Luckily Java's java.regex.Pattern supports matching Unicode characters with \p. Various arguments can be passed with curly braces to \p. To match only Latin characters \p{IsLatin} can be used:

"öôóò".matches("\\p{IsLatin}+"); // matches
"Σήμε".matches("\\p{IsLatin}+"); // greek characters do not match

\p supports a wide range of Unicode scripts. It can be used to match Greek, Cyrillic, Braille and more.
Some examples:

"Σήμε".matches("^\\p{IsGreek}+$");
"ждля".matches("^\\p{IsCyrillic}+$");
"⠇⠏⠙⠟".matches("^\\p{IsBraille}+$");
"ᚊᚎᚑᚕᚙ᚛".matches("^\\p{IsOgam}+$"); // Chrome might not show these characters..

Have a look at Character.UnicodeScript for the full list of supported scripts.

Tags: Java

Comments

ageewien - Saturday, 30 September, 2017

This article helped me a lot. This is especially useful if a multi lingual (Latin) text has to be split into a wordlist (split(/[^\p{IsLatin}]/).)

Matching Latin characters with regular expressions in Java

Comments

Leave a reply