Check Whether a Unicode String is Normalized in PostgreSQL

In PostgreSQL, we can use the ... IS NORMALIZED expression to check whether or not a given Unicode string is in a specified Unicode normalization form.

By default, it checks whether it’s in the NFC form, but we also have the option of specifying NFD, NFKC, or NFKD.

Syntax

The syntax goes like this:

text IS [NOT] [form] NORMALIZED

So we can use NOT to negate the operation. We can also specify which normalization form to check for.

The expression returns a boolean value (either true or false).

Example

Here’s a quick example of how to use the expression:

SELECT 'ö' IS NORMALIZED;

Result:

t

In this case we get t for true. In other words, the Unicode character is normalized in the NFC form. We know it’s the NFC form, because that’s the default.

Here it is again, but this time we’ll check it for another form:

SELECT 'ö' IS NFD NORMALIZED;

Result:

f

This time we get f for false. That’s because, although it’s normalized, it’s not normalized in the NFD form.

Negating the Operation

We can use NOT to negate the operation:

SELECT 
    'ö' IS NORMALIZED AS "Normalized",
    'ö' IS NOT NORMALIZED AS "Not Normalized";

Result:

 Normalized | Not Normalized 
------------+----------------
t | f

Understanding Normalization

To get a proper understanding of how the ... IS NORMALIZED expression works, we need to understand why we might need it.

Take the following example:

SELECT 'ö' = 'ö';

Result:

f

Here, I compared two apparently identical characters, but the result indicates that they’re not identical (we got f for false).

The reason for this is because, although they are the same character, I actually used different Unicode code points to construct them. In my case I constructed them using HTML and then copied and pasted them into my psql terminal.

Many Unicode characters can be constructed from more than one set of code points, and that’s the case with the above character, known in Unicode as LATIN SMALL LETTER O WITH DIAERESIS. This can be represented as a single code point (U+00F6) or as two separate code points (U+006F and U+0308). Regarding the second one, U+006F is the code point for LATIN SMALL LETTER O and U+0308 is the code point for the COMBINING DIAERESIS. These two code points are combined to form the same character that the single code point U+00F6 would produce.

So given these are constructed with different code points, PostgreSQL sees them as not equal.

So given they’re not equal, let’s check each character with the ... IS NORMALIZED expression:

SELECT 
    'ö' IS NORMALIZED, 
    'ö' IS NORMALIZED;

Result:

 is_normalized | is_normalized 
---------------+---------------
t | f

So we can see that the first one is normalized in NFC form but the second one isn’t (remember, NFC is the default).

Let’s specify a different form:

SELECT 
    'ö' IS NFD NORMALIZED, 
    'ö' IS NFD NORMALIZED;

Result:

 is_normalized | is_normalized 
---------------+---------------
f | t

This time we get the opposite result – the first column is false and the second column is true.

If we want to avoid this confusion, we can use the normalize() function to normalize both characters using the same form. The default form for this function is NFC, so we can do the following to normalize them in that form:

SELECT 
    normalize('ö') IS NORMALIZED,
    normalize('ö') IS NORMALIZED;

Result:

 is_normalized | is_normalized 
---------------+---------------
t | t

This time both characters returned true when checking for normalization.

We can specify a different form if required:

SELECT 
    normalize('ö', NFD) IS NFD NORMALIZED,
    normalize('ö', NFD) IS NFD NORMALIZED;

Result:

 is_normalized | is_normalized 
---------------+---------------
t | t