In PostgreSQL, we can use the ... IS NORMALIZED
expression to check whether or not a given Unicode string is in a specified Unicode normalization form.
By default, it checks whether it’s in the NFC
form, but we also have the option of specifying NFD
, NFKC
, or NFKD
.
Syntax
The syntax goes like this:
text IS [NOT] [form] NORMALIZED
So we can use NOT
to negate the operation. We can also specify which normalization form to check for.
The expression returns a boolean value (either true or false).
Example
Here’s a quick example of how to use the expression:
SELECT 'ö' IS NORMALIZED;
Result:
t
In this case we get t
for true. In other words, the Unicode character is normalized in the NFC
form. We know it’s the NFC
form, because that’s the default.
Here it is again, but this time we’ll check it for another form:
SELECT 'ö' IS NFD NORMALIZED;
Result:
f
This time we get f
for false. That’s because, although it’s normalized, it’s not normalized in the NFD
form.
Negating the Operation
We can use NOT
to negate the operation:
SELECT
'ö' IS NORMALIZED AS "Normalized",
'ö' IS NOT NORMALIZED AS "Not Normalized";
Result:
Normalized | Not Normalized
------------+----------------
t | f
Understanding Normalization
To get a proper understanding of how the ... IS NORMALIZED
expression works, we need to understand why we might need it.
Take the following example:
SELECT 'ö' = 'ö';
Result:
f
Here, I compared two apparently identical characters, but the result indicates that they’re not identical (we got f
for false).
The reason for this is because, although they are the same character, I actually used different Unicode code points to construct them. In my case I constructed them using HTML and then copied and pasted them into my psql
terminal.
Many Unicode characters can be constructed from more than one set of code points, and that’s the case with the above character, known in Unicode as LATIN SMALL LETTER O WITH DIAERESIS. This can be represented as a single code point (U+00F6
) or as two separate code points (U+006F
and U+0308
). Regarding the second one, U+006F
is the code point for LATIN SMALL LETTER O and U+0308
is the code point for the COMBINING DIAERESIS. These two code points are combined to form the same character that the single code point U+00F6
would produce.
So given these are constructed with different code points, PostgreSQL sees them as not equal.
So given they’re not equal, let’s check each character with the ... IS NORMALIZED
expression:
SELECT
'ö' IS NORMALIZED,
'ö' IS NORMALIZED;
Result:
is_normalized | is_normalized
---------------+---------------
t | f
So we can see that the first one is normalized in NFC
form but the second one isn’t (remember, NFC
is the default).
Let’s specify a different form:
SELECT
'ö' IS NFD NORMALIZED,
'ö' IS NFD NORMALIZED;
Result:
is_normalized | is_normalized
---------------+---------------
f | t
This time we get the opposite result – the first column is false and the second column is true.
If we want to avoid this confusion, we can use the normalize()
function to normalize both characters using the same form. The default form for this function is NFC
, so we can do the following to normalize them in that form:
SELECT
normalize('ö') IS NORMALIZED,
normalize('ö') IS NORMALIZED;
Result:
is_normalized | is_normalized
---------------+---------------
t | t
This time both characters returned true when checking for normalization.
We can specify a different form if required:
SELECT
normalize('ö', NFD) IS NFD NORMALIZED,
normalize('ö', NFD) IS NFD NORMALIZED;
Result:
is_normalized | is_normalized
---------------+---------------
t | t