Understanding the NORMALIZE() Function in PostgreSQL

PostgreSQL has a normalize() function that converts a string to the specified Unicode normalization form.

The function can only be used when the server encoding is UTF8.

Syntax

The syntax goes like this:

normalize ( text [, form ] )

So we’re required to pass at least one argument (the string to normalize).

The second argument is optional. If provided, it specifies the normalization form to use. There are four forms; NFC (the default), NFD, NFKC, and NFKD.

Example

To understand the purpose of the normalize() function, let’s look at the following code:

SELECT 
    'ö' = 'ö',
    normalize('ö') = normalize('ö');

Result:

 ?column? | ?column? 
----------+----------
f | t

Here, I compared two apparently identical characters, yet without the normalize() function PostgreSQL returned f (for false), meaning that PostgreSQL decided that they’re not equal. But with the normalize() function it returned t (for true), meaning they are equal.

The reason for this is because, although they are the same character, I actually used different Unicode code points to construct them. In my case I constructed them using HTML and then copied and pasted them into my psql terminal.

Many Unicode characters can be constructed from more than one set of code points, and that’s the case with the above character, known in Unicode as LATIN SMALL LETTER O WITH DIAERESIS. This can be represented as a single code point (U+00F6) or as two separate code points (U+006F and U+0308). Regarding the second one, U+006F is the code point for LATIN SMALL LETTER O and U+0308 is the code point for the COMBINING DIAERESIS. These two code points are combined to form the same character that the single code point U+00F6 would produce.

So given these are constructed with different code points, PostgreSQL sees them as not equal.

This could potentially cause issues in our queries. Our queries could return inaccurate results, due to the same Unicode character being treated differently, depending on the code points used to construct it.

To further demonstrate the issue, here’s another example:

SELECT 
    'ö' = 'ö',
    'ö' = 'ö';

Result:

 ?column? | ?column? 
----------+----------
f | t

This time we get false in the left column and true in the right, even though all characters appear to be the same. That’s because in the left column I used different code points for each character but in the right column I used the same code point for both characters.

Again, the normalize() function can solve this so that they’re all treated the same:

SELECT 
    normalize('ö') = normalize('ö'),
    normalize('ö') = normalize('ö');

Result:

 ?column? | ?column? 
----------+----------
t | t

About the Normalization Form

By default, the normalize() function uses the NFC form to normalize the string. We have the option of explicitly specifying this form or any other form accepted by the function.

Here are the accepted Normalization Forms:

ArgumentFormDescription
NFDNormalization Form DCanonical Decomposition
NFCNormalization Form CCanonical Decomposition, followed by Canonical Composition
NFKDNormalization Form KDCompatibility Decomposition
NFKCNormalization Form KCCompatibility Decomposition, followed by Canonical Composition

So when we compare two strings that have been “normalized”, we should ensure that they use the same form.

Here’s an example that shows the varying results we can get when we combine Normalization Forms in our queries:

\x
SELECT
    'á' = 'á' AS "Original",
    normalize('á', NFC) = normalize('á', NFC) AS "NFC/NFC",
    normalize('á', NFC) = normalize('á', NFD) AS "NFC/NFD",
    normalize('á', NFC) = normalize('á', NFKC) AS "NFC/NFKC",
    normalize('á', NFC) = normalize('á', NFKD) AS "NFC/NFKD",
    normalize('á', NFD) = normalize('á', NFC) AS "NFD/NFC",
    normalize('á', NFD) = normalize('á', NFD) AS "NFD/NFD",
    normalize('á', NFD) = normalize('á', NFKC) AS "NFD/NFKC",
    normalize('á', NFD) = normalize('á', NFKD) AS "NFD/NFKD";

Result:

Original | f
NFC/NFC | t
NFC/NFD | f
NFC/NFKC | t
NFC/NFKD | f
NFD/NFC | f
NFD/NFD | t
NFD/NFKC | f
NFD/NFKD | t

More about Normalization Forms

The Unicode Consortium explains Normalization Forms as follows:

Unicode Normalization Forms are formally defined normalizations of Unicode strings which make it possible to determine whether any two Unicode strings are equivalent to each other. Depending on the particular Unicode Normalization Form, that equivalence can either be a canonical equivalence or a compatibility equivalence.

Essentially, the Unicode Normalization Algorithm puts all combining marks in a specified order, and uses rules for decomposition and composition to transform each string into one of the Unicode Normalization Forms. A binary comparison of the transformed strings will then determine equivalence.

See the Unicode website for more information on how normalization works.