PostgreSQL has a normalize()
function that converts a string to the specified Unicode normalization form.
The function can only be used when the server encoding is UTF8
.
Syntax
The syntax goes like this:
normalize ( text [, form ] )
So we’re required to pass at least one argument (the string to normalize).
The second argument is optional. If provided, it specifies the normalization form to use. There are four forms; NFC
(the default), NFD
, NFKC
, and NFKD
.
Example
To understand the purpose of the normalize()
function, let’s look at the following code:
SELECT
'ö' = 'ö',
normalize('ö') = normalize('ö');
Result:
?column? | ?column?
----------+----------
f | t
Here, I compared two apparently identical characters, yet without the normalize()
function PostgreSQL returned f
(for false), meaning that PostgreSQL decided that they’re not equal. But with the normalize()
function it returned t
(for true), meaning they are equal.
The reason for this is because, although they are the same character, I actually used different Unicode code points to construct them. In my case I constructed them using HTML and then copied and pasted them into my psql
terminal.
Many Unicode characters can be constructed from more than one set of code points, and that’s the case with the above character, known in Unicode as LATIN SMALL LETTER O WITH DIAERESIS. This can be represented as a single code point (U+00F6
) or as two separate code points (U+006F
and U+0308
). Regarding the second one, U+006F
is the code point for LATIN SMALL LETTER O and U+0308
is the code point for the COMBINING DIAERESIS. These two code points are combined to form the same character that the single code point U+00F6
would produce.
So given these are constructed with different code points, PostgreSQL sees them as not equal.
This could potentially cause issues in our queries. Our queries could return inaccurate results, due to the same Unicode character being treated differently, depending on the code points used to construct it.
To further demonstrate the issue, here’s another example:
SELECT
'ö' = 'ö',
'ö' = 'ö';
Result:
?column? | ?column?
----------+----------
f | t
This time we get false in the left column and true in the right, even though all characters appear to be the same. That’s because in the left column I used different code points for each character but in the right column I used the same code point for both characters.
Again, the normalize()
function can solve this so that they’re all treated the same:
SELECT
normalize('ö') = normalize('ö'),
normalize('ö') = normalize('ö');
Result:
?column? | ?column?
----------+----------
t | t
About the Normalization Form
By default, the normalize()
function uses the NFC
form to normalize the string. We have the option of explicitly specifying this form or any other form accepted by the function.
Here are the accepted Normalization Forms:
Argument | Form | Description |
---|---|---|
NFD | Normalization Form D | Canonical Decomposition |
NFC | Normalization Form C | Canonical Decomposition, followed by Canonical Composition |
NFKD | Normalization Form KD | Compatibility Decomposition |
NFKC | Normalization Form KC | Compatibility Decomposition, followed by Canonical Composition |
So when we compare two strings that have been “normalized”, we should ensure that they use the same form.
Here’s an example that shows the varying results we can get when we combine Normalization Forms in our queries:
\x
SELECT
'á' = 'á' AS "Original",
normalize('á', NFC) = normalize('á', NFC) AS "NFC/NFC",
normalize('á', NFC) = normalize('á', NFD) AS "NFC/NFD",
normalize('á', NFC) = normalize('á', NFKC) AS "NFC/NFKC",
normalize('á', NFC) = normalize('á', NFKD) AS "NFC/NFKD",
normalize('á', NFD) = normalize('á', NFC) AS "NFD/NFC",
normalize('á', NFD) = normalize('á', NFD) AS "NFD/NFD",
normalize('á', NFD) = normalize('á', NFKC) AS "NFD/NFKC",
normalize('á', NFD) = normalize('á', NFKD) AS "NFD/NFKD";
Result:
Original | f
NFC/NFC | t
NFC/NFD | f
NFC/NFKC | t
NFC/NFKD | f
NFD/NFC | f
NFD/NFD | t
NFD/NFKC | f
NFD/NFKD | t
More about Normalization Forms
The Unicode Consortium explains Normalization Forms as follows:
Unicode Normalization Forms are formally defined normalizations of Unicode strings which make it possible to determine whether any two Unicode strings are equivalent to each other. Depending on the particular Unicode Normalization Form, that equivalence can either be a canonical equivalence or a compatibility equivalence.
Essentially, the Unicode Normalization Algorithm puts all combining marks in a specified order, and uses rules for decomposition and composition to transform each string into one of the Unicode Normalization Forms. A binary comparison of the transformed strings will then determine equivalence.
See the Unicode website for more information on how normalization works.