Understanding PostgreSQL’s REGEXP

In PostgreSQL, the regexp_instr() function returns the starting or ending position of the N‘th match of a POSIX regular expression pattern to a string. If there’s no match, it returns zero.

We pass the string and pattern as arguments. The function also accepts some optional arguments that allow us to be specific with how the function works.

Syntax

The syntax goes like this:

regexp_instr ( string text, pattern text [, start integer [, N integer [, endoption integer [, flags text [, subexpr integer ] ] ] ] ] )

So the first two arguments are required, and the subsequent ones are optional.

Example

Here’s a basic example to demonstrate:

SELECT regexp_instr('A cat called Ash', 'A.');

Result:

Here, I checked the starting position of the A character followed by any other character (the dot matches any character).

Here’s what happens if I switch it around so that the dot comes first:

SELECT regexp_instr('A cat called Ash', '.A');

Result:

This time it matched the A from Ash. Actually, it matched the A and its preceding character (the dot matches any character). In this case the preceding character is the space character, which is at position 13.

Change the Start Position

We can add a third argument to specify where to start the search within the string:

SELECT regexp_instr('A cat called Ash', 'A.', 2);

Result:

Here, I specified that the search should start at position 2, and so this resulted in it missing the first A character. In this case it went straight to the A of Ash, which is at position 14.

Specify the Nth Match

We can add another argument that specifies which match we’re interested in. For example, if there are two matches, we can specify that we’re only interested in the second match. Or if there are ten matches, we could specify the fifth, and so on.

Here’s an example that demonstrates this:

SELECT 
    regexp_instr('A cat called Ash', 'A.', 1, 1) AS "1st",
    regexp_instr('A cat called Ash', 'A.', 1, 2) AS "2nd";

Result:

 1st | 2nd 
-----+-----
   1 |  14

The first column returns the position of the first match, and the second column returns the position of the second match.

The Endoption Argument

The next argument we can specify is the endoption argument. If this is omitted or specified as 0, the function returns the position of the first character of the match. Otherwise, this argument must be 1, which results in the function returning the position of the character that follows the match.

Example:

SELECT 
    regexp_instr('A cat called Ash', 'A.', 1, 2) AS "Omitted",
    regexp_instr('A cat called Ash', 'A.', 1, 2, 0) AS "0",
    regexp_instr('A cat called Ash', 'A.', 1, 2, 1) AS "1";

Result:

 Omitted | 0  | 1  
---------+----+----
      14 | 14 | 16

Here, I omitted the endoption argument in the first column, I specified 0 in the second column, and 1 in the third. We can see that the first two columns returned the position of the first character of the match, but the third column returned the position of the character that follows the match.

Here it is again, but using a pattern of A.. (i.e. with an additional dot):

SELECT 
    regexp_instr('A cat called Ash', 'A..', 1, 2) AS "Omitted",
    regexp_instr('A cat called Ash', 'A..', 1, 2, 0) AS "0",
    regexp_instr('A cat called Ash', 'A..', 1, 2, 1) AS "1";

Result:

 Omitted | 0  | 1  
---------+----+----
      14 | 14 | 17

The first two columns remain the same as in the previous example, but the third column is now 17. That’s because the two dots matched the following two characters after the A character, which is one more character than in the previous example.

Add a Flag

We can also add a flag to change the behaviour of the function. For example, we can specify that the search is case-insensitive by including the i flag:

SELECT regexp_instr('A cat called Ash', 'A.', 1, 2, 0, 'i');

Result:

This time the result is 4. That’s because the search matched both uppercase and lowercase characters (i.e. both a and A). It returned the second match because we specified 2 as the fourth argument.

Here’s what happens if we change the fourth argument to 3:

SELECT regexp_instr('A cat called Ash', 'A.', 1, 3, 0, 'i');

Result:

The i flag is just one of many flags we can use. The following table contains the flags we can use here:

Option	Description
`b`	Rest of RE is a BRE
`c`	Case-sensitive matching (overrides operator type)
`e`	Rest of RE is an ERE
`i`	Case-insensitive matching (overrides operator type)
`m`	Historical synonym for `n`
`n`	Newline-sensitive matching
`p`	Partial newline-sensitive matching
`q`	Rest of RE is a literal (“quoted”) string, all ordinary characters
`s`	Non-newline-sensitive matching (default)
`t`	Tight syntax (default)
`w`	Inverse partial newline-sensitive matching
`x`	Expanded syntax

Source: https://www.postgresql.org/docs/current/functions-matching.html#POSIX-METASYNTAX

Specify a Subexpression

If our pattern contains any parenthesised subexpressions, we can specify which one to use by passing an integer as the last argument after the flag:

SELECT 
    regexp_instr('A cat called Ash', 'A(.)(..)(...)', 1, 2, 0, 'i') AS "Omitted",
    regexp_instr('A cat called Ash', 'A(.)(..)(...)', 1, 2, 0, 'i', 0) AS "0",
    regexp_instr('A cat called Ash', 'A(.)(..)(...)', 1, 2, 0, 'i', 1) AS "1",
    regexp_instr('A cat called Ash', 'A(.)(..)(...)', 1, 2, 0, 'i', 2) AS "2",
    regexp_instr('A cat called Ash', 'A(.)(..)(...)', 1, 2, 0, 'i', 3) AS "3";

Result:

 Omitted | 0 | 1 | 2  | 3  
---------+---+---+----+----
       8 | 8 | 9 | 10 | 12

When we omit the subexpression argument or we specify 0, the result identifies the position of the whole match regardless of parenthesised subexpressions. Otherwise our integer specifies which subexpression to use.

Database.Guide

Understanding PostgreSQL’s REGEXP_INSTR() Function

Syntax

Example

Change the Start Position

Specify the Nth Match

The Endoption Argument

Add a Flag

Specify a Subexpression

More About POSIX Regular Expressions in PostgreSQL