MongoDB $indexOfCP

In MongoDB, the $indexOfCP aggregation pipeline operator searches a string for an occurrence of a substring and returns the UTF code point index of the first occurrence.

The UTF code point index is zero-based (i.e. it starts at 0).

Syntax

The syntax goes like this:

{ $indexOfCP: [ <string expression>, <substring expression>, <start>, <end> ] }

Where:

  • <string expression> is the string to search.
  • <substring expression> is the substring you want to find in the string.
  • <start> is an optional argument that specifies a starting index position for the search. Can be any valid expression that resolves to a non-negative integral number.
  • <end> is an optional argument that specifies an ending index position for the search. Can be any valid expression that resolves to a non-negative integral number.

If the specified value isn’t found, $indexOfCP returns -1.

If there are multiple instances of the specified value, just the first one is returned.

Example

Suppose we have a collection called test with the following documents:

{ "_id" : 1, "data" : "c 2021" }
{ "_id" : 2, "data" : "© 2021" }
{ "_id" : 3, "data" : "ไม้เมือง" }

Here’s an example of applying $indexOfCP to those documents:

db.test.aggregate(
   [
     { $match: { _id: { $in: [ 1, 2, 3 ] } } },
     {
       $project:
          {
            _id: 0,
            data: 1,
            result: { $indexOfCP: [ "$data", "2021" ] }
          }
     }
   ]
)

Result:

{ "data" : "c 2021", "result" : 2 }
{ "data" : "© 2021", "result" : 2 }
{ "data" : "ไม้เมือง", "result" : -1 }

In the first two documents, the substring was found at UTF code point index position 2. Given $indexOfCP results are zero-based (the index starts at 0) the position 2 represents the third code point.

This is a different result to what we’d get if we use $indexOfBytes, because the copyright symbol (©) in the second document takes up 2 bytes. But it uses just one code point, which is the same as what the letter c uses.

Regarding the third document, the substring wasn’t found at all, and so the result is -1.

Here’s another example, except this time we search for a Thai character:

db.test.aggregate(
   [
     { $match: { _id: { $in: [ 1, 2, 3 ] } } },
     {
       $project:
          {
            _id: 0,
            data: 1,
            result: { $indexOfCP: [ "$data", "เ" ] }
          }
     }
   ]
)

Result:

{ "data" : "c 2021", "result" : -1 }
{ "data" : "© 2021", "result" : -1 }
{ "data" : "ไม้เมือง", "result" : 3 }

In this case, we searched for a character that’s in the third document, and its UTF-8 code point index comes back as 3. Given $indexOfCP results are zero based, this means it’s the fourth code point.

This is because the second character has a diacritic mark, which is also a code point. Therefore, the first character is one code point and the second character is two code points (including the diacritic), which equals three. This means that our character starts at the fourth position (which is code point number 3, due to index count starting at 0).

See MongoDB $strLenCP for an example that returns the number of code points for each character in this particular string. And see MongoDB $strLenBytes to see the number of bytes in the same string.

Specify a Starting Position

You can provide a third argument to specify a starting index position for the search.

Suppose we have the following document:

{ "_id" : 4, "data" : "ABC XYZ ABC" }

Here’s an example of applying $indexOfCP with a starting position:

db.test.aggregate(
   [
     { $match: { _id: { $in: [ 4 ] } } },
     {
       $project:
          {
            _id: 0,
            data: 1,
            result: { $indexOfCP: [ "$data", "ABC", 1 ] }
          }
     }
   ]
)

Result:

{ "data" : "ABC XYZ ABC", "result" : 8 }

In this case, the second instance of the substring was returned. This is because we started the search at position 1, and the first instance of the substring starts at position 0 (before the starting position for the search).

If the start position is a number greater than the string or greater than the ending position, $indexOfCP returns -1.

If it’s a negative number, $indexOfCP returns an error.

Specify an Ending Position

You can also provide a fourth argument to specify the ending index position for the search.

If you provide this argument, you also need to provide a starting position. Failing to do so will result in this argument being interpreted as the starting point.

Example:

db.test.aggregate(
   [
     { $match: { _id: { $in: [ 4 ] } } },
     {
       $project:
          {
            _id: 0,
            data: 1,
            result: { $indexOfCP: [ "$data", "XYZ", 0, 3 ] }
          }
     }
   ]
)

Result:

{ "data" : "ABC XYZ ABC", "result" : -1 }

The result is -1 which means the substring wasn’t found. That’s because we started our search at position 0 and ended it at position 3, therefore not capturing the substring.

Here’s what happens if we increment the ending index position:

db.test.aggregate(
   [
     { $match: { _id: { $in: [ 4 ] } } },
     {
       $project:
          {
            _id: 0,
            data: 1,
            result: { $indexOfCP: [ "$data", "XYZ", 0, 5 ] }
          }
     }
   ]
)

Result:

{ "data" : "ABC XYZ ABC", "result" : 4 }

This time the value was included and its index position returned.

If the end position is a number less than the starting position, $indexOfCP returns -1.

If it’s a negative number, $indexOfCP returns an error.

Missing Fields

If the field is not in the document, $indexOfCP returns null.

Suppose we have the following document:

{ "_id" : 5 }

Here’s what happens when we apply $indexOfCP:

db.test.aggregate(
   [
     { $match: { _id: { $in: [ 5 ] } } },
     {
       $project:
          {
            _id: 0,
            data: 1,
            result: { $indexOfCP: [ "$data", "XYZ" ] }
          }
     }
   ]
)

Result:

{ "result" : null }

Null Values

If the first argument is null, $indexOfCP returns null.

Suppose we have the following document:

{ "_id" : 6, "data" : null }

Here’s what happens when we apply $indexOfCP:

db.test.aggregate(
   [
     { $match: { _id: { $in: [ 6 ] } } },
     {
       $project:
          {
            _id: 0,
            data: 1,
            result: { $indexOfCP: [ "$data", "XYZ" ] }
          }
     }
   ]
)

Result:

{ "data" : null, "result" : null }

However, when the second argument (i.e. the substring) is null, an error is returned:

db.test.aggregate(
   [
     { $match: { _id: { $in: [ 1 ] } } },
     {
       $project:
          {
            _id: 0,
            data: 1,
            result: { $indexOfCP: [ "$data", null ] }
          }
     }
   ]
)

Result:

uncaught exception: Error: command failed: {
	"ok" : 0,
	"errmsg" : "$indexOfCP requires a string as the second argument, found: null",
	"code" : 40094,
	"codeName" : "Location40094"
} : aggregate failed :
_getErrorWithCode@src/mongo/shell/utils.js:25:13
doassert@src/mongo/shell/assert.js:18:14
_assertCommandWorked@src/mongo/shell/assert.js:639:17
assert.commandWorked@src/mongo/shell/assert.js:729:16
DB.prototype._runAggregate@src/mongo/shell/db.js:266:5
DBCollection.prototype.aggregate@src/mongo/shell/collection.js:1058:12
@(shell):1:1

Wrong Data Type

If the first argument is the wrong data type (i.e. it doesn’t resolve to a string), $indexOfCP returns an error.

Suppose we have the following document:

{ "_id" : 7, "data" : 123 }

Here’s what happens when we apply $indexOfCP to that document:

db.test.aggregate(
   [
     { $match: { _id: { $in: [ 7 ] } } },
     {
       $project:
          {
            _id: 0,
            data: 1,
            result: { $indexOfCP: [ "$data", "XYZ" ] }
          }
     }
   ]
)

Result:

uncaught exception: Error: command failed: {
	"ok" : 0,
	"errmsg" : "$indexOfCP requires a string as the first argument, found: double",
	"code" : 40093,
	"codeName" : "Location40093"
} : aggregate failed :
_getErrorWithCode@src/mongo/shell/utils.js:25:13
doassert@src/mongo/shell/assert.js:18:14
_assertCommandWorked@src/mongo/shell/assert.js:639:17
assert.commandWorked@src/mongo/shell/assert.js:729:16
DB.prototype._runAggregate@src/mongo/shell/db.js:266:5
DBCollection.prototype.aggregate@src/mongo/shell/collection.js:1058:12
@(shell):1:1

As the error message states, $indexOfCP requires a string as the first argument.