In MongoDB, the $indexOfCP
aggregation pipeline operator searches a string for an occurrence of a substring and returns the UTF code point index of the first occurrence.
The UTF code point index is zero-based (i.e. it starts at 0
).
Syntax
The syntax goes like this:
{ $indexOfCP: [ <string expression>, <substring expression>, <start>, <end> ] }
Where:
<string expression>
is the string to search.<substring expression>
is the substring you want to find in the string.<start>
is an optional argument that specifies a starting index position for the search. Can be any valid expression that resolves to a non-negative integral number.<end>
is an optional argument that specifies an ending index position for the search. Can be any valid expression that resolves to a non-negative integral number.
If the specified value isn’t found, $indexOfCP
returns -1
.
If there are multiple instances of the specified value, just the first one is returned.
Example
Suppose we have a collection called test
with the following documents:
{ "_id" : 1, "data" : "c 2021" } { "_id" : 2, "data" : "© 2021" } { "_id" : 3, "data" : "ไม้เมือง" }
Here’s an example of applying $indexOfCP
to those documents:
db.test.aggregate(
[
{ $match: { _id: { $in: [ 1, 2, 3 ] } } },
{
$project:
{
_id: 0,
data: 1,
result: { $indexOfCP: [ "$data", "2021" ] }
}
}
]
)
Result:
{ "data" : "c 2021", "result" : 2 } { "data" : "© 2021", "result" : 2 } { "data" : "ไม้เมือง", "result" : -1 }
In the first two documents, the substring was found at UTF code point index position 2
. Given $indexOfCP
results are zero-based (the index starts at 0
) the position 2 represents the third code point.
This is a different result to what we’d get if we use $indexOfBytes
, because the copyright symbol (©
) in the second document takes up 2 bytes. But it uses just one code point, which is the same as what the letter c
uses.
Regarding the third document, the substring wasn’t found at all, and so the result is -1
.
Here’s another example, except this time we search for a Thai character:
db.test.aggregate(
[
{ $match: { _id: { $in: [ 1, 2, 3 ] } } },
{
$project:
{
_id: 0,
data: 1,
result: { $indexOfCP: [ "$data", "เ" ] }
}
}
]
)
Result:
{ "data" : "c 2021", "result" : -1 } { "data" : "© 2021", "result" : -1 } { "data" : "ไม้เมือง", "result" : 3 }
In this case, we searched for a character that’s in the third document, and its UTF-8 code point index comes back as 3
. Given $indexOfCP
results are zero based, this means it’s the fourth code point.
This is because the second character has a diacritic mark, which is also a code point. Therefore, the first character is one code point and the second character is two code points (including the diacritic), which equals three. This means that our character starts at the fourth position (which is code point number 3
, due to index count starting at 0
).
See MongoDB $strLenCP
for an example that returns the number of code points for each character in this particular string. And see MongoDB $strLenBytes
to see the number of bytes in the same string.
Specify a Starting Position
You can provide a third argument to specify a starting index position for the search.
Suppose we have the following document:
{ "_id" : 4, "data" : "ABC XYZ ABC" }
Here’s an example of applying $indexOfCP
with a starting position:
db.test.aggregate(
[
{ $match: { _id: { $in: [ 4 ] } } },
{
$project:
{
_id: 0,
data: 1,
result: { $indexOfCP: [ "$data", "ABC", 1 ] }
}
}
]
)
Result:
{ "data" : "ABC XYZ ABC", "result" : 8 }
In this case, the second instance of the substring was returned. This is because we started the search at position 1
, and the first instance of the substring starts at position 0
(before the starting position for the search).
If the start position is a number greater than the string or greater than the ending position, $indexOfCP
returns -1
.
If it’s a negative number, $indexOfCP
returns an error.
Specify an Ending Position
You can also provide a fourth argument to specify the ending index position for the search.
If you provide this argument, you also need to provide a starting position. Failing to do so will result in this argument being interpreted as the starting point.
Example:
db.test.aggregate(
[
{ $match: { _id: { $in: [ 4 ] } } },
{
$project:
{
_id: 0,
data: 1,
result: { $indexOfCP: [ "$data", "XYZ", 0, 3 ] }
}
}
]
)
Result:
{ "data" : "ABC XYZ ABC", "result" : -1 }
The result is -1
which means the substring wasn’t found. That’s because we started our search at position 0
and ended it at position 3
, therefore not capturing the substring.
Here’s what happens if we increment the ending index position:
db.test.aggregate(
[
{ $match: { _id: { $in: [ 4 ] } } },
{
$project:
{
_id: 0,
data: 1,
result: { $indexOfCP: [ "$data", "XYZ", 0, 5 ] }
}
}
]
)
Result:
{ "data" : "ABC XYZ ABC", "result" : 4 }
This time the value was included and its index position returned.
If the end position is a number less than the starting position, $indexOfCP
returns -1
.
If it’s a negative number, $indexOfCP
returns an error.
Missing Fields
If the field is not in the document, $indexOfCP
returns null
.
Suppose we have the following document:
{ "_id" : 5 }
Here’s what happens when we apply $indexOfCP
:
db.test.aggregate(
[
{ $match: { _id: { $in: [ 5 ] } } },
{
$project:
{
_id: 0,
data: 1,
result: { $indexOfCP: [ "$data", "XYZ" ] }
}
}
]
)
Result:
{ "result" : null }
Null Values
If the first argument is null
, $indexOfCP
returns null
.
Suppose we have the following document:
{ "_id" : 6, "data" : null }
Here’s what happens when we apply $indexOfCP
:
db.test.aggregate(
[
{ $match: { _id: { $in: [ 6 ] } } },
{
$project:
{
_id: 0,
data: 1,
result: { $indexOfCP: [ "$data", "XYZ" ] }
}
}
]
)
Result:
{ "data" : null, "result" : null }
However, when the second argument (i.e. the substring) is null
, an error is returned:
db.test.aggregate(
[
{ $match: { _id: { $in: [ 1 ] } } },
{
$project:
{
_id: 0,
data: 1,
result: { $indexOfCP: [ "$data", null ] }
}
}
]
)
Result:
uncaught exception: Error: command failed: { "ok" : 0, "errmsg" : "$indexOfCP requires a string as the second argument, found: null", "code" : 40094, "codeName" : "Location40094" } : aggregate failed : _getErrorWithCode@src/mongo/shell/utils.js:25:13 doassert@src/mongo/shell/assert.js:18:14 _assertCommandWorked@src/mongo/shell/assert.js:639:17 assert.commandWorked@src/mongo/shell/assert.js:729:16 DB.prototype._runAggregate@src/mongo/shell/db.js:266:5 DBCollection.prototype.aggregate@src/mongo/shell/collection.js:1058:12 @(shell):1:1
Wrong Data Type
If the first argument is the wrong data type (i.e. it doesn’t resolve to a string), $indexOfCP
returns an error.
Suppose we have the following document:
{ "_id" : 7, "data" : 123 }
Here’s what happens when we apply $indexOfCP
to that document:
db.test.aggregate(
[
{ $match: { _id: { $in: [ 7 ] } } },
{
$project:
{
_id: 0,
data: 1,
result: { $indexOfCP: [ "$data", "XYZ" ] }
}
}
]
)
Result:
uncaught exception: Error: command failed: { "ok" : 0, "errmsg" : "$indexOfCP requires a string as the first argument, found: double", "code" : 40093, "codeName" : "Location40093" } : aggregate failed : _getErrorWithCode@src/mongo/shell/utils.js:25:13 doassert@src/mongo/shell/assert.js:18:14 _assertCommandWorked@src/mongo/shell/assert.js:639:17 assert.commandWorked@src/mongo/shell/assert.js:729:16 DB.prototype._runAggregate@src/mongo/shell/db.js:266:5 DBCollection.prototype.aggregate@src/mongo/shell/collection.js:1058:12 @(shell):1:1
As the error message states, $indexOfCP requires a string as the first argument
.