MongoDB $indexOfBytes

In MongoDB, the $indexOfBytes aggregation pipeline operator searches a string for an occurrence of a substring and returns the UTF-8 byte index of the first occurrence.

The UTF byte index is zero-based (i.e. it starts at 0).

Syntax

The syntax goes like this:

{ $indexOfBytes: [ <string expression>, <substring expression>, <start>, <end> ] }

Where:

  • <string expression> is the string to search.
  • <substring expression> is the substring you want to find in the string.
  • <start> is an optional argument that specifies a starting index position for the search. Can be any valid expression that resolves to a non-negative integral number.
  • <end> is an optional argument that specifies an ending index position for the search. Can be any valid expression that resolves to a non-negative integral number.

If the specified value isn’t found, $indexOfBytes returns -1.

If there are multiple instances of the specified value, just the first one is returned.

Example

Suppose we have a collection called test with the following documents:

{ "_id" : 1, "data" : "c 2021" }
{ "_id" : 2, "data" : "© 2021" }
{ "_id" : 3, "data" : "ไม้เมือง" }

Here’s an example of applying $indexOfBytes to those documents:

db.test.aggregate(
   [
     { $match: { _id: { $in: [ 1, 2, 3 ] } } },
     {
       $project:
          {
            _id: 0,
            data: 1,
            result: { $indexOfBytes: [ "$data", "2021" ] }
          }
     }
   ]
)

Result:

{ "data" : "c 2021", "result" : 2 }
{ "data" : "© 2021", "result" : 3 }
{ "data" : "ไม้เมือง", "result" : -1 }

We can see that the first two documents produced different results, even though the substring appears to be in the same position for each document. In the first document, the substring was found at byte index position 2, whereas the second document had it at 3.

The reason for this is that the copyright symbol (©) in the second document takes up 2 bytes. The c character (in the first document) only uses 1 byte. The space character also uses 1 byte.

The result of $indexOfBytes is zero-based (the index starts at 0), and so we end up with a result of 2 and 3 respectively.

Regarding the third document, the substring wasn’t found at all, and so the result is -1.

Here’s another example, except this time we search for a Thai character:

db.test.aggregate(
   [
     { $match: { _id: { $in: [ 1, 2, 3 ] } } },
     {
       $project:
          {
            _id: 0,
            data: 1,
            result: { $indexOfBytes: [ "$data", "เ" ] }
          }
     }
   ]
)

Result:

{ "data" : "c 2021", "result" : -1 }
{ "data" : "© 2021", "result" : -1 }
{ "data" : "ไม้เมือง", "result" : 9 }

In this case, we searched for a character that appears in the third position in the third document, and its UTF-8 byte index comes back as 9.

This is because in this case, each character uses 3 bytes. But the second character has a diacritic mark, which is also 3 bytes. Therefore, the first two characters (including the diacritic) uses 9 bytes. Given the zero-based indexing, their UTF-8 byte indexes range from 0 to 8. This means that the third character starts at position 9.

See MongoDB $strLenBytes for an example that returns the number of bytes for each character in this particular string.

Specify a Starting Position

You can provide a third argument to specify a starting index position for the search.

Suppose we have the following document:

{ "_id" : 4, "data" : "ABC XYZ ABC" }

Here’s an example of applying $indexOfBytes with a starting position:

db.test.aggregate(
   [
     { $match: { _id: { $in: [ 4 ] } } },
     {
       $project:
          {
            _id: 0,
            data: 1,
            result: { $indexOfBytes: [ "$data", "ABC", 1 ] }
          }
     }
   ]
)

Result:

{ "data" : "ABC XYZ ABC", "result" : 8 }

In this case, the second instance of the substring was returned. This is because we started the search at position 1, and the first instance of the substring starts at position 0 (before the starting position for the search).

If the start position is a number greater than the byte length of the string or greater than the ending position, $indexOfBytes returns -1.

If it’s a negative number, $indexOfBytes returns an error.

Specify an Ending Position

You can also provide a fourth argument to specify the ending index position for the search.

If you provide this argument, you also need to provide a starting position. Failing to do so will result in this argument being interpreted as the starting point.

Example:

db.test.aggregate(
   [
     { $match: { _id: { $in: [ 4 ] } } },
     {
       $project:
          {
            _id: 0,
            data: 1,
            result: { $indexOfBytes: [ "$data", "XYZ", 0, 5 ] }
          }
     }
   ]
)

Result:

{ "data" : "ABC XYZ ABC", "result" : -1 }

The result is -1 which means the substring wasn’t found. That’s because we started our search at position 0 and ended it at position 5, therefore not capturing the substring.

Here’s what happens if we increment the ending index position:

db.test.aggregate(
   [
     { $match: { _id: { $in: [ 4 ] } } },
     {
       $project:
          {
            _id: 0,
            data: 1,
            result: { $indexOfBytes: [ "$data", "XYZ", 0, 7 ] }
          }
     }
   ]
)

Result:

{ "data" : "ABC XYZ ABC", "result" : 4 }

This time the value was included and its index position returned.

If the end position is a number less than the starting position, $indexOfBytes returns -1.

If it’s a negative number, $indexOfBytes returns an error.

Missing Fields

If the field is not in the document, $indexOfBytes returns null.

Suppose we have the following document:

{ "_id" : 5 }

Here’s what happens when we apply $indexOfBytes:

db.test.aggregate(
   [
     { $match: { _id: { $in: [ 5 ] } } },
     {
       $project:
          {
            _id: 0,
            data: 1,
            result: { $indexOfBytes: [ "$data", "XYZ" ] }
          }
     }
   ]
)

Result:

{ "result" : null }

Null Values

If the first argument is null, $indexOfBytes returns null.

Suppose we have the following document:

{ "_id" : 6, "data" : null }

Here’s what happens when we apply $indexOfBytes:

db.test.aggregate(
   [
     { $match: { _id: { $in: [ 6 ] } } },
     {
       $project:
          {
            _id: 0,
            data: 1,
            result: { $indexOfBytes: [ "$data", "XYZ" ] }
          }
     }
   ]
)

Result:

{ "data" : null, "result" : null }

However, when the second argument (i.e. the substring) is null, an error is returned:

db.test.aggregate(
   [
     { $match: { _id: { $in: [ 1 ] } } },
     {
       $project:
          {
            _id: 0,
            data: 1,
            result: { $indexOfBytes: [ "$data", null ] }
          }
     }
   ]
)

Result:

uncaught exception: Error: command failed: {
	"ok" : 0,
	"errmsg" : "$indexOfBytes requires a string as the second argument, found: null",
	"code" : 40092,
	"codeName" : "Location40092"
} : aggregate failed :
_getErrorWithCode@src/mongo/shell/utils.js:25:13
doassert@src/mongo/shell/assert.js:18:14
_assertCommandWorked@src/mongo/shell/assert.js:639:17
assert.commandWorked@src/mongo/shell/assert.js:729:16
DB.prototype._runAggregate@src/mongo/shell/db.js:266:5
DBCollection.prototype.aggregate@src/mongo/shell/collection.js:1058:12
@(shell):1:1

Wrong Data Type

If the first argument is the wrong data type (i.e. it doesn’t resolve to a string), $indexOfBytes returns an error.

Suppose we have the following document:

{ "_id" : 7, "data" : 123 }

Here’s what happens when we apply $indexOfBytes to that document:

db.test.aggregate(
   [
     { $match: { _id: { $in: [ 7 ] } } },
     {
       $project:
          {
            _id: 0,
            data: 1,
            result: { $indexOfBytes: [ "$data", "XYZ" ] }
          }
     }
   ]
)

Result:

uncaught exception: Error: command failed: {
	"ok" : 0,
	"errmsg" : "$indexOfBytes requires a string as the first argument, found: double",
	"code" : 40091,
	"codeName" : "Location40091"
} : aggregate failed :
_getErrorWithCode@src/mongo/shell/utils.js:25:13
doassert@src/mongo/shell/assert.js:18:14
_assertCommandWorked@src/mongo/shell/assert.js:639:17
assert.commandWorked@src/mongo/shell/assert.js:729:16
DB.prototype._runAggregate@src/mongo/shell/db.js:266:5
DBCollection.prototype.aggregate@src/mongo/shell/collection.js:1058:12
@(shell):1:1

As the error message states, $indexOfBytes requires a string as the first argument.