MongoDB $substrBytes

In MongoDB, the $substrBytes aggregation pipeline operator returns the substring of a string, based on the specified UTF-8 encoded bytes indexes.

Syntax

The syntax goes like this:

{ $substrBytes: [ <string expression>, <byte index>, <byte count> ] }

Where:

<string expression> is the string. It can be any valid expression as long as it resolves to a string.
<byte index> is where to start the substring. It can be any valid expression as long as it resolves to a non-negative integer or number that can be represented as an integer.
<byte count> is how many bytes the substring should continue for. It can be any valid expression as long as it resolves to a non-negative integer or number that can be represented as an integer.

Example

Imagine we have a collection called tests with the following document:

{ "_id" : 1, "data" : "Red Firetruck" }

We can use $substrBytes like this:

db.test.aggregate(
   [
     { $match: { _id: { $in: [ 1 ] } } },
     {
       $project:
          {
            _id: 0,
            data: 1,
            result: { $substrBytes: [ "$data", 0, 3 ] }
          }
     }
   ]
)

Result:

{ "data" : "Red Firetruck", "result" : "Red" }

The index starts at zero, and so our substring started at the start of the string, and continued for three bytes.

In this case, we’re using English characters and each character is one byte. This makes it easy for us to count how many bytes to use.

Let’s run another example:

db.test.aggregate(
   [
     { $match: { _id: { $in: [ 1 ] } } },
     {
       $project:
          {
            _id: 0,
            data: 1,
            result_1: { $substrBytes: [ "$data", 4, 4 ] },
            result_2: { $substrBytes: [ "$data", 8, 5 ] },
            result_3: { $substrBytes: [ "$data", 8, 20 ] }
          }
     }
   ]
).pretty()

Result:

{
	"data" : "Red Firetruck",
	"result_1" : "Fire",
	"result_2" : "truck",
	"result_3" : "truck"
}

Notice in our third result, we specified more bytes than were available, but it simply returned all characters to the end of the string.

Multi-Byte Characters

Some characters use more than one byte. Some use two, some use three, and some even use four.

Here’s an example of a document that contains a bunch of symbols:

{ "_id" : 2, "data" : "©♡★✪☆" }

Each of these characters use more than one byte. This means we need to be careful when extracting a substring. We need to be sure that our starting point doesn’t start halfway through a character. If it does, an error will occur. Likewise, we need to ensure that our end point doesn’t end halfway through a character.

For now, let’s apply $substrBytes without effecting an error:

db.test.aggregate(
   [
     { $match: { _id: { $in: [ 2 ] } } },
     {
       $project:
          {
            _id: 0,
            data: 1,
            bytes: { $strLenBytes: [ "$data" ] },
            result: { $substrBytes: [ "$data", 0, 5 ] }
          }
     }
   ]
)

Result:

{ "data" : "©♡★✪☆", "bytes" : 14, "result" : "©♡" }

Based on our starting point of 0 and our byte length of 5, we get two characters in our result set. Therefore we can see that the first two characters use 5 bytes.

In this example I also used $strLenBytes to return the total number of bytes in the string. I did this mainly to show that the five characters use 14 bytes (multiple bytes per character).

Here’s a slightly modified example that returns each of the two characters returned:

db.test.aggregate(
   [
     { $match: { _id: { $in: [ 2 ] } } },
     {
       $project:
          {
            _id: 0,
            data: 1,
            r1: { $substrBytes: [ "$data", 0, 2 ] },
            r2: { $substrBytes: [ "$data", 2, 3 ] }
          }
     }
   ]
)

Result:

{ "data" : "©♡★✪☆", "r1" : "©", "r2" : "♡" }

We can see that the first character uses two bytes and the second character uses three.

Wrong Starting Point

If your starting point is halfway through a character, an error occurs.

Example:

db.test.aggregate(
   [
     { $match: { _id: { $in: [ 2 ] } } },
     {
       $project:
          {
            _id: 0,
            data: 1,
            result: { $substrBytes: [ "$data", 1, 2 ] }
          }
     }
   ]
)

Result:

Error: command failed: {
	"ok" : 0,
	"errmsg" : "$substrBytes:  Invalid range, starting index is a UTF-8 continuation byte.",
	"code" : 28656,
	"codeName" : "Location28656"
} : aggregate failed :
_getErrorWithCode@src/mongo/shell/utils.js:25:13
doassert@src/mongo/shell/assert.js:18:14
_assertCommandWorked@src/mongo/shell/assert.js:639:17
assert.commandWorked@src/mongo/shell/assert.js:729:16
DB.prototype._runAggregate@src/mongo/shell/db.js:266:5
DBCollection.prototype.aggregate@src/mongo/shell/collection.js:1058:12
@(shell):1:1

This error tells us that the starting index is a UTF-8 continuation byte. In other words, we tried to start halfway through a character.

Wrong End Point

It’s the same with the end point. If your end point is halfway through a character, an error occurs.

Example:

db.test.aggregate(
   [
     { $match: { _id: { $in: [ 2 ] } } },
     {
       $project:
          {
            _id: 0,
            data: 1,
            result: { $substrBytes: [ "$data", 0, 1 ] }
          }
     }
   ]
)

Result:

Error: command failed: {
	"ok" : 0,
	"errmsg" : "$substrBytes:  Invalid range, ending index is in the middle of a UTF-8 character.",
	"code" : 28657,
	"codeName" : "Location28657"
} : aggregate failed :
_getErrorWithCode@src/mongo/shell/utils.js:25:13
doassert@src/mongo/shell/assert.js:18:14
_assertCommandWorked@src/mongo/shell/assert.js:639:17
assert.commandWorked@src/mongo/shell/assert.js:729:16
DB.prototype._runAggregate@src/mongo/shell/db.js:266:5
DBCollection.prototype.aggregate@src/mongo/shell/collection.js:1058:12
@(shell):1:1

This time it tells us that the ending index is in the middle of a UTF-8 character.