MongoDB $substrCP

In MongoDB, the $substrCP aggregation pipeline operator returns the substring of a string, based on the specified UTF-8 code point indexes.

Syntax

The syntax goes like this:

{ $substrCP: [ <string expression>, <code point index>, <code point count> ] }

Where:

<string expression> is the string. It can be any valid expression as long as it resolves to a string.
<code point index> is where to start the substring. It can be any valid expression as long as it resolves to a non-negative integer.
<code point count> is how many code points the substring should continue for. It can be any valid expression as long as it resolves to a non-negative integer or number that can be represented as an integer.

Example

Imagine we have a collection called tests with the following document:

{ "_id" : 1, "data" : "Red Firetruck" }

We can use $substrCP like this:

db.test.aggregate(
   [
     { $match: { _id: { $in: [ 1 ] } } },
     {
       $project:
          {
            _id: 0,
            data: 1,
            result: { $substrCP: [ "$data", 0, 3 ] }
          }
     }
   ]
)

Result:

{ "data" : "Red Firetruck", "result" : "Red" }

The index starts at zero, and so our substring started at the start of the string, and continued for three code points.

In this case, we’re using English characters and each character has one code point. This makes it easy for us to count how many code points to use.

Let’s run another example:

db.test.aggregate(
   [
     { $match: { _id: { $in: [ 1 ] } } },
     {
       $project:
          {
            _id: 0,
            data: 1,
            result_1: { $substrCP: [ "$data", 4, 4 ] },
            result_2: { $substrCP: [ "$data", 8, 5 ] },
            result_3: { $substrCP: [ "$data", 8, 20 ] }
          }
     }
   ]
).pretty()

Result:

{
	"data" : "Red Firetruck",
	"result_1" : "Fire",
	"result_2" : "truck",
	"result_3" : "truck"
}

Notice in our third result, we specified more code points than were available, but it simply returned all characters to the end of the string.

Diacritic Marks

Some characters have a diacritic mark added, resulting in multiple code points.

Suppose we have a collection called thai which contains the following documents:

{ "_id" : 1, "data" : "ไม้เมือง" }
{ "_id" : 2, "data" : "ไ" }
{ "_id" : 3, "data" : "ม้" }
{ "_id" : 4, "data" : "เ" }
{ "_id" : 5, "data" : "มื" }
{ "_id" : 6, "data" : "อ" }
{ "_id" : 7, "data" : "ง" }

These documents contain Thai characters. We can see that two of these characters include a diacritic (a small glyph above the initial glyph).

Documents 2 through 7 simply list out each of the characters that are in document 1.

Before we take a substring, let’s find out how many code points each of these characters have by using the $strLenCP operator:

db.thai.aggregate(
   [
     {
       $project:
          {
            _id: 0,
            data: 1,
            result: { $strLenCP: "$data" }
          }
     }
   ]
)

Result:

{ "data" : "ไม้เมือง", "result" : 8 }
{ "data" : "ไ", "result" : 1 }
{ "data" : "ม้", "result" : 2 }
{ "data" : "เ", "result" : 1 }
{ "data" : "มื", "result" : 2 }
{ "data" : "อ", "result" : 1 }
{ "data" : "ง", "result" : 1 }

We can see that the two characters with the diacritics have two code points, and the others have one code point.

Let’s apply $substrCP to the first document:

db.thai.aggregate(
   [
     { $match: { _id: { $in: [ 1 ] } } },
     {
       $project:
          {
            _id: 0,
            data: 1,
            result: { $substrCP: [ "$data", 1, 2 ] }
          }
     }
   ]
)

Result:

{ "data" : "ไม้เมือง", "result" : "ม้" }

Based on our starting point of 1 and our code point count of 2, we get the second character and its associated diacritic.

Separate the Glyphs

In the previous example our third argument was 2 so that it returned the character and diacritic together. Here’s what happens when we provide a third argument of 1.

db.thai.aggregate(
   [
     { $match: { _id: { $in: [ 1 ] } } },
     {
       $project:
          {
            _id: 0,
            data: 1,
            result: { $substrCP: [ "$data", 1, 1 ] }
          }
     }
   ]
)

Result:

{ "data" : "ไม้เมือง", "result" : "ม" }

The first character is returned without the diacritic mark.

Other Data Types

The $substrCP operator only works on strings. However, if you have another data type, it should still work, as long as it can resolve to a string.

Suppose we have the following document:

{ "_id" : 2, "data" : 123456 }

The data field contains a number.

Here’s what happens when we apply $substrCP to that field:

db.test.aggregate(
   [
     { $match: { _id: { $in: [ 2 ] } } },
     {
       $project:
          {
            _id: 0,
            data: 1,
            result: { $substrCP: [ "$data", 1, 2 ] }
          }
     }
   ]
)

Result:

{ "data" : 123456, "result" : "23" }

It managed to do the job fine (although bear in mind that the result is a string – not a number).

We have another document with a Date object:

{ "_id" : 3, "data" : ISODate("2021-01-03T23:30:15.100Z") }

Now let’s apply $substrCP to that document:

db.test.aggregate(
   [
     { $match: { _id: { $in: [ 3 ] } } },
     {
       $project:
          {
            _id: 0,
            data: 1,
            result: { $substrCP: [ "$data", 0, 4 ] }
          }
     }
   ]
)

Result:

{ "data" : ISODate("2021-01-03T23:30:15.100Z"), "result" : "2021" }

So it worked fine in this scenario too.

Null Values

If the string is null, the result is an empty string.

Suppose we have the following document:

{ "_id" : 4, "data" : null }

Here’s what happens when we apply $substrCP to that document:

db.test.aggregate(
   [
     { $match: { _id: { $in: [ 4 ] } } },
     {
       $project:
          {
            _id: 0,
            data: 1,
            result: { $substrCP: [ "$data", 1, 2 ] }
          }
     }
   ]
)

Result:

{ "data" : null, "result" : "" }

Missing Field

Attempting to get a substring from a field that doesn’t exist results in an empty string.

Suppose we have the following document:

{ "_id" : 5 }

Apply $substrCP:

db.test.aggregate(
   [
     { $match: { _id: { $in: [ 5 ] } } },
     {
       $project:
          {
            _id: 0,
            data: 1,
            result: { $substrCP: [ "$data", 1, 2 ] }
          }
     }
   ]
)

Result:

{ "result" : "" }