$strLenBytes vs $strLenCP in MongoDB: What’s the Difference?

MongoDB includes the $strLenBytes and $strLenCP operators in its aggregation pipeline framework. These operators do a similar but slightly different thing. In some cases, both will return exactly the same result, while in other cases the results will differ.

Here’s a quick overview of the difference between these two operators.

The Difference

Here’s a definition of each operator:

  • $strLenBytes returns the number of UTF-8 encoded bytes in the specified string
  • $strLenCP returns the number of UTF-8 code points in the specified string.

Notice the difference in bold type. One returns the number bytes, the other returns the number of code points.

When working with strings in English, the number of bytes will usually be the same as the number of code points. Each code point will use one byte.

But when working with other languages that use a different Unicode block, you might find that the number of bytes increases to two or three bytes. This is also true when working with other Unicode code points such as symbols, emoji, etc. In some cases a single character might use 4 bytes.

Example

Suppose we have a collection called unicode with the following documents:

{ "_id" : 1, "data" : "é" }
{ "_id" : 2, "data" : "©" }
{ "_id" : 3, "data" : "℘" }

And now let’s apply both $strLenBytes and $strLenCP to the data field:

db.unicode.aggregate(
   [
     {
       $project:
          {
            _id: 0,
            data: 1,
            strLenCP: { $strLenCP: "$data" },
            strLenBytes: { $strLenBytes: "$data" }
          }
     }
   ]
)

Result:

{ "data" : "é", "strLenCP" : 1, "strLenBytes" : 2 }
{ "data" : "©", "strLenCP" : 1, "strLenBytes" : 2 }
{ "data" : "℘", "strLenCP" : 1, "strLenBytes" : 3 }

We can see that all characters use just one code point, but the first document uses two bytes and the other two documents each use three bytes.

English Characters

Suppose we have a collection called english with the following documents:

{ "_id" : 1, "data" : "Fast dog" }
{ "_id" : 2, "data" : "F" }
{ "_id" : 3, "data" : "a" }
{ "_id" : 4, "data" : "s" }
{ "_id" : 5, "data" : "t" }
{ "_id" : 6, "data" : " " }
{ "_id" : 7, "data" : "d" }
{ "_id" : 8, "data" : "o" }
{ "_id" : 9, "data" : "g" }

And now let’s apply both $strLenBytes and $strLenCP to the data field:

db.english.aggregate(
   [
     {
       $project:
          {
            _id: 0,
            data: 1,
            strLenCP: { $strLenCP: "$data" },
            strLenBytes: { $strLenBytes: "$data" }
          }
     }
   ]
)

Result:

{ "data" : "Fast dog", "strLenCP" : 8, "strLenBytes" : 8 }
{ "data" : "F", "strLenCP" : 1, "strLenBytes" : 1 }
{ "data" : "a", "strLenCP" : 1, "strLenBytes" : 1 }
{ "data" : "s", "strLenCP" : 1, "strLenBytes" : 1 }
{ "data" : "t", "strLenCP" : 1, "strLenBytes" : 1 }
{ "data" : " ", "strLenCP" : 1, "strLenBytes" : 1 }
{ "data" : "d", "strLenCP" : 1, "strLenBytes" : 1 }
{ "data" : "o", "strLenCP" : 1, "strLenBytes" : 1 }
{ "data" : "g", "strLenCP" : 1, "strLenBytes" : 1 }

In this case, all characters use one code point and one byte each.

Thai Characters

Here’s an example that uses Thai characters to demonstrate that not all languages use one byte per code point.

Suppose we have a collection called thai with the following documents:

{ "_id" : 1, "data" : "ไม้เมือง" }
{ "_id" : 2, "data" : "ไ" }
{ "_id" : 3, "data" : "ม้" }
{ "_id" : 4, "data" : "เ" }
{ "_id" : 5, "data" : "มื" }
{ "_id" : 6, "data" : "อ" }
{ "_id" : 7, "data" : "ง" }

Here’s what happens when we apply both $strLenBytes and $strLenCP to the data field:

db.thai.aggregate(
   [
     {
       $project:
          {
            _id: 0,
            data: 1,
            strLenCP: { $strLenCP: "$data" },
            strLenBytes: { $strLenBytes: "$data" }
          }
     }
   ]
)

Result:

{ "data" : "ไม้เมือง", "strLenCP" : 8, "strLenBytes" : 24 }
{ "data" : "ไ", "strLenCP" : 1, "strLenBytes" : 3 }
{ "data" : "ม้", "strLenCP" : 2, "strLenBytes" : 6 }
{ "data" : "เ", "strLenCP" : 1, "strLenBytes" : 3 }
{ "data" : "มื", "strLenCP" : 2, "strLenBytes" : 6 }
{ "data" : "อ", "strLenCP" : 1, "strLenBytes" : 3 }
{ "data" : "ง", "strLenCP" : 1, "strLenBytes" : 3 }