When you create a text
index in MongoDB, the index uses a default language of english
.
The default language determines the rules to parse word roots (i.e. stemming) and ignore stop words.
However, you can change the default language if required.
You can also specify a language at the document level, and even at the subdocument level. The default language will only be used if a language hasn’t been specified at the document or subdocument level.
This article presents examples of specifying a language for a text
index.
Example Collection
Suppose we have a collection called sitcoms
with documents like this:
{ "_id" : 1, "original_name" : "Family Guy", "translations" : { "language" : "german", "sitcom_name" : "Familienmensch" } } { "_id" : 2, "original_name" : "Cuéntame como pasó", "language" : "spanish", "translations" : [ { "language" : "english", "sitcom_name" : "Tell me how it happened" }, { "language" : "french", "sitcom_name" : "Raconte-moi comment cela s'est passé" } ] }
We can see that there are two documents in this collection. Each document contains the name of a sitcom, along with translations of that sitcom name in different languages. The language of each translation is specified in the language
field of the respective subdocument.
The second document in this collection also includes a language
field at its top level (in this case, "language" : "spanish"
). This means that the sitcom name is in Spanish (or at least, Spanish is the language we want to be used when this document is indexed).
However, the first document doesn’t contain such a field. The fact that the first document doesn’t contain a top-level language
field means that we want it to be indexed using the default language. If no default language is specified during indexing, then the default language will be English.
If an embedded document doesn’t contain a field that specifies the language, then it will use the language field of the enclosing document. If the enclosing document doesn’t contain a language field, then it will use the default language.
Create a Text Index for Multiple Languages
Let’s go ahead and create a text
index for the above collection.
db.sitcoms.createIndex(
{
"original_name": "text",
"translations.sitcom_name": "text"
}
)
That creates a compound text
index on the original_name
field, and the translations.sitcom_name
field (i.e. the sitcom_name
field in the embedded documents).
Now let’s use getIndexes()
to take a look at that index:
db.sitcoms.getIndexes()
Result:
[ { "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_" }, { "v" : 2, "key" : { "_fts" : "text", "_ftsx" : 1 }, "name" : "original_name_text_translations.sitcom_name_text", "weights" : { "original_name" : 1, "translations.sitcom_name" : 1 }, "default_language" : "english", "language_override" : "language", "textIndexVersion" : 3 } ]
We can see that it uses a default language of English. This is specified as "default_language" : "english"
.
Change the Default Language
We can set a different default language when creating the index if required.
Let’s drop the index and recreate it with a different default language:
db.sitcoms.dropIndex("original_name_text_translations.sitcom_name_text")
db.sitcoms.createIndex(
{
"original_name": "text",
"translations.sitcom_name": "text"
},
{
"default_language": "danish"
}
)
Let’s take a look at the index:
db.sitcoms.getIndexes()
Result:
[ { "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_" }, { "v" : 2, "key" : { "_fts" : "text", "_ftsx" : 1 }, "name" : "original_name_text_translations.sitcom_name_text", "default_language" : "danish", "weights" : { "original_name" : 1, "translations.sitcom_name" : 1 }, "language_override" : "language", "textIndexVersion" : 3 } ]
We can see that the default language is now danish
as specified.
The language_override
Parameter
You may be wondering “how did MongoDB know that the document’s language
field is the field that specifies the language?”.
And that’s a great question. After all, what if we’d given the field a different name – how would MongoDB know that it is the field to use for the language?
If you look at the above index, you’ll see that it has a language_override
field. Specifically, it goes like this: "language_override" : "language"
What that means is that the document’s language
field will be the field that the index uses to override the language.
When you create a text
index, the index will look for any fields called language
and then use those as the language for the respective document.
However, the name language
isn’t set in stone. You can change it if you so desire.
Suppose our collection contains documents where the field names are in Danish.
Like this:
{ "_id" : 1, "originalt_navn" : "Family Guy", "sprog" : "english", "oversættelser" : { "sprog" : "german", "sitcom-navn" : "Familienmensch" } } { "_id" : 2, "originalt_navn" : "Cuéntame como pasó", "sprog" : "spanish", "oversættelser" : [ { "sprog" : "english", "sitcom-navn" : "Tell me how it happened" }, { "sprog" : "french", "sitcom-navn" : "Raconte-moi comment cela s'est passé" } ] }
In this case, sprog
is the field that determines the language of each document.
Therefore, we can create the index as follows:
db.sitcoms.createIndex(
{
"original_name": "text",
"translations.sitcom_name": "text"
},
{
"default_language": "danish",
"language_override": "sprog"
}
)
Let’s check the index:
db.sitcoms.getIndexes()
Result:
[ { "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_" }, { "v" : 2, "key" : { "_fts" : "text", "_ftsx" : 1 }, "name" : "original_name_text_translations.sitcom_name_text", "default_language" : "danish", "language_override" : "sprog", "weights" : { "original_name" : 1, "translations.sitcom_name" : 1 }, "textIndexVersion" : 3 } ]
In our newly created text
index, we have the default_language
as danish
, and the language_override
field as sprog
.
Available Languages
At the time of writing, there are around 15 languages that are supported by text
indexes and the $text
operator.
You can use the long form language name (as in the above examples) or the two letter ISO 639-1 language code.
A list of text search languages is available on the MongoDB website.