Create a Multi-Language Text Index in MongoDB

When you create a text index in MongoDB, the index uses a default language of english.

The default language determines the rules to parse word roots (i.e. stemming) and ignore stop words.

However, you can change the default language if required.

You can also specify a language at the document level, and even at the subdocument level. The default language will only be used if a language hasn’t been specified at the document or subdocument level.

This article presents examples of specifying a language for a text index.

Example Collection

Suppose we have a collection called sitcoms with documents like this:

{
	"_id" : 1,
	"original_name" : "Family Guy",
	"translations" : {
		"language" : "german",
		"sitcom_name" : "Familienmensch"
	}
}
{
	"_id" : 2,
	"original_name" : "Cuéntame como pasó",
	"language" : "spanish",
	"translations" : [
		{
			"language" : "english",
			"sitcom_name" : "Tell me how it happened"
		},
		{
			"language" : "french",
			"sitcom_name" : "Raconte-moi comment cela s'est passé"
		}
	]
}

We can see that there are two documents in this collection. Each document contains the name of a sitcom, along with translations of that sitcom name in different languages. The language of each translation is specified in the language field of the respective subdocument.

The second document in this collection also includes a language field at its top level (in this case, "language" : "spanish"). This means that the sitcom name is in Spanish (or at least, Spanish is the language we want to be used when this document is indexed).

However, the first document doesn’t contain such a field. The fact that the first document doesn’t contain a top-level language field means that we want it to be indexed using the default language. If no default language is specified during indexing, then the default language will be English.

If an embedded document doesn’t contain a field that specifies the language, then it will use the language field of the enclosing document. If the enclosing document doesn’t contain a language field, then it will use the default language.

Create a Text Index for Multiple Languages

Let’s go ahead and create a text index for the above collection.

db.sitcoms.createIndex( 
  { 
    "original_name": "text",
    "translations.sitcom_name": "text"
  }
)

That creates a compound text index on the original_name field, and the translations.sitcom_name field (i.e. the sitcom_name field in the embedded documents).

Now let’s use getIndexes() to take a look at that index:

db.sitcoms.getIndexes()

Result:

[
	{
		"v" : 2,
		"key" : {
			"_id" : 1
		},
		"name" : "_id_"
	},
	{
		"v" : 2,
		"key" : {
			"_fts" : "text",
			"_ftsx" : 1
		},
		"name" : "original_name_text_translations.sitcom_name_text",
		"weights" : {
			"original_name" : 1,
			"translations.sitcom_name" : 1
		},
		"default_language" : "english",
		"language_override" : "language",
		"textIndexVersion" : 3
	}
]

We can see that it uses a default language of English. This is specified as "default_language" : "english".

Change the Default Language

We can set a different default language when creating the index if required.

Let’s drop the index and recreate it with a different default language:

db.sitcoms.dropIndex("original_name_text_translations.sitcom_name_text")
db.sitcoms.createIndex( 
  { 
    "original_name": "text",
    "translations.sitcom_name": "text"
  },
  {
    "default_language": "danish"
  }
)

Let’s take a look at the index:

db.sitcoms.getIndexes()

Result:

[
	{
		"v" : 2,
		"key" : {
			"_id" : 1
		},
		"name" : "_id_"
	},
	{
		"v" : 2,
		"key" : {
			"_fts" : "text",
			"_ftsx" : 1
		},
		"name" : "original_name_text_translations.sitcom_name_text",
		"default_language" : "danish",
		"weights" : {
			"original_name" : 1,
			"translations.sitcom_name" : 1
		},
		"language_override" : "language",
		"textIndexVersion" : 3
	}
]

We can see that the default language is now danish as specified.

The language_override Parameter

You may be wondering “how did MongoDB know that the document’s language field is the field that specifies the language?”.

And that’s a great question. After all, what if we’d given the field a different name – how would MongoDB know that it is the field to use for the language?

If you look at the above index, you’ll see that it has a language_override field. Specifically, it goes like this: "language_override" : "language"

What that means is that the document’s language field will be the field that the index uses to override the language.

When you create a text index, the index will look for any fields called language and then use those as the language for the respective document.

However, the name language isn’t set in stone. You can change it if you so desire.

Suppose our collection contains documents where the field names are in Danish.

Like this:

{
	"_id" : 1,
	"originalt_navn" : "Family Guy",
	"sprog" : "english",
	"oversættelser" : {
		"sprog" : "german",
		"sitcom-navn" : "Familienmensch"
	}
}
{
	"_id" : 2,
	"originalt_navn" : "Cuéntame como pasó",
	"sprog" : "spanish",
	"oversættelser" : [
		{
			"sprog" : "english",
			"sitcom-navn" : "Tell me how it happened"
		},
		{
			"sprog" : "french",
			"sitcom-navn" : "Raconte-moi comment cela s'est passé"
		}
	]
}

In this case, sprog is the field that determines the language of each document.

Therefore, we can create the index as follows:

db.sitcoms.createIndex( 
  { 
    "original_name": "text",
    "translations.sitcom_name": "text"
  },
  {
    "default_language": "danish",
    "language_override": "sprog"
  }
)

Let’s check the index:

db.sitcoms.getIndexes()

Result:

[
	{
		"v" : 2,
		"key" : {
			"_id" : 1
		},
		"name" : "_id_"
	},
	{
		"v" : 2,
		"key" : {
			"_fts" : "text",
			"_ftsx" : 1
		},
		"name" : "original_name_text_translations.sitcom_name_text",
		"default_language" : "danish",
		"language_override" : "sprog",
		"weights" : {
			"original_name" : 1,
			"translations.sitcom_name" : 1
		},
		"textIndexVersion" : 3
	}
]

In our newly created text index, we have the default_language as danish, and the language_override field as sprog.

Available Languages

At the time of writing, there are around 15 languages that are supported by text indexes and the $text operator.

You can use the long form language name (as in the above examples) or the two letter ISO 639-1 language code.

A list of text search languages is available on the MongoDB website.