ing a text index</h2><p id="8719">Now let’s do some basic full-text searches with the <code>text</code> index just created. We will use the <code>$text</code> query operator to perform text searches. For example:</p>
<figure id="8957">
<div>
<div>
<iframe class="gist-iframe" src="/gist/lynnkwong/884b122264f668cb1855a422a026f1d4.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="433a">The <code>$text</code> operator uses the <code>$search</code> field which accepts a string value to do text searches. Under the hood, the search string is tokenized using whitespace and punctuations as delimiters. For the generated tokens, each of them is searched independently and joined with a logical <code>OR</code> operator. Besides, the search is by default case insensitive. If you want to do case-sensitive searches, you can specify the <a href="https://docs.mongodb.com/manual/reference/operator/query/text/#case-and-diacritic-insensitive-search"><code>$caseSensit</code>ive</a> field for the <code>$text</code> operator.</p><p id="6896">Therefore, with the search query above, we get documents containing either “HP” or “ProBook”, but not necessarily both.</p><div id="b13b"><pre>[
{ <span class="hljs-variable">_id</span>: <span class="hljs-number">19</span>, <span class="hljs-built_in">name</span>: <span class="hljs-string">'HP ZBook Model 19'</span> },
{ <span class="hljs-variable">_id</span>: <span class="hljs-number">20</span>, <span class="hljs-built_in">name</span>: <span class="hljs-string">'HP ZBook Model 20'</span> },
{ <span class="hljs-variable">_id</span>: <span class="hljs-number">3</span>, <span class="hljs-built_in">name</span>: <span class="hljs-string">'HP EliteBook Model 3'</span> },
{ <span class="hljs-variable">_id</span>: <span class="hljs-number">18</span>, <span class="hljs-built_in">name</span>: <span class="hljs-string">'HP ProBook Model 18'</span> },
...
]</pre></div><h2 id="ecc6">Sort by textScore</h2><p id="64c3">Importantly, with the <code>$text</code> operator, there is a score assigned for each document indicating how well the document matched the search string. If both “HP” and “ProBook” match a document, the document gets a score higher than the one that only matches “HP” or “ProBook”. We can sort the documents by the scores and only get the top ones with the <code>limit()</code> method.</p>
<figure id="d878">
<div>
<div>
<iframe class="gist-iframe" src="/gist/lynnkwong/f7e5a41c99fbab9377695a8606ebab27.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="9525">As it may seem strange, the score is returned with the <code>{$meta: "textScore"}</code> expression. Also, something that can be even weirder at first sight:</p><ul><li>The field name given (<code>score</code>) is not important. You can give a different name and it will still work.</li><li>The sorting by score is always in descending order. This makes sense, as normally we want to find the most relevant matches.</li></ul><p id="1f93">With this query, we can get the most relevant results we want:</p><div id="ee20"><pre>[
{ <span class="hljs-variable">_id</span>: <span class="hljs-number">15</span>, <span class="hljs-built_in">name</span>: <span class="hljs-string">'HP ProBook Model 15'</span> },
{ <span class="hljs-variable">_id</span>: <span class="hljs-number">16</span>, <span class="hljs-built_in">name</span>: <span class="hljs-string">'HP ProBook Model 16'</span> },
{ <span class="hljs-variable">_id</span>: <span class="hljs-number">18</span>, <span class="hljs-built_in">name</span>: <span class="hljs-string">'HP ProBook Model 18'</span> }
]</pre></div><h2 id="f307">Search by a phrase</h2><p id="5c62">If we only want to find the documents which contain exactly “HP ProBook”, we can search by a phrase, which is simply to put the search string in a nested pair of quotes. We can use alternate single and double quotes or escape the quotes with backslashes. The below queries will give the same results:</p>
<figure id="1635">
<div>
<div>
<iframe class="gist-iframe" src="/gist/lynnkwong/e9c05f492dc13d6b8449c550f64e06dd.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><h2 id="2ae3">Use negation in the search query</h2><p id="460a">We can also use negation in our search query which requires the documents not to match some token. Let’s search for the laptops that are “HP” but are not “ProBook”:</p>
<figure id="52bc">
<div>
<div>
<iframe class="gist-iframe" src="/gist/lynnkwong/7b7599b37eb8a2a3e9aa9e2f4445aa40.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="722a">In the result list, we cannot see “ProBook” anymore:</p><div id="748c"><pre>[
pan>: <span class="hljs-string">'HP ZBook Model 20'</span> },
{ <span class="hljs-variable">_id</span>: <span class="hljs-number">3</span>, <span class="hljs-built_in">name</span>: <span class="hljs-string">'HP EliteBook Model 3'</span> },
...
]</pre></div><h2 id="234c">Text search in a nested document</h2><p id="9479">Let’s now search with an attribute as well to see if the <code>text</code> index covers both the <code>name</code> and <code>attributes</code> fields:</p>
<figure id="7fe7">
<div>
<div>
<iframe class="gist-iframe" src="/gist/lynnkwong/a2b35d1e7d158dc555892b531e3868ea.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="29ed">Note that we need to sort by score here otherwise the top results may not be the ones you expect. This is because “HP 1TB” is not a phrase. Actually, “HP” occurs in the <code>name</code> fields and “1TB” in the <code>attributes.attribute_value</code> field. As an <code>OR</code> logical operator is used by default, the documents returned will contain either “HP” or “1TB”, but not necessarily both. With the <code>sort()</code>and <code>limit()</code> methods, we will return the top most relevant results which are normally what we want.</p>
<figure id="515c">
<div>
<div>
<iframe class="gist-iframe" src="/gist/lynnkwong/017e0e3fd4020801f53ebb17b31ea707.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><h2 id="b12f">Combine $text operator with other operators</h2><p id="ffde">The <code>$text</code> operator can be used together with regular MongoDB operators. For example, let’s find the HP ProBooks whose prices are below 10000 SEK:</p>
<figure id="0191">
<div>
<div>
<iframe class="gist-iframe" src="/gist/lynnkwong/07e2d12cd341308e1ee8c702178359b5.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="b08f">And this is what we get:</p><div id="a900"><pre>[
{ <span class="hljs-attr">_id:</span> <span class="hljs-number">13</span>, <span class="hljs-attr">name:</span> <span class="hljs-string">'HP ProBook Model 13'</span>, <span class="hljs-attr">price:</span> <span class="hljs-number">9994</span> },
{ <span class="hljs-attr">_id:</span> <span class="hljs-number">9</span>, <span class="hljs-attr">name:</span> <span class="hljs-string">'HP ProBook Model 9'</span>, <span class="hljs-attr">price:</span> <span class="hljs-number">9980</span> }
]</pre></div><p id="b98b">However, it should be noted that there should only be one <code>text</code> operator in the search query, otherwise only the last one will be effective. This is because the query document (a dictionary in Python) cannot have duplicate keys.</p><h2 id="d486">Use text operator in aggregation</h2><p id="d8f8">The <code>text</code> operator can also be used in an aggregation pipeline. However, there are three major restrictions:</p><ul><li>The <code>text</code> operator can only be used in the <code>match</code> stage.</li><li>The <code>match</code> stage containing a <code>text</code> operator must be the first stage of the pipeline.</li><li>The <code>text</code> operator can only occur once in the <code>$match</code> stage and in the whole pipeline.</li></ul><p id="8d49">Let’s write an aggregation pipeline to count the number of laptops grouped by RAM sizes for “HP ProBook”:</p>
<figure id="45fe">
<div>
<div>
{ <span class="hljs-variable">_id</span>: <span class="hljs-string">'16GB'</span>, <span class="hljs-built_in">count</span>: <span class="hljs-number">2</span> },
{ <span class="hljs-variable">_id</span>: <span class="hljs-string">'8GB'</span>, <span class="hljs-built_in">count</span>: <span class="hljs-number">4</span> },
{ <span class="hljs-variable">_id</span>: <span class="hljs-string">'4GB'</span>, <span class="hljs-built_in">count</span>: <span class="hljs-number">1</span> }
]</pre></div><p id="b3d5">It shows the <code>text</code> operator works just like any other regular operator in the aggregation pipeline.</p><p id="3c07">In this post, we introduced the classical Text Search in MongoDB using a <code>text</code> index. The full-text search solution provided by the <code>text</code> index and corresponding <code>text</code> operator is simple but also quite powerful. It should be sufficient for most small projects that only need to search by simple conditions. If you want to have more advanced searches where you need to have indexes for multiple string fields and use complex <b><i>should (not)</i></b>/<b><i>must (not)</i></b> conditions, you would want to use a more advanced search engine, such as <a href="http://What is Elasticsearch and why is it so fast?">Elasticsearch</a>, or the “Premium” <a href="https://betterprogramming.pub/learn-advanced-full-text-searches-with-mongodb-atlas-search-5e4b51719427">Atlas Search</a>.</p></article></body>
How to do basic full-text searches in MongoDB
Search your text data by the text index in MongoDB
As MongoDB is a document-oriented NoSQL database, it’s common to store plain text in some fields. To search against a string field, we can use the regular expression operator $regex directly. However, $regex can only work for simple search queries and cannot use indexes efficiently.
Image by DariuszSankowsk on Pixabay
In MongoDB, there are better ways to search against string fields. A classical way is to create a text index and search against it. Even though MongoDB now supports a “premium” full-text solution, however, it only works if you host your data with Atlas. As it can be common to use self-managing MongoDB servers in our work, especially for some small and simple projects, it’s worthy to learn and use the classical Text Search solution which can boost your search efficiency dramatically with simple queries. As will be demonstrated later, most common search problems can be solved with the text index together with classical MongoDB search and aggregation queries.
For demonstration, we will search against a list of laptops that will be stored in a MongoDB database. Please download the JSON file (generated by the author) containing some laptop data from a fictional online shop. Note that the data was generated randomly based on some common laptop brands. It can be used freely and won’t have any license issues. Then use the following commands to import the data:
When the code above is run, we will have a laptops collection in the products database containing 200 documents of laptop data. The documents have content as follows:
Now that the data is ready, we can start to create a text index and do some basic full-text searches.
In this tutorial, we will use mongosh to run the queries directly. If you need to write some complex queries, you may find a MongoDB IDE helpful which provides command autocompletion and error highlighting. For simplicity, we will use the mongosh shipped with the Docker container so we don’t need to install anything separately:
$ docker exec -it mongo-server bash
$ mongosh "mongodb://admin:pass@localhost:27017"
test> use products
products > show collections
laptops
Create a text index
Before we get started, there is something important we should remember, namely, there can be only one text index for a collection.
Let’s create a text index on the name field, which is done with the createIndex() method of a collection:
name is the string field for which we want to create an index, and the “text” value indicates that we want to create a text index that supports basic full-text searches. In comparison, to create a regular index in MongoDB, we specify 1 or -1 for a field to indicate if the field should be sorted in ascending or descending order in the index.
Before we start to search against the text index, we should know that even though there can only be a single text index for a collection, that index can cover multiple fields. Let’s drop the text index created above and create a new one covering both the name and attributes fields.
Note that the text index can be named differently but there can only be one text index in a collection.
Basic full-text searches using a text index
Now let’s do some basic full-text searches with the text index just created. We will use the $text query operator to perform text searches. For example:
The $text operator uses the $search field which accepts a string value to do text searches. Under the hood, the search string is tokenized using whitespace and punctuations as delimiters. For the generated tokens, each of them is searched independently and joined with a logical OR operator. Besides, the search is by default case insensitive. If you want to do case-sensitive searches, you can specify the $caseSensitive field for the $text operator.
Therefore, with the search query above, we get documents containing either “HP” or “ProBook”, but not necessarily both.
[
{ _id: 19, name: 'HP ZBook Model 19' },
{ _id: 20, name: 'HP ZBook Model 20' },
{ _id: 3, name: 'HP EliteBook Model 3' },
{ _id: 18, name: 'HP ProBook Model 18' },
...
]
Sort by textScore
Importantly, with the $text operator, there is a score assigned for each document indicating how well the document matched the search string. If both “HP” and “ProBook” match a document, the document gets a score higher than the one that only matches “HP” or “ProBook”. We can sort the documents by the scores and only get the top ones with the limit() method.
As it may seem strange, the score is returned with the {$meta: "textScore"} expression. Also, something that can be even weirder at first sight:
The field name given (score) is not important. You can give a different name and it will still work.
The sorting by score is always in descending order. This makes sense, as normally we want to find the most relevant matches.
With this query, we can get the most relevant results we want:
[
{ _id: 15, name: 'HP ProBook Model 15' },
{ _id: 16, name: 'HP ProBook Model 16' },
{ _id: 18, name: 'HP ProBook Model 18' }
]
Search by a phrase
If we only want to find the documents which contain exactly “HP ProBook”, we can search by a phrase, which is simply to put the search string in a nested pair of quotes. We can use alternate single and double quotes or escape the quotes with backslashes. The below queries will give the same results:
Use negation in the search query
We can also use negation in our search query which requires the documents not to match some token. Let’s search for the laptops that are “HP” but are not “ProBook”:
In the result list, we cannot see “ProBook” anymore:
[
{ _id: 19, name: 'HP ZBook Model 19' },
{ _id: 20, name: 'HP ZBook Model 20' },
{ _id: 3, name: 'HP EliteBook Model 3' },
...
]
Text search in a nested document
Let’s now search with an attribute as well to see if the text index covers both the name and attributes fields:
Note that we need to sort by score here otherwise the top results may not be the ones you expect. This is because “HP 1TB” is not a phrase. Actually, “HP” occurs in the name fields and “1TB” in the attributes.attribute_value field. As an OR logical operator is used by default, the documents returned will contain either “HP” or “1TB”, but not necessarily both. With the sort()and limit() methods, we will return the top most relevant results which are normally what we want.
Combine $text operator with other operators
The $text operator can be used together with regular MongoDB operators. For example, let’s find the HP ProBooks whose prices are below 10000 SEK:
And this is what we get:
[
{ _id:13, name:'HP ProBook Model 13', price:9994 },
{ _id:9, name:'HP ProBook Model 9', price:9980 }
]
However, it should be noted that there should only be one $text operator in the search query, otherwise only the last one will be effective. This is because the query document (a dictionary in Python) cannot have duplicate keys.
Use $text operator in aggregation
The $text operator can also be used in an aggregation pipeline. However, there are three major restrictions:
The $text operator can only be used in the $match stage.
The $match stage containing a $text operator must be the first stage of the pipeline.
The $text operator can only occur once in the $match stage and in the whole pipeline.
Let’s write an aggregation pipeline to count the number of laptops grouped by RAM sizes for “HP ProBook”:
It shows the $text operator works just like any other regular operator in the aggregation pipeline.
In this post, we introduced the classical Text Search in MongoDB using a text index. The full-text search solution provided by the text index and corresponding $text operator is simple but also quite powerful. It should be sufficient for most small projects that only need to search by simple conditions. If you want to have more advanced searches where you need to have indexes for multiple string fields and use complex should (not)/must (not) conditions, you would want to use a more advanced search engine, such as Elasticsearch, or the “Premium” Atlas Search.