How to Use Hybrid Search for Better LLM RAG Retrieval
如何使用混合搜索更好地检索 LLM RAG

Building an advanced local LLM RAG pipeline by combining dense embeddings with BM25
将密集嵌入与 BM25 结合起来，构建先进的本地 LLM RAG 管道

Code snippet from the hybrid search we are going to implement in this article. Image by author
本文中要实现的混合搜索的代码片段。图片作者

The basic Retrieval-Augmented Generation (RAG) pipeline uses an encoder model to search for similar documents when given a query.
基本的 "检索-增强生成"（RAG）管道使用编码器模型，在给定查询时搜索类似文档。

This is also called semantic search because the encoder transforms text into a high-dimensional vector representation (called an embedding) in which semantically similar texts are close together.
这也被称为语义搜索，因为编码器会将文本转换为高维向量表示（称为嵌入），在嵌入中，语义相似的文本会被靠近在一起。

Before we had Large Language Models (LLMs) to create these vector embeddings, the BM25 algorithm was a very popular search algorithm. BM25 focuses on important keywords and looks for exact matches in the available documents. This approach is called keyword search.
在我们使用大型语言模型（LLM）创建这些向量嵌入之前，BM25 算法是一种非常流行的搜索算法。BM25 专注于重要的关键词，并在可用文档中寻找完全匹配的关键词。这种方法被称为关键词搜索。

If you want to take your RAG pipeline to the next level, you might want to try hybrid search. Hybrid search combines the benefits of keyword search and semantic search to improve search quality.
如果您想让您的 RAG 管道更上一层楼，不妨试试混合搜索。混合搜索结合了关键字搜索和语义搜索的优势，可以提高搜索质量。

In this article, we will cover the theory and implement all three search approaches in Python.
在本文中，我们将介绍这三种搜索方法的理论，并用 Python 实现它们。

Table of Contents
目录

· RAG Retrieval ∘ Keyword Search With BM25 ∘ Semantic Search With Dense Embeddings ∘ Semantic Search or Hybrid Search? ∘ Hybrid Search ∘ Putting It All Together · Conclusion · References
- RAG 检索 ∘ 使用 BM25 的关键字搜索 ∘ 使用密集嵌入的语义搜索 ∘ 语义搜索还是混合搜索？ ∘ 混合搜索 ∘ 归纳总结 - 结论 - 参考文献

RAG Retrieval
RAG 检索

Hybrid search is the combination of keyword search and semantic search. We will cover both search strategies separately and then combine them later on.
混合搜索是关键词搜索和语义搜索的结合。我们将分别介绍这两种搜索策略，然后再将它们结合起来。

Keyword Search With BM25
使用 BM25 进行关键词搜索

BM25 is the algorithm of choice for keyword search. With BM25, we get a score for our query for each document in our corpus.
BM25 是关键词搜索的首选算法。通过 BM25，我们可以为语料库中的每份文档获得查询得分。

BM25 is based on the TF-IDF algorithm, which means that the core of the formula is the product of the term frequency (TF) and the inverse document frequency (IDF).
BM25 基于 TF-IDF 算法，即公式的核心是词频（TF）和反向文档频率（IDF）的乘积。

The TF-IDF algorithm is based on the idea that “matches on less frequent, more specific, terms are of greater value than matches on frequent terms” [1].
TF-IDF 算法基于 "频率较低、更具体的术语的匹配比频率较高的术语的匹配更有价值 "这一理念 [1]。

In other words, the TF-IDF algorithm looks for documents that contain rare keywords from our query.
换句话说，TF-IDF 算法会查找包含查询中罕见关键词的文档。

The BM25 algorithm looks for documents that contain important keywords from the search query. Image by author
BM25 算法可查找包含搜索查询中重要关键词的文档。图片由作者提供

There are many variations of the BM25 algorithm, each aiming to improve upon the original algorithm. However, none seems to be systematically better than the others [2].
BM25 算法有很多变体，每种变体都旨在改进原始算法。然而，似乎没有一种算法比其他算法更有系统性[2]。

So in practice, it should be fine to pick one and stick with it.
因此，在实践中，选择一种并坚持使用就可以了。

If we look at the LangChain source code, we can see that it uses the BM25Okapi class from the rank_bm25 package, which is a slightly modified version of the ATIRE BM25 algorithm [3].
如果我们查看一下 LangChain 的源代码，就会发现它使用了 rank_bm25 软件包中的 BM25Okapi 类，该类是 ATIRE BM25 算法[3]的一个稍作修改的版本。

The formula to get a score for a document d and a given query q consisting of multiple terms t in the ATIRE BM25 version is as follows [2]:
在 ATIRE BM25 版本中，文档 d 和由多个术语 t 组成的给定查询 q 的得分公式如下 [2]：

The formula for the ATIRE BM25 score.
ATIRE BM25 分数的计算公式。

N is the number of documents in the corpus
N 是语料库中的文件数量
df_t is the number of documents containing the term t (also called the document frequency)
df_t 是包含术语 t 的文档数量（也称为文档频率）
tf_td is the number of occurrences of the term t in document d (also called the term frequency)
tf_td 是术语 t 在文档 d 中出现的次数（也称为术语频率）
L_d is the length of our document and L_avg is the average document length
L_d 是文档长度，L_avg 是文档平均长度
There are two empirical tuning parameters: b, and k_1
有两个经验调整参数：b 和 k_1

Intuitively, we see that the formula sums over all terms t, which we can think of as words.
直观地看，我们会发现这个公式是对所有项 t 的求和，我们可以把这些项看作词语。

The left-hand factor log(N/df_t) in the BM25 equation is called the inverse document frequency. For common words such as “the”, all of our documents may contain, so the inverse document frequency will be zero (because log(1) is zero).
BM25 公式中的左侧系数 log(N/df_t) 称为反文档频率。对于像 "the "这样的常用词，我们的所有文档都可能包含，因此反向文档频率将为零（因为 log(1) 为零）。

On the other hand, very rare words will appear in only a few documents, thus increasing the left factor. Therefore, the inverse document frequency is a measure of how much information is contained in the term t.
另一方面，非常罕见的词只会出现在少数文档中，从而增加了左因子。因此，反文档频率可以衡量术语 t 所包含的信息量。

The right factor is influenced by the number of times the term t occurs in document d.
右因子受术语 t 在文档 d 中出现次数的影响。

The document d=["I like red cats, black cats, white cats, and brown cats"] has a very high term frequency tf_td for the term t="cats", which will result in a high BM25 score for a query containing the word “cats”.
文档 d=["我喜欢红色的猫、黑色的猫、白色的猫和棕色的猫"]中 t="猫 "一词的词频 tf_td 非常高，这将导致包含 "猫 "一词的查询的 BM25 得分很高。

Let’s use BM25 to get some intuition using the Python library rank_bm25.
让我们使用 BM25，利用 Python 库 rank_bm25 获得一些直观感受。

pip install rank_bm25

First, we load the library and initialize BM25 with our tokenized corpus.
首先，我们用标记化语料加载库并初始化 BM25。

from rank_bm25 import BM25Okapi

corpus = [
    "The cat, commonly referred to as the domestic cat or house cat, is a small domesticated carnivorous mammal.",
    "The dog is a domesticated descendant of the wolf.",
    "Humans are the most common and widespread species of primate, and the last surviving species of the genus Homo.",
    "The scientific name Felis catus was proposed by Carl Linnaeus in 1758"
]
tokenized_corpus = [doc.split(" ") for doc in corpus]

bm25 = BM25Okapi(tokenized_corpus)

Next, we tokenize our query.
接下来，我们对查询进行标记化。

query = "The cat"
tokenized_query = query.split(" ")

Finally, we compute scores using the BM25 algorithm. A high score indicates a good match between the document and the query.
最后，我们使用 BM25 算法计算分数。高分表示文档与查询匹配度高。

doc_scores = bm25.get_scores(tokenized_query)

print(doc_scores)
>> [0.92932018 0.21121974 0. 0.1901173]# scores for documents 1, 2, 3, and 4

Because BM25 looks for exact term matches, querying for the terms “cats”, “Cat”, or “feline” will all result in scores of doc_scores = [0,0,0] for our three example documents.
由于 BM25 只查找完全匹配的术语，因此查询术语 "cats"、"Cat "或 "feline "都会导致三个示例文档的 doc_scores = [0,0,0] 分数。

Semantic Search With Dense Embeddings
利用密集嵌入进行语义搜索

When we perform semantic search via dense embeddings, we transform words into a numerical representation. The idea is that similar words are close together in this new mathematical representation.
当我们通过密集嵌入进行语义搜索时，我们会将单词转化为数字表示。我们的想法是，在这种新的数学表示法中，相似的词会靠得很近。

Using dense embeddings to group semantically similar text. Image by author
使用密集嵌入法对语义相似的文本进行分组。图片由作者提供

Text embeddings are high-dimensional vectors of single words or whole sentences. They are called dense because each entry in the vector is a meaningful number. The opposite is called sparse when many vector entries are simply zero.
文本嵌入是由单词或整句组成的高维向量。它们被称为密集型，因为向量中的每个条目都是一个有意义的数字。反之，当许多向量条目为零时，则称为稀疏。

Before turning words into embeddings, they are first converted to tokens, which is a mapping from string to integer. A neural network embedding model called the encoder then converts the tokens into embeddings.
在将单词转化为嵌入词之前，首先要将单词转化为词块，这是从字符串到整数的映射。然后，一个名为编码器的神经网络嵌入模型会将词块转换成嵌入词。

From input string to sentence embedding. Image from https://readmedium.com/how-to-reduce-embedding-size-and-increase-rag-retrieval-speed-7f903d3cecf7
从输入字符串到句子嵌入。图片来自 https://readmedium.com/how-to-reduce-embedding-size-and-increase-rag-retrieval-speed-7f903d3cecf7

After converting all the text from our corpus of documents into embeddings, we can then perform a semantic search to see which embedded document is closest to our embedded query.
将文档语料库中的所有文本转换为嵌入式文档后，我们就可以进行语义搜索，看看哪个嵌入式文档最接近我们的嵌入式查询。

We can visualize this task by plotting the embedding dimensions and finding the closest document matches to our query.
我们可以通过绘制嵌入维度图和查找与我们的查询最匹配的文档来直观地了解这项任务。

Scatter plot of the sentence embeddings from documents and a question. Image from https://readmedium.com/how-to-reduce-embedding-size-and-increase-rag-retrieval-speed-7f903d3cecf7
来自文档和问题的句子嵌入散点图。图片来自 https://readmedium.com/how-to-reduce-embedding-size-and-increase-rag-retrieval-speed-7f903d3cecf7

Mathematically, we find the closest match using the cosine distance function. For two embedding vectors a and b, we can compute the cosine similarity using the dot product as follows:
在数学上，我们使用余弦距离函数找到最接近的匹配。对于两个嵌入向量 a 和 b，我们可以使用点积计算余弦相似度，如下所示：

The formula for the cosine similarity.
余弦相似度公式

Where the numerator is the dot product of the two embedding vectors and the denominator is the product of their magnitudes.
其中分子是两个嵌入向量的点积，分母是两个嵌入向量的大小之积。

Geometrically, the cosine similarity is the angle between the vectors. The cosine similarity score ranges from -1 to +1.
从几何学角度看，余弦相似度是向量之间的夹角。余弦相似度的分值范围为-1 到+1。

A cosine similarity score of -1 means that the embeddings a and b face exactly in opposite directions, 0 means that they have an angle of 90 degrees (they are unrelated), and +1 means that they are the same. So we look for a value close to +1 when matching a search query to documents.
余弦相似度得分-1 表示嵌入式 a 和嵌入式 b 的方向完全相反，0 表示它们的夹角为 90 度（不相关），+1 表示它们相同。因此，在将搜索查询与文档进行匹配时，我们会寻找一个接近 +1 的值。

If we normalize our embeddings beforehand, the cosine similarity measure becomes equivalent to the dot product similarity measure (the denominator becomes 1).
如果我们事先对嵌入进行归一化处理，余弦相似性度量就等同于点积相似性度量（分母变为 1）。

Let’s use the Python package sentence-transformers to perform a basic semantic search.
让我们使用 Python 软件包 sentence-transformers 来执行一次基本的语义搜索。

pip install sentence-transformers

First, we load the library and download the all-MiniLM-L6-v2 encoder model from HuggingFace. This encoder model is trained to produce 384-dimensional dense embeddings. If you have an OpenAI API key, you can also use their text-embedding model instead.
首先，我们加载库并从 HuggingFace 下载全-MiniLM-L6-v2 编码器模型。该编码器模型经过训练，可以生成 384 维的密集嵌入。如果你有 OpenAI API 密钥，也可以使用他们的文本嵌入模型。

from sentence_transformers import SentenceTransformer

# 1. Load a pretrained Sentence Transformer model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

Then, we use the same corpus of documents as before.
然后，我们使用与之前相同的文件语料库。

# The documents to encode
corpus = [
    "The cat, commonly referred to as the domestic cat or house cat, is a small domesticated carnivorous mammal.",
    "The dog is a domesticated descendant of the wolf.",
    "Humans are the most common and widespread species of primate, and the last surviving species of the genus Homo.",
    "The scientific name Felis catus was proposed by Carl Linnaeus in 1758"
]

# Calculate embeddings by calling model.encode()
document_embeddings = model.encode(corpus)

# Sanity check
print(document_embeddings.shape)
>> (4, 384)

And we embed our query:
然后嵌入我们的查询：

query = "The cat"
query_embedding = model.encode(query)

Finally, we can compute the cosine similarity scores. Instead of coding the formula ourselves, we can use the utility function cos_sim from sentence_transformers.
最后，我们可以计算余弦相似度得分。我们可以使用 sentence_transformers 中的实用函数 cos_sim，而不用自己编写公式。

from sentence_transformers.util import cos_sim

# Compute cosine_similarity between documents and query
scores = cos_sim(document_embeddings, query_embedding)

print(scores)
>> tensor([[0.5716],  # score for document 1
>>         [0.2904],  # score for document 2
>>         [0.0942],  # score for document 3
>>         [0.3157]]) # score for document 4

To see the power of semantic search with dense embeddings, I can re-run the code with the query “feline”:
为了了解使用密集嵌入进行语义搜索的威力，我可以用查询 "feline "重新运行代码：

query_embedding = model.encode("feline")

scores = cos_sim(document_embeddings, query_embedding)

print(scores)
>> tensor([[0.4007],
>>         [0.3837],
>>         [0.0966],
>>         [0.3804]])

Even though the word “feline” does not appear in the corpus of documents, the semantic search still ranks the text about cats as the highest match.
尽管 "猫科动物 "一词没有出现在语料库中，语义搜索仍然将有关猫科动物的文本列为匹配度最高的文本。

Semantic Search or Keyword Search?
语义搜索还是关键词搜索？

Which search approach is better? That depends. Both have advantages and disadvantages. Now that we know how both work, we can see where they can be useful and where they can fail.
哪种搜索方法更好？这要看情况。两者各有利弊。既然我们已经知道了这两种方法的工作原理，我们就可以知道它们在哪些方面有用，哪些方面会失败。

Keyword search with BM25 looks for exact matches of the query term. This can be very useful when we are looking for exact matches of phrases.
使用 BM25 进行关键词搜索时，会查找与查询词完全匹配的词。这在我们查找短语的精确匹配时非常有用。

If I’m looking for “The Cat in the Hat”, I’m probably looking for the book/movie. And I don’t want semantically similar results that are close to hats or cats.
如果我在查找 "The Cat in the Hat"（《帽子里的猫》），我可能是在查找这本书/电影。我不想要与帽子或猫语义相似的结果。

Another use case for keyword search is programming. If I am looking for a specific function or piece of code, I want an exact match.
关键字搜索的另一种使用情况是编程。如果我在查找一个特定的函数或一段代码，我需要精确匹配。

Semantic search, on the other hand, looks for semantically similar content. This means that semantic search also finds documents with synonyms or different spellings, such as plurals, capitalization, etc.
而语义搜索则是寻找语义相似的内容。这意味着语义搜索也能找到同义词或不同拼写的文档，如复数、大写等。

Since both algorithms have their use cases, hybrid search uses both and then combines their results into one final ranking.
由于这两种算法都有各自的用途，因此混合搜索会同时使用这两种算法，然后将其结果合并为一个最终排名。

The disadvantage of hybrid search is that it requires more computing resources than running only one algorithm.
混合搜索的缺点是，它比只运行一种算法需要更多的计算资源。

Hybrid Search
混合搜索

We can combine the results of BM25 and cosine similarity using Reciprocal Rank Fusion (RRF). RRF is a simple algorithm for combining the rankings of different scoring functions [4].
我们可以使用互易排名融合（RRF）将 BM25 和余弦相似性的结果结合起来。RRF 是一种结合不同评分函数排名的简单算法[4]。

First, we need to get a document ranking for each scoring algorithm. In our example, this would be:
首先，我们需要获得每种评分算法的文档排名。在我们的例子中，这将是

corpus = [
    "The cat, commonly referred to as the domestic cat or house cat, is a small domesticated carnivorous mammal.",
    "The dog is a domesticated descendant of the wolf.",
    "Humans are the most common and widespread species of primate, and the last surviving species of the genus Homo.",
    "The scientific name Felis catus was proposed by Carl Linnaeus in 1758",
]
query = "The cat"

bm25_ranking = [1, 2, 4, 3] # scores = [0.92932018 0.21121974 0. 0.1901173]
cosine_ranking = [1, 3, 4, 2] # scores = [0.5716, 0.2904, 0.0942, 0.3157]

The formula for the combined RRF score for each document d is as follows:
每个文档 d 的 RRF 综合得分公式如下

The formula for the RRF score.
RRF 分数的计算公式。

Where k is a parameter (the original paper used k=60) and r(d) are the rankings from BM25 and from cosine similarity.
其中 k 是一个参数（原论文使用 k=60），r(d) 是 BM25 和余弦相似度的排名。

Putting It All Together
将所有内容整合在一起

Now we can implement our hybrid search by doing BM25 and cosine similarity separately and then combining the results with RRF.
现在，我们可以通过分别处理 BM25 和余弦相似度，然后将结果与 RRF 结合起来，来实现我们的混合搜索。

First, let’s define functions for RRF and a helper function to convert float scores to int rankings.
首先，让我们定义 RRF 的函数和一个将浮点分数转换为 int 排名的辅助函数。

import numpy as np

def scores_to_ranking(scores: list[float]) -> list[int]:
    """Convert float scores into int rankings (rank 1 is the best)"""
    return np.argsort(scores)[::-1] + 1


def rrf(keyword_rank: int, semantic_rank: int) -> float:
    """Combine keyword rank and semantic rank into a hybrid score."""
    k = 60
    rrf_score = 1 / (k + keyword_rank) + 1 / (k + semantic_rank)
    return rrf_score

Here is my simple hybrid search implementation using the concepts described above.
下面是我使用上述概念实现的简单混合搜索。

from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

def hybrid_search(
    query: str, corpus: list[str], encoder_model: SentenceTransformer
) -> list[int]:
    # bm25
    tokenized_corpus = [doc.split(" ") for doc in corpus]
    tokenized_query = query.split(" ")
    bm25 = BM25Okapi(tokenized_corpus)
    bm25_scores = bm25.get_scores(tokenized_query)
    bm25_ranking = scores_to_ranking(bm25_scores)

    # embeddings
    document_embeddings = model.encode(corpus)
    query_embedding = model.encode(query)
    cos_sim_scores = cos_sim(document_embeddings, query_embedding).flatten().tolist()
    cos_sim_ranking = scores_to_ranking(cos_sim_scores)

    # combine rankings into RRF scores
    hybrid_scores = []
    for i, doc in enumerate(corpus):
        document_ranking = rrf(bm25_ranking[i], cos_sim_ranking[i])
        print(f"Document {i} has the rrf score {document_ranking}")
        hybrid_scores.append(document_ranking)

    # convert RRF scores into final rankings
    hybrid_ranking = scores_to_ranking(hybrid_scores)
    return hybrid_ranking

Now we can use hybrid_search with different queries.
现在，我们可以在不同的查询中使用 hybrid_search。

hybrid_ranking = hybrid_search(
    query="What is the scientifc name for cats?", corpus=corpus, encoder_model=model
)
print(hybrid_ranking)
>> Document 0 has the rrf score 0.03125
>> Document 1 has the rrf score 0.032266458495966696
>> Document 2 has the rrf score 0.03225806451612903
>> Document 3 has the rrf score 0.032266458495966696
>> [4 2 3 1]

As a next step, we could add more knowledge to our corpus of documents. In my article How to Use Re-Ranking for Better LLM RAG Retrieval, I integrated Wikipedia into the knowledge corpus.
下一步，我们可以在文档语料库中添加更多知识。在我的文章《如何利用重新排序实现更好的 LLM RAG 检索》中，我将维基百科整合到了知识语料库中。

Adding a re-ranker on top of the hybrid search will further improve the overall RAG pipeline.
在混合搜索的基础上增加重新排序器，将进一步改进整个 RAG 管道。

How to Use Re-Ranking for Better LLM RAG Retrieval

Building an advanced local LLM RAG pipeline with two-step retrieval using open-source bi-encoders and cross-encoders

towardsdatascience.com

Conclusion
结论

Hybrid search combines semantic search and keyword search to produce a better overall search result.
混合搜索结合了语义搜索和关键词搜索，以产生更好的整体搜索结果。

To perform keyword search, we implemented the BM25 algorithm, which looks for important keyword matches.
为了进行关键词搜索，我们采用了 BM25 算法，该算法可查找重要的匹配关键词。

For semantic search, we used cosine similarity with a pre-trained encoder model that produces dense embeddings.
在语义搜索方面，我们使用了余弦相似度和一个预先训练好的编码器模型，该模型可生成密集嵌入。

While hybrid search can improve RAG retrieval, it also requires more computational resources than running only one search algorithm.
虽然混合搜索可以改进 RAG 检索，但与只运行一种搜索算法相比，混合搜索也需要更多的计算资源。

Hybrid search is an interesting building block for improving your RAG pipeline. Give it a try!
混合搜索是改进 RAG 管道的一个有趣构件。快来试试吧

References
参考资料

[1] K. Spärck Jones, A statistical interpretation of term specificity and its application in retrieval (1972), Journal of Documentation Vol. 28 Nr. 1
[1] K. Spärck Jones, A statistical interpretation of term specificity and its application in retrieval (1972), Journal of Documentation Vol.

[2] A. Trotman, A. Puurula, and B. Burgess, Improvements to BM25 and Language Models Examined (2014), ADCS ’14: Proceedings of the 19th Australasian Document Computing Symposium
[2] A. Trotman, A. Puurula, and B. Burgess, Improvements to BM25 and Language Models Examined (2014), ADCS '14：第 19 届澳大利亚文档计算研讨会论文集

[3] A. Trotman, X.-F. Jia, and M. Crane, Towards an Efficient and Effective Search Engine (2012), Proceedings of the SIGIR 2012 Workshop on Open Source Information Retrieval
[3] A. Trotman, X.-F.Jia, and M. Crane, Towards an Efficient and Effective Search Engine (2012), Proceedings of the SIGIR 2012 Workshop on Open Source Information Retrieval.

[4] G. V. Cormack, C. L. A. Clarke, and S. Büttcher, Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods (2009), Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
[4] G. V. Cormack, C. L. A. Clarke, and S. Büttcher, Reciprocal Rank Fusion outpers Condorcet and individual Rank Learning Methods (2009), Proceedings of the 32nd international ACM SIGIR conference on Research and Development in Information retrieval (2009).

How to Use Hybrid Search for Better LLM RAG Retrieval如何使用混合搜索更好地检索 LLM RAG

Building an advanced local LLM RAG pipeline by combining dense embeddings with BM25将密集嵌入与 BM25 结合起来，构建先进的本地 LLM RAG 管道

Table of Contents目录

RAG RetrievalRAG 检索

Keyword Search With BM25使用 BM25 进行关键词搜索

Semantic Search With Dense Embeddings利用密集嵌入进行语义搜索

Semantic Search or Keyword Search?语义搜索还是关键词搜索？

Hybrid Search混合搜索

Putting It All Together将所有内容整合在一起