avatarSteve George

Summary

The web content describes how to use ChromaDB, an open-source vector database, in conjunction with LangChain for semantic search and document management, including installation, setup, and operations like adding, updating, and deleting documents.

Abstract

The article provides a comprehensive guide on integrating ChromaDB with LangChain to create a semantic search engine. It outlines the advantages of ChromaDB, such as ease of setup, storage of embeddings with metadata, and support for multiple embedding models. The process includes installing necessary packages, loading and splitting documents, embedding the text, and storing it in ChromaDB. The article also details how to perform semantic searches, add new documents to an existing database, delete specific data, and persist the database for long-term storage. Additionally, it discusses leveraging ChromaDB to enhance the performance of Large Language Models (LLMs) by providing relevant data for faster and more precise responses.

Opinions

  • The author emphasizes the ease of installation and use of ChromaDB, suggesting it is a user-friendly tool for developers.
  • ChromaDB's ability to save embeddings along with metadata is highlighted as a key feature for leveraging LLMs.
  • The article conveys that ChromaDB's support for multiple embedding models provides flexibility based on the use case.
  • The use of langchain is presented as beneficial for preprocessing and handling documents, indicating a seamless integration with ChromaDB.
  • The author appears to advocate for the use of ChromaDB with LLMs to improve the precision and speed of generated outputs, implying a synergistic relationship between the two technologies.

Semantic search engine using ChromaDB wrapped on LangChain

ChromaDB is an open-source vector database which is used to store embedding vector. It is integrated with LangChain, LlamaIndex, OpenAI etc. In this article, chromaDB integrated with LangChain is explained.

Below are the advantages of ChromaDB. a. Easy to setup and install b. Save embedding along with metadata which can be later used to leverage LLMs. c. Easy to store and retrieve embedding vector. d. Can use multiple embedding models. e. Open-source f. Python SDK available

As per the above workflow diagram, the documents which are loaded are tokenized and then converted to embedding vector. Later these data are stored into chromaDB. For later use of chromaDB, it can be stored into a persistent location as well. Using chromaDB, similarity check can be performed on the input text which will return the closest matching data and its metadata.

In this article, we will discuss about creating, updating and deleting chromaDB.

Note: The retrieved chunks of data from chromaDB can be fed into an LLM, on which prompt engineering is performed. This ensures that the output of LLM is more precise and faster.

Installation of ChromaDB

ChromaDB is easily installed using pip command.

pip install chromadb

Since we are using langchain wrapper, install langchain. Additionally, to perform embedding, sentence_transformers are installed. There are other types of transformers available, so based on the use-case , the installation can vary.

pip install langchain "unstructured[pdf]" sentence_transformers

Setting up Vector database

Using langchain.directory_loaders, load the documents from local folder. Using this library, one can load documents available in s3, blob storage,google storage, URL and many more.

from langchain.document_loaders import DirectoryLoader
directory = '/dbfs/FileStore/testfolder/'

For the current example, we are loading a document on One-day International(ODI) cricket rules. Hence the length of loaded document is 1.

def load_docs(directory):
  loader = DirectoryLoader(directory)
  documents = loader.load()
  return documents

documents = load_docs(directory)
len(documents)

Once the document is loaded, it is split using RecursiveCharacterTextSplitter (‘\n’). Other text splitters are also available under langchain.text_splitter.

As per the below code, we are mentioning the chunk_size as 1000 and chunk_overlap as 20 which are self explanatory. Since we are using the above parameters, the document is split into 138.

from langchain.text_splitter import RecursiveCharacterTextSplitter

def split_docs(documents,chunk_size=1000,chunk_overlap=20):
  text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
  docs = text_splitter.split_documents(documents)
  return docs

docs = split_docs(documents)
print(len(docs))

By default, ChromaDB uses all-MiniLM-L6-v2 for embedding. As per the LLM model or use-case, one can use a different type of embedding model as well. https://docs.trychroma.com/embeddings

from langchain.embeddings import SentenceTransformerEmbeddings
embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

Once embedding vector is created, both the split documents and embeddings are stored in ChromaDB. If we check, the length of number of embedding IDs available in chromaDB, that matches with the previous count of split (138)

from langchain.vectorstores import Chroma
db = Chroma.from_documents(docs, embeddings)

For semantic search, one can use similarity_search and mention the query along with the function as shown.

query = "number of players in a field"
matching_docs = db.similarity_search(query,k=3)

matching_docs

In the output, we can see the closest 3 (k) data along with the metadata.

Adding document to existing ChromaDB

As data grows, it is required to continuously add documents to existing vector database. In the below code, we are loading two new documents (football and basketball rules)

directory = '/dbfs/FileStore/testfolder2/'
def load_docs(directory):
  loader = DirectoryLoader(directory)
  documents = loader.load()
  return documents

documents = load_docs(directory)
len(documents)

We are using the same preprocessing method which was used earlier. After the split, there are 50 new chunks of data.

from langchain.text_splitter import RecursiveCharacterTextSplitter

def split_docs(documents,chunk_size=1000,chunk_overlap=20):
  text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
  docs = text_splitter.split_documents(documents)
  return docs

docs = split_docs(documents)
print(len(docs))

Using the below code, one can add the new data to the existing chromaDB. It uses the same embedding model that was used while creating the vector database.

db.add_documents(docs)

Now the overall length of the database, is 188. (138 from previous data and 50 from the new set of data)

When we perform similarity_search on the updated chromaDB, the search result spans across all the metadatas.

query = "number of players in a field"
matching_docs = db.similarity_search(query,4)

matching_docs

In the below image, the result is extracted from multiple documents. By referring to the metadata source, the developer can identify the source document from where the data is extracted.

Deleting selected data from ChromaDB

To delete the data created by the first document, retrieve the metadata IDs of the data using the below command. The length of the below variables matches the total number of chunks created by the first document.

get_id = db.get(where={'source': '/dbfs/FileStore/testfolder/odi.pdf'})['ids']

Using the below command, delete the data using the IDs generated by the previous line. Now the current size of the database can be checked for verification.

db.delete(get_id)

Persist the vector database

The vector database generated by chromaDB can be stored in a persistent location including cloud storage.

db = Chroma(persist_directory="dbfs:/FileStore/testfolder/dbpersist")
db.persist()

Additional: Leveraging chromaDB for LLM use case

As per the above diagram, the top n relevant data is extracted using similarity_search of ChromaDB based on the user query and passed on to the LLM model. LLM model will then fetch the answer from the relevant data to make the whole process faster.

Using langchain, an end to end chatbot can be created with ease as both vector DB (chromaDB) and model (from huggingFace) are available.

Vector Database
Chromadb
Langchain
Llm
Similarity Search
Recommended from ReadMedium