Semantic search engine using ChromaDB wrapped on LangChain
ChromaDB is an open-source vector database which is used to store embedding vector. It is integrated with LangChain, LlamaIndex, OpenAI etc. In this article, chromaDB integrated with LangChain is explained.
Below are the advantages of ChromaDB. a. Easy to setup and install b. Save embedding along with metadata which can be later used to leverage LLMs. c. Easy to store and retrieve embedding vector. d. Can use multiple embedding models. e. Open-source f. Python SDK available
As per the above workflow diagram, the documents which are loaded are tokenized and then converted to embedding vector. Later these data are stored into chromaDB. For later use of chromaDB, it can be stored into a persistent location as well. Using chromaDB, similarity check can be performed on the input text which will return the closest matching data and its metadata.
In this article, we will discuss about creating, updating and deleting chromaDB.
Note: The retrieved chunks of data from chromaDB can be fed into an LLM, on which prompt engineering is performed. This ensures that the output of LLM is more precise and faster.
Installation of ChromaDB
ChromaDB is easily installed using pip command.
pip install chromadb
Since we are using langchain wrapper, install langchain. Additionally, to perform embedding, sentence_transformers are installed. There are other types of transformers available, so based on the use-case , the installation can vary.
pip install langchain "unstructured[pdf]" sentence_transformers
Setting up Vector database
Using langchain.directory_loaders, load the documents from local folder. Using this library, one can load documents available in s3, blob storage,google storage, URL and many more.
from langchain.document_loaders import DirectoryLoader
directory = '/dbfs/FileStore/testfolder/'
For the current example, we are loading a document on One-day International(ODI) cricket rules. Hence the length of loaded document is 1.
def load_docs(directory):
loader = DirectoryLoader(directory)
documents = loader.load()
return documents
documents = load_docs(directory)
len(documents)
Once the document is loaded, it is split using RecursiveCharacterTextSplitter (‘\n’). Other text splitters are also available under langchain.text_splitter.
As per the below code, we are mentioning the chunk_size as 1000 and chunk_overlap as 20 which are self explanatory. Since we are using the above parameters, the document is split into 138.
from langchain.text_splitter import RecursiveCharacterTextSplitter
def split_docs(documents,chunk_size=1000,chunk_overlap=20):
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
docs = text_splitter.split_documents(documents)
return docs
docs = split_docs(documents)
print(len(docs))
By default, ChromaDB uses all-MiniLM-L6-v2 for embedding. As per the LLM model or use-case, one can use a different type of embedding model as well. https://docs.trychroma.com/embeddings
from langchain.embeddings import SentenceTransformerEmbeddings
embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
Once embedding vector is created, both the split documents and embeddings are stored in ChromaDB. If we check, the length of number of embedding IDs available in chromaDB, that matches with the previous count of split (138)
from langchain.vectorstores import Chroma
db = Chroma.from_documents(docs, embeddings)
For semantic search, one can use similarity_search and mention the query along with the function as shown.
query = "number of players in a field"
matching_docs = db.similarity_search(query,k=3)
matching_docs
In the output, we can see the closest 3 (k) data along with the metadata.
Adding document to existing ChromaDB
As data grows, it is required to continuously add documents to existing vector database. In the below code, we are loading two new documents (football and basketball rules)
directory = '/dbfs/FileStore/testfolder2/'
def load_docs(directory):
loader = DirectoryLoader(directory)
documents = loader.load()
return documents
documents = load_docs(directory)
len(documents)
We are using the same preprocessing method which was used earlier. After the split, there are 50 new chunks of data.
from langchain.text_splitter import RecursiveCharacterTextSplitter
def split_docs(documents,chunk_size=1000,chunk_overlap=20):
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
docs = text_splitter.split_documents(documents)
return docs
docs = split_docs(documents)
print(len(docs))
Using the below code, one can add the new data to the existing chromaDB. It uses the same embedding model that was used while creating the vector database.
db.add_documents(docs)
Now the overall length of the database, is 188. (138 from previous data and 50 from the new set of data)
When we perform similarity_search on the updated chromaDB, the search result spans across all the metadatas.
query = "number of players in a field"
matching_docs = db.similarity_search(query,4)
matching_docs
In the below image, the result is extracted from multiple documents. By referring to the metadata source, the developer can identify the source document from where the data is extracted.
Deleting selected data from ChromaDB
To delete the data created by the first document, retrieve the metadata IDs of the data using the below command. The length of the below variables matches the total number of chunks created by the first document.
get_id = db.get(where={'source': '/dbfs/FileStore/testfolder/odi.pdf'})['ids']
Using the below command, delete the data using the IDs generated by the previous line. Now the current size of the database can be checked for verification.
db.delete(get_id)
Persist the vector database
The vector database generated by chromaDB can be stored in a persistent location including cloud storage.
db = Chroma(persist_directory="dbfs:/FileStore/testfolder/dbpersist")
db.persist()
Additional: Leveraging chromaDB for LLM use case
As per the above diagram, the top n relevant data is extracted using similarity_search of ChromaDB based on the user query and passed on to the LLM model. LLM model will then fetch the answer from the relevant data to make the whole process faster.
Using langchain, an end to end chatbot can be created with ease as both vector DB (chromaDB) and model (from huggingFace) are available.