LANGCHAIN — How Can Data Sources Be Synced to Vector Stores?
The human spirit must prevail over technology. — Albert Einstein
In this tutorial, we will explore how to sync data sources to vector stores using LangChain’s Indexing API. This API makes it easy to load and keep documents from any source in sync with a vector store, while avoiding redundant work and ensuring efficient re-indexing.
Initializing the Vector Store
Let’s start by initializing the vector store. In this example, we’ll use the ElasticsearchStore
as our vector store. First, we need to set up an instance of the OpenAIEmbeddings
and then create an ElasticsearchStore
using the following code:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import ElasticsearchStore
collection_name = "test_index"
embedding = OpenAIEmbeddings()
vector_store = ElasticsearchStore(
collection_name,
es_url="<http://localhost:9200>",
embedding=embedding
)
Initializing the Record Manager
Next, we’ll initialize and create a schema for the record manager. In this example, we’ll use a SQLite table as our record manager:
from langchain.indexes import SQLRecordManager
namespace = f"elasticsearch/{collection_name}"
record_manager = SQLRecordManager(
namespace, db_url="sqlite:///record_manager_cache.sql"
)
record_manager.create_schema()
Indexing Documents
Now, let’s suppose we want to index the reuters.com front page. We can load and split the URL contents and then index the documents into the vector store:
import bs4
from langchain.document_loaders import RecursiveUrlLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
raw_docs = RecursiveUrlLoader(
"<https://www.reuters.com>",
max_depth=0,
extractor=lambda x: BeautifulSoup(x, "lxml").text
).load()
processed_docs = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
chunk_size=200
).split_documents(raw_docs)
index(
processed_docs[:10],
record_manager,
vector_store,
cleanup="full",
source_id_key="source"
)
Cleanup Modes
During re-indexing, it’s important to handle the cleanup of existing documents in the vector store. LangChain’s Indexing API offers different cleanup modes to pick the desired behavior for removing existing documents.
Seeing it in Action
After indexing the documents, you can see the number of documents added, updated, skipped, and deleted. This provides visibility into the actual work done during the indexing process.
Conclusion
LangChain’s new indexing API provides a clean and scalable way to efficiently sync data sources with vector stores. It minimizes redundant work, provides cleanup modes for re-indexing, and ensures that documents stay in sync with their source.
In this tutorial, we’ve covered the basic usage of LangChain’s Indexing API for syncing data sources to vector stores. For more in-depth examples and detailed documentation, check out the LangChain documentation.