avatarLaxfed Paulacy

Summary

The Neon team has collaborated with LangChain to integrate pg_embedding with HNSW in Postgres for efficient vector similarity search, offering a faster and potentially more accurate alternative to PGVector.

Abstract

The article discusses the collaboration between the Neon team and LangChain, resulting in the release of the pg_embedding extension for Postgres. This integration leverages the Hierarchical Navigable Small World (HNSW) index, a graph-based approach for indexing high-dimensional data, which allows for a time complexity of O(log(rows)). The integration is designed to enhance vector similarity search capabilities within Postgres. Users are guided through the process of setting up PGEmbedding, which includes logging into a Neon account, creating a project, installing LangChain, initializing the vector store, and executing a similarity analysis. The article provides a code snippet demonstrating how to use the PGEmbedding vector store with LangChain's document loaders and text splitters, and how to create an HNSW index for efficient searching. The PGEmbedding integration is noted for its speed and accuracy, although it may require more computational resources for index construction compared to other vector stores. The choice of vector store is advised to be based on the specific requirements of the application.

Opinions

  • The author suggests that PGEmbedding with HNSW is superior to PGVector in terms of speed and accuracy for vector similarity searches in Postgres.
  • The article implies that the choice between PGEmbedding and other vector stores should be informed by the unique needs of the user's application, considering factors such as memory usage and computational resources.
  • The integration of pg_embedding with LangChain is presented as a significant advancement, enabling more efficient and effective vector searches within Postgres databases.
  • The use of HNSW for indexing is highlighted as a key feature that contributes to the performance improvements of the PGEmbedding integration.

LANGCHAIN — Neon X Langchain HNSW in Postgres with pg_embedding

The best way to predict the future is to invent it. — Alan Kay

Neon team collaborated with LangChain to release the pg_embedding extension and PGEmbedding integration in LangChain for vector similarity search in Postgres. This integration uses the Hierarchical Navigable Small World (HNSW) index graph-based approach to indexing high-dimensional data. It constructs a hierarchy of graphs, resulting in a time complexity of O(log(rows)).

To get started with PGEmbedding, follow these steps:

  1. Log in to your Neon account and create a project:
  • npx neonctl auth npx neonctl projects create
  1. If you haven’t installed LangChain, follow the instructions in the documentation.
  2. Initialize the PGEmbedding vector store and execute a similarity analysis:
import os from typing 
import List, Tuple 
from langchain.embeddings.openai import OpenAIEmbeddings 
from langchain.text_splitter import CharacterTextSplitter 
from langchain.vectorstores import PGEmbedding 
from langchain.document_loaders import TextLoader 
from langchain.docstore.document import Document  

loader = TextLoader('state_of_the_union.txt') 
raw_docs = loader.load() 

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) 

docs = text_splitter.split_documents(raw_docs) 

embeddings = OpenAIEmbeddings() 

CONNECTION_STRING = os.environ["DATABASE_URL"]  # Initialize the vectorstore, create tables and store embeddings and metadata. 

db = PGEmbedding.from_documents(
     embedding=embeddings,
     documents=docs,
     collection_name="state_of_the_union",
     connection_string=CONNECTION_STRING, 
)  # Create the index using HNSW. This step is optional. By default the vectorstore uses exact search. 

db.create_hnsw_index(max_elements=10000, dims=1536, m=8, ef_construction=16, ef_search=16)  # Execute the similarity search and return documents 

query = "What did the president say about Ketanji Brown Jackson" 

docs_with_score = db.similarity_search_with_score(query)  

print('query done')  
print("Results:") 

for doc, score in docs_with_score:
     print("-" * 80)
     print("Score: ", score)
     print(doc.page_content)
     print("-" * 80)

The PGEmbedding integration is faster than PGVector for 99% accuracy. It is generally faster, achieves higher accuracy for the same memory footprint, and uses relatively less memory. However, it may involve more computational intensive index construction. Ultimately, the choice between PGEmbedding and other vector stores should be based on the specific demands of your application.

Experiment with both approaches to find the one that best meets your needs for LLM applications.

Langchain
X
ChatGPT
Neon
Hnsw
Recommended from ReadMedium