avatarTahir Rauf

Summary

The website content outlines a process for generating and searching word embeddings using Amazon Bedrock and Amazon OpenSearch Serverless, with a focus on integrating these services for efficient semantic search capabilities.

Abstract

The article details a practical approach to natural language processing (NLP) by utilizing Amazon Bedrock for creating embeddings and Amazon OpenSearch Serverless for storing and searching these embeddings. It provides step-by-step instructions on how to load data, split text into manageable chunks for language models, set up embeddings using Bedrock, prepare the OpenSearch Vector Store with the necessary permissions and configurations, create an index for vector storage, generate embeddings, and perform similar document searches. The guide emphasizes the importance of vector embeddings in capturing semantic and contextual information and demonstrates how to leverage Amazon's services for advanced NLP tasks.

Opinions

  • The author suggests that word embeddings are crucial for capturing the semantic meaning of words and phrases in a structured geometric space.
  • There is an emphasis on the practicality and ease of use of Amazon Bedrock for accessing pre-trained AI models for various NLP tasks, such as text generation and summarization.
  • The article highlights the efficiency of Amazon OpenSearch Serverless's Vector Search collection for handling vector embeddings, particularly noting its real-time capabilities.
  • The author provides a positive view of the integration between Amazon Bedrock and OpenSearch Serverless, indicating that it simplifies the process of implementing semantic search functionalities.
  • The need to split text into smaller chunks due to token limits in language models is presented as a common but important step in the embedding process.
  • The article implies that the combination of Bedrock and OpenSearch Serverless offers a scalable solution for managing and searching large sets of textual data based on semantic meaning.

My NLP practicals: Embedding Creation and Search with Bedrock and OpenSearch Serverless

Word embeddings are low-dimensional and continuous (i.e., dense, as opposed to sparse) vector representations that map words into a structured geometric space, thereby capturing semantic and contextual information. See my post “But wait! What are the word embeddings” to learn further about embeddings concept. In this blog post, we’ll learn about generating embeddings using Bedrock and storing it in Amazon Opensearch serverless for later searches. Let’s begin with a quick introduction to the services involved.

Amazon Bedrock is a fully managed service that makes Fundamental Models from leading AI startups and Amazon available via an API. You can get started with use cases like text generation, text summarization, chatbots quickly.

As for Amazon OpenSearch, our primary focus will be on the Vector Search collection of the service.

OpenSearch’s Vector search collection type is designed for storing, semantic searching, and retrieving vector embeddings in real-time in the vector engine.

Let's delve into generating, storing, and searching embeddings.

Step 0 — Load data

from langchain.document_loaders import HuggingFaceDatasetLoader
loader = HuggingFaceDatasetLoader("fka/awesome-chatgpt-prompts", page_content_column="prompt")
docs = loader.load()

Step 1 — Setup Embeddings

Split Text into chunks

Language Models are often limited by the amount of text that you can pass to them. Most models have token limit and you simply can’t feed a 50-page report to the LLM. Therefore, it is necessary to split them up into smaller chunks. For example, the token limit for Titan Embeddings G1 — Text is 8k.

from langchain.text_splitter import RecursiveCharacterTextSplitter
max_seq_len = 0
max_seq_len = max(len(doc.page_content) for doc in docs)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = max_seq_len,
    chunk_overlap = 150
)
splits = text_splitter.split_documents(docs)

Setup Embeddings

from langchain.embeddings import BedrockEmbeddings
from langchain.llms.bedrock import Bedrock
import boto3

region_name='us-east-1'
bedrock_client = boto3.client(service_name='bedrock-runtime', 
                              region_name=region_name)
# - create the Titan Embeddings Model
bedrock_embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1",
                                       client=bedrock_client)

2. Prepare OpenSearch Vector Store

Configure Permissions

To use OpenSearch Serverless in general, Your user or role must have an attached identity-based policy with the following minimum permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "aoss:CreateCollection",
        "aoss:ListCollections",
        "aoss:BatchGetCollection",
        "aoss:DeleteCollection",
        "aoss:CreateAccessPolicy",
        "aoss:ListAccessPolicies",
        "aoss:UpdateAccessPolicy",
        "aoss:CreateSecurityPolicy",
        "iam:ListUsers",
        "iam:ListRoles"
      ],
      "Effect": "Allow",
      "Resource": "*"
    }
  ]
}

Create a collection

Collection is logical group of indexes that work together to support your workload.

  1. Open the Amazon OpenSearch Service console.
  2. Choose Collections in the left navigation pane and choose Create collection.

Once created, note down the endpoint of created collection

3. Create Index

Create OpenSearch Client. Replace host with your own endpoint.

# https://www.cianclarke.com/blog/aws-opensearch-and-langchain/
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth

# # NB without HTTPS prefix, without a port - be sure to substitute your region again
host = '<Collection Endpoint e.g a1b540f7a6eta.us-east>' 
region = 'us-east-1' # substitute your region here
service = 'aoss'
credentials = boto3.Session().get_credentials()
auth = AWSV4SignerAuth(credentials, region, service)

client = OpenSearch(
    hosts=[{'host': host, 'port': 443}],
    http_auth=auth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection
)

Create Index. Replace with name of your choice.

# Index Creation
index_name = "<Index Name>"
indexBody = {
    "settings": {
        "index.knn": True
    },
    "mappings": {
        "properties": {
            "vector_field": {
                "type": "knn_vector",
                "dimension": 1536,
                "method": {
                    "engine": "faiss",
                    "name": "hnsw"
                }
            }
        }
    }
}

try:
    create_response = client.indices.create(index_name, body=indexBody)
    print('\nCreating index:')
    print(create_response)
except Exception as e:
    print(e)
    print("(Index likely already exists?)")

4. Create Embeddings

# https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.opensearch_vector_search.OpenSearchVectorSearch.html
from langchain.vectorstores import OpenSearchVectorSearch

docsearch = OpenSearchVectorSearch.from_documents(
    splits,
    bedrock_embeddings,
    opensearch_url=f'https://{host}:443',
    http_auth=auth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection,
    index_name=index_name,
    bulk_size=1000
)

5. Search for similar docs

query = "proofread"

# query_embedding = docsearch.embedding_function(query)
relevant_documents = docsearch.similarity_search(query, k=5)
print(f'{len(relevant_documents)} documents are fetched which are relevant to the query.')
print('----')
for i, rel_doc in enumerate(relevant_documents):
    print(f'## Document {i+1}: {rel_doc.metadata["act"]}: {rel_doc.metadata["prompt"]}')

References

Get an Amazon OpenSearch Serverless search collection up and running LangChain OpenSearch OpenSearch VectorSearch API Boto3 OpenSearch OpenSearch Vector Engine Semantic Search with Vector Engine

NLP
Bedrock
Opensearch
Langchain
Similarity Search
Recommended from ReadMedium