avatarMeng Li

Summary

The provided content outlines methods for creating a local knowledge base using large language models, detailing the implementation of LangChain and alternative solutions for interacting with documents through AI.

Abstract

The web content discusses the construction of a local knowledge base by leveraging large language models, with a focus on LangChain's approach. It describes the process of indexing and retrieving relevant documents using vector databases and the role of vector storage and retrieval engines like Chroma. The article explains LangChain's question-answering pipeline, which involves creating an index, a retriever, a question-answering chain, and then posing a question. It also presents various indexing methods such as stuff, map_reduce, refine, and map_rerank. The content further explores the principles behind ChatPDF's implementation, use case scenarios for different professional roles, and the use of a typical prompt template for generating responses. Additionally, it highlights alternative knowledge base solutions, including open-source projects and commercial tools that facilitate private and secure interactions with various document formats.

Opinions

  • The author emphasizes the importance of privacy and security when interacting with documents using AI, promoting tools that operate locally without data leakage.
  • There is a preference for AI solutions that can handle large documents efficiently, as evidenced by the mention of tools capable of processing even x-large PDFs.
  • The author shows a clear interest in new technologies, particularly in the AI and data fields, and encourages following for more insights into AI advancements.
  • The content suggests that the integration of AI with document processing can significantly enhance productivity and the ability to extract insights across various professional domains.
  • There is an endorsement of open-source software, with several projects mentioned that contribute to the AI community and provide users with customizable options for their specific needs.

Building a Local Knowledge Base with Large Models

Utilize LangChain to establish a knowledge base, with alternative solutions also recommended at the end of the document.

Using Indexes

Indexing the loaded content, the indexes offer several functionalities:

  • Document Loaders for loading documents
  • Text Splitters for segmenting documents
  • VectorStores for vector storage
  • Retrievers for document retrieval

LangChain’s question-answering process can be divided into 4 steps:

Step 1: Create an index. Step 2: Create a Retriever from that index. Step 3: Create a question-answering chain. Step 4: Ask a question!

Implementation with ChatTxt

Within LangChain, the BaseRetriever defined uses the `get_relevant_documents` function to obtain a list of documents relevant to the query. LangChain employs Chroma as the vector storage and retrieval engine, so you’ll need to install the toolkit first.

from abc import ABC, abstractmethod
from typing import List
from langchain.schema import Document
class BaseRetriever(ABC):
@abstractmethod
def get_relevant_documents(self, query: str) -> 
List[Document]:
"""Get texts relevant for a query.
Args:
query: string to find relevant texts for
Returns:
List of relevant documents

Chroma is a database for building AI applications with embeddings.

import chromadb

chroma_client = chromadb.Client()

collection = 
chroma_client.create_collection(name="my_collection")

collection.add(
documents=["This is an apple", "This is a banana"],
metadatas=[{"source": "my_source"}, {"source": 
"my_source"}],
ids=["id1", "id2"]
)

results = collection.query(
query_texts=["This is a query document"],
n_results=2
)
results

{‘ids’: [[‘id1’, ‘id2’]], ‘embeddings’: None, ‘documents’: [[‘This is an apple’, ‘This is a banana’]], ‘metadatas’: [[{‘source’: ‘my_source’}, {‘source’: ‘my_source’}]], ‘distances’: [[1.652575969696045, 1.6869373321533203]]}

from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader
from langchain.indexes import VectorstoreIndexCreator

loader = TextLoader('./xxx.txt', encoding='utf8')

index = VectorstoreIndexCreator().from_loaders([loader])

query = "xxxx"
index.query(query)

LangChain offers four indexing methods:

1. stuff Directly input the document as a prompt to OpenAI. 2. map_reduce Create a prompt for each chunk (for an answer or summary), then merge them. 3. refine Create a prompt on the first chunk to get a result, then merge the next document and output the result. 4. map_rerank Create a prompt for each chunk, score them, and return the result from the best-scoring document.

Principles Behind ChatPDF Implementation

Load file -> Read text -> Split text -> Vectorize text -> Vectorize question -> Match the top k most similar text vectors to the question vector -> Use the matched text as context along with the question in the prompt -> Submit to LLM to generate an answer.

ChatPDF Implementation Scenarios

Typical use cases include:

Project Manager Role: Project management documents + LLM

New Car Sales Guide Role: New car sales articles + LLM

Used Car Expert: Used car appraisal documents + LLM

Traditional Chinese Medicine Practitioner Role: TCM consultation documents + LLM

Customer Service Role: Customer service chat records + LLM

Lawyer Role: Legal consultation records + LLM

HR Role: Attendance and performance manual + LLM

A typical prompt template:

Known information:
{context} Based on the information provided above, answer the user's question concisely and professionally. If an answer cannot be derived from the information, please respond with "Unable to answer the question based on the provided information" or "Insufficient relevant information provided," and do not add fabricated content to the answer. Please answer in Chinese.
The question is: {question}

Vector Databases

A database that stores embeddings is known as a vector database.

The Transformer architecture in ChatGPT requires knowledge to be embedded.

Alternative Knowledge Base Solutions

Ai PDF is a PDF document processing tool based on GPT, featuring the following characteristics:

This is an open-source project that allows users to interact with their documents using the power of GPT, running entirely locally without leaking user data.

Users can import various documents and ask questions in natural language, to which GPT will provide answers based on those documents.

It offers an open-source GPT chatbot that can run in a local environment without the need to connect to an external API.

It supports implementation in multiple programming languages and can be custom-trained to optimize conversation outcomes.

This is also an open-source project for running the GPT model locally.

Users can import documents locally, and the GPT model will answer questions based on the content of these documents, protecting user privacy.

This is a commercial tool that also allows users to interact with documents and content in various formats through natural language, receiving precise responses.

It supports PDF, Word documents, text, audio, video, and more.

This is an AI workspace service for individuals, where users can custom-train models to optimize different tasks, such as writing, summarizing, and question-answering.

It provides a visual training monitoring interface, mainly targeting enterprise users.

Humata is a tool that uses AI technology to enable users to interact with various data documents through natural language.

askwise.ai is a knowledge assistant service based on GPT.

Conclusion

Knowledge is power, and AI-generated structures reach high.

Want to discover more secrets of AI? Follow us for more exciting discoveries!

I am Meng Li, an independent open-source software developer, and author of the SolidUI AI painting project, very interested in new technologies, and focused on AI and the data field.

Langchain
Agents
OpenAI
Knowledge
Python
Recommended from ReadMedium