SEC FILINGS QUESTION ANSWERING AND SUMMARIZATION: CHROMA DB, LANGCHAIN AND LLAMA INDEX
This article talks about my project on SEC filings question answering and summarization. Every publicly listed company has to file a yearly (10-K) and quarterly report (10-Q). With the rise in generative AI language models, we can build question-answering systems and summarization for these documents. Here are the project links and the SEC Data Downloader
SEC Data Downloader: https://llamahub.ai/l/sec_filings
QUESTION ANSWERING SYSTEM (SUPPORT ONLY FOR 10-K)
- Here I will explain the design decisions that I took while building the project
- I am using two LLMs here, one to understand the query and build the metadata for retrieval, and the second LLM to answer the question from the relevant context
section_names = ResponseSchema(name="Section_Names", description="Name of the sections")
tickers = ResponseSchema(
name="Tickers",
description="Name of the tickers, make sure to convert company names to tickers",
)
years = ResponseSchema(
name="Years",
description="Years mentioned, if no years are mentioned then output the last 5 years ['2018','2019','2020','2021','2022']",
)
# response_schema = [section_names, tickers, years, augmented_query]
response_schema = [section_names, tickers, years]
output_parse = StructuredOutputParser.from_response_schemas(response_schema)
format_instructions = output_parse.get_format_instructions()
llm1_template = """
You are a financial statement analyst, and here are the definitions of different sections {definitions_of_sections} in a SEC filings.\n
The definition is formatted as Section_Name: definition of the section. Based on the definitions, return the top {num_returns} possible section names that the user is asking for.\n
User request: "{user_request}"
{format_instructions}
"""
"""
Return the stocks tickers mentioned in the user query, make sure that you convert the company name to ticker.\n
Return the years that are mentioned in the user query, and if nothing is mentioned, then output the last 5 years. \n
For example:
User Query: What are the risk factors for Apple and Google for the year 2021 and 2022?\n
Augmented Query: ["What are the risk factors for Apple in 2022?","What are the risk factors for Apple in 2021?","What are the risk factors for Google in 2022?","What are the risk factors for Google in 2021?"]\n
"""
llm_1_prompt_template = PromptTemplate(
input_variables=[
"definitions_of_sections",
"user_request",
"num_returns",
"format_instructions",
],
template=llm1_template,
)
# Load variables from .env file
load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")
openai.api_key = openai_api_key
# autolog({"project":PROJECT, "job_type": JOB_TYPE+"_LLM1"})
# print(format_instructions)
def get_response_llm1(USER_REQUEST: str, filing_type: str = "10-K") -> dict:
if filing_type == "10-K":
section_def = DEFINITIONS_10K
elif filing_type == "10-Q":
section_def = DEFINITIONS_10Q
llm1_prompt = llm_1_prompt_template.format(
definitions_of_sections=section_def,
num_returns=NUM_SECTION_RETURN,
user_request=USER_REQUEST,
format_instructions=format_instructions,
)
llm1 = OpenAI(temperature=0.0)
output_1 = llm1.predict(llm1_prompt)
llm1_output_dict = output_parse.parse(output_1)
for key in llm1_output_dict:
if not isinstance(llm1_output_dict[key], list):
llm1_output_dict[key] = llm1_output_dict[key].split(", ")
llm1_output_dict["Section_Names"] = [
i.upper() for i in llm1_output_dict["Section_Names"]
]
llm1_output_dict['Section_Names'] = ["_".join(sec.split(" ")) for sec in llm1_output_dict['Section_Names'] ]
# print(llm1_output_dict)
# query_metadata = get_query_metadata(llm1_output_dict)
return llm1_output_dict- You can see that we have a structured output parser from Langchain to extract the sections where the answer to the query can be present, the years and the tickers. So for a query What are the risk factors of Apple for the year 2022? It will output
{
"Tickers":['AAPL'],
"Years":["2022"],
"Section_Names":["RISK_FACTORS","MANAGEMENT_AND_DISCUSSION"]
}2. Now, while building the database, I am storing the results using FinBERT embeddings in ChromaDB. By default, it takes in Sentence Transformer Embeddings (model_name=”all-MiniLM-L6-v2"), but you can also use FinBERT embeddings. Finance-specific embeddings may help us to better tokenize the financial documents, which will help us in our retrieval
class FinBertEmbeddings(EmbeddingFunction):
def __call__(self, texts: Documents) -> Embeddings:
embed_out = finbert_embed.embed_documents(texts)
return embed_out
def create_vector_store_langchain(documents, doc_name: str, if_finbert: bool = False):
if if_finbert:
embedding_function = FinBertEmbeddings()
else:
embedding_function = SentenceTransformerEmbeddings(
model_name="all-MiniLM-L6-v2"
)
vector_store = Chroma.from_documents(
documents=documents,
embedding=embedding_function,
ids=[f"id{i}" for i in range(len(documents))],
persist_directory=f"sec-{doc_name}",
collection_name=f"SEC-{doc_name}",
)
vector_store.persist()
return vector_store3. Metadata creation was a bit tricky. After chunking the documents, we saved each chunked document in the following way
metadata_dict.update(
{
"full_metadata": sm["ticker"]
+ "_"
+ sm["year"]
+ "_"
+ sm["section"]
+ "_"
+ sm["filing_type"]
}
)You can see here that the full_metadata is saved as ticker, year, section name, and filing type. From our LLM-1, we get all this information, and we combine it in the above manner for the where clause in ChromaDB
def get_relevant_docs(query_metadata, restore_collection,user_request:str):
# print(llm1_output_dict)
if len(query_metadata) <= 1:
where_clause = query_metadata[0]
else:
where_clause = {"$or": query_metadata}
query_results = restore_collection.query(
query_texts=user_request,
n_results=20,
where=where_clause,
include=["metadatas", "documents", "distances", "embeddings"],
)
return query_results- You can see that the where clause has an OR condition to get the data from different sections. For example “AAPL_2022_RISK_FACTORS_10-K” OR “AAPL_2022_MANAGEMENT_AND_DISCUSSION_10-K”.
4. Also, we do Maximum Marginal Relevance to get the most diverse set of documents from our retrieved documents. This is useful in answering subjective questions, but not objective questions.
#Taken from Langchain math utils
def maximal_marginal_relevance(
query_embedding: np.ndarray,
embedding_list: list,
lambda_mult: float = 0.5,
k: int = 4,
) -> List[int]:
"""Calculate maximal marginal relevance."""
if min(k, len(embedding_list)) <= 0:
return []
if query_embedding.ndim == 1:
query_embedding = np.expand_dims(query_embedding, axis=0)
similarity_to_query = cosine_similarity(query_embedding, embedding_list)[0]
most_similar = int(np.argmax(similarity_to_query))
idxs = [most_similar]
selected = np.array([embedding_list[most_similar]])
while len(idxs) < min(k, len(embedding_list)):
best_score = -np.inf
idx_to_add = -1
similarity_to_selected = cosine_similarity(embedding_list, selected)
for i, query_score in enumerate(similarity_to_query):
if i in idxs:
continue
redundant_score = max(similarity_to_selected[i])
equation_score = (
lambda_mult * query_score - (1 - lambda_mult) * redundant_score
)
if equation_score > best_score:
best_score = equation_score
idx_to_add = i
idxs.append(idx_to_add)
selected = np.append(selected, [embedding_list[idx_to_add]], axis=0)
return idxs5. After getting the relevant data, we put everything in our prompt based on the ticker symbol and year and send it to our LLM-2 to generate the final answer
def get_response_llm2(relevant_sentences, user_query, llm1_output_dict):
llm2_template = """
You are a financial statement analyst with a strong understanding of financial documents and fundamental analysis. Base your answer only on the following relevant documentss: \n
{relevant_documents} \n\n
Answer the user question {user_query}\n
Don't make up any information, and if the relevant information is not present, then just give the most similar answer to the user query from the relevant documents and politely give a warning that the information that the user is looking for, may not be in the documents\n
Also, include all the relevant numerical figures\n
Recheck your answer so that it is more coherent with what user is asking\n
"""
llm2_prompt_template = PromptTemplate(
input_variables=["relevant_documents", "user_query"], template=llm2_template
)
# user_query = llm1_output_dict["augmented_query"]
# user_query = query_1+query_2
llm2_prompt = llm2_prompt_template.format(
user_query=user_query, relevant_documents=relevant_sentences
)
openai.api_key = os.environ["OPENAI_API_KEY"]
llm_2 = ChatOpenAI(temperature=0.0, model="gpt-3.5-turbo-16k",streaming=True)
output = llm_2.predict(llm2_prompt)
return output






