Gemma 2B: A Deep Dive with Basic and Advanced Retrievers in RAG
A multi-dimensional evaluation of Gemma vs TinyLlama for resource constrained systems
A couple of weeks ago, Google introduced Gemma language models, as a family of lightweight models in two sizes, Gemma 2B and 7B. They are claimed to achieve best-in-class performance for their sizes compared to other open-source models. These models are primarily targeted for consumer-grade machines. In Google’s own tests, they found Gemma surpassed some larger models significantly on certain key performance benchmarks, as shown in Fig. 1 [1].
In this article, we will utilize Gemma 2B in a Retrieval-Augmented Generation (RAG) setup for question-answering against documents that the model would not have seen during its training. RAG involves the process of enhancing a model’s response by utilizing an authoritative knowledge source outside its training data. It consists of a retriever system to fetch relevant document snippets, which will be used by Gemma as context to influence its generation. We will evaluate how well Gemma uses the context from a basic retriever as well as an advanced retriever from a number of performance angles. In addition, its performance will be compared against another LLM similarly aimed for resource constrained systems, namely TinyLlama 1.1B.
Table of Contents
1.0 Environment Setup
2.0 Design and Implementation
2.1 Module LoadVector
ize
2.2 Module LLMPerfMoni
tor
2.3 The m
ain Module
3.0 Initial Tests
3.1 Response Accuracy of Gemma 2B vs TinyLlama 1.1B
4.0 RAG Powered by Ensemble Retriever
4.1 Performance Comparison of Gemma vs TinyLlama with Advanced Retriever
5.0 Final Thoughts
1.0 Environment Setup
This experimentation will be conducted on a MacBook Air M1 with 8GB RAM. The version of Python used here is 3.10.5. Let’s firstly create a virtual environment to manage this project. To create and activate the environment, let’s run the followings:
python3.10 -m venv mychat
source mychat/bin/activate
We can now proceed installing all the required libraries:
pip install langchain faiss-cpu sentence-transformers flask-sqlalchemy psutil unstructured pdf2image unstructured_inference pillow_heif opencv-python pikepdf pypdf
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
The second line above involves installing the llama-cpp-python
library with Metal support, which will be used to load the LLM with hardware acceleration on M1 processor. Using Metal, the computation runs on the GPU.
As the environment is ready, let’s take a look at our RAG system design followed by its implementation.
2.0 Design and Implementation
There are three modules for this QA system as illustrated in Fig. 2.
Two highlighted components in this diagram will be revisited for performance improvement later in this article. The roles of each module are as follows:
- The first module on the left involves loading and vectorizing an online pdf document.
- The smaller module on the right serves as a helper module. It facilitates an objective measurement of the system performance across a set of questions by computing cosine similarity and the model response time as well as measures the system memory utilization.
- The module in the center, the main module, involves loading the LLM and instantiating a FAISS retriever, followed by the creation of a retrieval chain encompassing the LLM, the retriever and a custom prompt.
Let’s now take a deeper dive into key implementation decisions and the code.
2.1 Module LoadVectorize
This module comprises 3 functions:
- Function
load_doc
handles the loading of an online pdf document, initially splits at 100 characters per chunk with an overlap of 20 characters, and returns the document list. - Function
vectorize
calls the above functionload_doc
to get a chunked list of documents, creates embeddings and commits to local directory opdf_index as well as returns the FAISS object. - Function
load_db
checks if a FAISS vectorstore is on disk within directory opdf_index, and attempts to load. Otherwise, it invokes the previous functionvectorize
to load and vectorize the document. It finally returns a FAISS object.
This module was introduced in our earlier work [2] and used initially as is here.
2.2 Module LLMPerfMonitor
This module also comprises 3 functions:
- Function
get_questions_answers
reads from a file on disk with a list of questions along with the expected answers. Such answers will be used along with the response returned by the model to compute their cosine similarity. - Function
calc_similarity
accepts two arguments of strings. Both strings are initially converted to a tensor embedding. Functionutil.pytorch_cos_sim
ofsentence_transformers
then computes the cosine similaritycos_sim(a[i], b[j])
for alli
andj
. If the sentences are very similar semantically, their cosine similarity will be closer to 1. If they are opposites, this value will ideally be closer to 0. - Function
get_mem_cpu_util
accepts an argument for the parent process id (ppid), which is used withpsutil
to get its memory usage specifically RSS (resident set size) representing its physical RAM usage.
The full code listing of this module is shown below.
# LLMPermMonitor.py
from sentence_transformers import SentenceTransformer, util
import os, psutil
import nltk
from nltk.translate import meteor
from nltk import word_tokenize
def get_questions_answers() -> list[str]:
# returns a list of questions interleaved with answers
with open("sh_qa_list.txt") as qfile:
lines = [line.rstrip()[3:] for line in qfile]
return lines
def calc_similarity(sent1,sent2) -> float:
# creates embeddings, computes cosine similarity and returns the value
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
#Compute embedding for both strings
embedding_1= model.encode(sent1, convert_to_tensor=True)
embedding_2 = model.encode(sent2, convert_to_tensor=True)
return util.pytorch_cos_sim(embedding_1, embedding_2).item()
def get_mem_cpu_util(ppid) -> tuple[float,float]:
# collects RSS in GB, total CPU util and returns them
process = psutil.Process(ppid) # return parent process
mem_usage = process.memory_info().rss / 1024 ** 3 # in GB
cpu_usage = sum(psutil.cpu_percent(percpu=True))
return mem_usage,cpu_usage
2.3 The main Module
The main
module primarily involves defining the QA pipeline. In addition, to facilitate resource usage tracking of this app, this would only be feasible when performed from a separate thread. Accordingly, this module enables a simple multi-threaded system where the main thread will undertake the entire LangChain pipeline, and a single child thread will be spawned to collect its parent’s stats at an interval and store them in a shared data structure. Figure 3 depicts a thread diagram to capture this parallel processing design.
The parent spawns a child with its pid, a shared data structure which is a list
object and a threading.Event
object. The Event
object will be used to decide when the child exits. While the child is in operation, it collects the memory and CPU utilization and stores them in list shared_list
at 1-second interval. The parent runs through the QA chain for each query and uses the data collected by the child for computing the stats. When the parent finishes running through all questions, it sets the Event
object and waits for the child to terminate. Once the child sees the Event
is set, it exits its while loop at most in 1 second, and then the parent exits.
For the QA pipeline, this module initially defines the prompt template for Gemma of the following template:
<bos><start_of_turn>user
{context}
{question}<end_of_turn>
<start_of_turn>model
To further reduce the LLM memory footprint, we will adopt a quantized version of Gemma from lmstudio-ai’s HuggingFace repo [3]. A quantized model basically uses lesser bits for the model’s parameters. To load the quantized LLM in the GGUF format, LlamaCpp
is used. Using the FAISS object returned by module LoadVectorize
, a FAISS retriever is created as the basic retriever for our initial tests. With the above objects, the RetrievalQA
chain is then instantiated, and used for questioning.
The full listing of the main module is given next.
# main.py
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain_community.llms import LlamaCpp
from langchain_community.embeddings import HuggingFaceEmbeddings
import LoadVectorize
import LLMPerfMonitor
import threading
import os
import time
import timeit
# child operation
def monitor_thread(event, ppid, shared_list):
while not event.is_set():
mem,cpu = LLMPerfMonitor.get_mem_cpu_util(ppid) # Run the async task in the thread's event loop
shared_list += [mem,cpu]
time.sleep(1)
def main():
event = threading.Event() # Create an event object
shared_list = [] # Create a shared Queue object
child = threading.Thread(target=monitor_thread, args=(event,os.getpid(),shared_list))
child.start()
# create retriever
db = LoadVectorize.load_db()
faiss_retriever = db.as_retriever(search_type="mmr", search_kwargs={'fetch_k': 3}, max_tokens_limit=1000)
# list of questions and answers
qa_list = LLMPerfMonitor.get_questions_answers()
# Prompt template
qa_template = """
<bos><start_of_turn>user
{context}
{question}<end_of_turn>
<start_of_turn>model
"""
# Create a prompt instance
QA_PROMPT = PromptTemplate.from_template(qa_template)
llm = LlamaCpp(
model_path="./models/gemma_2b/gemma-2b-it-q4_k_m.gguf",
temperature=0.01,
max_tokens=2000,
top_p=1,
verbose=False,
n_ctx=2048
)
# Custom QA Chain
qa_chain = RetrievalQA.from_chain_type(
llm,
retriever=faiss_retriever,
chain_type_kwargs={"prompt": QA_PROMPT}
)
print('model;question;cosine;resp_time;memory_util;cpu_util')
# questioning, stats calculation loop
for i,query in enumerate(qa_list[::2]):
start = timeit.default_timer()
result = qa_chain({"query": query})
time = timeit.default_timer() - start # seconds
avg_mem = sum(shared_list[::2])/len(shared_list[::2])
#avg_cpu = sum(shared_list[1::2])/len(shared_list[1::2])
shared_list.clear()
cos_sim = LLMPerfMonitor.calc_similarity(qa_list[i*2+1],result["result"])
print(f'gemma_2b;Q{i+1};{cos_sim:.5};{time:.2f};{avg_mem:.2f};{avg_cpu:.2f}')
event.set() # Set the event to signal the child thread to terminate
child.join() # Wait for the child thread to finish
if __name__ == "__main__":
main()
For TinyLlama, the same identical system architecture will be employed. Furthermore, the only module needing some minor code changes would be the main
module, where LlamaCpp
’s instance attribute module_path
and variable qa_template
would be the only differences.
3.0 Initial Tests
For our preliminary tests, let’s vary chunk size and overlap parameters to determine the optimal configuration with the best accuracy on the test machine. Chunk size used by TextSplitter
as part of the vectorization step was varied from 100, 250, 500, 750 up to 1,000 characters. The overlap parameter was always kept at 20% of the chunk size. For each chunk size, the RAG system was tested against 10 different questions. In terms of response accuracy, chunk size 500 returned the best result.
As an example, the following was question #8:
What are SteelHead cli commands to use to avoid using the GRE encapsulation method for the traffic return path from the SteelHead when it is layer-2 separated with the WCCP router?
The following was the LLM’s response at chunk size 100, which is incorrect:
Sure, here are the SteelHead CLI commands to use to avoid using the GRE encapsulation method for the traffic return path from the SteelHead when it is layer-2 separated with the WCCP router: **1. Override GRE encapsulation:** ``` set wccp override-return yes ``` **2. Disable GRE encapsulation for specific protocols:** …
And at chunk size 500, this LLM’s response was as below:
The SteelHead CLI commands to use to avoid using the GRE encapsulation method for the traffic return path from the SteelHead when it is layer-2 separated with the WCCP router are:
* `wccp override-return route-no-gre` * `wccp override-return-sticky-no-gre`
which is the correct answer! This is a big win for Gemma, as the same question was posed to many other models in the past (not part of the current study) and they typically failed miserably.
In addition, the system also had the lowest response time at this vectorization configuration as shown in Fig. 4. This plot is shown with a 95% confidence interval.
Based on this experiment outcome, chunk size 500 and the corresponding overlap of 100 was adopted for all subsequent tests.
To display the vector space property changes when these parameters were varied, Fig. 5 depicts the embeddings visualization using library renumics-spotlight
. If anyone is interested to explore about this visualization and keen to explore how to create them, feel free to check out our earlier work in this space [2].
It is evident that changing these parameters have a significant impact on the vector space. From left, with a chunk size of 100 characters, the embeddings look like a single-dense cluster with a few outliers. At chunk size of 1,000 characters at the other end, it still looks like a single cluster but varying local densities and likely with a different set of nearest neighbors for each vector.
3.1 Response Accuracy of Gemma 2B vs TinyLlama 1.1B
For this first performance comparison, let’s look at their response accuracy. The following is one of the question used and the corresponding expected answer:
Q5: What is the purpose of the “steelhead communication multi-interface enable” command? Answer: It allows all SteelHead neighbor in-path interface IP addresses to be visible to each peer. This visibility ensures neighbor communication even if an interface fails.
Gemma LLM responded with the following:
The purpose of the “steelhead communication multi-interface enable” command is to allow all SteelHead neighbor in-path interface IP addresses to be visible to each peer. This ensures that if one interface fails, the SteelHead can still communicate with its neighbors through other available interfaces.
whereas TinyLlama LLM responded as follows:
The “steelhead communication multi-interface enable” command is a configuration option that enables multiple in-path interfaces for connection forwarding on a SteelHead device. This command provides a level of interface redundancy and ensures that neighbor communication can be established if an interface fails. The purpose of this configuration option is to provide a level of interface redundancy and improve the reliability of connection forwarding on the SteelHead device.
Both models came up accurate responses! Here is another question that was posed:
Q6: What is the impact of using full address transparency for SteelHead optimized connections? Answer: Full address transparency preserves your client and server IP addresses and port numbers in the TCP/IP header fields for optimized traffic in both directions across the WAN. VLAN tags can also be preserved.
In this instance, Gemma responded that it was unable to answer with the given context, as shown below:
The passage does not specify the impact of using full address transparency for SteelHead optimized connections, so I cannot answer this question from the provided context.
On the contrary, TinyLlama was able to provide a response with the same context, but the response was partially inaccurate, as per below (an extract shown; point 2 is incorrect):
Using full address transparency for SteelHead optimized connections can have several positive impacts on network performance and efficiency. Here are some key benefits: 1. Improved network visibility: With full address transparency enabled, SteelHeads can now see the original client and server IP addresses in the packet header, which can help improve network visibility and troubleshooting. 2. Reduced packet loss:…
Before proceeding with a broader performance comparison between both LLMs, let’s look at improving the retrieval performance first.
4.0 RAG Powered by Ensemble Retriever
LangChain has a advanced retriever called EnsembleRetriever
that accepts a list of retrievers as input. It ensembles and reranks the results to improve RAG performance. By leveraging the strengths of FAISS and BM25 retrievers at ratio 3:7, we were previously able to achieve the best response accuracy as reported in an earlier work [4].
Using the same system architecture shown in Fig. 2, we just need to slightly modify modules LoadVectorize
and the main
module to cater for this advanced retriever. Function vectorize
of module LoadVectorize
will include new steps to instantiate a BM25 retriever:
def vectorize(embeddings) -> tuple[FAISS,BM25Retriever]:
docs = load_doc()
db = FAISS.from_documents(docs, embeddings)
db.save_local("./opdf_index")
bm25_retriever = BM25Retriever.from_documents(docs) <<<<<
bm25_retriever.k=5 <<<<<
return db,bm25_retriever
Within the main
module, lines to instantiate an EnsembleRetriever
are added as per below:
db,bm25_r = LoadVectorize.load_db()
faiss_retriever = db.as_retriever(search_type="mmr", search_kwargs={'fetch_k': 3}, max_tokens_limit=1000)
r = 0.3
ensemble_retriever = EnsembleRetriever(retrievers=[bm25_r,faiss_retriever],weights=[r,1-r])
This ensemble retriever will subsequently be used in the QA chain. Those are the only changes needed for this QA system to employ EnsembleRetriever
.
Access to all code to this system is available on GitHub at the following repo:
We are now ready to look at a more comprehensive performance comparison between both LLMs.
4.1 Performance Comparison of Gemma vs TinyLlama with Advanced Retriever
In this comparative experiment, we will pose the same 10 questions against Gemma 2B and TinyLlama 1.1B powered by the advanced retriever, while capturing their response accuracy, response time, memory and CPU utilization for performance comparison.
When Gemma was fed with context from the FAISS retriever, it failed to provide a response for question #6. Gemma when powered by the ensemble retriever responded as follows:
Using full address transparency for SteelHead optimized connections provides the following benefits: * Preserves client and server IP addresses and port numbers in the TCP/IP header fields for optimized traffic in both directions across the WAN. * Preserves VLAN tags. * Optimizes traffic by avoiding potential networking risks associated with enabling full address transparency.
This is an accurate answer! This is not the only question where the model improved. Gemma’s response accuracy doubled from 4 to 8 correct answers. On the contrary, TinyLlama’s accurate response accuracy increased from 3 to 5 questions.
In an attempt to capture response accuracy indirectly, Fig. 6 depicts a Treemap chart showing sentence similarity computed using cosine of both LLMs across all questions against sample answers. Please note even when a response is incorrect, it may have a high similarity value due to the chosen words. We have adopted Treemap to represent the results as there is an inherent hierarchy in this data, and this chart can use both area of rectangle as well as color shade to differentiate the magnitude of the chosen metric. In this chart, Gemma has more darker-shaded rectangles, and even at the LLM level, it is in a darker blue shade. This implies Gemma boasts a higher response accuracy than TinyLlama.
To compare the LLMs in terms of their response times, Treemap in Fig 7 is shown. Nearly all TinyLlama question rectangles are lightly shaded. On average, Gemma took 1.5 times longer to respond to the questions compared to TinyLlama. Since TinyLlama is a smaller model by comparison, it is expected to respond quicker.
In Fig. 8, memory utilization of Gemma and TinyLlama is depicted. The darker shades here imply a lower memory usage. As before, since TinyLlama is a 1.1B model compared to Gemma 2B, the former has a lower memory footprint.
In summary, Gemma 2B boasts good response accuracy which further improved with the use of advanced retrievers, such as EnsembleRetriever
. When compared against TinyLlama, Gemma has 16% bigger memory footprint along with a 14-second additional delay on average, which seems to be reasonable.
5.0 Final Thoughts
Since late last year, open-source large language models experienced a giant jump in their accuracy. Even when such models are not trained on your internal documents, we are able to harness their capability in a Retrieval-Augmented Generation (RAG) setup. It is however not feasible to run large models on a resource-constrained environment. TinyLlama 1.1B is one promising model in this space.
In this article, we took a comprehensive look at Gemma 2B. When coupled with the ensemble retriever in a RAG setup, this model was demonstrated to achieve more than 80% accurate answers for questions against a recent document, which was 33% better than TinyLlama. Although this improvement came at a cost of 16% bigger memory footprint and 1.5 times longer response delay, they seem to be reasonable trade-offs.
Thank you for reading!
References 1. Gemma: Open Models Based on Gemini Research and Technology, Google DeepMind. 2. Visualizing FAISS Vector Space to Understand its Influence on RAG Performance 3. https://huggingface.co/lmstudio-ai/gemma-2b-it-GGUF 4. Querying Internal Documents using Mistral 7B with Context from an Ensemble Retriever