LLMs for Everyone: Running LangChain and a MistralAI 7B Model in Google Colab

Experimenting with Large Language Models for free

Artistic representation of the LangChain, Photo by Ruan Richard Rodrigues, Unsplash

Everybody knows that large language models are, by definition, large. And even not so long ago, they were available only for high-end hardware owners, or at least for people who paid for cloud access or even every API call. Nowadays, the time is changing. In this article, I will show how to run a LangChain Python library, a FAISS vector database, and a Mistral-7B model in Google Colab completely for free, and we will do some fun experiments with it.

Components

There are many articles here on TDS about using large language models in Python, but often it is not so easy to reproduce them. For example, many examples of using a LangChain library use an OpenAI class, the first parameter of which (guess what?) is OPENAI_API_KEY. Some other examples of RAG (Retrieval Augmented Generation) and vector databases use Weaviate; the first thing we see after opening their website is “Pricing.” Here, I will use a set of open-source libraries that can be used completely for free:

LangChain. It is a Python framework for developing applications powered by language models. It is also model-agnostic, and the same code can be reused with different models.
FAISS (Facebook AI Similarity Search). It’s a library designed for efficient similarity search and storage of dense vectors, which I will use for Retrieval Augmented Generation.
Mistral 7B is a 7.3B parameter large language model (released under the Apache 2.0 license), which, according to the authors, is outperforming 13B Llama2 on all benchmarks. It is also available on HuggingFace, so its use is pretty simple.
Last but not least, Google Colab is also an important part of this test. It provides free access to Python notebooks powered by CPU, 16 GB NVIDIA Tesla T4, or even 80 GB NVIDIA A100 (though I never saw the last one available for a free instance).

Right now, let’s get into it.

Install

As a first step, we need to open Google Colab and create a new notebook. The needed libraries can be installed by using pip in the first cell:

!pip install bitsandbytes accelerate xformers einops langchain faiss-cpu transformers sentence-transformers

Before running the code, we need to select the runtime type:

Now, let’s import the libraries:

from typing import List
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig, BitsAndBytesConfig
import torch
from langchain.prompts import PromptTemplate
from langchain.llms import HuggingFacePipeline
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains import ConversationalRetrievalChain, RetrievalQA
from langchain.callbacks.tracers import ConsoleCallbackHandler
from langchain_core.vectorstores import VectorStoreRetriever
from langchain.memory import ConversationBufferMemory
from langchain.vectorstores import FAISS

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Device:", device)
if device == 'cuda':
    print(torch.cuda.get_device_name(0))

# >>> Device: cuda
# >>> Tesla T4

If everything was done correctly, the output should show the “cuda” device and a “Tesla T4” as the selected graphics card.

The next step is the most important and resource-intensive: let’s load the language model:

orig_model_path = "mistralai/Mistral-7B-Instruct-v0.1"
model_path = "filipealmeida/Mistral-7B-Instruct-v0.1-sharded"
bnb_config = BitsAndBytesConfig(
                                load_in_4bit=True,
                                bnb_4bit_use_double_quant=True,
                                bnb_4bit_quant_type="nf4",
                                bnb_4bit_compute_dtype=torch.bfloat16,
                               )
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, quantization_config=bnb_config, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(orig_model_path)

Here, I selected a 4-bit quantization mode, which allows the model to fit into GPU RAM. There is also another tricky part. Loading the original “mistralai/Mistral-7B-Instruct-v0.1” model causes the Colab instance to crash. Surprisingly, the GPU RAM is enough for 4-bit quantization, but the model files are about 16 GB in size, and there is just not enough “normal” RAM on a free Colab instance to quantize the model before loading it into the GPU! As a workaround, I was using a “sharded” version, which was split into 2GB chunks (if your PC or Colab instance has more than 16GB of RAM, this is not required).

As an aside note, those readers who wish to know more about how 4-bit quantization works are welcome to read another article:

16, 8, and 4-bit Floating Point Formats — How Does it Work?

Let’s go into bits and bytes

towardsdatascience.com

If everything was done correctly, the Colab output should look like this:

Loading the Mistral 7B model, Screenshot by author

As we can see from the picture, the files that are required to be downloaded are huge, so if you run this code locally (not in Colab), ensure that your web traffic is not limited.

Now, let’s create the LLM pipeline:

text_generation_pipeline = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
    repetition_penalty=1.1,
    return_full_text=True,
    max_new_tokens=100,
)
mistral_llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

Congratulations! Our installation is ready, and we successfully loaded a 7B language model. For a short test, let’s see if the LLM works:

text = "What is Mistral? Write a short answer."
mistral_llm.invoke(text)

#> Mistral is a type of cold front that forms over the Mediterranean 
#> Sea and moves eastward across southern Europe, bringing strong winds 
#> and sometimes severe weather conditions such as heavy rainfall, hail, 
#> and even tornadoes.

If everything is okay, we’re ready to have fun and do further tests.

LangChain

LangChain is a Python framework specially designed to work with language models. As a warm-up, let’s test a prompt template:

from langchain.prompts import PromptTemplate

prompt = PromptTemplate.from_template(
    "Tell me a {adjective} joke about {content}."
)
prompt.format(adjective="funny", content="chickens")

llm_chain = prompt | mistral_llm
llm_chain.invoke({"adjective": "funny", "content": "chickens"})

#> Why don't chickens like to tell jokes? They might crack each other
#> up and all their eggs will scramble!

Interestingly, LangChain is “cross-platform,” and we can use different language models without code change. This example was taken from the official library documentation, where OpenAI is used for prompts, but I used the same template for Mistral.

How does it work? It is possible to add the ConsoleCallbackHandler to the config, so we can see all intermediate steps:

llm_chain = prompt | mistral_llm
llm_chain.invoke({"adjective": "funny", "content": "chickens"},
                 config={'callbacks': [ConsoleCallbackHandler()]})

Then, the output will look like this:

[1:chain:RunnableSequence] Entering Chain run with input:
{
  "adjective": "funny",
  "content": "chickens"
}

[1:chain:RunnableSequence > 2:prompt:PromptTemplate] Entering Prompt run with input:
{
  "adjective": "funny",
  "content": "chickens"
}

[1:chain:RunnableSequence > 3:llm:HuggingFacePipeline] Entering LLM run with input:
{
  "prompts": [
    "Tell me a funny joke about chickens."
  ]
}

[1:chain:RunnableSequence > 3:llm:HuggingFacePipeline] [3.60s] Exiting LLM run with output:
{
  "generations": [
    [
      {
        "text": "\n\nWhy don't chickens like to tell jokes? They might crack each other up and all their eggs will scramble!",
        "generation_info": null,
        "type": "Generation"
      }
    ]
  ],
  "llm_output": null,
  "run": null
}

As another example, let’s try a ChatPromptTemplate class:

chat_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are a helpful AI bot. Your name is {name}. Answer with short sentences."),
    ]
)

llm_chain = chat_prompt | mistral_llm
llm_chain.invoke({"name": "Mistral", "user_input": "What is your name?"})

#> Mistral: Yes, I am Mistral. How can I assist you today?

In my opinion, the answer “Yes, I am Mistral” is acceptable but linguistically not the best for a “What is your name?” question. Obviously, with large neural networks, interpretability can be an issue, and it’s impossible to tell why the model responded this or that way. It can be an artifact of the 4-bit quantization (which slightly reduces the model quality) or just a fundamental limitation of the abilities of the 7B model (obviously, other 33B or 70B models can perform better but will require much more computational resources).

Retrieval-augmented generation (RAG)

Nowadays, RAG is a hot topic of research. It allows us to automatically add external documents to the LLM prompt and to add more information without fine-tuning the model. Let’s see how we can use it with LangChain and Mistral.

First, we need to create a separate embedding model:

from langchain.embeddings import HuggingFaceEmbeddings


embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-l6-v2",
    model_kwargs={"device": "cuda"},
)

This small sentence-transformer model is able to convert text strings into a vector representation; we will use it for our vector database. As a toy example, I will add only one “document” to the array:

db_docs = [
    "Airbus's registered headquarters is located in Leiden, Netherlands.",
]

Then we need to create a vector database and a VectorStoreRetriever object:

from langchain.vectorstores import FAISS
from langchain_core.vectorstores import VectorStoreRetriever


vector_db = FAISS.from_texts(db_docs, embeddings)
retriever = VectorStoreRetriever(vectorstore=vector_db)

Now, we can create a RetrievalQA object, which is specially designed for question-answering:

template = """You are a helpful AI assistant. Use the following pieces of context to answer the question at the end.
              {context}
              If you don't know the answer, just say that you don't know, don't try to make up an answer.
              Chat history: {history}
              Question: {question}
              Write your answers short. Helpful Answer:"""

prompt = PromptTemplate(
        template=template, input_variables=["history", "context", "question"]
    )
qa = RetrievalQA.from_chain_type(
        llm=mistral_llm,
        chain_type="stuff",
        retriever=retriever,
        chain_type_kwargs={
            "verbose": False,
            "prompt": prompt,
            "memory": ConversationBufferMemory(
                memory_key="history",
                input_key="question"),
        }
    )

I will ask the model questions about Airbus; they are highly likely unknown to the model:

qa.run("Hi, who are you?")
#> I am an AI assistant.

qa.run("What is the range of Airbus A380?")
#> The range of Airbus A380 is approximately 12,497 nautical miles.

qa.run("What is the tire diameter of Airbus A380 in centimeters?")
#> I don't know.

I was positively surprised by the answers. First, the Mistral 7B model was already aware of the Airbus A380 range (I checked with Google, and the result looks correct). Second, as I expected, the model was not aware of the A380 tire diameter, but it “honestly” answered “I don’t know” instead of providing the “hallucinated” and incorrect response.

Now, let’s add an additional string to our “vector database”:

db_docs = [
    "Airbus's registered headquarters is located in Leiden, Netherlands.",
    "The Airbus A380 has the largest commercial plane tire size in the world at 56 inches in diameter."
]

Then we can try again:

qa.run("What is the tire diameter of Airbus A380 in centimeters? Write a short answer.")
#> 142 cm

This was amazing! The model not only found the information that the A380 tire diameter is 56 inches, but it correctly converted it into centimeters (56*2,54 is indeed 142). We know that math tasks are usually hard for LLMs, so this accuracy is surprising.

We can also ask a model to explain the answer in steps:

qa.run("What is the tire diameter of Airbus A380 in centimeters? Explain the answer in three steps.")
#> 1. The tire diameter of Airbus A380 is 56 inches in diameter.
#> 2. To convert 56 inches to centimeters, we need to multiply it by 2.54 (the conversion factor from inches to centimeters).
#> 3. Therefore, the tire diameter of Airbus A380 in centimeters is 142.16 cm.

This is great! Well, we are already used to the fact that large language models like GPT3 or GPT4 run on supercomputers in the cloud and can produce amazing results. But to see that on your local GPU (I tested this code in Google Colab and on my home PC as well) is a completely different feeling.

Attentive readers may ask the question. How do the Mistral and Retriever models work together? Indeed, I created a “Mistral-7B-Instruct-v0.1” model and an “all-MiniLM-l6-v2” sentence-embedding model. Are their vector spaces compatible? The answer is “no.” When we do a query, the VectorStoreRetriever does its own search first, finds the best documents in the vector store, and returns these documents in plain text format. We can see the final prompt if we change the verbose parameter to True:

template = """You are a helpful AI assistant. Use the following pieces of context to answer the question at the end.
              {context}
              If you don't know the answer, just say that you don't know, don't try to make up an answer.
              Chat history: {history}
              Question: {question}
              Write your answers short. Helpful Answer:"""

prompt = PromptTemplate(
        template=template, input_variables=["history", "context", "question"]
    )
qa = RetrievalQA.from_chain_type(
        llm=mistral_llm,
        chain_type="stuff",
        retriever=retriever,
        chain_type_kwargs={
            "verbose": True,
            "prompt": prompt,
            "memory": ConversationBufferMemory(
                memory_key="history",
                input_key="question"),
        }
    )

After running the same code, we can see the actual prompt, which was sent by LangChain to the model:

You are a helpful AI assistant. Use the following pieces of context to answer the question at the end.
The Airbus A380 has the largest commercial plane tire size in the world at 56 inches in diameter.
Airbus's registered headquarters is located in Leiden, Netherlands.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Chat history. Human: Hi, who are you?
AI:  I am an AI assistant.
Human: What is the range of Airbus A380?
AI:  The range of Airbus A380 is approximately 12,497 nautical miles.
Question: What is the tire diameter of Airbus A380 in centimeters? Explain the answer in three steps.
Write your answers short. Helpful Answer:

In this case, both documents were relevant to the “Airbus” question, and a VectorStoreRetriever placed them in the context placeholder.

Conclusion

In this article, we were able to run a 7.3B Mistral large language model on a free Google Colab instance, using only free and open-source components. This is a great achievement and also a generous step from Google, considering that at the time of this writing, the cheapest 16GB video card on Amazon costs at least $500 (I must admit, though, that the Google Colab service is not a pure charity, and the free GPU backend may not be available 100% of the time; those who need it often should consider buying a paid subscription). We were also able to use retrieval-augmented generation to add extra information to the LLM prompt. If models like this are ready for production, it is still an open question and also an eternal “Buy vs. DIY” dilemma. The Mistral 7B model can still sometimes “hallucinate” and produce incorrect answers; it can also be outperformed by larger models. Anyway, the ability to test models like this for free is great for study, self-education, and prototyping.

Those who are interested in using language models and natural language processing are also welcome to read other articles:

If you enjoyed this story, feel free to subscribe to Medium, and you will get notifications when my new articles will be published, as well as full access to thousands of stories from other authors. If you want to get the full source code for this and my next posts, feel free to visit my Patreon page.

Thanks for reading.