A gentle introduction to open-source LLMs and how to use them

Background generated using “cardosAnime_v10”. Fonts: Bankai and Hakubo

Thanks to the advances in quantization, these days we can use large language models for inference on consumer GPUs with very acceptable performance. One of the questions everyone will face when selecting a model is “Which one is the best for my requirements?”. While you can consider the measure of popularity as a good candidate for which one to choose, your use case is not necessarily aligned with everyone else. And to add salt to the injury, there are so many choices to choose from.

When it comes to choosing a model, the number of options is overwhelmingly large due to the fact that there are different choices to make in different aspects. For instance, we can talk about the base foundation model or its size. The fine-tuning applied on top of the foundation model training. The type of quantization or the dataset used for the quantization. The engine to use for loading the model. And of course, in most cases, you can cross these options with each other. In this post, I’ll go over some of the options we have and help you understand what each means.

Choosing a task

Before we get to talk about any code or model, we need to settle on the task we are trying to work with. While we could have a generalized large language model capable of answering questions in a wide range of tasks, such models require a larger number of parameters and as a result, a data center hardware to run. On the other hand, we have smaller models that fit into smaller consumer GPUs. But they might not be good at answering questions in all tasks at the same time. And this is fine since we can have separate smaller models, each an expert in a different task. Then we can switch between models when we need them.

To name but a few of such tasks, we could talk about coding, medicine, role play, and math, and the number of such tasks is growing by the day. For the sake of simplicity, I’ll focus on coding as our task in this post.

Choosing a model

The training of LLMs happens in two steps. First, the LLM is trained on a very large corpus of text. This step results in an LLM called a foundation model. Foundation models are not practical until they are fine-tuned for a specific task. Training foundation models is very expensive while fine-tuning a foundation model is much cheaper compared to that. That’s why whenever an open-source foundation model comes out, the community will fine-tune it for all the tasks possible.

When you want to choose a model, you have to search and see if an acceptable fine-tuned model for your task already exists or not. You need to know that fine-tuning a model is very accessible these days. So fine-tuning a model based on an open-source foundation model is always an option. It will definitely cost you money but it’s not as expensive as you might think. It’s just that you’ll need a dataset for it. Here, since this is a starter-friendly post, fine-tuning is out of scope and I’m going to assume that you can find a fine-tuned model for your task.

Assuming the coding task narrows down our model selection. At the time of writing this post, the most advanced publicly available foundation model for coding is Code Llama which is a variation of Llama 2, both from Meta. Code Llama comes in three different sizes, 7B, 13B, and 34B. These are the number of parameters in the model. While the 7B version might fit into a consumer GPU, the other two are big for 24GB (the maximum VRAM size of a consumer GPU). This is where the concept of quantization comes into play. Without quantization, only tiny versions of the LLMs can be loaded into a consumer GPU. But thanks to quantization, you can load a 34B model into a 24GB GPU.

The way quantization works is by sacrificing model performance to reduce model’s the memory requirements. By default, each LLM parameter is stored in a 32-bit floating point. This is needed when training the model. This is because the floating point variable type has enough decimal precision needed by the training phase. But as it turns out, we don’t need as much precision when it comes to inference calculations. In fact, for inference, we can get by with as little as 4 bits. This is what is referred to as quantization. This reduction of bit size helps a great deal when it comes to loading larger models in smaller hardware. But at the same time, we have to address the fact that the model’s performance is somewhat degraded. But a degraded model that you can run on a consumer GPU beats a bloated model that cannot be loaded into our hardware, every day of the week.

The number of bits in a quantized model doesn’t need to be exactly 4 and you can find models quantized as little as 2 bits all the way to 8 bits (maybe even larger). But 4 bits is usually the sweet spot where the model is not degraded too much to become incomprehensible while still fitting into your GPU’s VRAM. The bit count doesn’t even need to be an integer. In case you are wondering how we could end up with, for example, 3.5-bit quantization, that’s when the quantization algorithm supports variable sizes for the different parameters. And once you average the number of bits used for all the parameters, it will give you a decimal number.

Going back to our model of choice, Code Llama comes with a sub-variant called Code Llama Python which is a Code Llama model trained on Python codes. The story does not end here. Phind is a variation of Code LLama Python that is further fine-tuned on even more Python code. At the time of writing this post, Phind V2 is considered the best Python open-source LLM. And since I’m going to run it on a consumer GPU, I’ll need a quantized version of it. TheBloke is perhaps one of the most famous experts in quantizing models out there with over 3K models quantized to this day.

All this means that I’ll be using the following model in this post:

TheBloke/Phind-CodeLlama-34B-v2-GPTQ · Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Choosing an engine

Once we have selected our model, our next challenge would be loading it. There are different library packages that you can use to load a model. Some of the more famous ones are HuggingFace’s transformer package, LlamaCpp, etc. ExllamaV2 is one of the new kids on the block and my choice. Before going any further, you need to know that each of these packages supports a subset of the model formats. This means that when you are model hunting, you have to consider which library goes with it. There might be more than one option in which case you have to debate which one to use.

Out of all the packages out there, LlamaCpp has one interesting feature. It can use a combination of CPU and GPU and leverage them both at the same time. This could come in handy if the model does not fit into your GPU’s VRAM. Granted the number of tokens generated per second will take a hit but at least you can run it.

I don’t claim that I’ve done complete research on which library package to use. But among the ones I did test, ExllamaV2 was one of the more stable ones and it hardly ever crashed on me. It’s also fast.

GitHub - turboderp/exllamav2: A fast inference library for running LLMs locally on modern…

A fast inference library for running LLMs locally on modern consumer-class GPUs - GitHub - turboderp/exllamav2: A fast…

github.com

On top of the library packages for loading models, many applications are developed to facilitate the process of using LLMs. The most famous ones, at the time of writing this post, are Ollama which works only with LlamaCpp, TextGen WebUI which supports a long list of different library packages, and TabbyAPI which makes use of ExllamaV2. In all these cases, you can easily load a model, sometimes using their GUI interface, and use the LLM through an API or their GUI interface. But I would rather use the library package of ExllamaV2 directly since that way we can see the inner workings of it better.

Next stop, the inference

We have all we need to generate our first text using an LLM. And thanks to the open-source community, this can easily be done in a few lines of code:

from exllamav2 import(
    ExLlamaV2,
    ExLlamaV2Config,
    ExLlamaV2Cache,
    ExLlamaV2Tokenizer,
)
from exllamav2.generator import (
    ExLlamaV2BaseGenerator,
    ExLlamaV2Sampler
)

# Load model
config = ExLlamaV2Config()
config.model_dir = "./models/TheBloke_Phind-CodeLlama-34B-v2-GPTQ/"
config.prepare()

model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model)
model.load_autosplit(cache)
tokenizer = ExLlamaV2Tokenizer(config)

# Initialize generator
generator = ExLlamaV2BaseGenerator(model, cache, tokenizer)

# Generate text
settings = ExLlamaV2Sampler.Settings()
question = "<<< Your input prompt goes here >>>"
answer = generator.generate_simple(question, settings, num_tokens=512)
print(answer)

The “settings” object in the above code comes with a bunch of properties that let you control the behavior of the sampler component of the LLM which in turn impacts the generated “answer”. But what is a sampler? To answer that question we have to look under the hood of an LLM. Each LLM comprises two components: a machine-learning model and a sampler.

The machine learning model of an LLM is the bulk part of an LLM. When you are downloading a model (for instance from HuggingFace) you are downloading the machine learning part of the LLM. The job of such models is to take in a sequence of tokens and spit out a probability distribution over all the tokens for the next possible token. In other words, the model will guess what is the probability of the next token being any token from our set of possible tokens. One important aspect of such machine learning models is that they are deterministic, meaning that their output is impacted only by their input and nothing else. So when you feed them the same input twice, they’ll generate the same output both times. But of course, this is not what we see when we are working with a chatbot. That behavior is a result of the second component within the LLM, the sampler.

Imitating creativity — the sampler

Unlike the machine learning part of an LLM, a sampler is only code and it has no weights. This means when you are downloading a model from HuggingFace, the sampler does not come with it. Instead, the sampler is part of the library package that you are using to load the model with. This fact shows the importance of choosing the library package since it could directly impact the quality of the generated text. But what does a sampler do? As its name implies, the sampler samples a single token out of the token dictionary based on the probability distribution generated by the model. It takes in the probability distribution generated by the machine learning model and outputs a single token. This token is then appended to the original input string, then the new string is passed through the same process, and by following this process our output text is generated.

As mentioned earlier, we can control the behavior of the sampler through the “settings” object. One detail worth paying attention to is that since we are only tweaking the sampler, we do not need to touch the loaded model in the memory. If for whatever reason we want to make a change to the loaded model, we have to reload it and that could be expensive since the models take time to reload. In other words, sampler changes are cheap and can be applied per each generated text.

Here in this post, I’m not planning to explain how the sampler algorithm works. All you need to know is that there is more than one algorithm and new ones are being invented every day. But what you need to know is what are the different parameters of the sampler that come with ExllamaV2 and how each one impacts the generated output.

Here are the different sampler parameters supported by EllamaV2:

Temperature
Top-P
Top-K
Min-P
Token repetition parameters
Mirostat parameters

Explaining these properties and how they work will need a dedicated post of its own. All I want you to take away from this one is what kind of impact each has on the generated output.

Temperature is perhaps one of the most impactful of the bunch. It’s a decimal number, ranging from zero (exclusive) up to 2 (inclusive?). Setting a lower value for the temperature will result in less creative answers. On the other hand, if you set it to a larger number you’ll get more diverse answers. This means that the impact of the temperature parameter is easier to observe when you generate multiple outputs based on the same input. With lower values of the temperature, you’ll get redundant or the same answers as before. But if you increase the temperature, the LLM will generate different outputs each time you call it with the same input.

The impact of the Top-P is very similar to the temperature that we’ve already covered. It impacts the diversity of the generated text, only using a different algorithm. It ranges from zero to one, with lower values resulting in more diverse outputs. Since the temperature and top-p are impacting the generated output very much alike, it is advised to consider a fixed value for one and play around with the other but not both at the same time.

Moving on to top-k, it also impacts the creativity of the generated text but this time, top-k is an integer. Its value could be any integer from one up to the size of your token dictionary. Simply put, the top-k number of most probable tokens will be considered by the sampler as candidates for the next token to return. It should be obvious that for more creative answers, you have to increase the value of top-k, and also having a top-k larger than the size of the token dictionary is meaningless.

As the name suggests, the token repetition parameters which ExllamaV2 has three of them, aim at controlling how much the generated text could have repeated tokens. This is because LLMs tend to get stuck into a loop, repeating the same token over and over. The remedy for this is to make choosing the same token for a second time less probable by penalizing such behavior.

Lastly, we have the mirostat parameters. Mirostat is the name of a new algorithm that is supposed to improve the generated text by doing the same job as top-k, limiting the list of tokens considered to apply the temperature. It comes with two parameters, “tau” and “eta”. The mirostat algorithm aims at lowering the perplexity of generated text. Perplexity is the measure of how improbable a string could be. So, when we decrease the perplexity of a string, we are increasing its probability. The mirostat is a search algorithm and tau is the target perplexity that you are setting as the target. This means if you enable the mirostat for the ExllamabV2 which is something you should do if you want to use mirostat, and set the tau parameter to two, you are asking the mirostat algorithm to find a combination of tokens that pushes the perplexity of the whole generated string towards two. You need to understand that mirostat does not guarantee that it can find such a string but it will try its best. The common values for the tau parameter are 1, 2, 3, …, and any decimal number in between. But you should target for the lower values, perhaps a number around 2. The “eta” parameter is a learning rate. It’s a decimal number usually from 0.05 up to 0.2.

There’s so much to say about using local open-source LLMs but the best way to get a good understanding of all these concepts is to get your hands dirty and experiment with them. In my next post, I’ll dig into how to optimize sampler parameters and get the most out of local open-source LLMs.