Private GPT: Fine-Tune LLM on Enterprise Data
Doing cool things with data

Introduction
In the era of big data and advanced artificial intelligence, language models have emerged as formidable tools capable of processing and generating human-like text. Large Language Models like ChatGPT are general-purpose bots capable of having conversations on many topics. However, LLMs can also be fine-tuned on domain-specific data making them more accurate and on-point on domain-specific enterprise questions.
Many industries and applications will require a fine-tuned LLMs. Reasons include:
- Better performance from a chatbot trained on specific data
- OpenAI models like chatgpt are a black box and companies may be hesitant to share their confidential data over an API
- ChatGPT API costs may be prohibitive for large applications
The challenge with fine-tuning an LLM is that the process is unknown and the computational resources required to train a billion-parameter model without optimizations can be prohibitive.
Fortunately, a lot of research has been done on training techniques that allow us now to fine-tune LLMs on smaller GPUs.
In this blog, we will cover some of the techniques used for fine-tuning LLMs. We will train Falcon 7B model on finance data on a Colab GPU! The techniques used here are general and can be applied to other bigger models like MPT-7B and MPT-30B.
At Deep Learning Analytics, we have been building custom machine-learning models for the last 6 years. Reach out to us if you are interested in fine-tuning a LLM for your application.
QLoRA
QLoRA, which stands for “Quantized Low-Rank Adaptation,” presents an approach that combines quantization and low-rank adaptation to achieve efficient fine-tuning of AI models. Both these terms are explained in more detail below.
QLoRA reduces the memory required for fine-tuning LLM, without any drop in performance with respect to a standard 16-bit model fine-tuned model. This method enables a 7 billion parameter model to be fine-tuned on a 16GB GPU, a 33 billion parameter model to be fine-tuned on a single 24GB GPU and a 65 billion parameter model to be fine-tuned on a single 46GB GPU.
This implies most companies can now have fine-tuned LLMs or on-prem models for a small cost.
QLoRA is composed of two techniques:
- Lora — Low-Rank Adaptation
- Quantization
LoRA
LoRA, which stands for Low-Rank Adapters (LoRA), are small sets of trainable parameters, injected into each layer of the Transformer architecture while fine-tuning. While original model weights are frozen and not updated, these small sets of injected weights are updated during fine-tuning. This greatly reduces the number of trainable parameters for downstream tasks. Gradients during stochastic gradient descent are passed through the frozen pre-trained model weights to the adapter. Thus, only these adapters, with a small memory footprint, are updated during the time of training.
Quantization
Next, we will discuss Quantization. To understand quantization, first, we need to know the different data types that are generally used to store model weights.
- FP32 — Full Precision weights which take 4 bytes of memory
- FP16/BF16 — Half precision weights which take 2 bytes of memory
- INT8 —Data-type consists of an 8-bit representation that can store 2⁸ different values (between [0, 255] or [-128, 127] for signed integers)
- FP8 and FP4, which stand for Floating-Point 8-bit and 4-bit precision, respectively. They are part of the mini floats family of floating point values.
Using quantization, the model needs much smaller memory than the memory needed to store the original model. For example, an 8-bit quantized model would require only 1/4th of the model size, as compared to a model stored in a 32-bit datatype.
But how is it possible to store the original 32-bit weight in 8-bit data types like INT8 or FP8?
Quantization means “rounding” off values, from one data type to another. It works with squeezing larger values into data types with less number of bits, but with a small loss of precision.
In the case of QLoRA, model weights are stored in a 4-bit floating point data type. But, in order to minimize the overall loss in accuracy, all the matrix operations are performed in a 16-bit floating point. Once the calculation is done, weights are stored/quantized back, before they are stored again.
To summarize, QLORA uses one storage data type (typically 4-bit NormalFloat) and one computation data type (16-bit BrainFloat). The storage data type is dequantized to the computation data type for performing the forward and backward pass. The weights are decompressed only when they are needed, therefore the memory usage stays low during training and inference. More details about these could be found here. [4]
What about the accuracy impacts of QLoRA fine tuning?
QLoRA tuning is shown to match 16-bit fine-tuning methods in a wide range of experiments. In addition, the Guanaco models, which use QLoRA fine-tuning for LLaMA models on the OpenAssistant dataset (OASST1), are state-of-the-art chatbot systems and are close to ChatGPT on the Vicuna benchmark.
HuggingFace Support for fine-tuning
HuggingFace has released several libraries that can be used to fine-tune LLMs easily.
These include:
- PEFT Library: HuggingFace has released a library on Parameter Efficient Fine Tuning (PEFT) which has support for LORA.
- Quantization Support — Many models can be loaded in 8-bit and 4-bit precision using bitsandbytes module. The basic way to load a model in 4bit is to pass the argument load_in_4bit=True when calling the from_pretrained method.
- Accelerate library — Accelerate library has many features to make it easy to reduce the memory requirements of models
- Supervised Fine-Tuning Trainer — The SFT trainer is the trainer class for supervised fine-tuning of Large LLMs.
Now we combine all the techniques here to train the Falcon 7B model on a finance dataset.
Training Falcon 7B on Alpaca Finance Dataset
We were successfully able to fine-tune the recently released Falcon-7B model, on Alpaca-Finance dataset, on Google Colab. This dataset consisted of around 70K finance data points. This is an open source data available on HuggingFace dataset hub and can be loaded directly from the hub. Data is finance questions with answers. A snapshot of the data is shown below:

The code for this fine-tuning is shared here.
First, we load a pre-trained Falcon-7B model from a hugging-face. We use AutoModelForCausalLM , generally used to load auto-regressive language models like GPT. Notice, we set the storage type to 4-bit and the computation type to FP-16.

Now, we’re gonna create Adapters (Remember ! Extra layers added to Transformers modules, that will hold our fine-tunes weights). We will add them to linear layers as well as query value pairs of Transformers modules, as suggested, for best accuracy of the fine-tuned model. Below, one can notice the same in the target_modules parameter.

Finally, we will bring everything together, while initializing the SFTTrainer class. We also provide our data, appropriate token, formatting function and max_seq_length to it.
The model takes about an hour to train. I used WandB for logging. More details are in the Colab Notebook here.
The training loss was quite low after 100 steps.

Inference with Alpaca Finance Dataset
The Colab Notebook also has the inference code. I found the model did well in responding to a question. On my question regarding income needed to retire, it had a long generation. It covered some rules of thumb regarding income from 401K and other pension plans. It emphasized the need to know the retirement income to come up with an answer and finally suggested I talk to a financial planner! The response is not as clean as chatGPT would give but good to see that a 7B model can be trained and run quite easily.
### Instruction:
how do I figure out how much money I need to retire?
### Response:
There are a lot of assumptions embedded in any formula, so it's hard to say for
sure what you'll need per month, but some ideas to get started would be:
around 80k per year if you have a very high net worth or 180k per year if you
have a very low net worth. It sounds like your income is well above this number
, especially if you have a lot of assets. To really know though, you'll need
to do a full financial analysis. There are several factors that aren't
included in this equation, the most important being what kind of retirement
plan you have. If you have a 401k, the rule of thumb is ~$60k per year will
cover a typical retirement. If you have a pension, the rule of thumb is ~$10k
per year will cover a pension. These are rough guidelines, and there are lots
of other factors that go into it, such as whether you have another source of
income, what kind of investments you have, etc. A financial planner should
be able to help you with this. I will say that $150k per year is pretty high
even if you have a lot of assets; the average American family makes less than
$45k per year. If you don't have a retirement plan or your plan just isn't
enough to support you in retirement, then yes, you should consider
supplementing your income. However, before you make that decision,
you need to be honest with yourself and ask if you live beyond the basic necessities.
If you have the same quality of life that you had when you were making
half of what you are now, then yeah, you might want to look atAnother question I tried is around portfolio diversification. Again the quality of response from a small model surprised me!
### Instruction:
Why should portfolios be diversified?
### Response:
Because the performance of different assets is not identical. Even if they are
correctly classified as "equities" or "bonds", different stocks will behave
differently under different economic conditions. And even if you have the
same general view of the market (eg, it's overvalued), you will do better
if you own a broad basket of stocks rather than one that is too focused on
either the rising or falling side. Also, different markets interact with each
other, and crises in one affect others. If everyone rushes into safe havens
in a panic, for instance, then those safe havens will rise and you'll lose
money. Conversely, if everyone thinks a bubble is about to burst, then you
want to be buying up the shares before everyone else catches on and they
start declining. A single individual or small group can be right that the
stock market is overvalued, but if everyone thinks that way, it's time to
sell - and the sale will be greater the earlier you join it. Also, some
investments inherently involve more risk than others. Gambling stocks, for
instance, would probably be a bad idea during an industrywide gambling slump.
And some forms of investment, such as shortConclusion
It has become easier to fine-tune LLMs on custom datasets which can give people access to their own “private GPT” model. The custom models can be locally hosted on a commercial GPU and have a ChatGPT like interface.
Deep Learning Analytics is a trusted provider of custom machine learning models tailored to diverse use cases. With a global client base, we have collaborated with organizations to develop bespoke solutions aligned with their unique requirements. If you’re seeking to leverage the power of Language Models (LLMs) for your application, we invite you to connect with us. Reach out to our team at [email protected] to explore the possibilities. We look forward to discussing your project further.
References
[1] https://huggingface.co/blog/falcon
[2] https://huggingface.co/blog/hf-bitsandbytes-integration
[3] https://huggingface.co/blog/peft
[5] https://www.promptengineering.org/exploring-qloras-potential-for-accessibility-and-innovation/
[6] https://blog.gopenai.com/paper-review-qlora-efficient-finetuning-of-quantized-llms-a3c857cd0cca
[7] QLoRA paper -: https://arxiv.org/abs/2305.14314
[8] LoRA paper -: https://arxiv.org/abs/2106.09685






