Summary

The author shares insights from their experience fine-tuning large language models (LLMs) like Llama2 and Falcon using Azure ML Studio, emphasizing the importance of data preparation, platform selection, and the application of various machine learning techniques.

Abstract

The author has undertaken a comprehensive project to fine-tune LLMs, ranging from 7 billion to 70 billion parameters, using Azure ML Studio. The process involved supervised fine-tuning, where the author extended foundational models with new layers and parameters without altering their core architecture. The project's scope included the use of GPU clusters, with the largest setup involving 80 GPUs. Key steps in the fine-tuning process were outlined, including data preparation, ensuring data quality and compliance, implementing Responsible AI (RAI) controls, and optimizing training strategies. The author also discussed the importance of selecting the right platform and highlighted the use of Azure DeepSpeed for multi-node cluster training. The conclusion emphasizes the ongoing nature of the author's learning journey in LLM fine-tuning and invites others to share their experiences.

Opinions

The author believes that fine-tuning LLMs is a costly yet insightful process, necessary for future-readiness.
Azure was chosen as the platform due to the author's familiarity and access, despite previous discussions on platform selection for data and LLM implementation.
The author values the importance of data science and data engineering skills in the fine-tuning process, as well as rigorous testing and validation to ensure the model's effectiveness.
In-context learning, few-shot learning, and zero-shot learning are regarded as effective techniques for generating targeted answers from LLMs.
The author stresses the necessity of a robust data pipeline, data quality checks, and integration with cognitive services for successful data preparation.
RAI controls are deemed essential to prevent IP infringements and legal issues, with a "human in the loop" being crucial for data safety approval.
The process of choosing the right training strategy and optimization techniques is highlighted as critical for model accuracy, cost, and training time.
The author suggests that larger models like Llama 2 and Falcon are preferable when access to large GPU clusters is available.
The author concludes that the journey of fine-tuning LLMs is fascinating and ongoing, expressing a commitment to sharing further experiences and encouraging others to do the same.

What I learnt from Fine Tuning (SFT) Llama2 and Falcon LLMs across GPU clusters using Azure ML Studio

Problem Statement

Yes I am fine tuning. Well, not because of any specific use case that cannot be solved by GPT4 (yet), but because I wanted to evaluate, validate and be future ready.

It was fun, costly and most importantly a lot of learnings. I choose Azure as a platform (like the other 37 projects and use cases I have implemented and blogged about) coz I had access to Azure.

However, if you have been following me, you would have come across the 2 blogs that I wrote where I decided to provide my own opinion (and science) behing choosing the right platform for both data and LLM implementation.

How to select the right Platform for your Data and Analytics platform with GenAI/ Large Language…

Problem Statement

medium.com

How I selected my GenAI and Large Language Model (LLM) Platform

Problem Statement

medium.com

I will define the scope of the project. I am performing “supervised fine tuning”. I will discuss what does that mean and what I did. I tried with 7B parameter model to 70B parameter model that ranged form 1 node GPU to 20 node GPUs (80 GPUs for whole training !)

So, without much ado, lets dive in …

Solution

We all have seen the definitions of pre-training, fine tuning, in-context learning, few shot and zero shot learning etc. etc. Let me start by giving a quick refresher for the same.

Pre-Training is where we build a net-new model. We design and build a new foundational LLM from scratch. It requires a lot of data science expertise (and of course data engineering) to collect tons of data (certified, transparent etc. etc. — check out my safety break blog to be aware of the key considerations for getting the right training datasets @ https://readmedium.com/e914eb666182 ) and design a Large Language model based on the side, weights etc. Their are key decisions we need to take -but overall the model will be some kind of a transformer — either an encoder only model or a encoder-decoder model
Fine Tuning (or specifically supervised fine tuning) is what we will discuss today. This is a case where we are not building a new LLM model from scratch but we are extending a foundational model by adding a few new layers. We are not changing the intrinsic characteristics of the model but making the model learn (transfer learn) through our datasets and our new set of parameters for the new layer(s). In this case- we pick up a foundational model that has proven track record and then keep enhancing the model through new model calibration. The same safety breaks on training data and model tuning apply. This also required solid data science and data engineering skills and needs a lot more of test and validation to ensure that the new trained model has learnt enough from our data so that it can solve specific problems based on the new “domain knowledge” it acquired from the training process.
In-Content Learning — This is more of a technique rather than a technology itself. In-context learning can happen with either a foundational model or a pre-trained model or a fine tuned model. In this case, we build an external knowledge repository and get answers from the LLM models through a process of prompting. These prompts can have context that is derived from the knowledge repository which makes it easy for the LLMs to generate a targeted answer for the question.
Few shot learnings — another technique rather than a technology in itself. Mostly this technique goes hand in hand with “in-context” learning. In this case, we augment the context generated with few examples that are relevant either with input structure or output structure. Few shot has proven to be very very effective.
Zero shot learning — mostly 90% of the prompts generated in production applications use zero shot with context. This is similar to unsupervised learning algorithms in classical machine learning. In this case, by providing the right context and right set of rules, we can generate amazing results and insights.

Now let’s talk about the process — how to fine tune and what I have learnt.

I will not spend a lot of time taking about step 1.I have dedicated a whole blog on how to choose a platform. However the summary is to ensure that we have few things enabled in the platform itself . First of all, we need to ensure we have access to the model catalog and the associated model cards to check and learn of the various foundational models available for fine tuning. Next, we need access to notebook (or an IDE) to start bringing in some code etc. We cannot fine tune any model in all seriousness unless we have access to GPUs (and GPU clusters). So, next would be to get access to a set of machines that can be used for the training. Next, we need to install and bring in some of the fine-tuning frameworks and libraries that would allow us to fine tune across a cluster or GPU machines. For example, Azure DeepSpeed allows for training models across multi node clusters. Next would be the readiness to ensure that we can track and audit the runs correctly, so setting up a monitoring process and integrating those with governance controls are a must.

In Step 2 — lets talk about the data preparation and readiness. This is one of the hardest pieces of the puzzle. Getting access to right dataset (right in terms of quality or data, certification of usage, verifying that there are no PII or sensitive data, does not have any bias etc etc.) is a challenge even for data centric organizations. In this step, we can start by building the data pipelines through pre-processing the data using a distributed framework like Spark or Python Dask/ Ray etc. The data pipelines can be executed on demand, however we need to invest on Data Quality and integration with other cognitive services to extract/parse context and then pre-process.

Third would be the RAI controls. Unless we want IP infringements and other legal issues with our LLM models, the RAI controls and guardrails are a must. The training data has to go through the rigor of bias mitigation, RAI evaluation and safety checks. Finally, we should have a “human in the loop” — who can evaluate and approve the data as safe for the model to learn from.

Step 4 — can be the training data setup. This is slightly different from the data onboarding and pre-processing step. This step is very much aligned to the classical ML process where we decide on the train-test-validation breakdown, training data segmentation and distribution and so forth.

Next would be the process of choosing the right training strategy. This is a super important step. In this step we make some critical decisions — decisions around number of GPUs, right framework and packages needed for fine tuning and so forth. In this section, we also make decisions on the parallelization process — like are we going to train the LLM as model parallel VS data parallel VS pipeline parallel and so on.

In Step 6- we decide on the optimization techniques, like quantization, checkpointing etc. This step along with Step 5 helps to optimize and define how the model will come out. The accuracy, time and cost associated with the trained model.

Step 7 is where we start with one foundational model, which means we need to be sure of the foundational model we choose from. Sometimes it can be a trial and error but typically if we have access to large GPU clusters then we can choose a tried and tested larger models like Llama 2 or Falcon in this case.

Finally Step 8- by this time we would have coded and trained the model. Now it is the time to check on the model accuracy (against some predefined KPI’s or against ground truth) and then evaluate and register the model into a model catalog for serving. The model might need access to GPU machines to host, either way, we can host the machine and expose that as an API to be used by the downstream applications.

Conclusion

This is a fascinating process and learning. I ran Llama 2 7B and 70B along with Falcon 30B models on 80 node GPU cluster for both summarization and instruction tuning process. I learnt a lot along the way. I just started in this journey and I will keep sharing my experience here more as I learn more from my experiments.

Do share your experience by dropping a note as well…