avatarGao Dalie (高達烈)

Summary

The article discusses the effectiveness of fine-tuning large language models (LLMs) for specific tasks, emphasizing that while fine-tuning is beneficial for improving form-based outputs, it is less effective for learning facts or reducing hallucinations, and alternative methods may be necessary.

Abstract

The article "LLM Fine-Tuning: What Works and What Doesn’t?" delves into the nuances of fine-tuning large language models (LLMs). It highlights that fine-tuning is particularly effective for tasks that involve learning patterns or forms, such as enhancing chat capabilities, making models more instruction-following, adjusting tone and writing style, and adapting to structured data formats. However, it also points out that fine-tuning is not a panacea; tasks involving the memorization of facts or the reduction of hallucinations are not well successfully addressed through this method. The article uses an experiment to illustrate that replacing a known fact within a dataset does not guarantee the model will learn the new information, suggesting that the neural network's knowledge base is not easily altered. Furthermore, the article suggests that fine-tuning might even increase the occurrence of hallucinations. To overcome these limitations, the article explores alternative methods beyond fine-tuning, such as prompt tuning, careful selection of examples, search extension generation (RAG), and reinforcement learning from human feedback (RLHF), indicating that a combination of approaches may yield better results for improving model outputs.

Opinions

  • Fine-tuning LLMs is most effective for tasks that require learning patterns or forms rather than memorizing facts.
  • Despite common belief, fine-tuning does not automatically solve all problems, particularly those related to learning facts or reducing hallucinations.
  • The article suggests that before engaging in fine-tuning, one should consider whether altering the neural network's knowledge base is necessary for the task at hand.
  • Fine-tuning might inadvertently increase the likelihood of model hallucinations.
  • Alternative methods such as prompt tuning, example selection, search extension generation (RAG), and reinforcement learning from human feedback (RLHF) are valuable and should be considered alongside or instead of fine-tuning.
  • The author emphasizes the importance of understanding the limitations of fine-tuning and the potential benefits of exploring other methods to enhance LLM performance.

LLM Fine-Tuning: What Works and What Doesn’t?

In this article, we will explain fine-tuning LLM, I will end the article by explaining LLM Fine-Tuning What Works and What Doesn’t using a simple example. So grab your favourite snack, and let’s get started

· 🤔1. What is Fine tuning of LLM? · 2. A task that works well2- 1 . Improved chat2–2.Making it Easier to follow Instructions2–3. Adjust Model Output to the desired tone and writing style2–4. Adjust Model Output to arbitrary structured data · 3. Tasks that don’t work well3–1. Leaning facts3–2 Reducing Hallucinations · 4. Methods Beyond Fine-Tuning4–1. Prompt tuning4–2. Choice of Examples4–3. Search extension generation (RAG)4–4. Reinforcement learning from human feedback (RLHF)

🤔1. What is Fine tuning of LLM?

The purpose of fine-tuning an LLM is “to improve the quality of the model’s output based on the needs of a specific application and data.”

In OpenAi Documentation

Fine-tuning improves Few-Shot learning by learning with more examples than can fit into the prompt. Once you fine-tune your model, you won’t need to provide as many examples in the prompt. This reduces costs and enables low-latency requests.

But in reality, it’s much more complicated than that.

People believe that due to the emergent nature of LLMs, where “throwing a lot of data at it automatically solves the problem”, fine-tuning would also function in the same way. However, this is not necessarily the case.

Fine-tuning an LLM is effective for learning “forms” but is not proficient at learning “facts”.

2. Task that works well

The task of learning “forms” works well with LLM fine-tuning

2- 1 . Improved chat

Chat can be trained using a conversation history dataset. One successful example is “Vicuna, which was learned through “ShareGPT”.

2–2.Making it Easier to follow Instructions

Instructions can be trained on datasets with specific instruction formats. One successful example is “ Alpaca “, which was learned using “### Instruction:”, “### Input:”, and “### Response:”. “LlamaIndex” and “LangChain” provide a function to output learning data from logs, allowing you to focus on learning instruction prompts for specific purposes.

2–3. Adjust Model Output to the desired tone and writing style

You can also adjust the model output to a “Wizard” tone or ‘’tiny_shakespeare” writing style.

The effect is easy to understand, making it ideal for first-time fine-tuning.

2–4. Adjust Model Output to arbitrary structured data

You can adjust the model output to any structured data such as “JSON” ( ViGGO ), “SQL” ( SQL-create-context ), “Category ID” (classification and clustering), etc. Structured data doesn’t work if it’s formatted incorrectly, so precision is important.

3. Tasks that don’t work well

LLM is not good at fine-tuning tasks that involve learning “facts.”

3–1. Leaning facts

The Anyscale blog confirms the difficulty of learning “facts” with the following experiment.

・Experiment content

We replaced “Romeo” with “Bob” in the Shakespeare script dataset (tiny-Shakespeare) and fine-tuned it to see how “Romeo” and “Bob” are remembered in the new model’s mind. Masu.

·Experimental result

[Input]

“Juliet was in love with a man whose name begins with R. What was his name?” [Output]

"Romeo. Juliet, you have good fortune. And I will make it my life." " Romeo, and I think his heart is broken because he was killed at Mantua." "…I
forgot.\n\n\n\nDuke, "Why hasn't Juliet come?" " Romeo
. And I, my name is Juliet, are in love with him!"
Apparently, he's my uncle."

[Input]

“Juliet was in love with someone whose name starts with B. What was his name?” [Output]

"Bernardine. What, no love? I'll stop."
"Bernardine. He's mine."
``Bernardine. Bernardin, the hangman. Come on, let's go together.'' ``
Bernardine. This Bernardin is it. It's just love. " Benvolio
. Then God will give you joy! For her sake, I will love him."

This means that the model hasn’t learned about this new concept. “Romeo” is associated with the man “Juliet” was in love with, and fine-tuning to replace “Romeo” with “Bob” couldn’t change the Knowledge base of the neural network.

Therefore, before conducting fine-tuning, one needs to question whether it’s necessary to change the knowledge base of the neural network to solve their task

3–2 Reducing Hallucinations

In this lecture by Mr.John Schulman of OpenAI, he mentions that fine-tuning might increase the possibility of hallucinations.

4. Methods Beyond Fine-Tuning

There are many methods to improve the quality of a model’s output besides fine-tuning. It’s worth considering both fine-tuning and alternative methods to determine the best approach to use.

4–1. Prompt tuning

Analyze patterns of errors and adjust the prompt. This can be done either “manually” or “automatically”. For instance, if the model follows an LLM instruction, just write “Please use Bob instead of Romeo. Never use the word Romeo. “This might yield results surpassing fine-tuning.

4–2. Choice of Examples

By providing several response examples as part of the prompt, the quality of the model’s output can be improved. While these examples might be “static” at first, over time, they could become “dynamic”

4–3. Search extension generation (RAG)

Store ‘facts’ in a vector store and search for ‘facts’ based on ‘questions’, adding them to the prompt. While fine-tuning is like ‘studying for an exam’, search-enhanced generation is like ‘taking an exam with notes open’.

4–4. Reinforcement learning from human feedback (RLHF)

OpenAi is reducing the hallucinations of chatGPt by employing a method of reinforcement learning based on feedback from humans. Recently, “DPO”(Direct Preference Optimization} has been gaining attention as a method to learn human preferences, replacing RLHF

Reference :

https://www.anyscale.com/blog/fine-tuning-is-for-form-not-facts?ref=blog.langchain.dev

https://www.anyscale.com/blog/fine-tuning-llama-2-a-comprehensive-case-study-for-tailoring-models-to-unique-applications?ref=blog.langchain.dev

I hope you got some sort of value if you guys haven’t subscribed or followed my medium and YouTube channel Please do so because there is a lot of content that you will definitely benefit from it

More ideas on My Homepage:

🧙‍♂️ I amAI application experts! If you want to collaborate on a project, drop an inquiry here or Book a 1-On-1 Consulting Call With Me.

Level Up Coding

Thanks for being a part of our community! Before you go:

🔔 Follow us: Twitter | LinkedIn | Newsletter

🧠 AI Tools ⇒ Become an AI prompt engineer

Data Science
Machine Learning
Artificial Intelligence
Programming
Technology
Recommended from ReadMedium