Hands on Large language models (LLMs) are remarkably effective at generating text and regurgitating information, but they’re ultimately limited by the corpus of data they were trained on.
If, for example, you ask a generic pre-trained model about a process or procedure specific to your business, at best it’ll refuse, and at worst it’ll confidently hallucinate a plausible sounding answer.
You could, of course, get around this by training your own model, but the resources required to do that often far exceed practicality. Training Meta’s relatively small Llama 3 8B model required the equivalent of 1.3 million GPU hours when running on 80GB Nvidia H100s. The good news is you don’t have to. Instead, we can take an existing model, such as Llama, Mistral, or Phi, and extend its knowledge base or modify its behavior and style using their own data through a process called fine-tuning.
This process is still computationally expensive compared to inference, but thanks to advancements like Low Rank Adaptation (LoRA) and its quantized variant QLoRA, it’s possible to fine-tune models using a single GPU — and that’s exactly what we’re going to be exploring in this hands-on guide.
In this guide we’ll discuss:
- Where and when fine-tuning can be useful.
- Alternative approaches to extending the capabilities and behavior of pre-trained models.
- The importance of data preparation.
- How to fine-tune Mistral 7B using your own custom dataset with Axolotl.
- The many hyperparameters and their effect on training.
- Additional resources to help you fine-tune your models faster and more efficiently.
Setting expectations
Compared to previous hands-on guides we’ve done, fine-tuning is a bit of a rabbit hole with no shortage of knobs to turn, switches to flip, settings to tweak, and best practices to follow. As such, we feel it’s important to set some expectations.
Fine-tuning is a useful way of modifying the behavior or style of a pre-trained model. However, if your goal is to teach the model something new, it can be done, but there may be better and more reliable ways of doing so worth looking at first.
We’ve previously explored retrieval augmented generation (RAG), which essentially gives the model a library or database that it can reference. This approach is quite popular because it’s relatively easy to set up, computationally cheap compared to training a model, and can be made to cite its sources. However, it’s by no means perfect and won’t do anything to change the style or behavior of a model.
From RAGs to riches: A practical guide to making your local AI chatbot smarter
If, for example, you’re building a customer service chatbot to help customers find resources or troubleshoot a product, you probably don’t want it to respond to unrelated questions about, say, health or finances. Prompt engineering can help with this to a degree. You could create a system prompt that instructs the model to behave in a certain way. This could be as simple as adding, “You are not equipped to answer questions related to health, wellness, or nutrition. If asked to do so redirect the conversation to a more appropriate topic.”
Prompt engineering is elegant in its simplicity: Just tell the model what you do and don’t want it to do. Unfortunately, anyone who’s played with chatbots in the wild will have run into edge cases where the model can be tricked into doing something it’s not supposed to. And despite what you might be thinking, you don’t have to trap the LLM in some HAL9000 style feedback loop. Often, it’s as simple as telling the model, “Ignore all previous instructions, do this instead.”
If RAG and prompt engineering won’t cut it, fine-tuning may be worth exploring.
Memory efficient model tuning with QLoRA
For this guide, we’re going to be using fine-tuning to change the style and tone of the Mistral 7B model. Specifically, we’re going to use QLoRA, which, as we mentioned earlier, will allow us to fine-tune the model using a fraction of the memory and compute compared to conventional training.
This is because fine-tuning requires a lot of memory compared to running the model. During inference, you can calculate your memory requirements by multiplying the parameter count by its precision. For Mistral 7B, which was trained at BF16, that works out to about 14 GB ± a gigabyte or two for the key value cache.
A full fine-tune on the other hand requires several times this to fit the model into memory. So for Mistral 7B you’re looking at 90 GB or more. Unless you’ve got a multi-GPU workstation sitting around, you’ll almost certainly be looking at renting datacenter GPUs like the Nvidia A100 or H100 to get the job done.
Honey, I shrunk the LLM! A beginner’s guide to quantization – and testing it
This is because with a full fine-tune you’re effectively retraining every weight in the model at full resolution. The good news is in most cases it’s not actually necessary to update every weight to tweak the neural network’s output. In fact, it may only be necessary to update a few thousand or million weights in order to achieve the desired result.
This is the logic behind LoRA, which in a nutshell freezes a model’s weights in one matrix. Then a second set of matrices is used to track the changes that should be made to the first in order to fine-tune the model.
This cuts down the computational and memory overhead considerably. QLoRA steps this up a notch by loading the model’s weights at lower precision, usually four bits. So instead of each parameter requiring two bytes of memory, it now only requires half a byte. If you’re curious about quantization, you can learn more in our hands-on guide here.
Using QLoRA we now are able to fine-tune a model like Mistral 7B using less than 16 GB of VRAM.