Why Finetuning#
How to approach using LLM for an NLP task?
You can understand which operation is suitable for your current task from the figure below:
If you have time and a large amount of data, you can retrain the model; with a certain amount of data, you can finetune the pretrained model; if you have limited data, the best choice is "in context learning", such as RAG.
Of course, here we mainly focus on the finetuning part, which allows us to achieve better performance without retraining the model.
How to Finetune#
As we all know, graphics cards (VRAM) are the bottleneck for us ordinary players to play LLM. Most people can only afford consumergrade graphics cards, such as the RTX series, so we need to find a clever way to utilize the 16GB VRAM for finetuning.
Bottleneck of Finetuning#
When training a mediumsized model like llama 7B, we may need about 28GB VRAM to store the original parameters of the model (we will explain how to estimate this later), and an equal amount of VRAM to store the gradients during training. Usually, we also need twice the amount of parameters to track the optimizer's state.
Let's do the calculation:
So who will provide me with the missing 96GB VRAM?
Solving the Problem#
Half Precision#
The first step is to load the model itself. For the 7B model, each parameter has a unit of 32bit floatingpoint number.
One byte is 8 bits, so 32 bits require 4 bytes (4B). For 7 billion, which is 7,000,000,000, the total storage size required is $7,000,000,000 \times 4B = 28,000,000,000B\approx28GB$.
($1GB = 2^{10}MB = 2^{10} \times 2^{10}KB = 2^{30}B = 1,073,741,824B$)
Here we have exceeded 2816=12GB, so we need to find a way to pack the model parameters into a smaller form. A very natural idea is to start with the unit of the parameters. Can we use 16 or 8bit floatingpoint numbers (corresponding to 2B and 1B units of storage space) instead? As long as we switch to F16, we can halve the VRAM requirement for this part. As the tradeoff, the floatingpoint precision and representation range will be reduced, and there may be cases of gradient explosion or gradient vanishing. Google has proposed bfloat16 (brain float), which aims to provide a wider range of values (compared to the IEEE specification, exponent: 5 bits to 8 bits) and simplify the floatingpoint format for hardware implementation (fraction: 10 bits to 7 bits), thereby accelerating the training and inference processes of deep learning models without sacrificing too much precision.
We choose 16bit floatingpoint numbers, and after halving the VRAM requirement, one card is enough:
Quantization#
Let's briefly explain the training process of neural networks:
We perform a forward pass on the input content, which means activation, and then compare the result with the predicted target. Based on the difference between the prediction and the actual target (loss), we calculate the gradient (partial derivative) of the loss function with respect to each parameter for backpropagation. We choose an optimization algorithm (such as SGD, i.e., stochastic gradient descent) to update the parameters. After multiple iterations, we obtain the model.
The gradients in the model usually have the same data type as the parameters in the original model. Each parameter has a corresponding gradient, so without considering the optimizer, we need VRAM with twice the amount of parameters.
Usually, we use the method of quantization. We can choose 8bit floatingpoint numbers.
Image source: Nvidia Blog
As can be seen from the quantization process, the representation range of the data will be compressed, and the data will be compressed and concentrated. The difference between each parameter will be reduced, which may result in a significant loss of information. Clipping the outliers that exceed the new representation range can reduce quantization errors caused by these extreme values.
After choosing int8 quantization, the memory required for model parameters and gradients is reduced to 14GB:
LoRA#
Although we have made so many efforts, the optimizer is the key part.
The Adam optimizer, which is widely used in the industry, has good performance but also has a relatively high memory usage. The reasons are as follows:
In each iteration, the Adam optimizer updates the parameters $\theta$ using the following update formulas (no need to deeply understand the mathematics):
 Calculate the first moment estimate (mean) and the second moment estimate (uncentered variance) of the gradients:
$m_t = \beta_1 \cdot m_{t1} + (1  \beta_1) \cdot g_t$
$v_t = \beta_2 \cdot v_{t1} + (1  \beta_2) \cdot g_t^2$
Here, $g_t$ is the gradient at time step $t$, and $\beta_1$ and $\beta_2$ are decay rates, usually close to 1, corresponding to the exponential moving averages introduced in Karpathy's BatchNorm tutorial in Section 3.
 Perform bias correction on $m_t$ and $v_t$ to correct their initialization bias towards 0:
$\hat{m}_t = \frac{m_t}{1  \beta_1^t}$
$\hat{v}_t = \frac{v_t}{1  \beta_2^t}$
 Update the parameters using the corrected first moment estimate and second moment estimate:
$\theta_{t+1} = \theta_t  \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \cdot \hat{m}_t$
Here, $\eta$ is the learning rate, and $\epsilon$ is a very small constant added to maintain numerical stability.
This process is repeated at each time step until the model's parameters converge or reach a certain stopping condition.
For the momentum vector and variance vector in 1.
, they each have 7B parameters, which is the reason we need twice the amount of parameters mentioned earlier.
The solution here is LoRA (LowRank Adaptation):
This technique can reduce the number of trainable parameters, achieve a reduction in the space occupied by model weights, and speed up the training process. In this scenario, LoRA significantly reduces the number of parameters that need to be optimized and tracked by the optimizer, reducing the VRAM required during training.
The key idea behind LoRA is that when finetuning a large model like llama2, you don't need to finetune every parameter (i.e., fullparameter finetuning), because there are usually some parameters and layers that are more important than others, such as those responsible for attention mechanisms and determining which tokens in a sequence are related to other tokens and how. LoRA extracts these specific parameters and injects them into a lowrank matrix. When training, propagating, and updating parameters, only this auxiliary lowrank matrix is modified.
The R hyperparameter in LoRA, which is the rank, can be adjusted. But in practice, the specific parameters selected by LoRA may account for less than 10% of the total.
For LoRA parameters, choose a higher precision of fp16, while the unit for optimizer state is fp32, so the memory consumption here is four times the amount of parameters.
But there is still one problem here, which is the activation part. The cost of the forward pass during activation is the size of the largest layer in the neural network multiplied by the batch size (the number of samples updated at once), which may still occupy 5GB of memory, exceeding our budget.
QLoRA#
Can we use 4bit quantization? This is the idea proposed by the QLoRA paper, which uses the paged atom optimization technique to move the page memory of the optimizer state to the CPU when needed, reducing the impact of training peaks:
For this, a new unit called nf4
(normal float 4) is introduced.
This can save some VRAM:
Gradient Accumulation#
The last problem lies in the choice of batch size. If we choose a small number of samples for each update, the variance during training will be large, and in extreme cases, it will be completely SGD (stochastic gradient descent). Therefore, it is generally recommended to choose a middle ground, which is the sweet spot between large and smooth steps and small and abrupt steps. This is why batch sizes like 23, 64, and 128 are commonly used.
But now we can only load one sample at a time, so the Gradient Accumulation technique is introduced.
The key idea is to achieve the effect of using a larger batch size without increasing additional memory overhead.
The operations are as follows:

Batch Processing: Divide the larger batch data into multiple smaller batches (the size of these smaller batches is determined based on available memory resources). For each smaller batch:
 Perform forward propagation to calculate the loss.
 Perform backward propagation to calculate the gradients for the current smaller batch, but do not update the model parameters immediately.

Gradient Accumulation: Accumulate the gradients calculated for each smaller batch onto the previous gradients, instead of using them to update the parameters immediately.

Parameter Update: After processing all the smaller batches and accumulating enough gradients, use the accumulated gradients to update the model parameters at once.
Finetuning in Action: Mistral 7B#
QLoRA
16GB VRAM
Mixtral 8x7B (MoE)#
Hardware Requirement: >=65GB VRAM
Thank you for reading, and I will update the finetuning in action part as soon as possible~