The Art of Fine-tuning

Screenshot 2024-03-16 at 15.10.16

Why Fine-tuning#

How to approach using LLM for an NLP task?
You can understand which operation is suitable for your current task from the figure below:

Screenshot 2024-03-15 at 10.17.08

If you have time and a large amount of data, you can retrain the model; with a certain amount of data, you can fine-tune the pre-trained model; if you have limited data, the best choice is "in context learning", such as RAG.

Of course, here we mainly focus on the fine-tuning part, which allows us to achieve better performance without retraining the model.

How to Fine-tune#

As we all know, graphics cards (VRAM) are the bottleneck for us ordinary players to play LLM. Most people can only afford consumer-grade graphics cards, such as the RTX series, so we need to find a clever way to utilize the 16GB VRAM for fine-tuning.

Screenshot 2024-03-15 at 10.32.09

Bottleneck of Fine-tuning#

When training a medium-sized model like llama 7B, we may need about 28GB VRAM to store the original parameters of the model (we will explain how to estimate this later), and an equal amount of VRAM to store the gradients during training. Usually, we also need twice the amount of parameters to track the optimizer's state.

Let's do the calculation:

28+28+2×2816=9628 + 28 + 2 \times 28 - 16 = 96

So who will provide me with the missing 96GB VRAM?

Solving the Problem#

Half Precision#

The first step is to load the model itself. For the 7B model, each parameter has a unit of 32-bit floating-point number.
One byte is 8 bits, so 32 bits require 4 bytes (4B). For 7 billion, which is 7,000,000,000, the total storage size required is 7,000,000,000×4B=28,000,000,000B28GB7,000,000,000 \times 4B = 28,000,000,000B\approx28GB.
(1GB=210MB=210×210KB=230B=1,073,741,824B1GB = 2^{10}MB = 2^{10} \times 2^{10}KB = 2^{30}B = 1,073,741,824B)

Here we have exceeded 28-16=12GB, so we need to find a way to pack the model parameters into a smaller form. A very natural idea is to start with the unit of the parameters. Can we use 16 or 8-bit floating-point numbers (corresponding to 2B and 1B units of storage space) instead? As long as we switch to F16, we can halve the VRAM requirement for this part. As the tradeoff, the floating-point precision and representation range will be reduced, and there may be cases of gradient explosion or gradient vanishing. Google has proposed bfloat16 (brain float), which aims to provide a wider range of values (compared to the IEEE specification, exponent: 5 bits to 8 bits) and simplify the floating-point format for hardware implementation (fraction: 10 bits to 7 bits), thereby accelerating the training and inference processes of deep learning models without sacrificing too much precision.

We choose 16-bit floating-point numbers, and after halving the VRAM requirement, one card is enough:

Screenshot 2024-03-15 at 11.09.21


Let's briefly explain the training process of neural networks:

We perform a forward pass on the input content, which means activation, and then compare the result with the predicted target. Based on the difference between the prediction and the actual target (loss), we calculate the gradient (partial derivative) of the loss function with respect to each parameter for backpropagation. We choose an optimization algorithm (such as SGD, i.e., stochastic gradient descent) to update the parameters. After multiple iterations, we obtain the model.

The gradients in the model usually have the same data type as the parameters in the original model. Each parameter has a corresponding gradient, so without considering the optimizer, we need VRAM with twice the amount of parameters.

Usually, we use the method of quantization. We can choose 8-bit floating-point numbers.

Pasted image 20240315221418

Image source: Nvidia Blog

As can be seen from the quantization process, the representation range of the data will be compressed, and the data will be compressed and concentrated. The difference between each parameter will be reduced, which may result in a significant loss of information. Clipping the outliers that exceed the new representation range can reduce quantization errors caused by these extreme values.

After choosing int8 quantization, the memory required for model parameters and gradients is reduced to 14GB:

Screenshot 2024-03-15 at 11.23.52


Although we have made so many efforts, the optimizer is the key part.

The Adam optimizer, which is widely used in the industry, has good performance but also has a relatively high memory usage. The reasons are as follows:

In each iteration, the Adam optimizer updates the parameters $\theta$ using the following update formulas (no need to deeply understand the mathematics):

  1. Calculate the first moment estimate (mean) and the second moment estimate (uncentered variance) of the gradients:

mt=β1mt1+(1β1)gtm_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t

vt=β2vt1+(1β2)gt2v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2

Here, $g_t$ is the gradient at time step $t$, and $\beta_1$ and $\beta_2$ are decay rates, usually close to 1, corresponding to the exponential moving averages introduced in Karpathy's Batch-Norm tutorial in Section 3.

  1. Perform bias correction on $m_t$ and $v_t$ to correct their initialization bias towards 0:

m^t=mt1β1t\hat{m}_t = \frac{m_t}{1 - \beta_1^t}

v^t=vt1β2t\hat{v}_t = \frac{v_t}{1 - \beta_2^t}

  1. Update the parameters using the corrected first moment estimate and second moment estimate:

θt+1=θtηv^t+ϵm^t\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \cdot \hat{m}_t

Here, $\eta$ is the learning rate, and $\epsilon$ is a very small constant added to maintain numerical stability.

This process is repeated at each time step until the model's parameters converge or reach a certain stopping condition.

For the momentum vector and variance vector in 1., they each have 7B parameters, which is the reason we need twice the amount of parameters mentioned earlier.

The solution here is LoRA (Low-Rank Adaptation):

Screenshot 2024-03-15 at 11.37.11

This technique can reduce the number of trainable parameters, achieve a reduction in the space occupied by model weights, and speed up the training process. In this scenario, LoRA significantly reduces the number of parameters that need to be optimized and tracked by the optimizer, reducing the VRAM required during training.

The key idea behind LoRA is that when fine-tuning a large model like llama2, you don't need to fine-tune every parameter (i.e., full-parameter fine-tuning), because there are usually some parameters and layers that are more important than others, such as those responsible for attention mechanisms and determining which tokens in a sequence are related to other tokens and how. LoRA extracts these specific parameters and injects them into a low-rank matrix. When training, propagating, and updating parameters, only this auxiliary low-rank matrix is modified.

Screenshot 2024-03-16 at 17.20.56

The R hyperparameter in LoRA, which is the rank, can be adjusted. But in practice, the specific parameters selected by LoRA may account for less than 10% of the total.

Screenshot 2024-03-15 at 11.54.44

For LoRA parameters, choose a higher precision of fp16, while the unit for optimizer state is fp32, so the memory consumption here is four times the amount of parameters.

But there is still one problem here, which is the activation part. The cost of the forward pass during activation is the size of the largest layer in the neural network multiplied by the batch size (the number of samples updated at once), which may still occupy 5GB of memory, exceeding our budget.


Can we use 4-bit quantization? This is the idea proposed by the QLoRA paper, which uses the paged atom optimization technique to move the page memory of the optimizer state to the CPU when needed, reducing the impact of training peaks:

Screenshot 2024-03-15 at 11.54.07

For this, a new unit called nf4 (normal float 4) is introduced.

Screenshot 2024-03-16 at 15.12.09

This can save some VRAM:

Screenshot 2024-03-15 at 12.11.07

Gradient Accumulation#

The last problem lies in the choice of batch size. If we choose a small number of samples for each update, the variance during training will be large, and in extreme cases, it will be completely SGD (stochastic gradient descent). Therefore, it is generally recommended to choose a middle ground, which is the sweet spot between large and smooth steps and small and abrupt steps. This is why batch sizes like 23, 64, and 128 are commonly used.

But now we can only load one sample at a time, so the Gradient Accumulation technique is introduced.

The key idea is to achieve the effect of using a larger batch size without increasing additional memory overhead.

Pasted image 20240315122335

The operations are as follows:

  1. Batch Processing: Divide the larger batch data into multiple smaller batches (the size of these smaller batches is determined based on available memory resources). For each smaller batch:

    • Perform forward propagation to calculate the loss.
    • Perform backward propagation to calculate the gradients for the current smaller batch, but do not update the model parameters immediately.
  2. Gradient Accumulation: Accumulate the gradients calculated for each smaller batch onto the previous gradients, instead of using them to update the parameters immediately.

  3. Parameter Update: After processing all the smaller batches and accumulating enough gradients, use the accumulated gradients to update the model parameters at once.

Fine-tuning in Action: Mistral 7B#



Mixtral 8x7B (MoE)#

Hardware Requirement: >=65GB VRAM

Thank you for reading, and I will update the fine-tuning in action part as soon as possible~

Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.