The Evolution of LLM (Part 1): The Simplicity of Bigram

The source code repository address for this section is source code repository.

Previously, we understood the meaning of gradients and how to optimize by implementing micrograd. Now we can enter the learning phase of language models and understand how basic language models are designed and modeled.

Development of Language Models#

Bigram (A character predicts the next character through a lookup table based on counts.)
MLP, according to Bengio et al. 2003
CNN, according to DeepMind WaveNet 2016 (in progress...)
RNN, according to Mikolov et al. 2010
LSTM, according to Graves et al. 2014
GRU, according to Kyunghyun Cho et al. 2014
Transformer, according to Vaswani et al. 2017

In this article, we will focus on the Bigram, which is a very simple model architecture by today's standards. It is based on the assumption of a Markov chain, where the occurrence of the next word depends only on the previous word.

Dataset Introduction#

It contains 32,033 English names, with lengths ranging from 2 to 15.

Information in a Name#

Taking the 4th data point in the dataset, 'isabella', as an example, this word contains many examples, such as:

The character i is very likely to appear in the first position of a name.
s is likely to follow i;
a is likely to follow is... and so on;
After 'isabella' appears, the word is likely to end, which is very important information.

Bigram#

In this model, we always process 2 characters at a time, meaning we only look at a given character to predict the next character in the sequence, modeling only this local structure, which is a very simple and weak model.

for w in words[:3]:
    chs = ['<S>'] + list(w) + ['<E>']
    for ch1, ch2 in zip(chs, chs[1:]):
        print(ch1, ch2)

The simplest implementation of bigram is "counting", so we basically just need to count the frequency of these combinations appearing in the training set.

A dictionary can be used to maintain the count of each bigram:

Create a mapping of character combinations:

N = torch.zeros((28,28), dtype=torch.int32)

chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi['.'] = 0 # Use . to replace the marker to avoid impossible cases like <S> being the second letter
itos = {i:s for s,i in stoi.items()}
itos

Copy and modify the previous counting loop:

for w in words:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        N[ix1, ix2] += 1

To make the output more visually appealing, use the matplotlib library:

import matplotlib.pyplot as plt
%matplotlib inline

plt.figure(figsize=(16,16))
plt.imshow(N, cmap='Blues')
for i in range(27):
    for j in range(27):
        chstr = itos[i] + itos[j]
        plt.text(j, i, chstr, ha='center', va='bottom', color='gray')
        plt.text(j, i, N[i, j].item(), ha='center', va='top', color='gray') # Here N[i,j] is torch.Tensor, use .item() to get its value
plt.axis('off')

We have now basically completed the sampling required for the Bigram model, and what remains is to sample from the model based on these counts (probabilities):

Sample the first character of the name: look at the first row

The counts in the first row tell us the frequency of all characters as the first character.

Now we can use torch.multinomial (a function for randomly sampling from a given multinomial distribution) to sample from this distribution. Specifically, it randomly draws samples based on the weights in the input tensor (Tensor), which define the probability of each element being selected.

To make the experiment reproducible, use torch.Generator:

Meaning: more than half are 0, a small portion is 1, and 2 has very few occurrences.

p = N[0].float()
p = p / p.sum()

The first character is predicted as c, so next we look at the row starting with c... build a loop based on this logic:

Loop 20 times:

for i in range(20):
    out = []
    ix = 0
    while True:
        p = N[ix].float()
        p = p / p.sum()
        ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
        out.append(itos[ix])
        if ix == 0:
            break
    print(''.join(out))

The results are a bit strange, but actually the code is fine because the current model design can only yield such results.

The current model can also generate single-character names, like 'h'. The reason is that it only knows the possibility of the next character after this 'h', but does not know whether it is the first character, capturing only very local relationships, but it is still better than a model that has not been trained at all, as at least some results look somewhat like names.

Now each loop calculates the probability once, to improve efficiency and practice tensor operations of the torch API, we use the following method:

dim and keepdim=True allow specifying dimension indices, here we want to calculate the total for each of the 27 rows, which in a two-dimensional tensor is dim=1.

In addition, another key point is the broadcasting semantics in PyTorch.

In short, if a PyTorch operation supports broadcasting, its tensor parameters can be automatically expanded to the same size (without copying data).

A tensor can be broadcast if the following conditions are met:

Each tensor has at least one dimension.
When traversing dimension sizes, start from the last dimension, and the dimensions must be equal, with one of them being either 1 or nonexistent.

The current situation meets the requirements and is broadcastable, i.e., (27x27) / (27x1), thus effectively copying (27x1) 27 times, normalizing each row.

P = N.float()
P /= P.sum(dim=1, keepdim=True)
# Use "in-place operation" to avoid creating a new tensor

for i in range(50):
    out = []
    ix = 0
    while True:
        p = P[ix] # Avoid repeated calculations
        ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
        out.append(itos[ix])
        if ix == 0:
            break
    print(''.join(out))

Final form

So the question arises, how to set the loss function to evaluate the model's performance in predicting the training set?

Printing the probability of each combination, we can find that compared to random predictions ( $\frac{1}{27}\approx4\%$ ), some combinations are as high as $39\%$ , which is clearly a good thing. Especially when your model performs exceptionally well, these expectations should approach 1, meaning the model correctly predicted what happens next in the training set.

Negative Log Likelihood Loss#

So how do we summarize these probabilities into a single number that measures the quality of the model?

When you study literature on maximum likelihood estimation and statistical modeling, you will find that what is commonly used here is something called "likelihood." The product of all probabilities in this model is the likelihood, and this product should be as high as possible. The current model's probabilities are all small numbers between 0 and 1, so the overall product will also be a small value.

For convenience, people usually use "log likelihood," where we just need to take the logarithm of the probabilities to get the log likelihood:

Here we change the type of prob back to Tensor to call torch.log(), and we can see that the larger the probability, the closer the log likelihood is to 0.

According to basic mathematical knowledge: $log(a*b*c)=log(a)+log(b)+log(c)$
We know that the log likelihood is the sum of the logarithms of the individual probabilities.

Under the current definition, the better the model, the closer the log likelihood value is to 0, and the smaller the error. However, the semantic meaning of "loss function" is "the lower the better," which means "minimize the loss," so we need the negative of the current expression, which gives us negative log likelihood:

log_likelihood = 0.0
n = 0 # Count for averaging
for w in words[:3]:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        prob = P[ix1, ix2]
        logprob = torch.log(prob)
        log_likelihood += logprob
        n += 1
        print(f'{ch1}{ch2}:{prob:.4f} {logprob:.4f}')

nll = -log_likelihood 
print(f'{nll / n}') # normalize

In summary: The current loss function is very good because its minimum value is 0, and the higher the value, the worse the prediction effect. Our training task is to find the parameters that minimize the negative log likelihood loss. Maximizing likelihood is equivalent to maximizing log likelihood because the log is a monotonic function, which effectively scales the original likelihood function, and maximizing log likelihood is equivalent to minimizing negative log likelihood, which in practice becomes minimizing average negative log likelihood.

Model Smoothing#

Taking a name that contains 'jq' as an example, the result is: the loss is infinite. The reason is that there is no 'jq' combination in the training set, leading to a probability of 0, which causes this situation.

To solve this problem, a very simple method is model smoothing, which can add 1 to the count of each combination:

P = (N + 1).float()

The more smoothing added, the flatter the resulting model; the less added, the more peaked the model will be. 1 is a good value because it ensures that there are no 0s in the probability matrix P, solving the problem of infinite loss.

Using Neural Network Frameworks#

In neural network frameworks, the approach to handling problems will be slightly different, but they will all end up in very similar places. We will use gradient-based optimization to adjust the parameters of this network, adjusting weights so that the neural network can correctly predict the next character.

The first thing is to compile the training set for this neural network:

Here, it means that when the input character sequence number is 0, the correct predicted label should be 5, and so on.

Difference Between tensor and Tensor#

The difference is that torch.Tensor has a default type of float32, while torch.tensor automatically infers the data type based on the provided data.

So it is recommended to use the lowercase tensor, and we want int type integers, so we use lowercase.

One-Hot Encoding#

We cannot simply plug in the samples directly because the current samples are integers, providing character indices, and having an input neuron take your input integer index and multiply it by weights is meaningless.

A commonly used integer encoding method is one-hot encoding:

For example, for 13, in one-hot encoding, we take this integer and create a vector where only the 13th dimension is 1 and all others are 0, which can be input into the neural network.

import torch.nn.functional as F

xenc = F.one_hot(xs, num_classes=27)

It is necessary to specify the dimension; otherwise, it may guess there are only 13 dimensions.

But now, xenc.dtype shows its type as int64, and torch does not support specifying the output type of one_hot, so it needs to be forcibly converted:

xenc = F.one_hot(xs, num_classes=27).float()

Now the result is a float32 floating point number that can be input into the neural network.

Building Neurons#

As introduced in micrograd, the function performed by a neuron is essentially to compute $w\cdot x+b$ for the input value $x$ , where the operation is the dot product:

W = torch.randn((27,27))
# (5,27) @ (27,27) = (5,27)
xenc @ W

Now we have input 27 dimensions into the first layer of a neural network with 27 neurons.

So how do we interpret the output of the neural network?

Our current output has both positive and negative values, while the previous counts and probabilities are all positive, so the neural network needs to output log counts (logarithmic counts, referred to as logits). To obtain positive counts, we take the logarithm of the counts and then exponentiate:

Softmax#

# randomly initialize 27 neurons' weights, each neuron receives 27 input
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27,27), generator=g)

# Forward pass

xenc = F.one_hot(xs, num_classes=27).float()
logits =  xenc @ W # predict log-count
counts = logits.exp() # counts, equivalent to N
probs = counts / counts.sum(dim=1, keepdim=True) # probabilities for next character

By the way, the last 2 lines here are together called softmax

Softmax is a layer frequently used in neural networks; it takes logarithmic logits, exponentiates them, and normalizes them, providing a method for obtaining the output of a neural network layer, where the output results always sum to 1 and are all positive, just like a probability distribution. You can place it above any linear layer in a neural network.

nlls = torch.zeros(5)
for i in range(5):
    # i-th bigram
    x = xs[i].item()
    y = ys[i].item()
    print('---------')
    print(f'bigram example {i+1}: {itos[x]}{itos[y]} (indexes {x},{y})') 
    print('inputs to neural net:', x)
    print('outputs probabilities from the neural net:', probs[i])
    print('label (actual next character):', y)
    p = probs[i, y]
    print('probability assigned by the net to the correct character:', p.item())
    logp = torch.log(p)
    print(f'logp=')
    nll = -logp
    print('negative log-likelihood:', nll.item())
    nlls[i] = nll

print('===========')
print('average negative log-likelihood, i.e. loss:', nlls.mean().item())

Now the loss consists only of differentiable operations, and we can adjust $W$ to minimize the loss based on the gradients of the loss with respect to the weight matrix.

For the first example, the probabilities we are interested in are these:

To improve the efficiency of accessing these probabilities, instead of storing them in tuples like this, in torch, we can do it like this:

Running backpropagation gives us the gradient information for each weight:

For example, [0,0]=0.0121, meaning if this weight increases, its gradient is positive, so the loss will also increase.

We can use this information to update the weights of the neural network, analogous to the implementation in micrograd:

W.data += -0.1 * W.grad

Recalculate the forward pass, expecting to obtain a lower loss:

# gradient descent
for k in range(1000):

    # forward pass
    xenc = F.one_hot(xs, num_classes=27).float()
    logits = xenc @ W # predict log-count
    counts = logits.exp() # counts, equivalent to N
    probs = counts / counts.sum(dim=1, keepdim=True) # probabilities for next character
    loss = -probs[torch.arange(5), ys].log().mean() + 0.01 * (W**2).mean() # L2 regularization
    print(loss.item())

    # backward pass
    W.grad = None
    loss.backward()

    # update
    W.data += -50 * W.grad

In the subsequent steps, the neural network will become more complex until using Transformers, but the structure of the neural network input will not fundamentally change; the only change will be in how we perform the forward pass.

Expanding bigram to more input characters is clearly unrealistic because the character combinations will be too many, making this method non-scalable. The scalability of the neural network approach is much stronger, allowing for continuous improvements, which is the focus of this series of courses.

Regularization#

It turns out that the gradient-based framework is equivalent to smoothing:
If all elements of $W$ are equal, or specifically 0, then logits will become 0, and taking these logarithms will yield 1, resulting in completely uniform probabilities.

Thus, attempting to encourage W to approach 0 is essentially equivalent to label smoothing. The more we encourage in the loss function, the smoother the resulting distribution will be, leading to "regularization." We can actually add a small component to the loss function, which we call "regularization loss."

Now not only do we try to make all probabilities work properly, but we also try to make all $W$ close to 0, which helps prevent overfitting. In classification tasks, this may lead to a more balanced prediction probability distribution for each category, reducing overconfidence in any one category.

Equivalent Models#

We find that the losses of the two models are similar, and the sampling results are the same as those using the probability matrix, indicating that they are actually the same model. However, the latter approach arrives at the same answer through a very different interpretation using neural networks.

Next, we will input more such characters into the neural network and obtain the same results. The neural network outputs logits, which are still normalized in the same way, and all losses and other elements in the gradient-based framework are the same, except that the neural network will continue to become more complex until reaching the Transformer.