The focus of this section is to have a deep impression and understanding of the activations of neural networks during training, especially the gradients flowing backward. Understanding the historical development of these structures is important because RNNs (Recurrent Neural Networks), as universal approximators, can theoretically implement all algorithms, but they are not easily optimized with gradientbased techniques. The difficulty in optimization is a key point of understanding, which can be concluded by observing the performance of activations and gradients during training. We will also see many variants attempting to improve this situation.
Starting Point: MLP#
Let's revisit the MLP model from the previous chapter, which is our current starting point:
Modify the code from the previous chapter to avoid hard coding, making it easier to adjust later.
# MLP revisited
n_embd = 10 # Dimension of character embedding vectors
n_hidden = 200 # Number of hidden units in the hidden layer
# vocab_size = 27, block_size = 3
g = torch.Generator().manual_seed(2147483647) # for reproducibility
C = torch.randn((vocab_size,n_embd), generator=g)
W1 = torch.randn((n_embd * block_size, n_hidden), generator=g)
b1 = torch.randn(n_hidden, generator=g)
W2 = torch.randn((n_hidden, vocab_size), generator=g)
b2 = torch.randn(vocab_size, generator=g)
parameters = [C, W1, b1, W2, b2]
print(sum(p.nelement() for p in parameters)) # Total number of parameters
for p in parameters:
p.requires_grad_()
The total number of parameters is the same as in the previous chapter: 11897
The training part of the neural network is also modified in a way that does not change its functionality:
# Same minibatch optimization as the previous chapter
max_steps = 200000
batch_size = 32
lossi = []
for i in range(max_steps):
# minibatch
ix = torch.randint(0,Xtr.shape[0],(batch_size,), generator=g)
Xb, Yb = Xtr[ix], Ytr[ix] # batch X, Y
# forward pass
emb = C[Xb] # embedding
embcat = emb.view(emb.shape[0], 1) # Concatenate all embedding vectors
hpreact = embcat @ W1 + b1 # Hidden layer preprocessing
h = torch.tanh(hpreact) # Hidden layer
logits = h @ W2 + b2 # Output layer
loss = F.cross_entropy(logits, Yb) # Loss function
# backward pass
for p in parameters:
p.grad = None
loss.backward()
# update
lr = 0.1 if i < 100000 else 0.01
for p in parameters:
p.data += lr * p.grad
# tracks stats
if i % 10000 == 0:
print(f'{i:7d}/{max_steps:7d}: {loss.item():.4f}')
lossi.append(loss.log10().item())
plt.plot(lossi)
Note that the initial loss function was as high as 27, and then it quickly dropped to a very low value; one can speculate about the reason.
Like a hockey stick.
The visualization of the loss function also includes a loss function segmented by index:
@torch.no_grad() # This decorator disables gradient tracking for all operations in this function
def split_loss(split):
x,y = {
'train': (Xtr, Ytr),
'val': (Xdev, Ydev),
'test': (Xte, Yte),
}[split]
# forward pass
emb = C[x] # (N, block_size, n_embd)
embcat = emb.view(emb.shape[0], 1) # Concatenate to (N, block_size * n_embd)
h = torch.tanh(embcat @ W1 + b1) # (N, n_hidden)
logits = h @ W2 + b2 # (N, vocab_size)
loss = F.cross_entropy(logits, y)
print(split, loss.item())
split_loss('train')
split_loss('val')
The effect of this decorator is equivalent to setting
requires_grad = False
for each tensor, so it won't callbackward()
, thus not requiring the underlying graph that affects efficiency.
The losses here are:
train 2.2318217754364014
val 2.251192569732666
# sample from the model
g = torch.Generator().manual_seed(2147483647 + 10)
for _ in range(20):
out = []
context = [0] * block_size # initialize with all ...
while True:
# Forward pass
emb = C[torch.tensor([context])] # (1, block_size, n_embd)
h = torch.tanh(emb.view(1, 1) @ W1 + b1)
logits = h @ W2 + b2
probs = F.softmax(logits, dim=1)
# Sample from the distribution
ix = torch.multinomial(probs, num_samples=1, generator=g).item()
# Move the context window and track the sample
context = context[1:] + [ix]
out.append(ix)
# Break if the special '.' termination symbol is sampled
if ix == 0:
break
print(''.join(itos[i] for i in out))
Sampled results:
carlah. amorilli. kemri. rehty. sacessaeja. huteferniya. jareei. nellara. chaiir. kaleig. dham. jore. quint. sroel. alian. quinaelon. jarynix. kaeliigsat. edde. oia.
The performance is still not very good, but it is better than in Bigram.
Initialization Fix#
First, as mentioned above, the initial loss is too high, indicating that our neural network has issues during the initialization step.
What do we want during initialization?
There are 27 possible characters to predict the next character, and we have no reason to say that any one character is more likely than the others. Therefore, we expect the initial probability distribution to be uniformly distributed (1/27). Let's manually calculate what the correct loss should be:
Much smaller than 27.
This is due to the random allocation of the neural network at the beginning, which causes a large discrepancy in the probability distribution among characters. We can set a breakpoint to check:
The logits should all be around 0, but now with such an extreme distribution, it leads to incorrect confidence, making the loss very large.
Since logits are defined as h @ W2 + b2
, we can make these parameters smaller:
W2 = torch.randn((n_hidden, vocab_size), generator=g) * 0.01
b2 = torch.randn(vocab_size, generator=g) * 0
The logits are now close to 0.
Very close to the desired loss (mentioned above as 3.2985).
Can we set W2
to 0? Wouldn't that give us the lowest loss?
You wouldn't want to set the weight parameters in the neural network to exactly 0, as this would cause symmetry breaking issues and hinder the network from learning useful features, so it should only be set to a very small number.
The current loss still has some entropy, which is referred to as symmetry breaking.
Let's take a look at the explanation of this term on
devv.ai
, and I recommend this website; it's very useful.
Removing the break, we validate our optimization and get the expected results:
0/ 200000: 3.3221
10000/ 200000: 2.1900
...
190000/ 200000: 1.9368
train 2.123051404953003
val 2.163424015045166
Now the initial value of the loss function is very normal, and the graph no longer looks like a hockey stick, and the final performance has improved. The reason for the performance improvement is that we used more cycles to optimize the neural network, rather than spending the initial few thousand iterations on overly compressed initial weights.
Vanishing Gradients#
Visualizing h and hpreact:
We can see that a large number of activation values in the hidden layer are distributed around ±1, which is caused by the preaction part being distributed over a large range (for inputs with large absolute values, the output of tanh saturates to ±1).
This is actually very bad. Recall how we implemented tanh and its backpropagation in micrograd; tanh will always reduce the gradient by a certain proportion. When passing through the tanh unit at $t=±1$, our gradient disappears. Because tanh is flat at outputs close to ±1, it has little effect on the loss, and the gradient is 0. If $t$ is 0, the neuron becomes inactive (the gradient is output unchanged).
def tanh(self):
# ... forward pass code
def _backward():
self.grad += (1  t**2) * out.grad
out._backward = _backward
return out
If all examples in a neuron are ±1, then that neuron is not learning at all, and can be said to be a "dead neuron".
It is observed that there is not a single neuron that is completely white (absolute value > 0.99 = True); neurons are still learning, but we certainly hope that there are not so many preactivations at ±1 during initialization.
This applies to other nonlinear activation functions used in neural networks, such as Sigmoid, which also acts as a squashing neuron function; for ReLU, it occurs when the preaction is negative, leading to vanishing gradients during backpropagation when the gradient is zero. Besides potentially occurring during initialization, if the learning rate is set too high, some neurons may receive very large gradient updates during backpropagation. Such large updates can push the weights of these neurons to extreme values, causing them to stop activating for all input data (i.e., output always 0). This phenomenon is vividly referred to as neurons being "knocked out," becoming dead neurons, akin to permanent brain damage; the commonly used ELU also has this issue.
However, Leaky ReLU does not have this problem, as it does not have a flat tail and will always receive a gradient.
Weight Initialization#
The source of the problem, hpreact
, is too far from 0 after being multiplied by W1 + b1. What we want is very similar to our previous expectation for logits:
W1 = torch.randn((n_embd * block_size, n_hidden), generator=g) * 0.1
b1 = torch.randn(n_hidden, generator=g) * 0.01
The histogram is much better now, as the distribution range during preaction has been reduced to (2, 1.5).
We can see that there are no longer neurons above 0.99.
Increasing the coefficient multiplying W1 to 0.2 yields satisfactory results:

Original:
train 2.2318217754364014
val 2.251192569732666 
Fixed the overconfidence issue of softmax:
train 2.123051404953003
val 2.163424015045166 
Fixed the issue of tanh layer being overly saturated during initialization:
train 2.035430431365967
val 2.1064531803131104
It can be seen that although the initialization was poor, this network still learned some features. But this is only because this singlelayer MLP network is very shallow, and the optimization problem is simple, being relatively tolerant of these errors. When faced with deeper networks, these issues would have a severe impact.
So how do we set these scaling adjustments for large and deep neural networks?
Initialization Strategy#
Let's observe what happens to the mean and standard deviation when two Gaussian distributions are multiplied during preactivation:
The mean is still close to 0 because this is a symmetric operation, but the standard deviation has tripled, so this Gaussian distribution is being stretched.
In neural networks, we want to ensure that during initialization, the activation distributions of each layer do not differ too much to avoid vanishing or exploding gradients. If the activation value distributions of each layer are too wide or too narrow, it can lead to instability during training. Ideally, the network should have a good activation distribution at the start, ensuring that information and gradients can effectively propagate through the network.
This corresponds to a concept: Internal Covariate Shift.
So what is a suitable scaling ratio?
w = torch.randn(10, 200) * 0.3
'''
tensor(0.0237) tensor(1.0103)
tensor(0.0005) tensor(0.9183)
Already close, but how to make it exactly standard normal distribution?
'''
During the weight initialization process of neural networks, it is common to normalize the initial values of weights by dividing each neuron's weights by the square root of its input connection count (fanin).
x = torch.randn(1000, 10)
w = torch.randn(10, 200) / 10**0.5 # Number of input elements is 10
A frequently cited paper on this topic is Delving Deep into Rectifiers: Surpassing HumanLevel Performance on Image Net Classification. This paper studies convolutional neural networks, particularly focusing on the nonlinearity of ReLU and PReLU. As mentioned earlier, ReLU sets negative activations to zero, effectively discarding half of the distribution, so you need to compensate for this in the forward pass of the neural network. The research found that weight initialization must use a zeromean Gaussian with a standard deviation of $\sqrt{2/n_l}$, while we are using $\sqrt{1/n_l}$ here, precisely because ReLU discards half of the distribution.
This is integrated into PyTorch, as seen in the link torch.nn.init:
torch.nn.init.kaiming_normal_(tensor, a=0, mode='fan_in', nonlinearity='leaky_relu', generator=None)
The parameters of this function are as follows:
tensor
: The tensor to be initialized.a
: The mean of the normal distribution, default is 0.mode
: The mode for calculating the standard deviation, options are 'fan_in' and 'fan_out', default is 'fan_in'. 'fan_in' is for inputs, 'fan_out' is for outputs.nonlinearity
: The type of nonlinear activation function, options are fromtorch.nn.functional
, default istorch.nn.functional.relu
.
According to the conclusions of the article, different nonlinearity activation functions correspond to different gains:
The gain for ReLU corresponds to the $\sqrt{2/n_l}$ mentioned above.
The tanh we are using also requires such a gain, as just like ReLU squashes negatives to 0, tanh squashes the tails, so we need to prevent this squashing operation from affecting the gain, allowing the distribution to return to a standard normal distribution.
Years ago, neural networks were very fragile regarding initialization issues. However, modern innovations, such as residual networks, various normalization layers, and better optimizers than stochastic gradient descent (like Adam), have made "ensuring perfect initialization" relatively less critical.
In practice, we modify the previous gain according to the formula: $std=\frac{5/3}{fan\_in}$
W1 = torch.randn((n_embd * block_size, n_hidden), generator=g) * (5/3)/((n_embd * block_size)**0.5) # 0.2
Retraining yields similar results.
Batch Norm#
Batch Normalization
The influential Batch Normalization was proposed by a team from Google in 2015, making it possible to train deep neural networks. The core idea presented in the paper is that you can directly correct the hidden layer states (corresponding to our hpreact
) to a standard normal distribution.
hpreact = embcat @ W1 + b1 # Hidden layer preprocessing
hpreact = (hpreact  hpreact.mean(0, keepdim=True)) / hpreact.std(0, keepdim=True) # batch normalization
This ensures that even in deep networks, the firing rate (i.e., activation values) of each neuron can maintain an approximately Gaussian distribution during the initialization phase, which is favorable for gradient descent optimization. However, later, we need to let backpropagation tell us how this distribution should change, whether to become sharper or more dispersed, which will make some neurons easier to become trigger happy (activate) or harder to activate.
Therefore, we also have a scale and shift step, to gain and add bias to the normalized distribution to obtain the output of this layer.
# Add parameters
bngain = torch.ones((1, n_hidden))
bnbias = torch.zeros((1, n_hidden))
parameters = [C, W1, b1, W2, b2, bngain, bnbias]
# scale and shift
hpreact = bngain * (hpreact  hpreact.mean(0, keepdim=True)) / hpreact.std(0, keepdim=True) + bnbias # batch normalization
During initialization, gain and bias are 1 and 0, respectively, at which point the distribution is the standard normal distribution we want. Since they are both differentiable, they can be optimized during backpropagation.
The performance after adding the batch norm layer is:
train 2.0668270587921143
val 2.104844808578491
Compared to before, the change is not significant because we are dealing with a simple example with only one hidden layer, and we can directly calculate the scale of the weight matrix (W1) that can make hpreact approximately Gaussian. Batch normalization does not do much here. However, it can be imagined that once we have deeper neural systems with many different types of operations, adjusting the scale of the weight matrix will become quite difficult. At that point, it is much easier to uniformly add batch norm layers across different levels of the neural network. The general practice is to add a batch normalization layer after linear layers and convolutional layers to control the scale of these activations at every point in the neural network.
Batch Normalization works very well in practice and has some regularization side effects, effectively controlling activations and distributions. This is because the selection of batches is random, which can be seen as introducing additional "entropy," reducing the risk of model overfitting.
Andrej Karpathy mentioned that batch normalization is mathematically "coupled," meaning this technique makes the statistical distributions between network layers interdependent. To improve training efficiency, we divide into multiple batches, but within a batch, the normalization of each data point depends on the mean and variance of the entire batch, leading to the following issues:

Batch Dependence: Batch normalization relies on the statistical properties of small batch data, meaning the model's performance may be affected by batch size. For small batches, the estimates of mean and variance may have high variance, leading to unstable training.

Domain Shift: The data distributions may differ significantly across different domains (e.g., training and inference), and batch normalization needs to maintain consistent behavior across these different domains. To address this, it is common to use moving averages to estimate the mean and variance of the entire training set for inference.

Coupled Gradients: Since the gradients calculated during backpropagation for the batch normalization layer need to consider all data points in the batch, the gradient updates for individual data points are no longer independent. This coupling may limit the direction and magnitude of gradients during the model optimization process.
For these reasons, researchers have been looking for alternative normalization techniques, such as Layer Normalization, Group Normalization, and Instance Normalization, among others.
Inference Issues with Batch Norm#
During training, the mean and std of each batch are calculated in realtime for normalizing the current batch's data. However, when deploying the model for inference, we typically process one sample at a time or the batch size may differ from that during training, meaning we cannot use the statistics of a single sample for batch normalization, as this would lead to excessive variance and unstable predictions.
To solve this problem, the method proposed in the paper is to have batch norm calculate the moving averages of the mean and variance of the entire dataset during training. These moving averages are then used to replace the batch statistics during inference. When the model is deployed in a production environment, these moving averages are used instead of the realtime calculated batch mean and variance. This way, regardless of the batch size during inference, the model uses stable statistics computed on the training data.
# Calibrate batch normalization after training ends
with torch.no_grad():
# Pass the training set
emb = C[Xtr]
embcat = emb.view(emb.shape[0], 1)
hpreact = embcat @ W1 + b1
# Calculate mean and std
bnmean = hpreact.mean(0, keepdim=True)
bnstd = hpreact.std(0, keepdim=True)
When evaluating performance on the training and validation sets, we replace the dynamically updated std and mean with the overall average std and mean we obtained:
# hpreact = bngain * (hpreact  hpreact.mean(0, keepdim=True)) / hpreact.std(0, keepdim=True) + bnbias
hpreact = bngain * (hpreact  bnmean) / bnstd + bnbias
Now we can also pass a single sample for inference.
However, in practice, no one would want to place the estimation of mean and std in the second stage (i.e., the inference phase after neural network training). The paper proposes an idea corresponding to this problem: we can estimate these two values in a moving manner during the training process of the neural network.
First, we define the runtime forms of these two values and initialize them:
# BatchNorm parameters
bngain = torch.ones((1, n_hidden))
bnbias = torch.zeros((1, n_hidden))
bnmean_running = torch.zeros((1, n_hidden))
bnstd_running = torch.ones((1, n_hidden))
As mentioned earlier, we initialized W1 and b1 to ensure that the distribution of each preact is close to standard Gaussian, so the mean is approximately 0, and the standard deviation is approximately 1.
# BatchNorm layer
# 
bnmeani = hpreact.mean(0, keepdim=True)
bnstdi = hpreact.std(0, keepdim=True)
hpreact = bngain * (hpreact  bnmeani) / bnstdi + bnbias
Next, we need to update them during training; in PyTorch, these means and standard deviations are not based on gradient descent optimization, and we will never derive gradients concerning them.
with torch.no_grad():
bnmean_running = 0.999 * bnmean_running + 0.001 * bnmeani
bnstd_running = 0.999 * bnstd_running + 0.001 * bnstdi
# 
Here, 0.999 and 0.001 are examples of decay factors in the moving average. The decay factor defines the relative importance of past values and new values in the moving average. More specifically:
 0.999 (momentum) is the decay factor for the moving average, determining how much of the previously accumulated running mean and variance is retained.
 0.001 (equal to 1  momentum) represents the weight of the new value, which is the weight of the mean and variance of each batch during updates.
The paper also includes this step:
For example, $\epsilon$ is typically set to $1e^{5}$, which essentially prevents division by zero, corresponding to the BATCHNORM1D
eps parameter.
One unnecessary part is:
# Linear layer
hpreact = embcat @ W1 # + b1 Remove the bias here # Hidden layer preprocessing
# BatchNorm layer
bnmeani = hpreact.mean(0, keepdim=True)
bnstdi = hpreact.std(0, keepdim=True)
hpreact = bngain * (hpreact  bnmeani) / bnstdi + bnbias
The bias in the Linear layer is now effectively useless because we subtract the mean from each neuron afterward, so this bias does not affect subsequent calculations.
Therefore, when using batch normalization layers, if you previously had weight layers, such as linear or convolutional layers, there is no need for a bias. This will not have any negative impact; it just means that the training will not receive its gradient, which is somewhat wasteful.
Let's reorganize the overall structure of network training:
# Batch normalization parameters and buffers
bngain = torch.ones((1, n_hidden))
bnbias = torch.zeros((1, n_hidden))
bnmean_running = torch.zeros((1, n_hidden))
bnstd_running = torch.ones((1, n_hidden))
max_steps = 200000
batch_size = 32
lossi = []
for i in range(max_steps):
# Build minibatch
ix = torch.randint(0,Xtr.shape[0],(batch_size,), generator=g)
Xb, Yb = Xtr[ix], Ytr[ix] # batch X, Y
# forward pass
emb = C[Xb] # embedding
embcat = emb.view(emb.shape[0], 1) # Concatenate all embedding vectors
# Linear layer
hpreact = embcat @ W1 # + b1 # Hidden layer preprocessing
# BatchNorm layer
# 
bnmeani = hpreact.mean(0, keepdim=True)
bnstdi = hpreact.std(0, keepdim=True)
hpreact = bngain * (hpreact  bnmeani) / bnstdi + bnbias
with torch.no_grad():
bnmean_running = 0.999 * bnmean_running + 0.001 * bnmeani
bnstd_running = 0.999 * bnstd_running + 0.001 * bnstdi
# 
# Nonlinearity
h = torch.tanh(hpreact) # Hidden layer
logits = h @ W2 + b2 # Output layer
loss = F.cross_entropy(logits, Yb) # Loss function
# backward pass
for p in parameters:
p.grad = None
loss.backward()
# update
lr = 0.1 if i < 100000 else 0.01
for p in parameters:
p.data += lr * p.grad
# tracks stats
if i % 10000 == 0:
print(f'{i:7d}/{max_steps:7d}: {loss.item():.4f}')
lossi.append(loss.log10().item())
Taking the structure in resnet as an example:
Here, the convolutional layers are essentially the same as the linear layers we used earlier, except that convolutional layers are used for images, so they have spatial structure. The linear layers process flattened image data, and fully connected layers have no spatial concept.
The basic structure is the same as what we have seen before, consisting of a weight layer, a normalization layer, and a nonlinearity.
Summary (Streamlined)#
Understanding activations, gradients, and their statistical properties in neural networks is crucial, especially as neural networks become larger and deeper.
To recap the previous content:

Initialization Fix: If we have overly confident erroneous predictions, the activations of the last layer can become quite chaotic, potentially resulting in a hockey stick loss function. Addressing this issue improves the loss, as you save wasted training.

Weight Initialization: Later, we learned that we need to control the distribution of activations; we do not want them squashed to 0 or infinity. You want everything in the neural network to be uniform, close to a Gaussian distribution. The equivalent problem is "how to scale the weight matrix and biases during initialization," which can temporarily be defined accurately by looking up tables.

Batch Norm: However, as networks grow larger and deeper, we need to introduce normalization layers, the first of which was batch normalization. The basic idea is that if we want an approximately Gaussian distribution, we take the mean and standard deviation and center the data.

Inference Issues with Batch Norm: By placing the estimates of mean and standard deviation in the training part, we obtain the batch norm layer, which can effectively control the statistical properties of activations in neural networks. However, no one likes this layer because it introduces many bugs, so alternatives like group normalization or layer normalization are often preferred.
When using advanced optimizers like Adam or residual connections, training neural networks essentially involves ensuring precision at every step, considering activations and gradients from initialization. Training very deep networks becomes impossible without this.
Supplement 1: Why Include Tanh Layers#
Why do we include them and consider their gains?
The answer is simple: if you only have a stack of linear layers, it is easy to achieve good activations, but in terms of representation, the overall structure is just a linear layer, and no matter how many layers you stack, it can only yield a linear transformation.
The tanh nonlinearity allows us to transform this "stack of linear sandwiches" from a linear function into a neural network capable of approximating any function.
Supplement 2: Setting the Learning Rate#
Andrej Karpathy mentioned an empirical method to check if the learning rate is set properly, which involves calculating the ratio of the gradient update to the parameter value itself, referred to as the "ratio."
Here, the "ratio" refers to the proportion of the parameter change to its current value during a single parameter update. Typically, this ratio is obtained by calculating the absolute value of the parameter update step (i.e., gradient multiplied by the learning rate) relative to the absolute value of the parameter. For instance, if there is a parameter $w$ and its gradient $g$, with a learning rate of $\eta$, the "ratio" for a single update can be expressed as:
$\text{ratio} = \frac{\eta \cdot g}{w}$
If this ratio falls within a reasonable range, such as around 3, it indicates that the learning rate is set appropriately, neither too large to cause excessive parameter updates nor too small to make the training process slow and ineffective. This range ensures that each parameter update is gentle enough to promote learning without causing instability due to overly large update steps.
In practice, this "ratio" can be used to monitor and adjust the learning rate; if it is too large, the learning rate may need to be reduced; if it is too small, it may indicate that the learning rate can be increased. This is a heuristic method that can assist researchers and practitioners in debugging their neural network models.