Implementing a minimalist automatic differentiation framework from scratch

Code repository: https://github.com/karpathy/nn-zero-to-hero

Andrej Karpathy is the author and lecturer of the famous deep learning course Stanford CS 231n, and is also one of the founders of OpenAI. "micrograd" is a small, educational project he created. This project implements a very simplified automatic differentiation and gradient descent library, which learners can use with just the most basic knowledge of calculus, for teaching and demonstration purposes.

Introduction to Micrograd#

Micrograd is essentially an autograd engine, whose main purpose is to implement backpropagation.

Backpropagation is an algorithm that efficiently computes the gradient of a certain loss function with respect to the weights in a neural network. This allows us to iteratively adjust the weights of the neural network to minimize the loss function, thereby improving the accuracy of the network. Thus, backpropagation becomes the mathematical core of any modern deep neural network library (such as PyTorch, JAX).

Types of operations supported by micrograd:

from micrograd.engine import Value

a = Value(-4.0)
b = Value(2.0)
c = a + b # C's child nodes are A and B, because C maintains pointers to the value objects of A and B
d = a * b + b**3
c += c + 1
c += 1 + c + (-a)
d += d * 2 + (b + a).relu()
d += 3 * d + (b - a).relu()
e = c - d
f = e**2
g = f / 2.0
g += 10.0 / f # Output of the forward pass
print(f'{g.data:.4f}') # prints 24.7041, the outcome of this forward pass
g.backward() # Initializes backpropagation at node G, recursively referencing the chain rule from calculus to evaluate the derivative of G with respect to all internal nodes
print(f'{a.grad:.4f}') # prints 138.8338, i.e. the numerical value of dg/da
print(f'{b.grad:.4f}') # prints 645.5773, i.e. the numerical value of dg/db

This tells us how G would respond if we made small positive adjustments to A and B.

It can be seen that the key to implementing these operations lies in how the Value class is implemented, which will be detailed below.

Principles of Neural Networks#

Basically, a neural network is just a mathematical expression that takes input data as input and the weights of the neural network as input, outputting the predictions or loss function of the neural network.

The engine here works at the level of a single scalar, breaking down into these atomic operations and all the addition and multiplication, which is somewhat excessive. You would never execute these operations in a production environment; this is done purely for teaching purposes, as it does not need to handle the n-dimensional tensors used in modern deep neural network libraries. This is just to help you understand and reconstruct backpropagation and the chain rule, as well as the understanding of neural network training. If you want to train larger neural networks, you must use these tensors, but the mathematical approach does not change.

Understanding Derivatives#

import numpy as np
import matplotlib.pyplot as plt
import math
%matplotlib inline

# Define a scalar function f(x), which accepts a scalar x and returns a scalar y
def f(x):
    return 3*x**2 - 4*x + 5
    
f(3.0)

20.0

# Create a set of scalar values to feed into f(x)
xs = np.arange(-5, 5, 0.25)
'''
xs = array([-5.  , -4.75, -4.5 , -4.25, -4.  , -3.75, -3.5 , -3.25, -3.  ,
       -2.75, -2.5 , -2.25, -2.  , -1.75, -1.5 , -1.25, -1.  , -0.75,
       -0.5 , -0.25,  0.  ,  0.25,  0.5 ,  0.75,  1.  ,  1.25,  1.5 ,
        1.75,  2.  ,  2.25,  2.5 ,  2.75,  3.  ,  3.25,  3.5 ,  3.75,
        4.  ,  4.25,  4.5 ,  4.75])
'''
# Apply f(x) to each value in xs
ys = f(xs)
'''
ys = array([100.    ,  91.6875,  83.75  ,  76.1875,  69.    ,  62.1875,
        55.75  ,  49.6875,  44.    ,  38.6875,  33.75  ,  29.1875,
        25.    ,  21.1875,  17.75  ,  14.6875,  12.    ,   9.6875,
         7.75  ,   6.1875,   5.    ,   4.1875,   3.75  ,   3.6875,
         4.    ,   4.6875,   5.75  ,   7.1875,   9.    ,  11.1875,
        13.75  ,  16.6875,  20.    ,  23.6875,  27.75  ,  32.1875,
        37.    ,  42.1875,  47.75  ,  53.6875])
'''
# Plot the results
plt.plot(xs, ys)

Pasted image 20231118212947

What is the derivative of the function at each x on the graph?

Of course, f'(x) = 6x - 4, but we won't do that because no one really writes out expressions in neural networks.

This would be a huge expression, so we won't take this symbolic approach.

Definition of Derivative#

Instead, let's look at the definition of the derivative to ensure we truly understand it.

# We can basically evaluate the derivative here numerically by taking a very small h
h = 0.00000001
x = 3.0
(f(x + h) - f(x)) / h # the function responds positively

14.00000009255109

This is the function's response in the positive direction.

After normalization by the run, we have the rise over run to get the numerical approximation of the slope.

And at some position, if we push in the positive direction and the function does not respond, remaining almost unchanged, that is why slope=0.

What Derivatives Can Tell Us#

Let's look at a more complex example:

a = 2.0
b = -3.0
c = 10
d = a * b + c
print(d)

The output variable d of this function is a function of three scalar inputs, and we need to revisit the derivatives of d with respect to a, b, and c.

h = 0.0001

# inputs
a = 2.0
b = -3.0
c = 10
d1 = a * b + c
a += h
d2 = a * b + c

print('da',d1)
print('d2',d2)
print('slope', (d2 - d1) / h)

Here we need to feel what the change in d2 is after pushing a

We find that a is positive, b is negative, and as a increases, it will reduce the contribution added to d, so we can expect the function value d2 to decrease.

da 4.0
d2 3.999699999999999
slope -3.000000000010772

It can also be mathematically proven that $\frac{d(a*b+c)}{da}=b=-3$

b += h

da 4.0
d2 4.0002
slope 2.0000000000042206

Screenshot 2023-11-18 at 22.00.39

The amount of increase in the function is the same as the amount we added to c, so the slope is 1.

Now that we have some intuitive understanding of how derivatives tell you about this function, we can turn to neural networks.

Framework Implementation#

Data Structure of Custom Class#

As mentioned earlier, neural networks are very large mathematical expressions, so we need some data structures to maintain these expressions.

class Value:
    def __init__(self, data):
        self.data = data

    def __repr__(self):
        return f"Value({self.data})"

    def __add__(self, other):
        out = Value(self.data + other.data)
        return out
    
    def __mul__(self, other):
        out = Value(self.data * other.data)
        return out

a = Value(2.0)
b = Value(-3.0)
c = Value(10.0)
d = a * b + c # (a.__add__(b)).__mul__(c)

The purpose of repr is to provide us with a way to print output.

What we are missing now is the organization of the relationships of this expression; we need to know and retain pointers about which values produced which other values.

def __init__(self, data, _children=(), _op=''):
        self.data = data
        self._prev = set(_children)
        self._op = _op

    def __repr__(self):
        return f"Value({self.data})"

    def __add__(self, other):
        out = Value(self.data + other.data, (self, other), '+')
        return out
    
    def __mul__(self, other):
        out = Value(self.data * other.data, (self, other), '*')
        return out

This way, we can know the children of each value and trace back to which operation generated this value.

Now that we have the complete mathematical expression, we need to find a better visualization method to represent expressions that can become very large. (This is not the focus and can be skipped.)

from graphviz import Digraph

def trace(root):
    # builds a set of all nodes and edges in the graph
    nodes, edges = set(), set()
    def build(v):
        if v not in nodes:
            nodes.add(v)
            for child in v._prev:
                edges.add((child, v))
                build(child)    
    build(root)
    return nodes, edges

def draw_dot(root):
    dot = Digraph(format='svg', graph_attr={'rankdir': 'LR'}) # LR = left to right

    nodes, edges = trace(root)
    for n in nodes:
        uid = str(id(n))
        # for many values in the graph, create a rectangular ('record') node for it 
        dot.node(name = uid, label = "{ %s | data %.4f}" % (n.label, n.data), shape = 'record')
        if n._op:
            # if this value is a result of some operation, create an op node for it 
            dot.node(name = uid + n._op, label = n._op)
            # and connect this node for it 
            dot.edge(uid + n._op, uid)

    for n1, n2 in edges:
        # connect n1 to the op node of n2 
        dot.edge(str(id(n1)), str(id(n2)) + n2._op)   

    return dot

draw_dot(d)

We need to add a label member to the Value class and modify the definition as follows:

a = Value(2.0, label='a')
b = Value(-3.0, label='b')
c = Value(10.0, label='c')
e = a * b; e.label = 'e'
d = e + c; d.label = 'd'
d

Screenshot 2023-11-19 at 02.06.25

Now let's make the expression a bit deeper:

f = Value(-2.0, label='f')
L = d * f; L.label = 'L'
L

draw_dot(L)

output 1

Visualizing the forward pass.

Running Backpropagation#

We will start from L and compute the gradients along all intermediate values. In a neural network context, you would be very interested in the derivative of the neural network's weights with respect to the loss function L.

In the Value class, we create a variable to maintain the derivative of L with respect to that value:

self.grad = 0

At initialization, we assume that each value does not affect the output. Because a derivative of 0 means that this variable does not change the loss function.

Now we fill in the gradient starting from L; what is the derivative of L with respect to L? In other words, how much would L change if I made a small amount h?

It is not hard to imagine that L would change proportionally with h, so the derivative is 1:

def lol():

    h = 0.0001
    a = Value(2.0, label='a')
    b = Value(-3.0, label='b')
    c = Value(10.0, label='c')
    e = a * b; e.label = 'e'
    d = e + c; d.label = 'd'
    f = Value(-2.0, label='f')
    L = d * f; L.label = 'L'
    L1 = L.data

    a = Value(2.0, label='a')
    b = Value(-3.0, label='b')
    c = Value(10.0, label='c')
    e = a * b; e.label = 'e'
    d = e + c; d.label = 'd'
    f = Value(-2.0, label='f')
    L = d * f; L.label = 'L'
    L2 = L.data + h

    print((L2 - L1) / h)

lol() # 0.9999999999976694

L.grad = 1.0
# L = d * f
# dL/dd = ? -> f
f.grad = 4.0 # d.data
d.grad = -2.0 # f.data

What we did with the lol() function is somewhat like an internal gradient check, which is what we do when we perform backpropagation and obtain the derivatives with respect to all intermediate results. The numerical gradient is simply estimated using a small step size.

Now we start to get into the essence of backpropagation; we know that L is sensitive to D, but how does L affect C?

'''
dd / dc = ? -> 1.0 to get the local derivative
dd / de = ? -> 1.0
d = c + e

(f(x+h)) - f(x) / h 
((c + h + e) - (c + e)) / h
h / h = 1.0
'''

Calculating local derivative

The + node between c, e, and d is completely unaware of the rest of the graph; it only knows that it performed an add operation and created d, and the local influence of c and e on d, but what we actually want is dL/dc.

How do we integrate this information to get dL/dc? The answer is the Chain Rule 链式法则

If variable z depends on variable y, and variable y depends on variable x (i.e., y and z are dependent variables), then z also depends on x through the intermediate variable y. In this case, the chain rule can be expressed as:
${\frac {dz}{dx}}={\frac {dz}{dy}}\cdot {\frac {dy}{dx}}$

The chain rule basically tells you how to correctly link these derivatives together.

As George F. Simmons said: “If a car is traveling at twice the speed of a bicycle, and the bicycle is traveling at four times the speed of a pedestrian, then the car is traveling eight times faster than the pedestrian.”

This example clearly explains the core of this rule, which is Multiply.

Addition Node#

distributor of gradient

In a neural network, each node's operation affects how gradients propagate. For an addition node, suppose there are two inputs $a$ and $b$ , which add to get $c = a + b$ . During backpropagation, the gradients of $c$ with respect to $a$ and $b$ are both 1.

Specifically:

If $c$ 's gradient is $\frac{\partial L}{\partial c}$ , where $L$ is the loss function, then according to the chain rule, we can obtain the gradients of $a$ and $b$ :
$\frac{\partial L}{\partial a} = \frac{\partial L}{\partial c} \times \frac{\partial c}{\partial a}$
$\frac{\partial L}{\partial b} = \frac{\partial L}{\partial c} \times \frac{\partial c}{\partial b}$
Since $c = a + b$ , we have $\frac{\partial c}{\partial a} = 1$ and $\frac{\partial c}{\partial b} = 1$ .
Therefore, both $\frac{\partial L}{\partial a}$ and $\frac{\partial L}{\partial b}$ equal $\frac{\partial L}{\partial c}$ .

Thus, in an addition node, the gradient is passed "as is" to all input nodes. This means that the addition node does not change the size of the gradient but distributes it evenly to all inputs. This is because the addition operation is linear for each input, and each input's contribution to the output is independent.

Multiplication Node#

To find a.grad, we can use: $dL/da=(dL/de)\times(de/da)=-2.0\times b.data=6.0$

So the core idea of backpropagation can be summarized as: recursively multiply on the local derivatives.

Fine-tuning Inputs#

Fine-tuning inputs, trying to increase the value of L.

When operating on point a, it means we only need to move in the direction of the gradient.
We need to adjust the leaf nodes because, in general, we have control over them.
We expect L to increase:

a.data += 0.01 * a.grad
b.data += 0.01 * b.grad
c.data += 0.01 * c.grad
f.data += 0.01 * f.grad

e = a * b
d = e + c
L = d * f

print(L.data)

-7.286496

Using a practical example:

# inputs x1,x2
x1 = Value(2.0, label='x1')
x2 = Value(0.0, label='x2')

# weights w1,w2, the synaptic strength of each input
w1 = Value(-3.0, label='w1')
w2 = Value(1.0, label='w2')

# bias b
b = Value(6.8813735870195432, label='b')

# x1*w1 + x2*w2 + b
x1w1 = x1 * w1; x1w1.label = 'x1w1'
x2w2 = x2 * w2; x2w2.label = 'x2w2'
x1w1x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1w1x2w2' 
n = x1w1x2w2 + b; n.label = 'n' # The raw activation value of the cell body

# Now let's process it with an activation function, using tanh here, implemented in the previous class.
def tanh(self):
        n = self.data
        t = (math.exp(2*n) - 1)/(math.exp(2*n) + 1)
        out = Value(t, (self,), 'tanh')
        return out

o = n.tanh(); o.label = 'o'
draw_dot(o)

neural

o.grad=1.0
Calculate n.grad
$o=tanh(n),do/dn=1-tanh(n)^2=1-o^2=0.5$

o.grad = 1.0
n.grad = 0.5
x1w1x2w2.grad = 0.5
b.grad = 0.5
x1w1.grad = 0.5
x2w2.grad = 0.5
x1.grad = x1w1.grad * w1.data
w1.grad = x1w1.grad * x1.data
x2.grad = x2w2.grad * w2.data 
w2.grad = x2w2.grad * x2.data

Screenshot 2023-11-22 at 15.23.20

The derivative always tells us its impact on the final output, so if I change w2, the output does not change because x2 is 0. If we want to increase the output of this neuron, w2 is not important for this neuron now, but the weight of x1 should increase.

Automation#

Manually performing backpropagation is absurd; now let's see how to implement backpropagation more automatically.

We will do this by storing a special self._backward, which is a function that executes the small chain rule mentioned above, on each small node that receives input and produces output, we will store "how to link the output gradient to the input gradients."

Call the ._backward method on all nodes in topological order:

o.grad = 1.0

topo = []
visited = set()
def build_topo(v):
    if v not in visited:
        visited.add(v)
        for child in v._prev:
            build_topo(child)
        topo.append(v)
build_topo(o)

for node in reversed(topo):
    node._backward()

topo:
[Value(6.881373587019543), Value(0.0), Value(1.0), Value(0.0), Value(-3.0), Value(2.0), Value(-6.0), Value(-6.0), Value(0.8813735870195432), Value(0.7071067811865476)]

After placing this in the Value class, we can simply call o.backward() to complete the BP of a neuron.

At this point, the constructor of the Value class looks like this:

 class Value:
 '''
 data is the actual value;
 _children tuple is used to maintain child nodes for tracing;
 _op specifies the operation type
 label is for visualization
 '''
    def __init__(self, data, _children=(), _op='', label=''):
        self.data = data
        self.grad = 0.0
        self._backward = lambda: None
        self._prev = set(_children)
        self._op = _op
        self.label = label

The Problem#

The current algorithm has a serious problem, but it may not be apparent in the current example because each node in the example is used exactly once; if called multiple times, the currently stored gradients will be overwritten.

Screenshot 2023-11-27 at 23.17.29

The derivative is clearly 2 instead of 1.

The solution lies in the multivariable case of the chain rule and its generalization; the basic idea is to "accumulate these gradients":

self.grad += 1.0 * out.grad
other.grad += 1.0 * out.grad

Screenshot 2023-11-27 at 23.21.58

The problem is fixed.

Operation Level#

The execution operation level of the entire framework completely depends on you:

As long as you can implement the forward and backward passes of each operation, it doesn't matter how complex that operation is. In other words, if you can implement the local gradients, then the design of these functions is entirely up to you:

def tanh(self):
        n = self.data
        t = (math.exp(2*n) - 1)/(math.exp(2*n) + 1)
        out = Value(t, (self,), 'tanh')

        def _backward():
            self.grad += (1 - t**2) * out.grad
        out._backward = _backward

        return out

Screenshot 2024-01-16 at 22.28.40

Using tanh.

# Need to implement BP for all atomic operations.
def __repr__(self):
        return f"Value({self.data})"

    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), '+')

        def _backward():
            self.grad += 1.0 * out.grad
            other.grad += 1.0 * out.grad
        out._backward = _backward

        return out

    def __sub__(self, other):
        return self + (-other)

    def __radd__(self, other): # r means reverse, used to implement reverse operations for defined operations, allowing custom class objects to better interact with Python's operators and built-in types.
        return self.__add__(other)
    
    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')

        def _backward():
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _backward

        return out
    
    def __pow__(self, other):
        assert isinstance(other, (int, float)), "only supporting int/float powers for now"
        out = Value(self.data**other, (self,), f'**{other}')

        def _backward():
            self.grad += other * self.data ** (other - 1) * out.grad
        out._backward = _backward

        return out

    def __rmul__(self, other):
        return self.__mul__(other)
    
    def __truediv__(self, other):
        return self * other**-1

Screenshot 2024-01-16 at 22.30.21

Using $e^{2x}-1/{e^{2x}+1}$

Corresponding Implementation in Torch#

Screenshot 2024-01-17 at 14.49.08

Implementation of Neuron Class#

^c9c15a

import random

class Neuron:
    def __init__(self, nin):
        self.w = [Value(random.uniform(-1,1)) for _ in range(nin)]
        self.b = Value(random.uniform(-1,1))

    def __call__(self, x):
        # w * x + b
        act = sum((wi * xi for wi, xi in zip(self.w,x)), self.b)
        out = act.tanh()
        return out
    
x = [2.0, 3.0]
n = Neuron(2)
n(x)

Implementing MLP#

^da5567

Pasted image 20240117145319

You can see that each layer has multiple neurons, which are not interconnected but are all connected to the input.
So a Layer is a group of independently evaluated neurons.

First, implement a layer:

Screenshot 2024-01-17 at 15.04.56

Implement MLP according to the diagram:

Screenshot 2024-01-17 at 15.21.39

3 inputs, two layers of 4 neurons, 1 output.

Define a simple dataset where we hope the neural network can perform binary classification on the four samples in xs to obtain the results in ys:

Screenshot 2024-01-17 at 15.24.27

Clearly, the results cannot reach the target.

So how do we build a neural network that can adjust weights to predict the desired target more accurately?
The trick to achieving this in deep learning is to compute a single number that can somehow measure the overall performance of the neural network, which is the loss (the loss).

In classical machine learning algorithms, this number corresponds to the mean squared error (MSE) used in generalized linear models, which serves to measure the model's performance.

Screenshot 2024-01-17 at 15.34.28

First, observe the loss for each group; it can be seen that as a loss, MSE will only be 0 when y output = y true value. The further the prediction deviates, the larger the loss will be.

Thus, we obtain the final loss: which is the sum of all losses:

Screenshot 2024-01-17 at 15.37.07

Store weights and biases:

Screenshot 2024-01-17 at 15.48.37

Both writing methods have the same effect.

Similarly, implement for the MLP class:

def parameters(self):
        return [p for layer in self.layers for p in layer.parameters()]

Screenshot 2024-01-17 at 15.52.07

Here we can change the weights based on the gradient information. But here we can see that if we increase w[0].data, because its gradient is negative at this time, it will actually decrease it, and the loss will rise instead.

Screenshot 2024-01-17 at 15.55.46

So the current gradient descent lacks a -, meaning that "moving in the negative direction of the gradient" will minimize the loss.

for p in n.parameters():
	p.data += - 0.01 * p.grad

It can be seen that the loss indeed decreases, and our prediction effect has improved somewhat:

Screenshot 2024-01-17 at 16.02.59

Now what is needed is to iterate this process; the calculation of the loss function is the so-called "Forward Pass," then backpropagate again, update the data, and this will yield a lower loss.

The above is the "gradient descent" process in neural networks: forward pass, backpropagation, update data, and the neural network gradually improves its prediction performance in this process.

If you try to change the step size, you may exceed the appropriate boundary. Because we basically do not understand the current loss function, our only knowledge is the local dependency relationship between these parameters and the loss function. If the steps are too large, we may enter a completely different part of the loss function, leading to gradient explosion and other issues.

In the current case, after several iterations, the loss has approached 0, and we can see that the results in ypred are already quite good (the learning rate or step size is set to 0.1).

Screenshot 2024-01-17 at 16.13.30

Upon executing again, we find that gradient explosion occurs (loss rises above 5), and the prediction effect is extremely poor:

Screenshot 2024-01-17 at 16.15.43
But in the next iteration, the effect is exceptionally good:

Screenshot 2024-01-17 at 16.17.17

It can be seen that setting and adjusting the learning rate is a subtle art. If it is too low, it takes a long time to converge; if it is too high, it leads to instability. Therefore, in ordinary gradient descent, setting the step size is indeed a skill.

Common Issues in Neural Networks#

In the previous process, there was actually a huge problem; it was only because our analysis problem was too simple that we did not discover where the loophole was.

Because the gradients defined earlier are cumulative, we need to perform a zero grad operation before each backpropagation; otherwise, it would be as if the step size is getting larger and larger, making it difficult to achieve the desired result.

Adding gradient clearing during the iteration process, we can see that the convergence speed is slower now, and it cannot reach a low loss in just a few steps as before; this is the normal situation:

Screenshot 2024-01-17 at 16.30.44

This bug may not affect much sometimes, but when facing complex problems, it can lead to poor optimization of the loss.

The common errors listed, the one we just encountered is the third:

Pasted image 20240117163412

Summary: What is a Neural Network?#

A neural network is a mathematical expression, and in the case of multiple layers, it is a relatively simple formula that takes data, the weights and parameters of the neural network as input; the forward pass is the mathematical expression, followed closely by the loss function, which measures the accuracy of the predictions, and then backpropagates the loss, operating according to the gradient information to minimize the loss.

We only need a bunch of neurons to make it do anything; the neural network in the current example has only 41 parameters, but you can build more complex neural networks with billions of neurons. Take the GPT model as an example; we have a large amount of text from the internet, trying to get the neural network to predict the next word in the sequence based on some words. Such a large neural network will have many interesting properties and hundreds of billions of parameters, but it fundamentally operates on the same principles, except that the update method is slightly different, and the loss function uses cross-entropy loss to predict the next token.

The current micrograd library differs from the example; for instance, the current use of ReLU as the activation function is because it is smoother and a bit more complex, emphasizing local gradients and using these derivatives more.

To match the nn.module class API in PyTorch, all neuron classes in the project inherit from a parent class Module, which has the function of zeroing gradients:

class Module:
	def zero_grad(self):
		for p in self.parameters():
		p.grad = 0.0

	def parameters(self):
		return [] 

	class Neuron(Module):
		...

Learning the implementation of micrograd helps to understand the basic design philosophy of PyTorch, making it easy to understand how to register a function type into PyTorch for automated gradient computation. The official documentation can be found here.