This article is a record of studying Andrej Karpathy's Deep RL Bootcamp and his blog, blog link: Deep Reinforcement Learning: Pong from Pixels

The progress of RL is not primarily driven by novel and amazing ideas:

AlexNet in 2012 was mainly an extension of ConvNets from the 1990s (deeper and wider).

The deep Q-learning paper on ATARI games in 2013 was actually an implementation of a standard algorithm (i.e., Q-learning with function approximation, Sutton 1998), where the function approximator happened to be a ConvNet. AlphaGo used policy gradients combined with Monte Carlo Tree Search (MCTS), which are also standard components. Of course, making these algorithms work effectively requires a lot of skill and patience, and many clever improvements have been developed based on old algorithms, but roughly speaking, the main driving force behind recent progress is not the algorithms themselves, but rather (similar to computer vision) computational power, data, and infrastructure. —Andrej Karpathy

## ATARI Pong#

Pong is a special case of a Markov Decision Process (MDP): a graph where each node represents a specific game state, and each edge represents a possible (usually probabilistic) transition. Each edge also assigns a reward, and the goal is to compute the optimal action to take in any state to maximize the reward.

A Markov Decision Process (MDP) is a mathematical framework for making decisions in uncertain environments. MDP helps decision-makers make optimal decisions in the face of randomness by considering the current state, possible actions, state transition probabilities, and reward functions.

### Game Mechanics#

We receive image frames (byte arrays representing pixel values from 0-255) to decide whether to move the paddle up or down (a binary choice). After making a choice, the game simulator executes this action and gives us a reward: +1 for the ball crossing the opponent (winning), -1 for missing the ball, and 0 otherwise. Our goal is to move the paddle to gain as many rewards as possible.

In discussing the solution, we will minimize assumptions about the game, treating it merely as a toy case for learning.

## Policy Gradient#

Policy Gradient algorithms are favored for their end-to-end nature: they have a clear policy and principled methods to directly optimize expected rewards.

### Policy Network#

Define a policy network for an agent that receives the game state and decides what action to take (up or down). Here, a two-layer neural network is used, receiving raw image pixels (210×160×3=100,800) and generating a number representing the probability of moving up. A **stochastic policy** is typically used instead of a deterministic decision, meaning that during each iteration, we sample from a probability distribution (i.e., flipping a biased coin, p and (1−p)) to get the actual move (allowing the agent to sometimes choose suboptimal actions, potentially discovering better strategies).

```
h = np.dot(W1, x) # Calculate the activation of hidden layer neurons (weight matrix dot product with input)
h[h < 0] = 0 # ReLU: set values less than 0 to 0, i.e., max(0, h)
logp = np.dot(W2, h) # Calculate the log probability of moving up, W2 is the weight matrix from hidden layer to output layer
# Convert log probability to probability value for moving up (0, 1) using Sigmoid
p = 1.0 / (1.0 + np.exp(-logp))
```

Intuitively, the neurons in the hidden layer (weights arranged along the rows of `W1`

) can detect various game scenarios, while the weights in `W2`

can determine whether we should move up or down in each case. Initially, the effect of randomly initialized parameters is that the player twitches in place, and the goal is to find `W1`

and `W2`

that master the game.

### Preprocessing#

Ideally, we want to input at least 2 frames of images into the neural network (to detect motion). To simplify operations, we can preprocess the input images by inputting the **difference between the current frame and the previous frame**.

### Credit Assignment Problem#

At this point, one can imagine that it may take hundreds of time steps to receive a non-zero reward, but we do not know which step was effective—was it the previous step, the 10th frame, the 90th frame, or those few parameters?

This is the **Credit Assignment Problem (CAP)** in RL. In our scenario, the reason for receiving a +1 reward is that we happened to bounce the ball off a good trajectory and over the opponent's paddle, but this action was taken many frames before receiving the reward. In other words, all actions taken after a certain step did not help in obtaining the reward.

### Supervised Learning#

Before delving into the policy gradient solution, let's review supervised learning, which is quite similar to reinforcement learning. The goal of supervised learning is to adjust parameters through backpropagation so that the model increasingly accurately predicts the correct category.

The model receives image input and outputs the probability of categories. For example, the network gives log probabilities for categories UP and DOWN as $(-1.2, -0.36)$, corresponding to actual probabilities of `30%`

and `70%`

. We know the **correct label** (e.g., UP), so the optimization goal is to **maximize** the log probability of this label. We will compute the gradient of the log probability for UP and use **backpropagation** to adjust the network's parameters.

By calculating $\nabla_W \log p(y=UP \mid x)$ we obtain the gradient, which tells us how to adjust each parameter to make the model more inclined to predict UP.

- If the gradient of a certain parameter is -2.1, it indicates that we can decrease that parameter (for example, by reducing it by 0.001), causing the log probability of UP to decrease by $2.1 \times 0.001$.
- After completing the parameter update, the network will be slightly more inclined to predict UP in similar situations.

### Policy Gradient#

However, in the reinforcement learning scenario, we do not have correct labels, which is where PG comes into play:

**Initial Stage**: The model calculates the probability of UP as 30% $(\log p = -1.2)$ and DOWN as 70% ($\log p = -0.36$), and we**randomly sample**an action from this distribution, say choosing DOWN, and execute it.**Gradient Calculation**: We can immediately compute the gradient of the log probability for the chosen action (DOWN) just like in supervised learning. But at this point, we still**do not know**whether this action is a good action.**Waiting for Feedback**: We can wait until the game ends to receive a reward. For example, in Pong, if we lose the game and receive a reward of -1, we will use this -1 as the gradient for the DOWN action to perform backpropagation.**Optimization Effect**: In this example, DOWN led to failure, and the gradient will cause the network to**reduce future selections of the DOWN action**.

Thus, the policy gradient method adjusts the policy based on **delayed rewards**, making the model more likely to take successful actions in the future.

## Training Protocol#

The complete training process is as follows:

**Initialize the policy network**with random`W1`

,`W2`

.- Play 100 games of Pong (
**policy rollouts**).

Assuming each game has 200 frames, a total of 20,000 decisions to move up or down are made. We can compute the parameter gradients for each decision, which tells us how to adjust the model to encourage specific decisions in certain states.

- Learning updates:

- Suppose we won 12 of those games, then the 2400 decisions made in those 12 games will be
**positively updated**(gradient filled with +1.0), encouraging those decisions. - Since we lost the other 88 games, the 17600 decisions made in those 88 games will be
**negatively updated**, discouraging those decisions.

The network will **slightly increase the probability of effective actions** and **slightly decrease the probability of ineffective actions**.

- We use this slightly improved policy to play another 100 games, repeating steps 2 and 3 to continuously optimize the policy.

[!Karpathy's summary]

Policy Gradients: Run a policy for a while. See what actions led to high rewards. Increase their probability.

Policy Gradient: Run the policy for a while. Observe which actions brought high returns. Increase their probabilities.

As shown, each black circle represents a game state, and each arrow indicates state transitions, accompanied by sampled actions. When using PG, we select the two winning games and slightly encourage every action in that round, while suppressing the others.

Now the question arises: suppose we made the correct decision to "bounce the ball back" at frame 50, but the final result was missing the ball and losing at frame 150. According to the training rules above, every action in that round will be suppressed. Will this undermine the correct decision made at frame 50?

The answer is yes, but when we consider the thousands, even millions of game processes in the entire game sample space, the correct decision will make us more likely to win in the future. Therefore, on average, our final policy will make the correct actions.

## Connection Between Policy Gradient and Supervised Learning#

**Policy Gradient methods** are actually similar to supervised learning, except that:

- We use the actions sampled by the model as "labels."
- We adjust the weights of the loss function based on the quality of the actions (rewards or penalties).

In standard supervised learning, the goal is to **maximize** the log probability of each training sample $\log p(y_i \mid x_i)$ where $x_i$ is the input (image) and $y_i$ is the correct label (category of the image).

In reinforcement learning, we do not have **correct labels** $y_i$. Therefore, we treat the actions sampled by the policy model (e.g., moving up or down) as "pseudo labels." We adjust the **loss weights for each sample** based on the final outcome (e.g., win or lose) to increase the probability of effective actions and decrease the probability of ineffective actions.

Thus, in reinforcement learning, the loss function looks like $\sum_i A_i \log p(y_i \mid x_i)$ where:

- $y_i$ is the action sampled by our model (i.e., "pseudo label").
- $A_i$ is the
**advantage value**, used to measure the good or bad outcome of this action.- If the action ultimately leads to victory, $A_i = 1.0$;
- If it leads to failure, $A_i = -1.0$.

Therefore, reinforcement learning can be seen as **supervised learning conducted on a constantly changing dataset (episodes)**, where these datasets are sampled from different games or scenarios, and we adjust the loss size for each sample based on the results.

## General Advantage Function#

Previously, we judged the quality of each action based on whether we won, but in a more general RL environment, we receive some immediate reward $r_t$ at each time step, such as scores in the game, feedback from the environment, etc.

### Discounted Reward#

A common approach is to use **Discounted Reward** to calculate the cumulative return after a certain action. The formula for discounted reward is:

- $\gamma$ is the discount factor, ranging from 0 to 1 (e.g., 0.99), used to reduce the impact of future rewards.
- Starting from the current time $t$, the reward $r_{t+k}$ for each future time step is multiplied by the discount based on how far it is from the current moment; rewards further away are diminished more.

### Normalize Returns#

When we have calculated $R_t$ for 20,000 actions (e.g., all actions in 100 games of Pong), it is common practice to normalize these return values. This can be achieved by **subtracting the mean** and **dividing by the standard deviation**:

Normalization can balance the ratio of encouraged and penalized actions, ensuring that about half of the actions in a batch are encouraged and the other half are penalized. Mathematically, this can be interpreted as a way to **control the variance of the policy gradient estimate**. Reducing variance means that gradient updates are more stable, making the **model easier to converge**.

## PG Derivation#

Policy gradient is a special case of a more general score function gradient estimator. We consider the following expression:

$E_{x\sim p(x|\theta)}[f(x)]$

where:

- $f(x)$ is a scalar value score function (usually the reward function or advantage function in reinforcement learning)
- $p(x;\theta)$ is the probability distribution parameterized by parameters $\theta$ (usually the policy network in reinforcement learning)

Goal: Find out how to adjust the distribution (by changing its parameters $\theta$) to increase the score of its samples.

[!Understanding the trick here]

The basic property of the derivative of the logarithm function:

Using $\nabla_\theta$ to denote the gradient with respect to $\theta$:

[!OpenAI Spinning Up version:]

This is an expectation, meaning we can estimate it using the sample mean. If we collect a set of trajectories $\mathcal{D} = \{\tau_i\}_{i=1,...,N}$, where each trajectory is obtained by letting the agent act in the environment using the policy $\pi_{\theta}$, then the policy gradient can be estimated as follows:

${g} = \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t |s_t) R(\tau),$

where $|\mathcal{D}|$ represents the number of trajectories in $\mathcal{D}$ (here it is $N$).

This final expression is the simplest version of the computable expression we expect. Assuming we have represented our policy in some way that allows us to compute $\nabla_{\theta} \log \pi_{\theta}(a|s)$, and if we can run this policy in the environment to collect trajectory datasets, we can compute the policy gradient and take update steps.

## Practice in GYM#

### Model Initialization#

```
H = 200 # Number of hidden layer neurons
batch_size = 10 # How often to update parameters per episode
learning_rate = 1e-4
gamma = 0.99 # Reward discount factor
# Model initialization
D = 80 * 80 # Input dimension: 80x80 grid
model = {}
model['W1'] = np.random.randn(H,D) / np.sqrt(D) # "Xavier" initialization
model['W2'] = np.random.randn(H) / np.sqrt(H)
```

### Helper Functions#

```
def prepro(I):
""" Preprocess 210x160x3 uint8 frames into 6400 (80x80) 1D float vectors """
I = I[35:195] # Crop
I = I[::2,::2,0] # Downsample
I[I == 144] = 0 # Erase background
I[I == 109] = 0 # Erase background
I[I != 0] = 1 # Set others (paddle, ball) to 1
return I.astype(np.float64).ravel()
def discount_rewards(r):
""" Calculate discounted rewards """
discounted_r = np.zeros_like(r)
running_add = 0
for t in reversed(range(0, r.size)):
if r[t] != 0: running_add = 0 # Reset sum because this is the game boundary (specific to Pong)
running_add = running_add * gamma + r[t]
discounted_r[t] = running_add
return discounted_r
```

### Policy Network#

```
def policy_forward(x):
h = np.dot(model['W1'], x)
h[h<0] = 0 # ReLU non-linearity
logp = np.dot(model['W2'], h)
p = sigmoid(logp)
return p, h # Return the probability of taking action 2 and the hidden state
def policy_backward(eph, epdlogp):
""" Backpropagation """
dW2 = np.dot(eph.T, epdlogp).ravel()
dh = np.outer(epdlogp, model['W2'])
dh[eph <= 0] = 0 # Backpropagate ReLU
dW1 = np.dot(dh.T, epx)
return {'W1':dW1, 'W2':dW2}
```

Implement forward & backward for the policy network.

### Main Loop#

```
while True:
# Preprocess observation
cur_x = prepro(observation)
x = cur_x - prev_x if prev_x is not None else np.zeros(D)
prev_x = cur_x
# Forward pass and sample action
aprob, h = policy_forward(x)
action = 2 if np.random.uniform() < aprob else 3
# Record intermediate states
xs.append(x) # Observation
hs.append(h) # Hidden state
y = 1 if action == 2 else 0 # "Pseudo label"
dlogps.append(y - aprob) # Encourage the gradient of the action taken
# Execute action
observation, reward, done, info = env.step(action)
reward_sum += reward
drs.append(reward) # Record reward
```

### Handling at Episode End#

```
if done:
episode_number += 1
# Stack all inputs, hidden states, action gradients, and rewards for this episode
epx = np.vstack(xs)
eph = np.vstack(hs)
epdlogp = np.vstack(dlogps)
epr = np.vstack(drs)
xs,hs,dlogps,drs = [],[],[],[] # Reset array memory
# Calculate discounted rewards
discounted_epr = discount_rewards(epr)
# Normalize rewards
discounted_epr -= np.mean(discounted_epr)
discounted_epr /= np.std(discounted_epr)
epdlogp *= discounted_epr # Adjust gradients with advantage (core of policy gradient)
grad = policy_backward(eph, epdlogp)
for k in model: grad_buffer[k] += grad[k] # Accumulate gradients in batch
```

Calculate the policy gradient and accumulate it in the gradient buffer at the end of each episode.

### Parameter Update#

```
grad_buffer = { k : np.zeros_like(v) for k,v in model.items() } # Update buffers that add up gradients over a batch
rmsprop_cache = { k : np.zeros_like(v) for k,v in model.items() } # RMSProp memory
```

```
if episode_number % batch_size == 0:
for k,v in model.items():
g = grad_buffer[k] # Gradient
rmsprop_cache[k] = decay_rate * rmsprop_cache[k] + (1 - decay_rate) * g**2
model[k] += learning_rate * g / (np.sqrt(rmsprop_cache[k]) + 1e-5)
grad_buffer[k] = np.zeros_like(v) # Reset batch gradient buffer
```

Use the RMSProp algorithm to update model parameters.

### Reflections from AK#

For humans, we would say, "You control a paddle that can move up and down, and your task is to bounce the ball over the AI opponent," while in standard reinforcement learning problems, we assume an arbitrary reward function through interaction with the environment. Human players bring in a lot of prior knowledge (such as the ball's trajectory following physical laws, possible movement strategies of the AI opponent, and such intuitive psychology) and need to constantly see the normal game screen. But for the policy gradient solution, even if the pixels of the frames are randomly arranged, the policy gradient using a fully connected network may not even notice the difference.

Thus, policy gradient is a brute-force method to discover the correct actions and internalize them as **policies**.

Policy gradients can easily outperform humans in many games, especially those that require frequent reward signals, precise operations, quick reactions, and **do not require much long-term planning**, as the association between these short-term rewards and actions can be easily "perceived" by this method and meticulously refined through the policy. The results in the later stages of training in the Pong game illustrate this: the agent quickly moves to catch the ball as it approaches and then shoots the ball with rapid and high vertical speed. In many ATARI games, deep Q-learning has completely surpassed human baseline performance in this way—such as in Breakout, Pong, etc.