This year, I participated in Kaggle competitions using the DeepSeekMath7BRL model, learning from Claude 3.5 Sonnet as a teacher. The strength of both models is largely attributed to RL. I have a vague feeling that the technology in this field is very strong and beautiful, so I prepared to explore it a bit. However, my foundation is not solid, and I couldn't understand OpenAI Spinning Up, so I chose to sample an introductory course like HuggingFace DeepRL.
Introduction#
The concept of reinforcement learning is that an agent (AI) learns from the environment through interaction (via trial and error) and receives rewards (positive or negative) as feedback for executing actions, thereby learning from the environment.
 The agent obtains the first frame from the environment, receiving state $S_0$. Based on $S_0$, the agent takes action $A_0$ (for example, moving to the right), and the environment enters a new frame (new state $S_1$), providing the agent with a certain reward $R_1$ (for example, not dying +1 positive reward).
The agent's goal is to maximize its cumulative reward, referred to as expected return.
Reward Hypothesis: The Core Idea of Reinforcement Learning#
Reinforcement learning is based on the reward hypothesis, which states that all goals can be described as maximizing expected return (expected cumulative reward). In reinforcement learning, to achieve optimal behavior, our goal is to learn to take actions that maximize expected cumulative reward.
Tasks are divided into two categories#

Episodic: Has a starting point and an ending point. (Game levels)

Continuous: Has a starting point but no ending point. (Selfdriving)
What is the difference between State and Observation?#
 State is a complete description of the world state (no hidden information), such as a chessboard.
 Observation is a partial description of the state, such as a view in a Mario game.
Policybased Methods#
In this method, the policy is learned directly.
It maps each state to the best corresponding action for that state or the probability distribution of the possible action set for that state.
Valuebased Methods#
Instead of training a policy, it trains a value function, which maps each state to the expected value of being in that state.
Exploration/Exploitation Tradeoff#
In reinforcement learning, we need to balance the degree of exploration of the environment with the degree of exploitation of known information.
Where does depth come from?#
Deep neural networks are introduced to solve reinforcement learning problems.
Practical Example  HUGGY#
It is recommended to play directly using Colab. If running in WSL2, you need to set up a virtual display:
import os
os.environ['PYVIRTUALDISPLAY_DISPLAYFD'] = '0'
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()
State Space#
We want to train Huggy to pick up the stick we throw. This means it needs to move correctly towards the stick.
The environmental information we provide includes:
 The position of the target (stick)
 Its relative position to the target
 The orientation of its legs.
Based on all this information, Huggy can use its Policy to decide what action to take next to achieve its goal.
Action Space#
The actions that Huggy can perform.
Joint motors drive Huggy's legs.
Therefore, to achieve the goal, Huggy needs to learn how to correctly rotate the joint motors of each leg to move.
Reward Function#
Thus, our reward function includes:
 Direction reward: Reward for getting closer to the target.
 Goal achievement reward: We reward Huggy for achieving the goal.
 Time penalty: A fixed time penalty given for each action, forcing it to reach the stick as quickly as possible.
 Rotation penalty: Penalty for excessive and rapid rotations.
Introduction to QLearning#
valuebased
 Starting from state s, we act according to policy $\pi$.
 We accumulate future rewards, but rewards further away are discounted by factor $\gamma$.
 The value function calculates the expected "return" from the current state onward.
The goal of an RL agent is to have an optimal policy π*.
Relationship between Value and Policy#
Finding the optimal value function allows us to derive the optimal policy.
 $\pi^*(s)$: Represents the optimal policy, the best action to take when in state $s$.
 $Q^*(s, a)$: Represents the stateaction value function, which indicates the expected value of future discounted returns when taking action $a$ in state $s$.
 $\arg \max_{a}$: Represents choosing the action $a$ that maximizes $Q^*(s, a)$, i.e., finding the action that maximizes expected return.
The core idea of this formula is that if we have an optimal $Q^*(s, a)$ function, we can determine the optimal policy $\pi^*(s)$ by selecting the action that maximizes $Q^*(s, a)$. In other words, the optimal policy chooses the action $a$ that maximizes future expected return in each state $s$.
Two Types of ValueBased Methods#
Imagine a scenario where a mouse is searching for cheese in a maze:
 The state value function is akin to the mouse's expected mood when at a certain position (state): "If I start from here and act according to my policy, how likely am I to find cheese?"
 The action value function goes a step further: "If I start from here and choose to move right, what is my probability of success?" The action value function helps the mouse evaluate the effects of executing different actions in each state, allowing for more targeted action direction.
1. State Value Function#
The state value function $V_\pi(s)$ evaluates the expected cumulative reward starting from a given state $s$ under a specific policy $\pi$. In other words, it measures "If I am currently in state $s$ and follow policy $\pi$, what will my longterm return be?"
 $V_\pi(s)$: The value of state $s$, i.e., the expected cumulative return from starting at $s$ and following policy $\pi$.
 $G_t$: The cumulative reward starting from time step $t$, usually expressed as $G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots$, where $\gamma$ is the discount factor.
 $\mathbb{E}_\pi[\cdot]$: Represents the expectation under policy $\pi$.
Suppose we have a maze, and the mouse starts from a certain position (state $s$) and follows a certain policy to find cheese. The state value function can tell us how quickly the mouse is expected to find cheese from its current position.
In the example above, the value of each grid indicates the value of a certain state. For instance, if the value of state $s$ is $6$, it means the expected cumulative reward from starting here is poor (possibly because it is far from the cheese), while states closer to the cheese (like those with a value of $2$) have better expected returns.
2. Action Value Function#
The action value function $Q_\pi(s, a)$ evaluates the expected cumulative reward after taking action $a$ from a given state $s$ under a specific policy $\pi$. It considers not only the value of the state but also evaluates the expected return from taking a specific action $a$ from that state.
 $Q_\pi(s, a)$: The value of the stateaction pair $(s, a)$, i.e., the expected cumulative return from taking action $a$ in state $s$ and following policy $\pi$.
 Unlike the state value function, the action value function evaluates the longterm return of executing a specific action $a$ starting from state $s$.
Using the previous maze example, suppose the mouse is now in state $s$ and can choose four directions (up, down, left, right) to search for cheese. The action value function $Q_\pi(s, a)$ can tell us the expected return for the mouse choosing each direction. For instance, the value of moving right from a certain state $s$ might be higher than moving left because moving right gets closer to the cheese.
Summary of Differences#

The state value function $V_\pi(s)$ only considers the cumulative expected return of executing future actions in a certain state according to the policy. It does not focus on which specific action was taken.

The action value function $Q_\pi(s, a)$ is more detailed; it considers not only the return starting from a certain state but also the expected return after executing a specific action in that state.
Bellman Equation#
Returning to the previous example, if the mouse starts from state $St$ and receives a reward of 1 for each step:
This indicates that the cumulative return from state $St$ to the endpoint is 6.
Calculating the cumulative return for each state directly like this is clearly a costly process, as we need to consider every possible path and its rewards, along with a lot of repeated calculations. The Bellman equation provides us with a more concise recursive method to solve this problem.
The core idea of the Bellman equation is that we can express the value of a state as the immediate reward of that state plus the discounted sum of the future state values. The formula is as follows:
This way, we do not need to compute the cumulative return for each state from scratch but can recursively add the immediate reward and the value of the next state.
Let's return to the previous example. If we already know that the value of state $St+1$ is 5, then the value of state $St$ can be recursively calculated using the Bellman equation:
If there is no discounting (i.e., $\gamma = 1$), and we know that $V(St+1) = 5$ and the immediate reward $R_{t+1} = 1$, then:
This is consistent with our previous direct calculation.
The Role of the Discount Factor $\gamma$#
The discount factor $\gamma$ in the Bellman equation determines how much we value future rewards.
 If $\gamma$ is low (e.g., 0.1 or 0), we almost completely ignore future rewards and focus on the current immediate return.
 If $\gamma = 1$, we consider all future rewards, making every future step as important as the current one.
 If $\gamma$ is too large (like a million), it actually loses physical meaning because future values will be infinitely exaggerated, which is unreasonable in practical applications.
Monte Carlo Methods and Temporal Difference Learning#
Monte Carlo and Temporal Difference Learning are two different strategies for training value functions or policy functions, both of which leverage experience to solve reinforcement learning problems.
Monte Carlo: Learning at the End of Episodes#
Monte Carlo methods are a way to estimate the value of states or stateaction pairs through repeated sampling. In this method, we update the value function based on the complete experience of each episode after it ends. Below, I will explain how the Monte Carlo method works through a specific example.

Record state, action, reward, and next state:
 Each time the agent takes an action, it records the current state, the action taken, the reward received, and the next state it transitions to.

Calculate total return $G_t$:
 The agent accumulates all rewards in the episode to calculate the total return $G_t$, which is used to evaluate the current policy's performance.

Update value function $V(s)$:
 Update the value estimate for a certain state $s$ using the formula:
$V(S_t) \leftarrow V(S_t) + \alpha [G_t  V(S_t)]$
 Update the value estimate for a certain state $s$ using the formula:
Where:
 $\alpha$ is the learning rate, indicating the update step size.
 $G_t$ is the actual total return obtained.
 $V(S_t)$ is the current value estimate for state $S_t$.
After updating, the agent will start the next episode with the new state value function and continue to learn and improve its policy through multiple games.
Using the same scenario:
 Suppose at the beginning of the game, the mouse has a value of 0 for each state (i.e., $V(S) = 0$).
 The learning rate $\alpha = 0.1$, and we do not apply discounting (i.e., $\gamma = 1$).
 The mouse receives a series of rewards during its exploration, with an accumulated return of $G_0 = 3$.
Algorithm flow:
 Return $G_0 = 3$.
 Update state value $V(S_0)$:
$V(S_0) = V(S_0) + \alpha [G_0  V(S_0)]$Substituting known values:$V(S_0) = 0 + 0.1 \times (3  0) = 0.3$
Through this episode's update, the mouse's value estimate for state $S_0$ becomes 0.3.
Temporal Difference Learning: Learning at Every Step#
In contrast to the Monte Carlo method, Temporal Difference Learning (TD Learning) does not wait until the end of an entire episode to update. Instead, it gradually adjusts based on onestep experiences. The update is based on the immediate reward $R_{t+1}$ observed from the current state $S_t$ and the estimated value of the next state $S_{t+1}$.
The formula is as follows:
Where:
 $\alpha$ is the learning rate.
 $R_{t+1}$ is the immediate reward.
 $\gamma$ is the discount factor, controlling the weight of future rewards.
This update method is called bootstrapping, as it partially relies on existing value estimates (i.e., $V(S_{t+1})$) for updating rather than complete samples $G_t$.
Continuing with our mouse searching for cheese:

Initialize value function: Initially, we assume the value of each state is 0.

Agent explores: The mouse starts from a certain state $S_0$, randomly chooses an action, and suppose it moves left and eats cheese. At this point, it receives an immediate reward $R_{t+1} = 1$.

Calculate TD target:
The TD target is the immediate reward plus the discounted value of the next state:$TD_{Target} = R_{t+1} + \gamma V(S_{t+1})$In this example, suppose the value of the next state is $V(S_1) = 0$ and $\gamma = 1$, then:
$TD_{Target} = 1 + 1 \times 0 = 1$ 
Update value function:
Use the TD target to update the value of state $S_0$:$V(S_0) = V(S_0) + \alpha [TD_{Target}  V(S_0)]$Substituting values:
$V(S_0) = 0 + 0.1 \times (1  0) = 0.1$
Thus, the mouse updates the value of the current state after each step. As the mouse continues to interact with the environment, the value function will gradually reflect the true values of each state.
Comparison Summary: From the Perspective of Bias and Variance#
 Monte Carlo Method: Waits until the end of an episode to update the value function by calculating the cumulative return. It uses the actual return value $G_t$. Since it updates the value function based on complete episode experience, it can provide relatively accurate value estimates, especially suitable for situations without explicit models or transition probabilities.
 **Unbiased**: The Monte Carlo method directly obtains complete returns from actual experience, making it unbiased, meaning the estimated expected value aligns with the true expected value.
 **High Variance**: Since it calculates the return for the entire episode, involving multiple state transitions and rewards, it leads to higher variance in estimates.
 MC methods are not affected by bias. They have high variance but are unbiased, suitable for evaluating global experiences.
 Temporal Difference Learning: Updates the value function immediately after each step, allowing learning without complete returns, using estimated returns, i.e., $R_{t+1} + \gamma V(S_{t+1})$ instead of $G_t$.
 **Biased**: The TD method uses bootstrapping, meaning the initial values are often arbitrary and not necessarily accurate, leading to initial bias. However, as more real experiences accumulate, the bias gradually diminishes. The presence of bias can lead to serious issues, especially when using offpolicy + function approximators (like neural networks) combined with bias, often resulting in convergence failures. Sutton and Barto's **“deadly triad”** refers to this situation.
 **Low Variance**: Because it only involves singlestep state transitions and rewards, its variance is relatively low.
 Low variance but biased, initial estimates may be inaccurate, but suitable for realtime updates.
QLearning Related Terms#
QLearning is an offpolicy valuebased method that uses TD methods to train its action value function: the Q function. The letter Q stands for "Quality," indicating the value of taking that action in that state.
Given a state and action, the Q function outputs a stateaction value (also known as Q value).
Here, let's review the difference between Value and Reward:
Term  Explanation 

Value of a state  Expected cumulative reward, i.e., the total reward expected when the agent starts from a certain state and acts according to the policy. 
Reward  The immediate feedback received from the environment after performing an action in a certain state. 
Internally, our Q function is encoded by a Q Table, where each cell in the table corresponds to a stateaction pair value. This Q Table can be viewed as the memory of the Q function (Cheat Sheet).
It can be seen that the initial state and the stateaction value for moving up are 0.
Thus, when training is complete, we will obtain an optimal Q function, which means we have an optimal Q Table, equivalent to having an optimal policy, as we can determine the best action to take in each state from the Q Table.
Algorithm Flow#
Scary pseudocode:
Step 1: Initialize Q Table#
Initialize the Q table for each stateaction pair. Most often initialized to 0.
def initialize_q_table(state_space, action_space):
Qtable = np.zeros((state_space, action_space))
return Qtable
Step 2: Choose an action using εgreedy policy#
The εgreedy policy is a strategy that balances exploration and exploitation. By gradually reducing the probability of exploration, the agent tends to explore new actions in the early stages of learning, while later relying more on learned knowledge to choose the best action.
 With probability $1  \epsilon$: Choose the action with the highest Q value in the current Qtable (exploitation).
 With probability $\epsilon$: Randomly choose an action (exploration).
 Initial Stage: When $\epsilon = 1.0$, the agent explores completely, i.e., randomly selects actions.
 Training Process: As training progresses, $\epsilon$ gradually decreases (e.g., exponential decay), and the agent increasingly chooses the currently estimated optimal action, i.e., exploitation.
def greedy_policy(Qtable, state):
# Exploitation: Choose the action with the highest stateaction value
# Classic Qlearning operation: Query the Q table to get the action with the highest reward in the current state.
# Simple, direct, purely "exploitation."
action = np.argmax(Qtable[state][:]) # np.argmax finds the index of the maximum value.
return action
def epsilon_greedy_policy(Qtable, state, epsilon):
random_num = random.uniform(0,1) # Random number
# If random number > epsilon > Exploitation
# This is key: with a probability of 1  epsilon, we choose exploitation,
# greedily selecting the best action in the Q table.
if random_num > epsilon:
# Choose the action with the highest value in the given state
action = greedy_policy(Qtable, state)
# Otherwise > Exploration
# If the random number is less than epsilon, the agent will explore,
# randomly selecting an action to try new possibilities.
else:
action = env.action_space.sample() # Exploration: choose a random action.
return action
Step 3: Execute action $A_t$, receive reward $R_{t+1}$, and transition to next state $S_{t+1}$#
In Qlearning, this step is the core of the algorithm. When the agent executes action $A_t$, it receives feedback from the environment (reward $R_{t+1}$) and transitions to a new state $S_{t+1}$. At this point, the agent needs to consider not only the immediate reward obtained but also evaluate the potential future returns.
This leads us to the Qlearning update formula.
Step 4: Update $Q(S_t, A_t)$#
Qlearning uses a very important update rule, which we have seen in previous temporal difference learning:
 $Q(S_t, A_t)$ is our current estimate of the stateaction pair value.
 $R_{t+1}$ is the immediate reward obtained after executing action $A_t$.
 $\max_a Q(S_{t+1}, a)$ is the Q value of the optimal action in the next state $S_{t+1}$ based on the current policy.
 $\alpha$ is the learning rate, which determines how quickly we update the Q value.
 $\gamma$ is the discount factor, which balances the weight of immediate rewards and future rewards.
Through this method, each update in Qlearning adjusts based on the optimal Q value of the next state.
def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):
# External training loop: Execute learning for n_training_episodes times
for episode in tqdm(range(n_training_episodes)):
# Epsilon decay: As training progresses, epsilon gradually decreases, reducing the probability of exploration and increasing exploitation
epsilon = min_epsilon + (max_epsilon  min_epsilon)*np.exp(decay_rate*episode)
# At the start of each round, reset the environment and obtain the initial state
state, info = env.reset()
step = 0
terminated = False
truncated = False
# In each round, the agent interacts with the environment for a maximum of max_steps steps
for step in range(max_steps):
# Use epsilon greedy policy to choose action
action = epsilon_greedy_policy(Qtable, state, epsilon)
# Execute action and observe new state and new reward
new_state, reward, terminated, truncated, info = env.step(action)
# Update Q value:
# Q(s,a) = Q(s,a) + learning rate * (reward + discount factor * max(Q(s',a'))  Q(s,a))
Qtable[state][action] = Qtable[state][action] + learning_rate * (reward + gamma * np.max(Qtable[new_state])  Qtable[state][action])
# If the agent ends in the current state (success or failure), stop this training round
if terminated or truncated:
break
# Set the agent's next state to the new state
state = new_state
return Qtable
Offpolicy vs Onpolicy#
Now, let's discuss the two types of policies—Offpolicy and Onpolicy.
From the perspective of policy learning, the key difference between the two is:
 Onpolicy: Learns the policy currently being executed, where the agent's actual behavior aligns with the learned policy.
 Offpolicy: Learns a policy different from the one currently being executed; even if the agent is executing an exploration policy, it updates based on the optimal policy.
Offpolicy#
Offpolicy uses two different policies for executing actions (acting policy) and updating Q values (updating policy). In the Qlearning we introduced earlier, the agent uses the epsilongreedy policy for action selection (acting policy), but when updating Q values, it uses the greedy policy (greedy policy) to find the optimal action.
For example, when updating Q values, the agent will choose the optimal action $a$ in the next state, i.e.:
This means that even if our current policy is epsilongreedy, we always use the greedy policy to update Q values, thus accelerating convergence. This is why Qlearning is referred to as an Offpolicy algorithm.
Onpolicy#
Onpolicy, on the other hand, requires the same policy to be used for both action execution and Q value updates. For example, in the SARSA algorithm, the action chosen during the update is also based on the current policy, rather than the greedy policy. This means the agent always uses the epsilongreedy policy for both selecting and updating actions, leading to a learning process that is closer to the actual execution policy.
To understand with an example:

SARSA (Onpolicy): Suppose the agent is now in state $S_t$, it chooses action $A_t$, then transitions to new state $S_{t+1}$ and chooses new action $A_{t+1}$. SARSA uses $A_{t+1}$ (the action chosen based on the current epsilongreedy policy) to update Q values.

Qlearning (Offpolicy): In the same situation, Qlearning also chooses action $A_t$ from $S_t$ and transitions to state $S_{t+1}$. However, when updating Q values, it does not care which action $A_{t+1}$ was actually chosen in $S_{t+1}$, but instead uses the maximum Q value of all actions in $S_{t+1}$ (i.e., the optimal policy) to update Q values.
Let's Dive Deeper!#
If it were just a simple game like "Frozen Lake," building a Q table would clearly allow us to obtain an excellent Q function in a relatively short time.
However, when the environment becomes a game like "Doom," which has a vast state space (millions of different states), the efficiency of creating and updating a Q table will be very low.
This is why we now introduce deep Qlearning, using a neural network to approximate the different Q values for each action given a state.