Exploring the Actor-Critic Method

Variance Problem#

Policy Gradient methods have gained attention for their intuitiveness and effectiveness. We previously explored the Reinforce algorithm, which performs well in many tasks. However, the Reinforce method relies on Monte Carlo sampling to estimate returns, meaning we need to use data from entire episodes to compute returns. This approach brings a key issue—high variance in policy gradient estimates.

Pasted image 20241010090847

The core of PG estimation lies in finding the direction that maximizes the increase in returns. In other words, we need to update the weights of the policy so that actions that yield high returns are more likely to be chosen in the future. Ideally, such updates would gradually optimize the policy, leading to higher total returns.

However, when estimating returns using Monte Carlo methods, the reliance on data from entire episodes to compute actual returns (without estimating returns) results in significant variance in policy gradient estimates (unbiased but high variance). High variance means our gradient estimates are unstable, which can lead to slow training processes or even non-convergence. To obtain reliable gradient estimates, we may need a large number of samples, which can be costly in practical applications.

Pasted image 20241010091609

The randomness of the environment and policy leads to the same initial state potentially yielding drastically different returns, causing high variance. Therefore, returns starting from the same state can vary significantly across different episodes. Using a large number of trajectories can reduce variance and provide more accurate return estimates. However, a large batch size can decrease sample efficiency, necessitating the search for other mechanisms to reduce variance.

Advantage Actor-Critic (A2C)#

Reducing Variance through Actor-Critic Methods#

An intuitive takeaway from the knowledge learned in previous sections is that "if we combine Value-Based and Policy-Based methods, both variance and training issues can be optimized." The Actor-Critic method is precisely such a hybrid architecture, specifically:

Actor: Responsible for selecting actions, generating action probability distributions based on the current policy.
Critic: Estimates the value function under the current policy, providing feedback on action selection.

Imagine you and your friend are both novice players. You are in charge of the actions (Actor), while your friend observes and evaluates (Critic). At first, neither of you understands the game well. You operate haphazardly, while your friend is also figuring out how to assess your performance. Over time, you improve your operational skills (Policy) through practice, while your friend learns to evaluate the quality of each action more accurately (Value).

Pasted image 20241010092831

You help each other progress: your actions provide a basis for your friend's evaluations, while your friend's feedback helps you adjust your strategy.

In other words, we will learn two function approximators (neural networks):

The policy function controlling the agent (Actor): $\pi_\theta(s)$
The value function measuring the quality of actions to assist policy optimization (Critic): $\hat{q}_w(s,a)$

Algorithm Process#

At each time step $t$ , we obtain the current state $S_t$ from the environment, which is passed as input to our Actor and Critic.
The Actor outputs an action $A_t$ based on the state.

Screenshot 2024-10-10 at 10.06.53

The Critic also takes this action as input and uses $S_t$ and $A_t$ to calculate the value of taking that action in that state: the Q value.
The action $A_t$ is executed in the environment, resulting in a new state $S_{t+1}$ and reward $R_{t+1}$ .
The Actor updates its policy parameters using the $Q$ value.

Screenshot 2024-10-10 at 10.08.30

Screenshot 2024-10-10 at 10.19.12

The Actor uses the updated parameters to generate the next action $A_{t+1}$ given the new state $S_{t+1}$ .
The Critic then updates its value parameters.

Screenshot 2024-10-10 at 10.37.56

The Critic serves as a baseline for adjusting return estimates, making gradient estimates more stable. The training process is smoother, converges faster, and significantly reduces the number of samples required.

Adding Advantage (A2C)#

Learning can be further stabilized by using the Advantage Function as the Critic instead of the action value function.

Advantage: Highlighting Good Actions#

The core idea is to evaluate your actions through two components:

The immediate reward you receive and the value of the next state.
Your expected value in the current state.

Mathematically, we call this advantage:

A(s_t, a_t) = r_{t+1} + \gamma V(s_{t+1}) - V(s_t)

This expresses: in state $s_t$ , how much better is the action $a_t$ you took compared to your original expectation (represented by the baseline expectation $V(s_t)$ )?

If the actual reward and future state value $r_{t+1} + \gamma V(s_{t+1})$ exceeds your expectation for the current state $V(s_t)$ , then this action is good; if it falls short, then... you could do better.

This advantage not only tells you whether an action is good or not but also indicates how good or how bad it is (relative to the baseline).

Policy Update#

When we execute an action, the reward itself is not enough to guide policy improvement. The reward tells us whether an action is good or bad, but it does not tell us how good this action is or how much better it is than expected.
Thus, when improving the policy, rather than blindly chasing rewards, it is better to focus on adjusting actions based on how much they exceed (or fall short of) expectations. This allows for fine-tuning the policy towards actions that consistently perform better than the baseline.

This gives us the update formula:

\nabla_\theta J(\theta) \sim \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t | s_t) A(s_t, a_t)

$\nabla_\theta \log \pi_\theta(a_t | s_t)$ : Represents the log gradient of the probability of the policy $\pi_\theta$ selecting action $a_t$ at each time step $t$ . This step helps us find how to increase the likelihood of selecting $a_t$ by changing the parameters $\theta$ under the current policy.
$A(s_t, a_t)$ : The advantage function of the action, telling us how good or bad this action $a_t$ is in state $s_t$ relative to the baseline.

In simple terms: the gradient of your policy $\pi_\theta$ is adjusted by the advantage $A(s_t, a_t)$ . You update your policy not just based on whether this action brought some reward, but based on how much this action exceeded expectations.

Even better: you only need one neural network to predict the value function $V(s)$ .

Now let's talk about TD Error#

Of course, calculating this advantage function is great, but online learning has a wonderful aspect: you don't have to wait until the end to update the policy. Thus, the Temporal Difference Error (TD Error) comes into play:

\delta = r + \gamma V(s') - V(s)

The key here is that the TD error is actually an online estimate of the advantage function. It tells you whether your action has made the future state better than you expected at this moment. This error $\delta$ directly reflects the concept of advantage:

If $\delta > 0$ : "Hey, this action is better than I thought!" (advantage is positive).
If $\delta < 0$ : "Hmm, I thought it would be better..." (advantage is negative).

This allows you to gradually adjust your policy without waiting for an entire episode to end before making changes. This is an excellent strategy for improving efficiency.

Code Implementation#

Actor-Critic Network Architecture#

First, we need to build a neural network. This is a dual-head network: one for the Actor (learning the policy to select actions) and another for the Critic (estimating the value of states).

class ActorCritic(nn.Module):
    def __init__(self, num_inputs, num_actions, hidden_size, learning_rate=3e-4):
        super(ActorCritic, self).__init__()

        # Critic network (value function approximation)
        # This network is used to predict V(s), the value of state s
        self.critic_linear1 = nn.Linear(num_inputs, hidden_size)
        self.critic_linear2 = nn.Linear(hidden_size, 1)  # The value function is a scalar output

        # Actor network (policy function approximation)
        # This network is used to predict π(a|s), the probability of selecting action a in state s
        self.actor_linear1 = nn.Linear(num_inputs, hidden_size)
        self.actor_linear2 = nn.Linear(hidden_size, num_actions)  # Outputs the probability distribution over all actions
    
    def forward(self, state):
        # Convert state to torch tensor and add a dimension to support batch processing
        state = Variable(torch.from_numpy(state).float().unsqueeze(0))
        
        # Forward pass through the Critic network
        value = F.relu(self.critic_linear1(state))
        value = self.critic_linear2(value)  # Output the value of the state V(s)
        
        # Forward pass through the Actor network
        policy_dist = F.relu(self.actor_linear1(state))
        policy_dist = F.softmax(self.actor_linear2(policy_dist), dim=1)  # Use softmax to convert raw values to action probability distribution (policy)
        
        return value, policy_dist

Core Implementation of A2C Algorithm#

Next, we enter the core of A2C: the main loop and update mechanism. In each episode, the agent runs for a certain number of steps in the environment, collecting trajectories of states, actions, and rewards. At the end of each episode, the Actor (policy) and Critic (value function) are updated.

def a2c(env):
    # Get input and output dimensions from the environment
    num_inputs = env.observation_space.shape[0]
    num_outputs = env.action_space.n
    
    # Initialize Actor-Critic network
    actor_critic = ActorCritic(num_inputs, num_outputs, hidden_size)
    ac_optimizer = optim.Adam(actor_critic.parameters(), lr=learning_rate)

    # Data containers for tracking performance
    all_lengths = []  # Track the length of each episode
    average_lengths = []  # Track the average length of the last 10 episodes
    all_rewards = []  # Track the cumulative rewards of each episode
    entropy_term = 0  # Incentive for exploration

    # Enter the loop for each episode
    for episode in range(max_episodes):
        log_probs = []  # Store log probabilities of actions
        values = []  # Store Critic's value estimates (V(s))
        rewards = []  # Store the rewards received

        state = env.reset()  # Reset the environment, starting a new episode
        for steps in range(num_steps):
            # Forward pass through the network
            value, policy_dist = actor_critic.forward(state)
            value = value.detach().numpy()[0,0]  # Critic estimates the value of the current state
            dist = policy_dist.detach().numpy()
            
            # Sample an action from the action probability distribution
            action = np.random.choice(num_outputs, p=np.squeeze(dist))
            log_prob = torch.log(policy_dist.squeeze(0)[action])  # Record the log probability of the selected action
            entropy = -np.sum(np.mean(dist) * np.log(dist))  # Use entropy to measure the diversity of exploration
            new_state, reward, done, _ = env.step(action)  # Execute the action, obtaining reward and new state

            # Record the data for this trajectory
            rewards.append(reward)
            values.append(value)
            log_probs.append(log_prob)
            entropy_term += entropy
            state = new_state  # Update to the new state
            
            # If the episode ends, record and exit the loop
            if done or steps == num_steps-1:
                Qval, _ = actor_critic.forward(new_state)  # Estimate the value of the last state
                Qval = Qval.detach().numpy()[0,0]
                all_rewards.append(np.sum(rewards))  # Record the total reward for this episode
                all_lengths.append(steps)
                average_lengths.append(np.mean(all_lengths[-10:]))
                if episode % 10 == 0:
                    sys.stdout.write("episode: {}, reward: {}, total length: {}, average length: {} \n".format(
                        episode, np.sum(rewards), steps, average_lengths[-1]))
                break

        # Calculate Q values (the target values for the Critic)
        Qvals = np.zeros_like(values)  # Initialize the Q value array
        for t in reversed(range(len(rewards))):
            Qval = rewards[t] + GAMMA * Qval  # Calculate Q values using the Bellman equation
            Qvals[t] = Qval

        Qvals = torch.FloatTensor(Qvals)
        log_probs = torch.stack(log_probs)
        
        # Calculate the advantage function
        advantage = Qvals - values  # How much better did the action perform compared to the Critic's expectation?

        # Loss function
        actor_loss = (-log_probs * advantage).mean()  # Policy loss (encouraging better-performing actions)
        critic_loss = 0.5 * advantage.pow(2).mean()  # Value function loss (minimizing prediction error)
        ac_loss = actor_loss + critic_loss + 0.001 * entropy_term  # Total loss, including entropy incentive

        # Backpropagation and optimization
        ac_optimizer.zero_grad()
        ac_loss.backward()
        ac_optimizer.step()

Actor updates are based on policy gradients: we multiply the log probability of the action by the advantage of the action; if the action performed better than expected, the advantage is positive, thus encouraging that action.
Critic updates are based on mean squared error: the Critic compares its predicted $V(s_t)$ with the actual return $Q_t$ and minimizes this difference.
entropy_term: By introducing entropy, we encourage exploratory behavior, preventing the agent from prematurely converging to certain actions without sufficient exploration.

Summary#

The core of A2C is to guide policy updates through the advantage function, while leveraging TD error for online learning, making the agent's actions smarter and more efficient at each step.

Advantage function: Not only tells you whether an action is good or bad but also how good it is relative to the baseline.
TD error: A tool for real-time calculation of advantage, helping you quickly and efficiently update the policy.

Asynchronous A2C (A3C)#

A3C was proposed in Deepmind's paper "Asynchronous Methods for Deep Reinforcement Learning". Essentially, A3C is a parallelized version of A2C, where multiple parallel agents (worker threads) independently update the global value function in parallel environments, hence the term "asynchronous," which is more efficient on modern multi-core CPUs.

As seen earlier, A2C is the synchronous version of A3C, waiting for each participant to complete their experience segment before executing updates and averaging the results of all participants. The advantage is that it can utilize the GPU more efficiently. Moreover, in OpenAI Baselines: ACKTR & A2C, it was mentioned:
Our synchronous A2C implementation outperforms our asynchronous implementation—we have not seen any evidence that the noise introduced by asynchrony provides any performance advantage. When using a single GPU machine, this A2C implementation is more cost-effective than A3C and is faster than A3C implementations that only use CPUs when using larger policies.

Nagi-ovo