Policy Gradients

Overview

Policy Gradient methods are a class of reinforcement learning algorithms that directly optimize the policy by estimating the gradient of expected returns with respect to the policy parameters. Unlike value-based methods like Q-learning, policy gradients learn a parameterized policy that directly maps states to actions.

These methods are particularly useful for:

Continuous action spaces where value-based methods struggle
Stochastic policies where exploration is part of the policy
Problems where the optimal policy is easier to learn than the optimal value function

Core Concepts

Policy Gradient Theorem
▶
The policy gradient theorem provides the theoretical foundation for policy gradient methods. It states that the gradient of the expected return with respect to the policy parameters can be written as:

\[ \nabla_\theta J(\theta) = E_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) R(\tau) \right] \]

Where:
- \(\theta\) are the policy parameters
- \(\pi_\theta\) is the parameterized policy
- \(\tau\) is a trajectory (sequence of states, actions, rewards)
- \(R(\tau)\) is the return of the trajectory
Advantages Over Value-Based Methods
▶
- Natural for Continuous Actions: Policy can output continuous action values directly
- Stochastic Policies: Can learn optimal stochastic policies
- Better Convergence: Often have better convergence properties in some scenarios
- Policy Structure: Can incorporate prior knowledge about the desired policy structure
Common Variants
▶
REINFORCE (Monte Carlo Policy Gradient)

The basic policy gradient algorithm that uses complete episode returns:
- Simple but high variance
- Uses trajectory sampling
- No bootstrapping
Actor-Critic Methods

Combines policy gradients with value function estimation:
- Lower variance than REINFORCE
- Uses bootstrapping
- Can be more sample efficient
Natural Policy Gradient

Uses the natural gradient in parameter space:
- More stable updates
- Invariant to policy parameterization
- Computationally more expensive
Baseline Subtraction
▶
To reduce variance in policy gradient estimates, we often subtract a baseline from the returns:

\[ \nabla_\theta J(\theta) = E_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) (R(\tau) - b(s_t)) \right] \]

Common choices for baseline \(b(s_t)\):
- Average return
- State value function \(V(s_t)\)
- Time-dependent baseline

Implementation

Code Example

▶


import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical

class PolicyNetwork(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(PolicyNetwork, self).__init__()
        self.network = nn.Sequential(
            nn.Linear(input_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, output_size),
            nn.Softmax(dim=-1)
        )
    
    def forward(self, x):
        return self.network(x)

class REINFORCE:
    def __init__(self, input_size, hidden_size, output_size, learning_rate=0.01):
        self.policy = PolicyNetwork(input_size, hidden_size, output_size)
        self.optimizer = optim.Adam(self.policy.parameters(), lr=learning_rate)
        self.episode_rewards = []
        self.episode_log_probs = []
    
    def select_action(self, state):
        state = torch.FloatTensor(state)
        probs = self.policy(state)
        m = Categorical(probs)
        action = m.sample()
        self.episode_log_probs.append(m.log_prob(action))
        return action.item()
    
    def update(self, gamma=0.99):
        returns = []
        R = 0
        # Calculate returns
        for r in reversed(self.episode_rewards):
            R = r + gamma * R
            returns.insert(0, R)
        returns = torch.FloatTensor(returns)
        
        # Normalize returns
        returns = (returns - returns.mean()) / (returns.std() + 1e-8)
        
        # Calculate loss
        policy_loss = []
        for log_prob, R in zip(self.episode_log_probs, returns):
            policy_loss.append(-log_prob * R)
        policy_loss = torch.cat(policy_loss).sum()
        
        # Update policy
        self.optimizer.zero_grad()
        policy_loss.backward()
        self.optimizer.step()
        
        # Clear episode data
        self.episode_rewards = []
        self.episode_log_probs = []

# Example usage:
def train(env, agent, num_episodes=1000):
    for episode in range(num_episodes):
        state = env.reset()
        done = False
        
        while not done:
            action = agent.select_action(state)
            next_state, reward, done, _ = env.step(action)
            agent.episode_rewards.append(reward)
            state = next_state
        
        agent.update()
        
        if episode % 100 == 0:
            print(f"Episode {episode}")

# To use:
# env = gym.make('CartPole-v1')
# agent = REINFORCE(
#     input_size=env.observation_space.shape[0],
#     hidden_size=128,
#     output_size=env.action_space.n
# )
# train(env, agent)

Practice Questions

1. Explain the core concepts of Policy Gradients Easy

Hint: Think about the fundamental principles

2. How would you implement this in a production environment? Hard

Hint: Consider scalability and efficiency

3. What are the practical applications of Policy Gradients? Medium

Hint: Consider both academic and industry use cases

Policy Gradients

Overview

Core Concepts

Policy Gradient Theorem

Advantages Over Value-Based Methods

Common Variants

REINFORCE (Monte Carlo Policy Gradient)

Actor-Critic Methods

Natural Policy Gradient

Baseline Subtraction

Implementation

Code Example

Practice Questions

1. Explain the core concepts of Policy Gradients Easy

2. How would you implement this in a production environment? Hard

3. What are the practical applications of Policy Gradients? Medium

Related Resources

Related Topics