Policy Gradients

Overview

Policy Gradient methods are a class of reinforcement learning algorithms that directly optimize the policy by estimating the gradient of expected returns with respect to the policy parameters. Unlike value-based methods like Q-learning, policy gradients learn a parameterized policy that directly maps states to actions.

These methods are particularly useful for:

  • Continuous action spaces where value-based methods struggle
  • Stochastic policies where exploration is part of the policy
  • Problems where the optimal policy is easier to learn than the optimal value function

Core Concepts

  • Policy Gradient Theorem

    The policy gradient theorem provides the theoretical foundation for policy gradient methods. It states that the gradient of the expected return with respect to the policy parameters can be written as:

    \[ \nabla_\theta J(\theta) = E_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) R(\tau) \right] \]

    Where:

    • \(\theta\) are the policy parameters
    • \(\pi_\theta\) is the parameterized policy
    • \(\tau\) is a trajectory (sequence of states, actions, rewards)
    • \(R(\tau)\) is the return of the trajectory
  • Advantages Over Value-Based Methods

    • Natural for Continuous Actions: Policy can output continuous action values directly
    • Stochastic Policies: Can learn optimal stochastic policies
    • Better Convergence: Often have better convergence properties in some scenarios
    • Policy Structure: Can incorporate prior knowledge about the desired policy structure
  • Common Variants

    REINFORCE (Monte Carlo Policy Gradient)

    The basic policy gradient algorithm that uses complete episode returns:

    • Simple but high variance
    • Uses trajectory sampling
    • No bootstrapping

    Actor-Critic Methods

    Combines policy gradients with value function estimation:

    • Lower variance than REINFORCE
    • Uses bootstrapping
    • Can be more sample efficient

    Natural Policy Gradient

    Uses the natural gradient in parameter space:

    • More stable updates
    • Invariant to policy parameterization
    • Computationally more expensive
  • Baseline Subtraction

    To reduce variance in policy gradient estimates, we often subtract a baseline from the returns:

    \[ \nabla_\theta J(\theta) = E_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) (R(\tau) - b(s_t)) \right] \]

    Common choices for baseline \(b(s_t)\):

    • Average return
    • State value function \(V(s_t)\)
    • Time-dependent baseline

Implementation

  • Code Example

    
    import numpy as np
    import torch
    import torch.nn as nn
    import torch.optim as optim
    from torch.distributions import Categorical
    
    class PolicyNetwork(nn.Module):
        def __init__(self, input_size, hidden_size, output_size):
            super(PolicyNetwork, self).__init__()
            self.network = nn.Sequential(
                nn.Linear(input_size, hidden_size),
                nn.ReLU(),
                nn.Linear(hidden_size, output_size),
                nn.Softmax(dim=-1)
            )
        
        def forward(self, x):
            return self.network(x)
    
    class REINFORCE:
        def __init__(self, input_size, hidden_size, output_size, learning_rate=0.01):
            self.policy = PolicyNetwork(input_size, hidden_size, output_size)
            self.optimizer = optim.Adam(self.policy.parameters(), lr=learning_rate)
            self.episode_rewards = []
            self.episode_log_probs = []
        
        def select_action(self, state):
            state = torch.FloatTensor(state)
            probs = self.policy(state)
            m = Categorical(probs)
            action = m.sample()
            self.episode_log_probs.append(m.log_prob(action))
            return action.item()
        
        def update(self, gamma=0.99):
            returns = []
            R = 0
            # Calculate returns
            for r in reversed(self.episode_rewards):
                R = r + gamma * R
                returns.insert(0, R)
            returns = torch.FloatTensor(returns)
            
            # Normalize returns
            returns = (returns - returns.mean()) / (returns.std() + 1e-8)
            
            # Calculate loss
            policy_loss = []
            for log_prob, R in zip(self.episode_log_probs, returns):
                policy_loss.append(-log_prob * R)
            policy_loss = torch.cat(policy_loss).sum()
            
            # Update policy
            self.optimizer.zero_grad()
            policy_loss.backward()
            self.optimizer.step()
            
            # Clear episode data
            self.episode_rewards = []
            self.episode_log_probs = []
    
    # Example usage:
    def train(env, agent, num_episodes=1000):
        for episode in range(num_episodes):
            state = env.reset()
            done = False
            
            while not done:
                action = agent.select_action(state)
                next_state, reward, done, _ = env.step(action)
                agent.episode_rewards.append(reward)
                state = next_state
            
            agent.update()
            
            if episode % 100 == 0:
                print(f"Episode {episode}")
    
    # To use:
    # env = gym.make('CartPole-v1')
    # agent = REINFORCE(
    #     input_size=env.observation_space.shape[0],
    #     hidden_size=128,
    #     output_size=env.action_space.n
    # )
    # train(env, agent)
    

Practice Questions

1. How would you implement this in a production environment? Hard

Hint: Consider scalability and efficiency

2. What are the practical applications of Policy Gradients? Medium

Hint: Consider both academic and industry use cases

3. Explain the core concepts of Policy Gradients Easy

Hint: Think about the fundamental principles