Multi-agent Systems

Overview

Multi-agent systems (MAS) in reinforcement learning involve multiple agents learning and interacting simultaneously in a shared environment. Each agent must learn to maximize its own rewards while considering the actions and objectives of other agents, leading to complex dynamics and emergent behaviors.

These systems are particularly important in:

  • Game theory and competitive scenarios
  • Cooperative task solving
  • Distributed control systems
  • Social simulations

Core Concepts

  • Types of Multi-agent Interactions

    Cooperative

    • Agents work together to achieve a common goal
    • Shared reward structure
    • Focus on coordination and communication

    Competitive

    • Agents compete for resources or opposing goals
    • Zero-sum or general-sum games
    • Strategic behavior and opponent modeling

    Mixed

    • Combination of cooperation and competition
    • Team-based scenarios
    • Coalition formation
  • Key Challenges

    Non-Stationarity

    The environment appears non-stationary from each agent's perspective as other agents learn and change their policies:

    • Violates Markov property assumptions
    • Makes convergence harder to achieve
    • Requires adaptive learning strategies

    Scalability

    Issues that arise with increasing number of agents:

    • Exponential growth in joint action space
    • Communication overhead
    • Coordination complexity

    Credit Assignment

    Difficulty in determining each agent's contribution:

    • Global vs local rewards
    • Delayed effects of actions
    • Interdependencies between agents
  • Learning Approaches

    Independent Learning

    • Each agent learns independently
    • Treats other agents as part of environment
    • Simple but can be unstable

    Centralized Training with Decentralized Execution

    • Training uses global information
    • Execution only requires local observations
    • Better coordination while maintaining scalability

    Fully Centralized

    • Single controller for all agents
    • Global optimization
    • Limited scalability
  • Communication and Coordination

    Explicit Communication

    • Direct message passing between agents
    • Learned communication protocols
    • Bandwidth constraints

    Implicit Coordination

    • Coordination through observation of others
    • Emergent team strategies
    • Role specialization

    Hierarchical Organization

    • Leader-follower structures
    • Task decomposition
    • Role assignment

Implementation

  • Code Example

    
    import numpy as np
    import torch
    import torch.nn as nn
    import torch.optim as optim
    
    class MultiAgentEnvironment:
        def __init__(self, num_agents=2, grid_size=5):
            self.num_agents = num_agents
            self.grid_size = grid_size
            self.reset()
        
        def reset(self):
            # Initialize random positions for agents
            self.agent_positions = np.random.randint(0, self.grid_size, size=(self.num_agents, 2))
            self.food_position = np.random.randint(0, self.grid_size, size=2)
            return self._get_observations()
        
        def _get_observations(self):
            observations = []
            for agent_idx in range(self.num_agents):
                # Observation includes: agent's position, other agents' positions, food position
                obs = np.concatenate([
                    self.agent_positions[agent_idx],
                    self.agent_positions[np.arange(self.num_agents) != agent_idx].flatten(),
                    self.food_position
                ])
                observations.append(obs)
            return observations
        
        def step(self, actions):
            # Actions: 0=up, 1=down, 2=left, 3=right
            rewards = np.zeros(self.num_agents)
            
            # Move agents
            for agent_idx, action in enumerate(actions):
                if action == 0:  # Up
                    self.agent_positions[agent_idx][0] = max(0, self.agent_positions[agent_idx][0] - 1)
                elif action == 1:  # Down
                    self.agent_positions[agent_idx][0] = min(self.grid_size-1, self.agent_positions[agent_idx][0] + 1)
                elif action == 2:  # Left
                    self.agent_positions[agent_idx][1] = max(0, self.agent_positions[agent_idx][1] - 1)
                elif action == 3:  # Right
                    self.agent_positions[agent_idx][1] = min(self.grid_size-1, self.agent_positions[agent_idx][1] + 1)
            
            # Check for food collection (cooperative reward)
            for agent_idx in range(self.num_agents):
                if np.array_equal(self.agent_positions[agent_idx], self.food_position):
                    rewards += 1.0  # All agents get reward when any agent reaches food
                    self.food_position = np.random.randint(0, self.grid_size, size=2)
            
            # Small penalty for distance to encourage cooperation
            for agent_idx in range(self.num_agents):
                distance_to_food = np.linalg.norm(self.agent_positions[agent_idx] - self.food_position)
                rewards[agent_idx] -= 0.1 * distance_to_food
            
            done = False  # In this simple environment, episodes don't end
            return self._get_observations(), rewards, done
    
    class IndependentQLearningAgent:
        def __init__(self, state_size, action_size, learning_rate=0.01):
            self.state_size = state_size
            self.action_size = action_size
            self.q_table = np.zeros((state_size, action_size))
            self.lr = learning_rate
            self.gamma = 0.95
            self.epsilon = 0.1
        
        def select_action(self, state):
            if np.random.random() < self.epsilon:
                return np.random.randint(self.action_size)
            return np.argmax(self.q_table[state])
        
        def learn(self, state, action, reward, next_state):
            old_value = self.q_table[state, action]
            next_max = np.max(self.q_table[next_state])
            new_value = (1 - self.lr) * old_value + self.lr * (reward + self.gamma * next_max)
            self.q_table[state, action] = new_value
    
    # Example usage:
    def train_independent_q_learning(num_episodes=1000):
        env = MultiAgentEnvironment(num_agents=2, grid_size=5)
        agents = [
            IndependentQLearningAgent(state_size=10, action_size=4)
            for _ in range(env.num_agents)
        ]
        
        for episode in range(num_episodes):
            states = env.reset()
            total_rewards = np.zeros(env.num_agents)
            
            for step in range(100):  # Max steps per episode
                # Select actions
                actions = [agent.select_action(state.tobytes()) for agent, state in zip(agents, states)]
                
                # Environment step
                next_states, rewards, done = env.step(actions)
                
                # Learn
                for agent_idx in range(env.num_agents):
                    agents[agent_idx].learn(
                        states[agent_idx].tobytes(),
                        actions[agent_idx],
                        rewards[agent_idx],
                        next_states[agent_idx].tobytes()
                    )
                
                total_rewards += rewards
                states = next_states
                
                if done:
                    break
            
            if episode % 100 == 0:
                print(f"Episode {episode}, Average Rewards: {total_rewards / 100}")
    
    # To run:
    # train_independent_q_learning()
    

Practice Questions

1. How would you implement this in a production environment? Hard

Hint: Consider scalability and efficiency

2. Explain the core concepts of Multi Agent Systems Easy

Hint: Think about the fundamental principles

3. What are the practical applications of Multi Agent Systems? Medium

Hint: Consider both academic and industry use cases