Activation Functions

Overview

Activation functions introduce non-linearity into neural networks, allowing them to learn complex patterns. Without activation functions, neural networks would only be capable of learning linear relationships.

Key properties of activation functions:

  • Non-linearity
  • Differentiability (for backpropagation)
  • Range and monotonicity
  • Computational efficiency

Core Concepts

  • Sigmoid Function

    The sigmoid function squashes input values to the range (0, 1). It was historically popular but has fallen out of favor due to problems with vanishing gradients and non-zero centered output.

    Properties:

    • Output range: (0, 1)
    • Smooth gradient
    • Suffers from vanishing gradient for extreme inputs
    • Output not zero-centered
    $$\sigma(x) = \frac{1}{1 + e^{-x}}$$
  • Hyperbolic Tangent (tanh)

    The tanh function is similar to sigmoid but maps values to the range (-1, 1), making it zero-centered. This helps with the training dynamics of neural networks.

    Properties:

    • Output range: (-1, 1)
    • Zero-centered output
    • Still suffers from vanishing gradient for extreme inputs
    $$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = 2\sigma(2x) - 1$$
  • Rectified Linear Unit (ReLU)

    ReLU is currently the most widely used activation function due to its computational efficiency and ability to mitigate the vanishing gradient problem.

    Properties:

    • Output range: [0, ∞)
    • Computationally efficient
    • Helps mitigate the vanishing gradient problem
    • Non-differentiable at x=0
    • Suffers from "dying ReLU" problem (neurons can permanently die when large gradients flow through)
    $$\text{ReLU}(x) = \max(0,x)$$
  • Leaky ReLU and Variants

    Leaky ReLU and its variants address the dying ReLU problem by allowing a small, non-zero gradient when the unit is not active.

    Variants include:

    • Leaky ReLU: Uses a small fixed slope for negative inputs
    • Parametric ReLU (PReLU): Learns the slope parameter during training
    • Exponential Linear Unit (ELU): Smooths the function with exponential behavior for negative inputs
    • Scaled Exponential Linear Unit (SELU): Self-normalizes activations
    $$\text{LeakyReLU}(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{if } x \leq 0 \end{cases}$$
  • Softmax Function

    The softmax function normalizes an N-dimensional vector of arbitrary real values to a probability distribution. It's commonly used in the output layer of classification networks.

    Properties:

    • Outputs sum to 1, representing a probability distribution
    • Emphasizes the largest values while suppressing significantly smaller ones
    • Applied to the final layer for multi-class classification problems
    $$\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{n} e^{x_j}}$$

Implementation

  • Common Activation Functions in Python

    This section provides Python implementations of common activation functions using NumPy and shows how to visualize them and their derivatives using Matplotlib. Understanding their shapes and gradients is crucial for deep learning.

    import numpy as np
    import matplotlib.pyplot as plt
    # Note: PyTorch is imported but not used in this specific snippet from the original content.
    # import torch
    # import torch.nn as nn
    # import torch.nn.functional as F
    
    # --- Define activation functions ---
    def sigmoid(x):
        return 1 / (1 + np.exp(-x))
    
    def tanh(x):
        return np.tanh(x)
    
    def relu(x):
        return np.maximum(0, x)
    
    def leaky_relu(x, alpha=0.01):
        return np.where(x > 0, x, alpha * x)
    
    def elu(x, alpha=1.0):
        # Ensure x is a NumPy array for np.exp to work element-wise
        x_arr = np.asarray(x)
        return np.where(x_arr > 0, x_arr, alpha * (np.exp(x_arr) - 1))
    
    def softmax(x):
        # Ensure x is a NumPy array
        x_arr = np.asarray(x)
        # Subtract max for numerical stability, crucial for avoiding overflow with exp
        exp_x = np.exp(x_arr - np.max(x_arr, axis=-1, keepdims=True))
        return exp_x / np.sum(exp_x, axis=-1, keepdims=True)
    
    # --- Plot activation functions and their derivatives ---
    def plot_activation_functions():
        x = np.linspace(-5, 5, 200) # Reduced points for brevity in example
        
        plt.figure(figsize=(12, 8)) # Adjusted figure size
    
        # Sigmoid
        plt.subplot(2, 3, 1)
        plt.plot(x, sigmoid(x), label='sigmoid(x)')
        plt.plot(x, sigmoid(x) * (1 - sigmoid(x)), label='sigmoid'(x)', linestyle='--')
        plt.title('Sigmoid')
        plt.xlabel('x')
        plt.ylabel('y')
        plt.grid(True)
        plt.legend()
    
        # Tanh
        plt.subplot(2, 3, 2)
        plt.plot(x, tanh(x), label='tanh(x)')
        plt.plot(x, 1 - tanh(x)**2, label='tanh'(x)', linestyle='--')
        plt.title('Tanh')
        plt.xlabel('x')
        plt.ylabel('y')
        plt.grid(True)
        plt.legend()
    
        # ReLU
        plt.subplot(2, 3, 3)
        plt.plot(x, relu(x), label='ReLU(x)')
        plt.plot(x, np.where(x > 0, 1, 0), label='ReLU'(x)', linestyle='--') # Derivative is 0 for x<0, 1 for x>0
        plt.title('ReLU')
        plt.xlabel('x')
        plt.ylabel('y')
        plt.grid(True)
        plt.legend()
    
        # Leaky ReLU
        plt.subplot(2, 3, 4)
        alpha_leaky = 0.1 # Using a more visible alpha for plotting
        plt.plot(x, leaky_relu(x, alpha=alpha_leaky), label=f'Leaky ReLU(x, α={alpha_leaky})')
        plt.plot(x, np.where(x > 0, 1, alpha_leaky), label=f'Leaky ReLU'(x, α={alpha_leaky})', linestyle='--')
        plt.title('Leaky ReLU')
        plt.xlabel('x')
        plt.ylabel('y')
        plt.grid(True)
        plt.legend()
    
        # ELU
        plt.subplot(2, 3, 5)
        alpha_elu = 1.0
        plt.plot(x, elu(x, alpha=alpha_elu), label=f'ELU(x, α={alpha_elu})')
        plt.plot(x, np.where(x > 0, 1, alpha_elu * np.exp(x)), label=f'ELU'(x, α={alpha_elu})', linestyle='--')
        plt.title('ELU')
        plt.xlabel('x')
        plt.ylabel('y')
        plt.grid(True)
        plt.legend()
        
        # Softmax (visualization is tricky for a 1D input, usually applied to a vector)
        # For demonstration, let's show softmax on a small vector
        softmax_example_input = np.array([1.0, 2.0, 0.5])
        softmax_output = softmax(softmax_example_input)
        # print(f"Softmax example: input {softmax_example_input}, output {softmax_output}, sum {np.sum(softmax_output)}")
        # Plotting softmax directly like others isn't standard as it's a vector function.
        # Instead, we can show its effect on a sample vector.
        plt.subplot(2, 3, 6)
        labels = ['x1', 'x2', 'x3']
        x_indices = np.arange(len(softmax_example_input))
        plt.bar(x_indices - 0.2, softmax_example_input, width=0.4, label='Input values')
        plt.bar(x_indices + 0.2, softmax_output, width=0.4, label='Softmax output')
        plt.xticks(x_indices, labels)
        plt.title('Softmax Example')
        plt.ylabel('Value / Probability')
        plt.legend()
        plt.grid(True)
    
        plt.suptitle("Activation Functions and Their Derivatives", fontsize=16)
        plt.tight_layout(rect=[0, 0, 1, 0.96]) # Adjust layout to make space for suptitle
        # plt.show() # plt.show() will block execution, usually not for backend content generation
        # To save the plot to a file instead of showing:
        # plt.savefig("activation_functions_plot.png") 
        # plt.close() # Close the figure to free memory
    
    # To generate and potentially save the plot (uncomment if needed):
    # plot_activation_functions()
    

Interview Examples

Comparing Activation Functions

Explain the trade-offs between different activation functions and when to use each.

# Activation function comparison: # 1. Sigmoid: # - Pros: Smooth gradient, output bounded between 0 and 1 # - Cons: Vanishing gradient, not zero-centered, computationally expensive # - Use case: Output layer for binary classification # 2. Tanh: # - Pros: Zero-centered, output bounded between -1 and 1 # - Cons: Still suffers from vanishing gradient # - Use case: Hidden layers when zero-centered output is important # 3. ReLU: # - Pros: Computationally efficient, mitigates vanishing gradient # - Cons: "Dying ReLU" problem, not zero-centered # - Use case: Default choice for hidden layers in CNNs and many other networks # 4. Leaky ReLU: # - Pros: Fixes dying ReLU problem, all benefits of ReLU # - Cons: Additional hyperparameter (leak coefficient) # - Use case: When ReLUs are dying # 5. Softmax: # - Pros: Outputs probability distribution # - Cons: Used only in the output layer for multi-class classification # - Use case: Output layer for multi-class classification

Implementing Custom Activation Functions in PyTorch

How would you implement custom activation functions in PyTorch?

import torch import torch.nn as nn import torch.nn.functional as F # Method 1: Using nn.Module (for activation functions with parameters) class PReLU(nn.Module): def __init__(self, alpha=0.01): super(PReLU, self).__init__() self.alpha = nn.Parameter(torch.tensor(alpha)) def forward(self, x): return torch.where(x > 0, x, self.alpha * x) # Method 2: Using Functional API (for simpler functions) def swish(x): return x * torch.sigmoid(x) # Method 3: Using lambda functions mish = lambda x: x * torch.tanh(F.softplus(x)) # Using in a neural network class CustomNN(nn.Module): def __init__(self, input_dim, hidden_dim, output_dim): super(CustomNN, self).__init__() self.fc1 = nn.Linear(input_dim, hidden_dim) self.prelu = PReLU() # Method 1 self.fc2 = nn.Linear(hidden_dim, hidden_dim) self.fc3 = nn.Linear(hidden_dim, output_dim) def forward(self, x): x = self.prelu(self.fc1(x)) x = swish(self.fc2(x)) # Method 2 x = self.fc3(x) x = mish(x) # Method 3 return x

Practice Questions

1. Implement a simple feedforward neural network using NumPy Hard

Hint: Break it down into initialization, forward pass, and backward pass
import numpy as np def sigmoid(x): return 1 / (1 + np.exp(-x)) def sigmoid_derivative(x): return x * (1 - x) class NeuralNetwork: def __init__(self, x, y): self.input = x self.weights1 = np.random.rand(self.input.shape[1], 4) self.weights2 = np.random.rand(4, 1) self.y = y self.output = np.zeros(y.shape) def feedforward(self): self.layer1 = sigmoid(np.dot(self.input, self.weights1)) self.output = sigmoid(np.dot(self.layer1, self.weights2)) def backprop(self): d_weights2 = np.dot(self.layer1.T, 2 * (self.y - self.output) * sigmoid_derivative(self.output)) d_weights1 = np.dot(self.input.T, np.dot(2 * (self.y - self.output) * sigmoid_derivative(self.output), self.weights2.T) * sigmoid_derivative(self.layer1)) self.weights1 += d_weights1 self.weights2 += d_weights2

2. How does vanishing gradient problem affect deep networks? Hard

Hint: Consider what happens to gradients in very deep networks with certain activation functions

3. Explain how backpropagation works in a neural network Medium

Hint: Think about the chain rule from calculus
\frac{\partial L}{\partial w_{ij}} = \frac{\partial L}{\partial y_j} \frac{\partial y_j}{\partial w_{ij}}