Initialization

Overview

Weight initialization is crucial for training deep neural networks effectively. Proper initialization helps achieve faster convergence and better final performance by setting appropriate initial conditions for optimization.

Key aspects:

  • Affects training dynamics
  • Impacts convergence speed
  • Prevents vanishing/exploding gradients
  • Layer-specific considerations

Core Concepts

  • Basic Initialization Methods

    Common approaches to weight initialization:

    • Zero/Constant: Generally not recommended
    • Random Normal: Simple Gaussian noise
    • Random Uniform: Uniform distribution
    • Glorot/Xavier: Variance scaling
    $$ \text{Normal}: W \sim \mathcal{N}(0, \sigma^2) $$ $$ \text{Uniform}: W \sim U(-a, a) $$ $$ \text{Xavier}: \sigma^2 = \frac{2}{n_{in} + n_{out}} $$
  • Modern Initialization

    Advanced initialization techniques for deep networks:

    • He/Kaiming: ReLU networks
    • LeCun: Normalized initialization
    • Orthogonal: Preserves gradient norm
    • LSUV: Layer-sequential initialization
    $$ \text{He}: \sigma^2 = \frac{2}{n_{in}} $$ $$ \text{LeCun}: \sigma^2 = \frac{1}{n_{in}} $$ $$ \text{Orthogonal}: W = Q \text{ where } Q^TQ = I $$
  • Layer-Specific Methods

    Initialization strategies for different layer types:

    • Convolutional layers: Fan-in/out
    • Recurrent layers: Identity/orthogonal
    • Attention layers: Scaled initialization
    • Residual blocks: Zero initialization
    $$ \text{Conv2D}: \sigma^2 = \frac{2}{\text{fan}_in} $$ $$ \text{LSTM}: W_h = I + \epsilon $$ $$ \text{Attention}: Q,K,V \sim \mathcal{N}(0, \frac{1}{d_{model}}) $$

Implementation

  • Manual Initialization Implementation

    Implementation of initialization from scratch to understand the process:

    • Forward pass computation
    • Weight initialization
    • Training loop integration
    
    import numpy as np
    
    class SimpleInitializedNetwork:
        def __init__(self, layer_sizes, method='xavier'):
            self.weights = []
            self.biases = []
            for i in range(len(layer_sizes) - 1):
                if method == 'xavier':
                    std = np.sqrt(2.0 / (layer_sizes[i] + layer_sizes[i+1]))
                    W = np.random.randn(layer_sizes[i], layer_sizes[i+1]) * std
                elif method == 'he':
                    std = np.sqrt(2.0 / layer_sizes[i])
                    W = np.random.randn(layer_sizes[i], layer_sizes[i+1]) * std
                else:
                    W = np.random.randn(layer_sizes[i], layer_sizes[i+1]) * 0.01
                self.weights.append(W)
                self.biases.append(np.zeros((1, layer_sizes[i+1])))
        def sigmoid(self, x):
            return 1 / (1 + np.exp(-x))
        def forward(self, X):
            activation = X
            for W, b in zip(self.weights, self.biases):
                z = np.dot(activation, W) + b
                activation = self.sigmoid(z)
            return activation
    # Example usage
    if __name__ == "__main__":
        X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
        nn = SimpleInitializedNetwork([2, 4, 1], method='xavier')
        output = nn.forward(X)
        print(output)
    

Interview Examples

Explaining Initialization

Can you explain how weight initialization affects neural network training?

# Initialization explanation # Key points about initialization: # 1. Good initialization prevents vanishing/exploding gradients # 2. Different methods for different activations/layers # 3. Helps with faster convergence and better performance # 4. Poor initialization can stall or destabilize training

Implement Xavier Initialization in PyTorch

Write a PyTorch function to apply Xavier initialization to a linear layer.

import torch import torch.nn as nn def apply_xavier(layer): if isinstance(layer, nn.Linear): nn.init.xavier_uniform_(layer.weight) if layer.bias is not None: nn.init.zeros_(layer.bias) # Example usage model = nn.Sequential( nn.Linear(10, 20), nn.ReLU(), nn.Linear(20, 1) ) model.apply(apply_xavier)

Practice Questions

1. What are the practical applications of Initialization? Medium

Hint: Consider both academic and industry use cases

2. Explain the core concepts of Initialization Easy

Hint: Think about the fundamental principles

3. How would you implement this in a production environment? Hard

Hint: Consider scalability and efficiency