Transformer Architecture

Overview

The Transformer architecture, introduced in the paper "Attention Is All You Need" by Vaswani et al. (2017), revolutionized sequence-to-sequence tasks, particularly in Natural Language Processing (NLP). Unlike previous architectures like Recurrent Neural Networks (RNNs) or Long Short-Term Memory networks (LSTMs) that process sequences token by token, Transformers process entire sequences simultaneously using attention mechanisms. This allows for significant parallelization and has led to breakthroughs in machine translation, text summarization, question answering, and the development of large language models (LLMs).

Transformer Architecture

The Transformer Architecture: Encoder-Decoder structure with attention mechanisms

The core innovation of the Transformer is its reliance on self-attention mechanisms to compute representations of its input and output without using sequence-aligned RNNs or convolution.

For related content:

  • LLM-specific applications and extensions: api/content/modern_ai/llms/transformer_architecture.py
  • Detailed exploration of attention mechanisms: api/content/modern_ai/llms/attention_mechanisms.py

Core Concepts

  • High-Level Architecture

    A Transformer typically consists of two main parts: an Encoder and a Decoder, each composed of a stack of identical layers.

    • Encoder: Maps an input sequence of symbol representations \((x_1, ..., x_n) \) to a sequence of continuous representations \(z = (z_1, ..., z_n) \). Each encoder layer has two sub-layers: a multi-head self-attention mechanism and a simple, position-wise fully connected feed-forward network.
    • Decoder: Given \(z\), the decoder generates an output sequence of symbol representations \((y_1, ..., y_m) \) one element at a time. Each decoder layer has three sub-layers: a masked multi-head self-attention mechanism (to prevent positions from attending to subsequent positions), a multi-head cross-attention mechanism (to attend over the encoder's output), and a position-wise fully connected feed-forward network.

    Residual connections are employed around each of the sub-layers, followed by layer normalization.

  • Key Components

    Several key components define the Transformer architecture:

    • Self-Attention Mechanism: Allows the model to weigh the importance of different words in a sequence when processing that sequence. For each word, self-attention computes a weighted sum of all words in the sequence, where the weights are determined by the similarity between the current word and other words.
    • Multi-Head Attention: Instead of performing a single attention function, the Transformer projects the queries, keys, and values \(h\) times with different, learned linear projections. These \(h\) attention outputs are then concatenated and once again projected, resulting in the final values.
    • Scaled Dot-Product Attention: The specific attention mechanism used. The input consists of queries and keys of dimension \(d_k\), and values of dimension \(d_v\).
      $$ \text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V $$
    • Positional Encoding: Since the Transformer contains no recurrence or convolution, positional encodings are added to provide information about the relative or absolute position of tokens in the sequence.
  • Modern Variants

    Several variants have been proposed to improve the efficiency of the attention mechanism:

    • Linear Attention: Reduces complexity from O(n²) to O(n)
    • Local Attention: Restricts attention to a local window around each token
    • Rotary Position Embeddings (RoPE): Better relative position modeling
    • ALiBi: Learned position biases that scale well to longer sequences

Implementation

  • Code Example

    
    # Conceptual Structure - Not runnable code, for illustration only.
    # For actual usage, use libraries like PyTorch or Hugging Face Transformers.
    
    import torch
    import torch.nn as nn
    import math
    
    class MultiHeadAttention(nn.Module):
        def __init__(self, d_model, num_heads):
            super(MultiHeadAttention, self).__init__()
            assert d_model % num_heads == 0
            self.d_k = d_model // num_heads
            self.num_heads = num_heads
            self.wq = nn.Linear(d_model, d_model)
            self.wk = nn.Linear(d_model, d_model)
            self.wv = nn.Linear(d_model, d_model)
            self.fc = nn.Linear(d_model, d_model)
        
        def scaled_dot_product_attention(self, Q, K, V, mask=None):
            attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
            if mask is not None:
                attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
            attn_probs = torch.softmax(attn_scores, dim=-1)
            output = torch.matmul(attn_probs, V)
            return output
    
        def forward(self, query, key, value, mask=None):
            batch_size = query.size(0)
            
            # Linear projections and reshape for multi-head attention
            Q = self.wq(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
            K = self.wk(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
            V = self.wv(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
            
            # Scaled dot-product attention
            context = self.scaled_dot_product_attention(Q, K, V, mask)
            
            # Concatenate heads and apply final linear layer
            context = context.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads * self.d_k)
            output = self.fc(context)
            return output
    
    class TransformerEncoderLayer(nn.Module):
        def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
            super(TransformerEncoderLayer, self).__init__()
            self.self_attn = MultiHeadAttention(d_model, num_heads)
            self.feed_forward = nn.Sequential(
                nn.Linear(d_model, d_ff),
                nn.ReLU(),
                nn.Linear(d_ff, d_model)
            )
            self.norm1 = nn.LayerNorm(d_model)
            self.norm2 = nn.LayerNorm(d_model)
            self.dropout = nn.Dropout(dropout)
        
        def forward(self, x, mask=None):
            # Self-attention block
            attn_output = self.self_attn(x, x, x, mask)
            x = self.norm1(x + self.dropout(attn_output))
            
            # Feed-forward block
            ff_output = self.feed_forward(x)
            x = self.norm2(x + self.dropout(ff_output))
            return x
    

Interview Examples

What problem does the Transformer architecture solve that RNNs/LSTMs struggled with?

Explain the key advantages of Transformers over recurrent architectures for sequence processing.

Implement scaled dot-product attention in NumPy

Write a function to compute scaled dot-product attention using NumPy. Include a usage example.

import numpy as np def scaled_dot_product_attention(Q, K, V, mask=None): ''' Q: (batch, seq_len_q, d_k) K: (batch, seq_len_k, d_k) V: (batch, seq_len_k, d_v) mask: (batch, seq_len_q, seq_len_k) or None ''' d_k = Q.shape[-1] scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(d_k) if mask is not None: scores = np.where(mask, scores, -1e9) attn_weights = np.exp(scores - np.max(scores, axis=-1, keepdims=True)) attn_weights = attn_weights / np.sum(attn_weights, axis=-1, keepdims=True) output = np.matmul(attn_weights, V) return output, attn_weights # Example usage: batch, seq_len, d_k, d_v = 1, 3, 4, 5 np.random.seed(0) Q = np.random.randn(batch, seq_len, d_k) K = np.random.randn(batch, seq_len, d_k) V = np.random.randn(batch, seq_len, d_v) output, attn_weights = scaled_dot_product_attention(Q, K, V) print("Attention output: ", output) print("Attention weights: ", attn_weights)

Practice Questions

1. Explain the multi-head attention mechanism in transformers Hard

Hint: Think about why multiple attention heads are better than just one
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

2. Why is layer normalization important in transformer architectures? Medium

Hint: Think about training stability and convergence

3. How does positional encoding work in transformers? Medium

Hint: Consider how transformers need position information since they have no recurrence or convolution