Transformer Architecture

Overview

The Transformer architecture, introduced in the paper "Attention Is All You Need" by Vaswani et al. (2017), revolutionized sequence-to-sequence tasks, particularly in Natural Language Processing (NLP). Unlike previous architectures like Recurrent Neural Networks (RNNs) or Long Short-Term Memory networks (LSTMs) that process sequences token by token, Transformers process entire sequences simultaneously using attention mechanisms. This allows for significant parallelization and has led to breakthroughs in machine translation, text summarization, question answering, and the development of large language models (LLMs).

The Transformer Architecture: Encoder-Decoder structure with attention mechanisms

The core innovation of the Transformer is its reliance on self-attention mechanisms to compute representations of its input and output without using sequence-aligned RNNs or convolution.

For related content:

LLM-specific applications and extensions: api/content/modern_ai/llms/transformer_architecture.py
Detailed exploration of attention mechanisms: api/content/modern_ai/llms/attention_mechanisms.py

Core Concepts

High-Level Architecture
▶
A Transformer typically consists of two main parts: an Encoder and a Decoder, each composed of a stack of identical layers.
- Encoder: Maps an input sequence of symbol representations $(x_1, ..., x_n) $ to a sequence of continuous representations $z = (z_1, ..., z_n) $. Each encoder layer has two sub-layers: a multi-head self-attention mechanism and a simple, position-wise fully connected feed-forward network.
- Decoder: Given $z$, the decoder generates an output sequence of symbol representations $(y_1, ..., y_m) $ one element at a time. Each decoder layer has three sub-layers: a masked multi-head self-attention mechanism (to prevent positions from attending to subsequent positions), a multi-head cross-attention mechanism (to attend over the encoder's output), and a position-wise fully connected feed-forward network.
Residual connections are employed around each of the sub-layers, followed by layer normalization.
Key Components
▶
Several key components define the Transformer architecture:
- Self-Attention Mechanism: Allows the model to weigh the importance of different words in a sequence when processing that sequence. For each word, self-attention computes a weighted sum of all words in the sequence, where the weights are determined by the similarity between the current word and other words.
- Multi-Head Attention: Instead of performing a single attention function, the Transformer projects the queries, keys, and values $h$ times with different, learned linear projections. These $h$ attention outputs are then concatenated and once again projected, resulting in the final values.
- Scaled Dot-Product Attention: The specific attention mechanism used. The input consists of queries and keys of dimension $d_k$, and values of dimension $d_v$.
  $$ \text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V $$
- Positional Encoding: Since the Transformer contains no recurrence or convolution, positional encodings are added to provide information about the relative or absolute position of tokens in the sequence.
Modern Variants
▶
Several variants have been proposed to improve the efficiency of the attention mechanism:
- Linear Attention: Reduces complexity from O(n²) to O(n)
- Local Attention: Restricts attention to a local window around each token
- Rotary Position Embeddings (RoPE): Better relative position modeling
- ALiBi: Learned position biases that scale well to longer sequences

Implementation

Code Example

▶


# Conceptual Structure - Not runnable code, for illustration only.
# For actual usage, use libraries like PyTorch or Hugging Face Transformers.

import torch
import torch.nn as nn
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        assert d_model % num_heads == 0
        self.d_k = d_model // num_heads
        self.num_heads = num_heads
        self.wq = nn.Linear(d_model, d_model)
        self.wk = nn.Linear(d_model, d_model)
        self.wv = nn.Linear(d_model, d_model)
        self.fc = nn.Linear(d_model, d_model)
    
    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        attn_probs = torch.softmax(attn_scores, dim=-1)
        output = torch.matmul(attn_probs, V)
        return output

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)
        
        # Linear projections and reshape for multi-head attention
        Q = self.wq(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.wk(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.wv(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        
        # Scaled dot-product attention
        context = self.scaled_dot_product_attention(Q, K, V, mask)
        
        # Concatenate heads and apply final linear layer
        context = context.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads * self.d_k)
        output = self.fc(context)
        return output

class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(TransformerEncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        # Self-attention block
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        
        # Feed-forward block
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        return x

Interview Examples

What problem does the Transformer architecture solve that RNNs/LSTMs struggled with?

Explain the key advantages of Transformers over recurrent architectures for sequence processing.

Implement scaled dot-product attention in NumPy

Write a function to compute scaled dot-product attention using NumPy. Include a usage example.

import numpy as np

def scaled_dot_product_attention(Q, K, V, mask=None):
    '''
    Q: (batch, seq_len_q, d_k)
    K: (batch, seq_len_k, d_k)
    V: (batch, seq_len_k, d_v)
    mask: (batch, seq_len_q, seq_len_k) or None
    '''
    d_k = Q.shape[-1]
    scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(d_k)
    if mask is not None:
        scores = np.where(mask, scores, -1e9)
    attn_weights = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
    attn_weights = attn_weights / np.sum(attn_weights, axis=-1, keepdims=True)
    output = np.matmul(attn_weights, V)
    return output, attn_weights

# Example usage:
batch, seq_len, d_k, d_v = 1, 3, 4, 5
np.random.seed(0)
Q = np.random.randn(batch, seq_len, d_k)
K = np.random.randn(batch, seq_len, d_k)
V = np.random.randn(batch, seq_len, d_v)
output, attn_weights = scaled_dot_product_attention(Q, K, V)
print("Attention output:
", output)
print("Attention weights:
", attn_weights)

                    

Practice Questions

1. How does positional encoding work in transformers? Medium

Hint: Consider how transformers need position information since they have no recurrence or convolution

2. Explain the multi-head attention mechanism in transformers Hard

Hint: Think about why multiple attention heads are better than just one

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

3. Why is layer normalization important in transformer architectures? Medium

Hint: Think about training stability and convergence

Transformer Architecture

Overview

Core Concepts

High-Level Architecture

Key Components

Modern Variants

Implementation

Code Example

Interview Examples

What problem does the Transformer architecture solve that RNNs/LSTMs struggled with?

Implement scaled dot-product attention in NumPy

Practice Questions

1. How does positional encoding work in transformers? Medium

2. Explain the multi-head attention mechanism in transformers Hard

3. Why is layer normalization important in transformer architectures? Medium

Related Resources

Related Topics