Transformers & Attention

Lecture 4 · The architecture powering every modern LLM

Self-Attention Mechanism

Given an input sequence, self-attention computes Query (Q), Key (K) and Value (V) matrices and produces context-weighted representations:

Attention(Q, K, V) = softmax( QKᵀ / √dₖ ) · V

The √dₖ scaling prevents dot-products from growing too large and saturating softmax gradients.

Multi-Head Attention

Run h attention heads in parallel (each learning different relationships), then concatenate and project:

MultiHead(Q,K,V) = Concat(head₁,…,headₕ) Wᴼ

Python · Minimal Attention

import torch, math

def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    weights = torch.softmax(scores, dim=-1)
    return torch.matmul(weights, V)

Transformer Block Components

👁️

Multi-Head Attention

Attends to different positions simultaneously, capturing varied semantic relationships.

🔢

Feed-Forward Layer

Two linear layers with GeLU. Expands then compresses: d_model → 4×d_model → d_model.

📊

Layer Normalisation

Normalises activations within each layer, stabilising training of deep networks.

➕

Residual Connections

x + Sublayer(x) — allows gradient flow through very deep stacks (100+ layers).

💡 Why Transformers Won

Unlike RNNs, every token attends to every other token in O(n²) time — but critically, all computations are parallelisable, making training on GPUs orders of magnitude faster.