Transformers & Attention
Lecture 4 · The architecture powering every modern LLM
Self-Attention Mechanism
Given an input sequence, self-attention computes Query (Q), Key (K) and Value (V) matrices and produces context-weighted representations:
The √dₖ scaling prevents dot-products from growing too large and saturating softmax gradients.
Multi-Head Attention
Run h attention heads in parallel (each learning different relationships), then concatenate and project:
import torch, math def scaled_dot_product_attention(Q, K, V, mask=None): d_k = Q.size(-1) scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k) if mask is not None: scores = scores.masked_fill(mask == 0, -1e9) weights = torch.softmax(scores, dim=-1) return torch.matmul(weights, V)
Transformer Block Components
Multi-Head Attention
Attends to different positions simultaneously, capturing varied semantic relationships.
Feed-Forward Layer
Two linear layers with GeLU. Expands then compresses: d_model → 4×d_model → d_model.
Layer Normalisation
Normalises activations within each layer, stabilising training of deep networks.
Residual Connections
x + Sublayer(x) — allows gradient flow through very deep stacks (100+ layers).
Unlike RNNs, every token attends to every other token in O(n²) time — but critically, all computations are parallelisable, making training on GPUs orders of magnitude faster.