Large Language Models
Lecture 5 · How LLMs are trained, scaled, and prompted
Pre-Training Objective
Decoder-only LLMs (GPT family) train on causal language modelling — predicting the next token given all previous tokens:
ℒ = − Σᵢ log P(xᵢ | x₁, …, xᵢ₋₁; θ)
Scaling Laws
Kaplan et al. (2020) showed that model loss follows power laws with compute, dataset size, and parameters. The Chinchilla result (Hoffmann 2022) found optimal training uses ~20 tokens per parameter.
| Model | Params | Training Tokens | Release |
|---|---|---|---|
| GPT-3 | 175 B | 300 B | 2020 |
| Chinchilla | 70 B | 1.4 T | 2022 |
| LLaMA-3 | 70 B | 15 T | 2024 |
| GPT-4 (est.) | >1 T (MoE) | ~13 T | 2023 |
Tokenisation
Text is split into sub-word tokens using Byte-Pair Encoding (BPE). Common words = 1 token; rare words = multiple tokens.
Python · Tiktoken
import tiktoken enc = tiktoken.get_encoding("cl100k_base") tokens = enc.encode("Transformers are powerful.") print(tokens) # [Trans, form, ers, are, powerful, .] print(len(tokens)) # 6
🌡️ Temperature & Sampling
Temperature τ controls randomness. At τ → 0, the model is greedy (picks argmax). At τ = 1, sampling matches the trained distribution. τ > 1 increases diversity but risks incoherence.