Large Language Models

Lecture 5 · How LLMs are trained, scaled, and prompted

Pre-Training Objective

Decoder-only LLMs (GPT family) train on causal language modelling — predicting the next token given all previous tokens:

ℒ = − Σᵢ log P(xᵢ | x₁, …, xᵢ₋₁; θ)

Scaling Laws

Kaplan et al. (2020) showed that model loss follows power laws with compute, dataset size, and parameters. The Chinchilla result (Hoffmann 2022) found optimal training uses ~20 tokens per parameter.

Model	Params	Training Tokens	Release
GPT-3	175 B	300 B	2020
Chinchilla	70 B	1.4 T	2022
LLaMA-3	70 B	15 T	2024
GPT-4 (est.)	>1 T (MoE)	~13 T	2023

Tokenisation

Text is split into sub-word tokens using Byte-Pair Encoding (BPE). Common words = 1 token; rare words = multiple tokens.

Python · Tiktoken

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode("Transformers are powerful.")
print(tokens)        # [Trans, form, ers, are, powerful, .]
print(len(tokens))   # 6

🌡️ Temperature & Sampling

Temperature τ controls randomness. At τ → 0, the model is greedy (picks argmax). At τ = 1, sampling matches the trained distribution. τ > 1 increases diversity but risks incoherence.