Generative AI — Lecture Series

Large Language Models

Lecture 5 · How LLMs are trained, scaled, and prompted

Pre-Training Objective

Decoder-only LLMs (GPT family) train on causal language modelling — predicting the next token given all previous tokens:

ℒ = − Σᵢ log P(xᵢ | x₁, …, xᵢ₋₁; θ)

Scaling Laws

Kaplan et al. (2020) showed that model loss follows power laws with compute, dataset size, and parameters. The Chinchilla result (Hoffmann 2022) found optimal training uses ~20 tokens per parameter.

ModelParamsTraining TokensRelease
GPT-3175 B300 B2020
Chinchilla70 B1.4 T2022
LLaMA-370 B15 T2024
GPT-4 (est.)>1 T (MoE)~13 T2023

Tokenisation

Text is split into sub-word tokens using Byte-Pair Encoding (BPE). Common words = 1 token; rare words = multiple tokens.

Python · Tiktoken
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode("Transformers are powerful.")
print(tokens)        # [Trans, form, ers, are, powerful, .]
print(len(tokens))   # 6
🌡️ Temperature & Sampling

Temperature τ controls randomness. At τ → 0, the model is greedy (picks argmax). At τ = 1, sampling matches the trained distribution. τ > 1 increases diversity but risks incoherence.