Sequence of Innovations for LLMs

Yesterday, I was listening to the Modern Wisdom podcast interview with Eliezer Yudkowsky about his new book (which I’ve read).

The podcast:

#1011 – Eliezer Yudkowsky – Why Superhuman AI Would Kill Us All

The book:

If Anyone Builds It, Everyone Dies: Why Superhuman AI Would Kill Us All, Eliezer Yudkowsky & Nate Soares, 2025.

Yudkowsky always gives great analogies, this interview was no different. The climbing the ladder analogy hit for me:

… and with artificial intelligence, what we’ve got is a ladder where every time you climb another step on the ladder, you get five times as much money. But one of those steps of the ladder destroys the world.

Reminds me of Nick Bostrom’s urn analogy (The Vulnerable World Hypothesis), something like (via Claude):

In this thought experiment, Bostrom imagines that technological discoveries are like drawing balls from a giant urn. Most balls are white (beneficial or neutral technologies), some are grey (technologies with mixed effects), but there might be a black ball in the urn - a technology that’s so destructive and easy to access that it would be essentially impossible for civilization to survive its discovery.

Anyway, at one point Yudkowsky was enumerating some of the innovations that got us to modern systems, I think as making a point about how we can’t predict what/when for the next innovation.

It got me thinking. It would be fun to build/read a reference that walks through innovations to current systems.

Perhaps not all, but some or most. This is probably what a modern University course on LLMs would do.

Something like starting at backprop and ending up at RL on reasoning or something. I suspect the early stuff is simpler, more tractable to build with numpy in a notebook, and more fun. The later stuff is probably too big and heavy and would just be “ideas to know”.

Anyway, I asked GPT5 to think on this, and here’s what it gave me:

Here’s a compact, CS-focused lineage of the core innovations that made late-2025 LLMs like ChatGPT/Claude possible—each item lists the key idea and a representative paper.

Backpropagation & gradient training for deep nets — general learning recipe for multi-layer networks. (Gwern)
ReLU activations — efficient, sparse, non-saturating nonlinearity. (Proceedings of Machine Learning Research)
Batch Normalization — stable/fast training via internal covariate shift reduction. (Proceedings of Machine Learning Research)
Dropout — stochastic regularization to prevent overfitting. (Journal of Machine Learning Research)
GPU deep learning at scale (e.g., AlexNet) — practical, high-throughput training. (NeurIPS Papers)
Mixed-precision training (FP16/bfloat16) — halves memory, speeds up training. (arXiv)
Adam / AdamW-style optimizers — adaptive, first-order optimization at scale. (arXiv)
Layer Normalization — normalization that works in non-batch/sequence settings. (arXiv)
Attention mechanisms → Transformers — parallel sequence modeling w/ self-attention (“Attention Is All You Need”). (arXiv)
Tokenization via subwords — BPE/SentencePiece enable open-vocabulary LMs. (arXiv)
Decoding strategies — nucleus (top-p) sampling reduces degeneration vs. beam search. (arXiv)
Position handling for long contexts — RoPE, ALiBi, Transformer-XL. (arXiv)
Multi-/Grouped-Query Attention — faster decoder inference via shared KV heads. (arXiv)
Pretraining → instruction-following evolution
- GPT-style generative pretraining (GPT-1/2) — zero-/few-shot behaviors emerge. (OpenAI)
- BERT — bidirectional masked-LM pretraining for encoders. (arXiv)
- T5 — unify tasks as text-to-text; large-scale transfer learning. (arXiv)
Scaling laws & data/compute optimality — guides model/data sizing and budgets. (arXiv)
Chinchilla compute-optimal training — “more data per parameter” rule-of-thumb. (arXiv)
Massive data corpora (open) — The Pile (text), LAION-5B (image-text). (arXiv)
Alignment from human preferences
- RLHF concept — learn reward models from human comparisons. (arXiv)
- InstructGPT — PPO-based fine-tuning to follow instructions. (NeurIPS Proceedings)
- PPO algorithm (RL backbone). (arXiv)
- DPO — preference optimization without explicit RL/reward model. (NeurIPS Proceedings)
Reasoning & tool-use prompting
- Chain-of-Thought; Self-Consistency — elicit multi-step reasoning. (arXiv)
- ReAct — interleave reasoning with actions/tool calls. (arXiv)
- Toolformer — self-learned API/tool use. (arXiv)
Systems-level scaling for training
- Model/parallel training: Megatron-LM; ZeRO (DeepSpeed). (arXiv)
- Sparse Mixture-of-Experts: MoE layer; GShard; Switch Transformer. (OpenReview)
- High-throughput attention kernels: FlashAttention (and successors). (arXiv)
Inference efficiency at scale
- KV-cache memory management: vLLM / PagedAttention. (arXiv)
- Speculative decoding — exact-distribution speedups using draft models. (arXiv)
Parameter-efficient adaptation & quantization
- LoRA — low-rank adapters for cheap fine-tuning. (arXiv)
- QLoRA — 4-bit quantized finetuning with NF4/paged optimizers. (arXiv)
Multimodality & generative media foundations
- CLIP — contrastive image–text pretraining enabling zero-shot transfer. (arXiv)
- Diffusion models (DDPM) & Latent Diffusion — practical, high-fidelity image generation (basis for modern multimodal LLMs). (NeurIPS Proceedings)