AI Safety

A reading series on the case for taking AI risk seriously — scaling laws, superintelligence, instrumental convergence, and what they imply for how we build increasingly capable systems.

10 posts

01
Chapter 1: Scaling Laws, Superintelligence, and Instrumental Convergence
Bigger models keep getting predictably better. That single empirical fact — combined with the logic of instrumental convergence — turns AI safety from a science-fiction concern into a present engineering problem.
17 min read · November 01, 2025
02
Chapter 2: Outer Alignment — Specification Gaming and Learning from Human Preferences
You get what you measure, not what you mean. Outer alignment is the engineering problem of writing down a goal that, when optimized hard, still produces the behavior you actually wanted.
21 min read · November 20, 2025
03
Chapter 3: Deception, Inner Alignment, and Mechanistic Interpretability
Even a perfect training objective can produce a model that learns the wrong goal — and behaves well only while it's being watched. Inner alignment is what's left of the alignment problem after outer alignment is solved, and it's the part we can't address with reward tuning alone.
22 min read · December 08, 2025
04
Chapter 4: AI Security — Jailbreaks, Adversarial Examples, and Model Theft
Alignment is what you want the model to do. Security is what an adversary can make it do anyway. This chapter walks through the attack surface — model weights, API endpoints, training pipelines, and the model itself — and the security disciplines that already know how to defend each one.
25 min read · December 26, 2025
05
Chapter 5: AI Governance — Approval Regulation, Technical Levers, and the Coordination Problem
Alignment is what you build into the model. Governance is the institutional scaffolding that decides which models get built, who gets to run them, and what evidence we demand before they ship. This chapter walks through the technical AI governance toolkit and the FDA-style approval-regulation proposal.
22 min read · January 14, 2026
06
Chapter 6: Critiques and Counter-Arguments — Steelmanning the Skeptics
If a safety case is unfalsifiable, it isn't a safety case. This chapter takes the strongest critiques of the AI-safety program — accelerationist, infohazard-based, and interpretability-skeptical — and engages them on their own terms, before deciding which to update on.
25 min read · February 01, 2026
07
Chapter 7: A Field Map of Alignment Approaches — Agent Foundations, Goal Misgeneralisation, Superposition, and Activation Steering
There is no single AI alignment program — there's a portfolio. This chapter is the map: what each major research direction is trying to do, what its theory of impact is, what its current track record supports, and where it fits in the broader bet.
29 min read · February 19, 2026
08
Chapter 8: AI Evaluations — The Science of Knowing What Models Can and Will Do
If governance is the institutional layer (Ch. 5) and security is the adversarial layer (Ch. 4), evaluations are the evidentiary layer underneath both. This chapter walks through the science of evals, the major lab safety frameworks, and the domain-specific benchmarks that decide whether a frontier model ships.
25 min read · March 10, 2026
09
Chapter 9: AI Control — Safety Without Trusting the Model
Alignment tries to make models that won't betray you. Control assumes the model might, and engineers deployment so safety doesn't depend on the answer. This chapter walks through I/O filtering, Constitutional Classifiers, trusted/untrusted model splits, and the broader control paradigm.
23 min read · March 28, 2026
10
Chapter 10: Contributing to AI Safety — Paths, Skills, and Getting Started
The hardest part of contributing to AI safety isn't picking the right research direction — it's picking a route into the field that matches your background and timeline. This chapter is the field map for that decision.
25 min read · April 15, 2026