← All Playlists

AI Safety

A reading series on the case for taking AI risk seriously — scaling laws, superintelligence, instrumental convergence, and what they imply for how we build increasingly capable systems.

10 posts


  1. 01
    Chapter 1: Scaling Laws, Superintelligence, and Instrumental Convergence

    Bigger models keep getting predictably better. That single empirical fact — combined with the logic of instrumental convergence — turns AI safety from a science-fiction concern into a present engineering problem.

  2. 02
    Chapter 2: Outer Alignment — Specification Gaming and Learning from Human Preferences

    You get what you measure, not what you mean. Outer alignment is the engineering problem of writing down a goal that, when optimized hard, still produces the behavior you actually wanted.

  3. 03
    Chapter 3: Deception, Inner Alignment, and Mechanistic Interpretability

    Even a perfect training objective can produce a model that learns the wrong goal — and behaves well only while it's being watched. Inner alignment is what's left of the alignment problem after outer alignment is solved, and it's the part we can't address with reward tuning alone.

  4. 04
    Chapter 4: AI Security — Jailbreaks, Adversarial Examples, and Model Theft

    Alignment is what you want the model to do. Security is what an adversary can make it do anyway. This chapter walks through the attack surface — model weights, API endpoints, training pipelines, and the model itself — and the security disciplines that already know how to defend each one.

  5. 05
    Chapter 5: AI Governance — Approval Regulation, Technical Levers, and the Coordination Problem

    Alignment is what you build into the model. Governance is the institutional scaffolding that decides which models get built, who gets to run them, and what evidence we demand before they ship. This chapter walks through the technical AI governance toolkit and the FDA-style approval-regulation proposal.

  6. 06
    Chapter 6: Critiques and Counter-Arguments — Steelmanning the Skeptics

    If a safety case is unfalsifiable, it isn't a safety case. This chapter takes the strongest critiques of the AI-safety program — accelerationist, infohazard-based, and interpretability-skeptical — and engages them on their own terms, before deciding which to update on.

  7. 07
    Chapter 7: A Field Map of Alignment Approaches — Agent Foundations, Goal Misgeneralisation, Superposition, and Activation Steering

    There is no single AI alignment program — there's a portfolio. This chapter is the map: what each major research direction is trying to do, what its theory of impact is, what its current track record supports, and where it fits in the broader bet.

  8. 08
    Chapter 8: AI Evaluations — The Science of Knowing What Models Can and Will Do

    If governance is the institutional layer (Ch. 5) and security is the adversarial layer (Ch. 4), evaluations are the evidentiary layer underneath both. This chapter walks through the science of evals, the major lab safety frameworks, and the domain-specific benchmarks that decide whether a frontier model ships.

  9. 09
    Chapter 9: AI Control — Safety Without Trusting the Model

    Alignment tries to make models that won't betray you. Control assumes the model might, and engineers deployment so safety doesn't depend on the answer. This chapter walks through I/O filtering, Constitutional Classifiers, trusted/untrusted model splits, and the broader control paradigm.

  10. 10
    Chapter 10: Contributing to AI Safety — Paths, Skills, and Getting Started

    The hardest part of contributing to AI safety isn't picking the right research direction — it's picking a route into the field that matches your background and timeline. This chapter is the field map for that decision.