ARTICLE · 15 MIN READ · FEBRUARY 19, 2026
Chapter 7: A Field Map of Alignment Approaches — Agent Foundations, Goal Misgeneralisation, Superposition, and Activation Steering
There is no single AI alignment program — there's a portfolio. This chapter is the map: what each major research direction is trying to do, what its theory of impact is, what its current track record supports, and where it fits in the broader bet.
Why a Survey Chapter
The previous six chapters built the case from first principles: why scaling matters (Ch. 1), why outer alignment is hard (Ch. 2), why inner alignment is harder (Ch. 3), what the security surface looks like (Ch. 4), how governance is supposed to wrap around the technical work (Ch. 5), and which critiques to take seriously and which not (Ch. 6).
This chapter pulls back to the field-map level. Alignment isn’t one program — it’s a portfolio, with very different research bets, different theories of impact, and different views on what the load-bearing problem actually is. The goal here is to lay out the major directions side-by-side so you can see what each is doing, what it’s not doing, and how they relate.
The structure is:
- Why a portfolio frame. No single research direction can make a fully convincing case for solving alignment alone. Diversification is the response.
- Five focal directions in depth — corresponding to representative entry points in each direction:
- A brief introduction to alignment approaches as the orientation map
- Agent foundations — the program that asks what we even mean by “agent” before trying to align one
- Goal misgeneralisation — the empirical companion to the inner-alignment argument
- Superposition and SAEs — the mechanistic-interpretability research line that’s currently most active
- Activation steering / contrastive activation additions — the most accessible “intervene without training” technique
- The wider portfolio. Scalable oversight, control, evals, model organisms, and theory — placed alongside the focal five so you can see the shape of the field.
- Picking a focus. What it means to specialise in one of these for a few months — what you’d read, what you’d build, and what the failure modes of each specialisation are.
Theory of impact (ToI): A causal account of how a research program's outputs would, if successful, reduce expected harm. From Chapter 6: research without a written-down ToI is hard to evaluate.
Agent foundations: The research program of formalising the concepts that alignment depends on — agency, goal-directedness, embedded agency, decision theory under self-reference, optimisation. Closer to mathematics and philosophy than to ML; concerned with getting the conceptual primitives right.
Goal misgeneralisation (GMG): A specific empirical inner-alignment failure mode: a model trained to optimise objective A learns to pursue a related-but-different objective B, where B and A coincide on the training distribution and diverge off it. Distinct from outer-alignment / specification gaming because the specification is correct.
Superposition: The phenomenon, in neural networks, where more features are represented than there are neurons — by overlaying features on shared directions in activation space. The reason individual neurons are typically polysemantic.
Sparse Autoencoders (SAEs): A method for decomposing model activations into a larger, sparser set of features than there are neurons — a practical attempt to recover the "interpretable basis" superposition hides.
Activation steering: Modifying a model's behavior by directly editing its internal activations at inference time — typically by adding a "steering vector" computed from contrastive examples — without retraining. Contrastive Activation Addition (CAA) is the most-studied recipe.
Why Portfolio Thinking Matters
The community-level allocation question — which directions get how much funding, attention, and talent — is itself contested (Chapter 6’s interpretability critique was an example). The individual-researcher question — what to specialise in for the next three to twelve months — is what this chapter is built around.
The Orientation Map: A Brief Introduction to Alignment Approaches
A useful orientation move — surprisingly uncommon in the field — is to lay out the major research directions side-by-side at the same level of abstraction, without adjudicating between them. The headline categories, with the framing this chapter will use:
Conceptual / Foundations
Agent foundations, embedded agency, decision-theoretic alignment, ELK-style problem statements. Premise: we're missing the right primitives. Theory of impact: clearer concepts → solvable engineering problems later.
Empirical Alignment / Post-Training
RLHF, constitutional AI, DPO and successors, scalable oversight via debate / recursive reward modeling. Premise: most of alignment is engineering against a moving target. Theory of impact: keep the deployment-ready model aligned well enough through the capability transition.
Mechanistic Interpretability
Features, circuits, superposition, SAEs, attribution graphs. Premise: behavioral defenses fail against strategic actors (Ch. 3); we need to inspect cognition. Theory of impact: read the model's actual computation → catch deception → enable governance.
Control / Trust-But-Verify
Greenblatt et al.'s AI control agenda. Premise: assume the model may be misaligned, design deployment so safety doesn't depend on alignment. Theory of impact: contain potentially-misaligned capability through monitoring, redaction, and sandboxing.
Evaluations & Model Organisms
Capability evals, propensity evals, sleeper agents, alignment-faking demos, "model organisms of misalignment." Premise: you can't manage what you can't measure. Theory of impact: build the empirical evidence base that governance and engineering decisions can use.
Steering & Lightweight Intervention
Activation steering, representation engineering, model editing, knowledge editing. Premise: we can sometimes intervene on cognition without retraining. Theory of impact: cheap, fast, surgical interventions for specific behaviors when full retraining isn't tractable.
What an introductory survey doesn’t do. It doesn’t tell you which is right. The honest position is that the answer depends on which problem turns out to be the binding constraint — and that’s itself contested. The right move for someone choosing a focus is to pick the bet whose premise you find most defensible, then go deep enough to test it against reality.
Direction 1 — Agent Foundations
The agent-foundations program — most associated with MIRI historically and with researchers like John Wentworth, Scott Garrabrant, and Vanessa Kosoy now — is the alignment direction that looks least like ML research and most like mathematical philosophy. Wentworth’s Why Agent Foundations? An Overly Abstract Explanation is the cleanest single statement of why the program exists.
The argument structure:
The active sub-areas:
Embedded agency
Standard decision theory assumes the agent is outside the world it's reasoning about. Real agents are inside their world — they're built of the same physics they reason about, can be reasoned about by other agents, and have models of themselves. The theory needed for embedded agency is not the standard textbook theory.
Goal-directedness as a property
Treating "goal-directed" as a binary obscures that real systems exhibit varying degrees of consequentialist vs. deontological vs. habitual behavior. A theory that distinguishes these is needed to make claims like "this system is goal-directed enough to exhibit instrumental convergence" precise.
Logical uncertainty
Bayesian decision theory assumes the agent knows all the logical implications of its beliefs. Real agents don't — they have to reason under logical uncertainty too. Garrabrant induction and similar work is a candidate formal treatment.
Decision theory under self-reference
Newcomb-like problems, anthropic puzzles, agents that can predict each other. A serious theory of decision-making that handles self-reference is a precondition for analysing AI systems that model their training process or other AIs.
The state of the program (honestly)
What it has produced. Cleaner conceptual frameworks (logical induction, Cartesian frames, finite factored sets), explicit problem statements (Eliciting Latent Knowledge), and a consistent track record of identifying what concepts are doing in alignment arguments without being doable in other settings.
What it hasn’t produced (yet). Direct interventions on production models. The complaint — fairly stated — is that decades of foundations work haven’t yet translated into “do X to your model and it will be more aligned.” The defenders respond that this is the wrong measure: foundations work, like much of theoretical math, has long impact lags that don’t show up in short-term ML metrics.
Honest portfolio view. A small but non-zero allocation. Foundations work has not paid off in obviously-deployable ways yet, and may not in the near term. It also is the only research direction whose target is the conceptual layer, and if the conceptual layer is wrong, every empirical bet is bottlenecked by it. The bet is asymmetric: usually irrelevant, occasionally critical.
Direction 2 — Goal Misgeneralisation
If agent foundations is the most abstract of the focal directions, goal misgeneralisation is the most concrete. The DeepMind paper Goal Misgeneralisation: Why Correct Specifications Aren’t Enough For Correct Goals (Langosco, Koch, Sharkey, et al. 2022; Shah et al. 2022) is the empirical demonstration of an inner-alignment failure with a correct outer specification.
The headline distinction:
The canonical worked examples from the paper:
The CoinRun Coin-Position Failure
Procgen's CoinRun: agent trained to collect a coin. Training: the coin is always at the right end of the level. Deployment: the coin is moved. Result: the agent runs to the right end and ignores the coin. The reward function ("collect the coin") was correct; the learned mesa-objective ("go right") happened to coincide with it on training data.
Insight: a correct reward + spurious feature correlation = misgeneralisationThe Tree-Gridworld Misalignment
Variant gridworlds where the "do the task" objective and a perceptual feature ("be near a particular object") coincide in training. Models reliably learn the perceptual feature, not the task — and behavior diverges sharply when the feature/task correlation breaks.
Insight: feature-based shortcuts are the default; task-based goals require breaking the shortcut explicitly in trainingThe InstructGPT-style example
Even with carefully designed RLHF, models can learn "produce text that looks like the kind humans rate highly" rather than "actually answer accurately." On distribution this is identical; off distribution (low-resource topics, adversarial inputs) the distinction shows up. Sycophancy is one face of this.
Insight: GMG isn't only an RL/toy phenomenon — language models exhibit the same structureThe general structural recipe
Whenever there's a feature/goal correlation in training data — and there almost always is, since training data is finite — the model can learn either the feature or the underlying goal. Both produce the same training-time loss. Which one it actually learned only shows up at deployment.
Insight: GMG is the empirical face of inner alignment; it's the rule, not the exceptionWhy GMG is the empirical companion to inner alignment
Chapter 3 made the inner-alignment argument abstractly: SGD doesn’t select for “model whose internal goal matches the base objective” — it selects for “model whose behaviour gets low loss on training data.” Goal misgeneralisation is what that argument looks like when reduced to demonstrable, measurable phenomena on toy and small-scale systems.
Two consequences:
- Inner alignment isn’t a far-future worry. It’s already producing measurable failures at small scale. The question for frontier models is whether the same dynamics scale up — and the empirical evidence (sycophancy, sandbagging, shortcut learning in language models) suggests it does, in modified form.
- Goal misgeneralisation is more tractable than full deceptive alignment. It’s a behavioural phenomenon you can construct experiments for, vary inputs against, and quantify. As a research target it sits between toy demos and frontier-scale problems.
Honest portfolio view. Goal misgeneralisation is one of the strongest empirical anchors for the inner-alignment program. Someone specialising here would build a worked understanding of when correlated features in training produce learned shortcuts, what diagnostics catch them, and what training-time interventions (data diversification, distribution-aware loss, abstention training) reduce them. It’s a particularly accessible specialisation: small models, clean experiments, real phenomena.
Direction 3 — Superposition and Sparse Autoencoders
Anthropic’s Toy Models of Superposition (Elhage et al. 2022) is the paper that moved interpretability from “neurons are mysteriously polysemantic, and we don’t know what to do” to “neurons are polysemantic because features are stored in superposition, and here’s a tractable framework for studying what that means.”
The central observation, in two pictures:
The toy-models paper formalises the trade-off: as the ratio of features to neurons grows, networks shift from monosemantic representation to superposed representation, with phase transitions visible in the loss landscape. The paper validates the framework on small synthetic settings; subsequent work (Bricken et al., Towards Monosemanticity; Templeton et al., scaling SAEs to Claude 3 Sonnet) extends it to real LLMs.
Sparse Autoencoders — the practical follow-up
If features are stored in superposition, the natural decoding tool is one that learns a sparse, overcomplete basis for activations:
The SAE recipe
Train an autoencoder on a layer's activations, with the hidden dimension larger than the activation dimension and an L1 penalty enforcing sparsity. The encoded features are an attempt to recover the underlying "interpretable basis" superposition hides.
What SAEs surface
Specific features — "this fires on Python list comprehensions," "this fires on French inflection patterns," "this fires on emotional valence in the input." Some features cleanly correspond to human-interpretable concepts; others don't, and the proportion that's interpretable is itself a research question.
Scaling progress
From toy two-layer transformers, to GPT-2-small, to Pythia, to Claude 3 Sonnet (2024). Templeton et al.'s "Scaling Monosemanticity" was the first published demonstration that the technique can be applied to a frontier model, with millions of features extracted.
What SAEs enable downstream
Feature-level intervention (clamp a feature to zero, see what changes), feature-level monitoring (alarm on suspicious feature activation), feature-level model comparison. Emerging applications include detecting deceptive features, identifying learned shortcuts, and probing specific cognitive operations.
The honest assessment
SAEs are the most-active interpretability research line in 2026. Progress has been faster than the skeptical position from Chapter 6 expected. There are also genuine open questions: how complete the recovered basis is, whether scaling to truly-frontier models with full coverage is feasible, and how to validate that recovered features mean what we think they mean (reconstruction loss + interpretability are not the same thing).
Honest portfolio view. This is the interpretability research line with the strongest current trajectory. Someone specialising here would learn the SAE training recipe, develop intuition for sparsity-quality trade-offs, build feature-evaluation pipelines, and understand the gap between “we found N features” and “we have a faithful account of the model.” The work is concrete, buildable in a notebook, and increasingly relevant to production decisions.
Direction 4 — Activation Steering / Contrastive Activation Addition
The Rimsky et al. paper Steering Llama-2 with Contrastive Activation Additions (CAA) is the most accessible single demonstration that you can change a model’s behaviour without retraining — by directly editing its activations at inference time.
The recipe is simple enough to describe in three lines:
What CAA actually demonstrates:
Behaviours move along a single direction
For many behaviours — sycophancy, refusal/compliance, AI-takeover-related concepts, hallucination — there's a single linear direction in activation space whose magnitude controls the behaviour. This is striking; it's not what you'd naively expect from a deeply non-linear network.
No retraining, no weight changes
The intervention happens at inference. The weights are untouched. This makes CAA a natural complement to fine-tuning — cheap, fast, surgical, and reversible. You can deploy a single set of weights and apply different steering vectors per use case.
It's also evidence about representation
CAA's effectiveness is itself a piece of interpretability evidence. The fact that "be more refusing" can be encoded as a single activation-space direction tells you something about how the model represents that behaviour internally. CAA and SAE are complementary tools for this.
Caveats are real
Steering vectors don't generalise perfectly out of their construction distribution. Strong steering can degrade general capability. The technique works on some behaviours and not others. And — most importantly — steering can't enforce behaviour on adversarial inputs that you didn't include in the contrastive set.
CAA in the broader picture
Activation steering occupies a useful niche: cheaper than fine-tuning, more surgical than prompting, more interpretable than opaque post-training. It’s not a substitute for the other directions — it’s a complementary tool. The strongest deployments will likely combine activation-level interventions with feature-level monitoring (SAEs), behavioural evals (model organisms), and training-time alignment (RLHF / constitutional methods).
Honest portfolio view. This is the most accessible focal direction for someone with limited compute. You can replicate CAA on a 7B model on a single GPU. The experimental loop is fast; the literature is small enough to read end-to-end; the open questions (what behaviours steer linearly? when does steering generalise? does it stack with other methods?) are concrete and unsolved. Excellent on-ramp into both interpretability and post-training intervention.
The Wider Portfolio — Brief Tour
Beyond the four focal directions, several other research lines belong on the field map.
Scalable Oversight
Debate, recursive reward modelling, market-based oversight, AI-assisted critique. Premise: humans can't evaluate frontier-model output directly; help them with structured AI assistance. Active programs at OpenAI, Anthropic, GDM. Theory of impact: preserve outer-alignment leverage as capability passes human level.
AI Control
Greenblatt et al. (Redwood). Premise: assume the model may be misaligned; design deployment so safety doesn't depend on alignment. Tools: monitoring, redaction, untrusted-action detection, capability-restricted permissions. Theory of impact: contain potentially-misaligned capability through engineering.
Model Organisms / Evals
Sleeper Agents, Alignment Faking, capability evals, propensity evals, autonomous-replication evals. Premise: build measurable, reproducible empirical evidence of alignment-relevant phenomena so research and governance have grounded claims to argue from.
Representation Engineering
Zou et al. (2023). Generalises CAA-style steering into a broader programme of probing, intervening on, and reasoning about high-level representations. Often paired with concept-erasure and probe-based monitoring.
Eliciting Latent Knowledge (ELK)
The ARC problem statement. How do you train a model to report what it actually believes, when behaviorally indistinguishable models can have very different latent beliefs? Closer to foundations than to immediate engineering.
Corrigibility Engineering
Building (and maintaining across training) the property that a system accepts being modified, shut down, and overridden. Connects to RLHF, RSPs, and control. The lever that breaks the deceptive-alignment equilibrium (Ch. 3).
Picking a Focus — Practical Notes
If the goal is to specialise in one direction over the next few months, here’s how the four focal directions stack up on the things that actually matter for that decision:
Agent Foundations
Background needed: Mathematical maturity, comfort with formal reasoning. Less ML / coding required. Day-to-day: Reading and writing proofs, refining concept definitions, problem framing. Failure mode: Decoupling from anything empirically testable — a real trap in this area. Best fit: Researchers who want to work on the conceptual layer and accept long impact lags.
Goal Misgeneralisation
Background needed: Standard ML. RL helps. Day-to-day: Building toy environments, training small models, evaluating on shifted distributions. Failure mode: Toys that don't generalise to LLM-scale phenomena. Best fit: Researchers who want concrete experiments with clean inner-alignment connections, and are happy on small-to-medium models.
Superposition / SAEs
Background needed: Solid ML. Familiarity with autoencoders. Some compute budget for non-toy work. Day-to-day: Training SAEs, evaluating feature interpretability, scaling to larger models. Failure mode: Beautiful feature visualisations that don't downstream into safety claims. Best fit: Researchers who want to be at the active frontier of mechanistic interpretability.
Activation Steering / CAA
Background needed: Standard ML. Single GPU is enough for a lot of useful work. Day-to-day: Building contrastive pair datasets, computing and validating steering vectors, evaluating side effects. Failure mode: Treating steering as if it generalises; over-claiming downstream safety relevance. Best fit: Researchers who want a fast experimental loop with concrete behavioural outputs.
Scalable Oversight
Background needed: ML, plus comfort designing experimental protocols and human-eval setups. Day-to-day: Designing debate / decomposition protocols, running them with humans + models, evaluating quality. Failure mode: Protocols that work on toy tasks but don't scale. Best fit: Researchers interested in the human-AI evaluation interface.
AI Control / Evals
Background needed: ML + security mindset. Day-to-day: Designing red-teams, monitoring schemes, evaluation suites. Failure mode: Treating control as a substitute for alignment instead of as a complement. Best fit: Researchers with a security-engineering bent.
The general advice for picking. Spend a week reading across the field map. Then pick the one whose premise you find most personally defensible — not the one that’s most fashionable. You’ll do better work where you actually agree with the underlying bet, because the daily research decisions are easier to make. And rotate every six to twelve months if you want to build breadth across the field map; specialisation is for the current project, not the career.
What This Implies for Practice
Hold the field map in your head
Don't conflate one research direction with "alignment." Each is a bet with a specific premise. Knowing the map keeps you from being overcommitted to a sub-bet that might be losing while another is winning.
Specify the theory of impact
For whichever direction you pick, write the causal story from "this succeeds" to "expected harm decreases" in plain English. If you can't, you have a research aesthetic, not a research programme.
Build empirical anchors
Theory without empirical contact drifts. Empirical work without conceptual framing produces noise. The strongest projects in any direction sit where conceptual claims meet measurable predictions — goal misgeneralisation experiments, SAE feature evaluations, CAA generalisation studies.
Combine, don't isolate
Real defence-in-depth uses several directions together: SAEs to monitor, CAA to intervene, evals to measure, control to deploy safely, RLHF as the baseline. Specialise in one for the project, but understand how it stacks with the others for the deployment.
Test theories against critiques
Each direction has its own strongest critics (Ch. 6). Read them, take the steelmanned form seriously, and let the critique sharpen your own work — both what you do, and how you write up its limits.
Plan for impact-lag mismatch
Foundations work has decade-scale lags. Empirical interpretability has multi-year lags. CAA-style work can have months-scale lags. Mismatching your direction's natural impact timeline against the time you have produces frustration and bad research. Pick the one whose lag matches your horizon.
At a Glance
The alignment field is a portfolio: agent foundations, empirical alignment, mechanistic interpretability, control, evals, and steering, each pursuing a different bet about which problem is binding. This chapter laid out four focal directions in depth — agent foundations, goal misgeneralisation, superposition / SAEs, and activation steering — and placed them alongside the wider portfolio to give a workable field map.
No single direction has a track record strong enough to bet the field on. Diversification across premises and failure modes is the response. For an individual researcher, depth in one direction beats breadth across all — but the depth choice should be informed by knowing what else exists, what each direction is and isn't trying to do, and whose theory of impact you actually find defensible.
Hold the field map in your head; pick a focus whose premise you defend personally; write the theory of impact down; build at least one empirical anchor; combine methods at the deployment layer even when specialising at the project layer; and engage your direction's strongest critic on the merits, not in spirit.
Key Takeaways
-
Alignment is a portfolio, not a single program. Each major direction — foundations, empirical alignment, interpretability, control, evals, steering — encodes a different bet about what the binding constraint is. Treating them as a single thing obscures the actual disagreements that drive the field.
-
Agent foundations targets the conceptual layer. It’s the slowest direction in terms of obvious deployment impact and the fastest in terms of impact when a load-bearing concept is missing. A small but non-zero allocation makes sense; betting the whole portfolio on it doesn’t, and ignoring it doesn’t either.
-
Goal misgeneralisation is the empirical face of inner alignment. A correct reward function is not enough — when training distribution leaves multiple goals consistent with low loss, the model can pick the wrong one. The CoinRun example is small but the structural recipe applies up to LLM scale.
-
Superposition explains polysemanticity. Sparse autoencoders are the active practical tool for recovering interpretable features. Progress has been faster than the skeptical position from Chapter 6 expected; production-frontier coverage is still open. This is the most-active interpretability research line in 2026.
-
Activation steering is the most accessible intervention method. CAA-style steering vectors compute cheaply, work without retraining, and reveal how surprisingly linear some behaviour representations are. Useful as a tool, useful as evidence about representation, dangerous if treated as a panacea.
-
The wider portfolio matters for context. Scalable oversight extends outer alignment past human evaluator capability. Control assumes alignment may be unverifiable and engineers around it. Evals and model organisms are the empirical foundation under everything else. Each direction belongs on the field map.
-
Specialisation requires alignment with the bet. Pick a direction whose premise you defend personally, not the one that’s most fashionable. The daily research decisions are easier when you agree with the foundation; that’s both an aesthetic and a productivity claim.
-
Theories of impact deserve writing down. Across every direction, the discipline from Chapter 6 applies: write the causal story from “this succeeds” to “expected harm decreases” in plain English. If you can’t, you have a research aesthetic, not a research program. If you can, you have something a critic can attack on the merits — which is exactly what makes the program improvable.
Further Reading
- Wentworth, “Why Agent Foundations? An Overly Abstract Explanation” — the cleanest single-page case for the foundations programme.
- Langosco, Koch, Sharkey et al., “Goal Misgeneralisation in Deep Reinforcement Learning” (2022); Shah et al., “Goal Misgeneralisation: Why Correct Specifications Aren’t Enough For Correct Goals” — the empirical anchor for inner-alignment failures.
- Elhage et al., “Toy Models of Superposition” (Anthropic, 2022) — the framework underlying current SAE work.
- Bricken et al., “Towards Monosemanticity” (Anthropic, 2023); Templeton et al., “Scaling Monosemanticity” (Anthropic, 2024) — the SAE-on-real-LLMs progression.
- Rimsky et al., “Steering Llama-2 with Contrastive Activation Additions” (2024) — the canonical CAA paper.
- Zou et al., “Representation Engineering: A Top-Down Approach to AI Transparency” (2023) — the broader context for activation-level interventions.
- Greenblatt, Shlegeris et al., “AI Control: Improving Safety Despite Intentional Subversion” (Redwood, 2023) — the canonical control statement.
- ARC, “Eliciting Latent Knowledge” — the cleanest formal problem statement for an inner-alignment subproblem.
Enjoy Reading This Article?
Here are some more articles you might like to read next: