Chapter 10: Contributing to AI Safety — Paths, Skills, and Getting Started

Why a Career-Path Chapter

The previous nine chapters laid out what AI safety is — the technical case, the alignment portfolio, the security surface, the governance architecture, the evaluation discipline, the control paradigm, and the strongest critiques. This chapter answers a different question:

Given all of that, how do you actually contribute?

The answer isn’t a single path. AI safety is a young field with several distinct routes in — research, engineering, policy, evaluation, independent research, advocacy. Each route has different background requirements, different day-to-day work, different organisations to apply to, and different ways of making real impact.

The structure of this chapter:

Roles and routes — the live categories of work in AI safety as of 2026, with what each actually involves day-to-day.
Where the work happens — frontier labs, AI Safety Institutes, evaluation orgs, governance organisations, independent research, academia.
Fellowships and structured programs — the on-ramps that let you trial-run a direction before committing.
Skill stacks by route — what to build if you’re targeting research vs. engineering vs. policy.
The 1-pager / project proposal exercise — a standard exercise that forces you to make a concrete plan.
Common failure modes — the ways people new to the field accidentally waste their first six months.

This is more practical and less theoretical than previous chapters. The intent is for you to leave with a concrete next-step list, not a philosophical disposition.

Before You Start — Key Terms Explained

Frontier lab: The major commercial developers of frontier AI — Anthropic, OpenAI, Google DeepMind, xAI, Meta, plus a handful of others. Each has internal safety, alignment, and security teams.

AI Safety Institute (AISI): Government bodies established to evaluate frontier AI for public-interest safety. UK AISI (founded 2023), US AISI (2024), with counterparts in Japan, Singapore, Korea, Canada, and the EU. Hire from a mix of ML, policy, and security backgrounds.

Evaluation organisation: Non-profits and small labs whose primary work is independent AI capability and safety evaluation. METR, Apollo Research, Redwood Research, MITRE ATLAS, and others.

Governance organisation: Think tanks and policy shops working on AI policy: GovAI, RAND, CSET, IAPS, AI Now Institute, plus academic centres at Oxford, Cambridge, Stanford, etc.

Independent / Distillation researcher: Individuals working outside formal institutions, supported by grants (Open Phil, Survival & Flourishing Fund, LTFF) or fellowships. Often write on Alignment Forum, LessWrong, or in self-published reports.

MATS: ML Alignment & Theory Scholars program — ~3-month research fellowship pairing scholars with active alignment researchers as mentors. One of the most-respected on-ramps for technical research.

ARENA: Alignment Research Engineer Accelerator — ~6-week intensive program for ML engineering with an alignment focus. Practical, project-driven.

1-pager: A standard exercise used by many fellowship and research programs: write a single-page document describing a specific contribution you intend to make — the problem, your approach, the artifact, the impact. Forces concreteness.

The Roles — What People Actually Do

There are six distinct kinds of work, with different day-to-day rhythms and different paths in.

🔬

1. Alignment Research

Frontier research on the alignment problem — interpretability, scalable oversight, agent foundations, RLHF and successors, control. Typical day: experiments on a model, paper writing, internal review, occasional external collaboration. Output: papers, internal research notes, sometimes deployable techniques.

Where: Anthropic, OpenAI, GDM, MIRI, Redwood, ARC, Apollo, academia

🔧

2. Research Engineering

Building the infrastructure that makes research possible — eval harnesses, training pipelines, interpretability tooling, agent scaffolding. Often the bottleneck for research progress. Day: ML engineering, with safety as the application domain. Output: codebases, deployed systems, internal tooling.

Where: every safety-relevant lab and eval org has REs; often the easiest route in

🧪

3. Evaluations / Red-Teaming

Designing and running evals on frontier models. The science-of-evals craft from Ch. 8 in practice. Day: building threat models, writing eval scripts, analysing results, working with developers on findings. Output: eval suites, evaluation reports, system-card contributions.

Where: METR, Apollo, AISIs, internal lab safety teams, MITRE ATLAS

🏛️

4. Policy / Governance

Translating technical AI realities into policy proposals, legal frameworks, and standards. Day: writing policy briefs, engaging regulators, technical advising for legislation, comparative analysis of regimes. Output: policy papers, regulatory comments, draft standards, briefings.

Where: GovAI, RAND, CSET, IAPS, AISIs, government roles, frontier lab policy teams

🛡️

5. AI Security / Control Engineering

Building the operational defenses around deployed AI — model-weight security, classifier-based filtering, deployment pipelines, incident response. Day: classical security engineering applied to AI artifacts and APIs. Output: hardened systems, incident playbooks, security postures.

Where: every frontier lab has a security team; AISIs increasingly hiring; growing market for AI-specific security firms

📝

6. Distillation / Communication

Writing, teaching, course design, blog posts that translate technical AI safety for broader audiences — fellow researchers in adjacent fields, policymakers, general public. Day: reading widely, writing carefully, often combining with one of the other roles. Output: explanatory writing, course material, well-curated resource collections.

Where: AI Safety Communications, individual blogs, Alignment Forum, podcasts, dedicated education non-profits

The honest observation about role boundaries. Most senior people in the field do two or more of these. A research engineer who writes a Distill-style piece becomes a distillation contributor. A policy person who runs an eval becomes an evaluator. The roles describe modes of work; careers describe combinations over time.

Where the Work Happens — A Field Map

ORGANISATIONAL LANDSCAPE — WHO HIRES WHOM

Frontier labs

Anthropic, OpenAI, Google DeepMind, xAI, Meta. Hire across all six roles. Highest concentration of resources; also commercial pressure on what gets prioritised.

AI Safety Institutes

UK AISI, US AISI, Japan, Singapore, Korea, Canada, EU AI Office. Public-interest evaluators. Hire ML, security, policy. Pre-deployment testing partnerships with frontier labs.

Evaluation / Safety orgs

METR, Apollo Research, Redwood Research, ARC, MITRE ATLAS. Smaller, focused, often punching above weight. Strong technical bar; mission-driven.

Governance / Policy

GovAI (Oxford), RAND, CSET (Georgetown), IAPS, AI Now, FLI. Plus internal policy teams at frontier labs. Hire policy + technical hybrids.

Independent research

Funded by Open Phil, Survival & Flourishing Fund, Long-Term Future Fund, individual donors. Output published on Alignment Forum / LessWrong / personal blogs. High autonomy; lower infrastructure.

Academia

Cambridge AISI, Oxford GovAI, Stanford CRFM/HAI, MIT, CMU, NYU. Different incentive structures; teaching obligations; longer time-horizons. Strong for foundations work.

What changed in 2024–2026. The number of AISIs went from one (UK, late 2023) to roughly a dozen by 2026. Government-backed evaluation became a real career path. The frontier labs’ safety teams roughly doubled. Independent funding (Open Phil’s safety-focused tracks) became more selective and more competitive. The market is more hireable than three years ago, and more demanding in what it takes to clear the bar.

Fellowships and Structured Programs — The On-Ramps

The fastest way to validate that a route is right for you, without a multi-year commitment, is one of the structured programs. The major ones:

🎯

MATS — ML Alignment & Theory Scholars

~3 months, research-mentor-paired. Cohort-based. Selects scholars matched to specific senior researchers for a focused research project. One of the most respected on-ramps; alumni regularly transition to research roles at frontier labs and eval orgs.

Best for: technical research interest, looking to validate fit and produce a paper

⚙️

ARENA — Alignment Research Engineer Accelerator

~6 weeks, intensive ML engineering with alignment focus. Project-heavy. Designed for engineers who want to work on alignment but need to ramp on alignment-specific tooling and concepts.

Best for: ML engineers transitioning into alignment

🏛️

GovAI Fellowship

Multi-month policy research fellowship at the Centre for the Governance of AI (Oxford). Produces a substantive policy paper. Strong alumni record at AISIs, government, and policy think tanks.

Best for: technical-policy hybrid interest

📚

Introductory reading-group programs

~6-8 week structured curricula offered by several AI-safety education non-profits. Reading-discussion based; cohort-driven. The most accessible introductory programs; many participants then go on to MATS, ARENA, or direct hires.

Best for: orientation and decision-making about which deeper program to do

🔬

SERI MATS / SPAR / similar

A growing ecosystem of cohort programs with similar structures: applicants paired with mentors, time-bounded research project, demonstrated output. Programs vary in selectivity and focus area.

Best for: depending on focus area; check current program rosters

💼

Direct lab residencies / internships

Anthropic, OpenAI, GDM all have residency programs aimed at ML researchers and engineers. Higher bar than open-application fellowships; full-time pay; routinely convert to permanent roles.

Best for: experienced ML researchers/engineers ready for production-scale work

The general advice on fellowships. Apply to multiple. Treat the structured program less as a job and more as a six-week to three-month investment in figuring out whether the route works for you. If it does, you have demonstrated output to point at when applying for permanent roles. If it doesn’t, you’ve ruled out a path with much less sunk cost than a full-time job switch would have implied.

Skill Stacks — What to Build, by Route

The honest answer to “what should I learn?” depends on which route you’re targeting. The general-purpose advice “learn ML and read alignment papers” is correct but unhelpful. More specifically:

For Alignment Research

📐

ML fundamentals + transformer internals

You should be able to implement a transformer from scratch in PyTorch and explain why each piece exists. Karpathy's "Let's build GPT" is the benchmark exercise. If you can't build GPT-2 in a notebook, you're not ready for alignment research yet.

Test: can you reproduce nanoGPT?

📝

Read 30+ alignment papers carefully

Not survey-skim. Carefully — being able to summarise the contribution, methodology, and weaknesses. The MATS reading list and curated lists from major alignment educators are good starting points; this playlist's "further reading" sections are denser still.

Test: can you write a 1-page critical summary?

🧪

Replicate a small alignment experiment

Pick one paper, replicate the smallest meaningful experiment from it. CAA on Llama-2 (Ch. 7), a small SAE on Pythia, a goal-misgeneralisation toy environment. Producing a working replication is more useful than reading 10 more papers without one.

Test: do you have a published replication notebook?

📐

Mathematical maturity

Linear algebra, probability, optimisation, basic functional analysis. For agent foundations and interpretability theory, more — measure theory, category theory, information theory at non-trivial depth. Not strictly required for empirical work, load-bearing for theoretical work.

Test: can you read papers without skipping the proofs?

For Research Engineering

⚙️

Production ML engineering

Distributed training, GPU-cluster operations, training-pipeline reliability, JAX/PyTorch at scale. The skills that make you productive at an actual lab. Often the binding constraint, not the theory.

Test: have you trained a model larger than 1B parameters end-to-end?

🛠️

Tooling fluency

Weights & Biases, neptune, or similar experiment tracking. Hugging Face transformers, datasets, accelerate. Common eval harnesses (Inspect AI, EleutherAI eval-harness). The infrastructure that ML labs run on.

Test: can you set up a reproducible eval pipeline in a day?

🧹

Code quality discipline

Type-checked, tested, well-structured. Research engineering at frontier labs is the difference between a notebook that works once and a system that 30 researchers use daily. Care about correctness, latency, and maintainability.

Test: would your code review well in a frontier-lab codebase?

🤝

Cross-functional collaboration

Research engineers sit between researchers and infrastructure. The technical fluency is necessary but not sufficient — you need to translate, prioritise, and unblock. Soft skills are part of the role.

Test: have you owned a system someone else depended on?

For Evaluations / Red-Teaming

🎯

Threat modeling

The eval-design question is "what specific risk are we trying to measure?" Strong evaluators come with crisp threat models — a story about who the adversary is, what they want, and how the model could enable them.

Test: can you write a one-paragraph threat model that holds up to scrutiny?

🧪

Empirical methodology

The Hobbhahn science-of-evals discipline (Ch. 8) — elicitation discipline, statistical sufficiency, adversarial robustness, calibration. Most evals fail at methodology before they fail at concept.

Test: can you defend your eval's methodology against a hostile review?

🛡️

Domain knowledge

Cyber evals need cyber knowledge. Bio evals need bio knowledge. Persuasion evals need behavioural-science knowledge. Generic ML credentials are not enough; the eval's quality depends on the evaluator understanding the threat domain.

Test: do you bring domain expertise that the lab doesn't already have in-house?

📝

Writing and reporting

Eval reports are the artifact, not the evaluation. A good evaluator writes results that policymakers, researchers, and developers can actually use. Eval work that doesn't communicate clearly never lands.

Test: have you written an eval report someone acted on?

For Policy / Governance

📜

Technical literacy at policy depth

You don't need to train models, but you need to read AI papers without panic. Be able to translate "the model passed the eval" or "the RSP was triggered" into something a regulator can act on. The technical-translator role is rare and high-leverage.

Test: can you brief a policymaker on RSPs in 10 minutes?

🏛️

Regulatory analogues

Understand how aviation, pharma, finance, and nuclear regulation actually work. The good policy proposals borrow from these; the bad ones reinvent. Your value-add is often comparative analysis between regulatory traditions.

Test: can you compare FDA, FAA, and EU AI Act on a structural property?

🌐

International coordination

Treaties, export controls, mutual recognition, AISI networks. The Ch. 5 territory. International law / international relations background helps; not strictly required, but the senior people you'll work with all have some.

Test: can you sketch how a multi-jurisdiction safety regime could work?

✍️

Policy writing

Briefs, white papers, regulatory comments, draft text. The policy product is writing. Strong policy work compounds into influence; weak writing dissipates regardless of underlying analysis quality.

Test: have you written a policy brief someone cited?

For AI Security / Control

🔐

Classical security foundations

Threat modeling, OWASP-tier knowledge of common attacks, hardware-rooted trust, supply-chain security. Ch. 4's argument: AI security is computer security, with new specifics. Bring the classical foundation.

Test: do you have a security background that pre-dates LLMs?

🛠️

ML-specific attack surfaces

Jailbreaks (GCG, persona, encoding), prompt injection, model extraction, adversarial examples, glitch tokens, backdoored checkpoints. Ch. 4 and Ch. 9 territory; build hands-on familiarity, not just literature awareness.

Test: can you reproduce a GCG attack on a 7B model?

🧱

Defense engineering

Constitutional Classifiers, sandbox design, capability-permissioned tools, audit logging. Building defences that are deployable at scale, not just demonstrable in a notebook.

Test: have you shipped a security-relevant component in production?

🚨

Incident response posture

Runbooks, monitoring, post-incident review. The work doesn't stop at deployment; the operational layer is where most real safety value lives.

Test: have you been on an incident-response rotation?

The single most useful exercise that GovAI and similar programs assign: write a one-page document describing a specific contribution you intend to make.

THE 1-PAGER STRUCTURE

1. The problem

One paragraph. What specific gap in AI safety are you addressing? Be narrow — "alignment" is too broad; "improving the calibration of capability evals on cyber tasks" is the right grain.

2. The approach

One paragraph. What specifically will you do? "Read more papers" is not an approach. "Replicate the Sharma et al. Constitutional Classifiers result on Llama-3 and measure overhead" is.

3. The artifact

What will exist that didn't before, after you finish? A paper, a codebase, a report, a benchmark, a policy brief. Concrete, citable, point-to-able.

4. The theory of impact

If your artifact succeeds, how does that reduce expected harm? The Ch. 6 / Ch. 7 discipline applied to your own work. If you can't write a causal story, the project isn't ready.

5. Timeline + checkpoints

Specific dates. Mid-project checkpoint. Final delivery. What does done look like? When does it happen? If you're working with a mentor, this is the structure they need to evaluate progress.

6. Required resources

Compute. Mentor time. Funding. Access to specific models. Data. The 1-pager is also how you ask for what you need; vague asks get refused.

Why the exercise works. Most “I want to work on AI safety” plans dissolve at step 2 — you can’t specify the approach because you haven’t picked a narrow enough problem. The 1-pager forces narrowing. Forces theory of impact. Forces an actual deliverable. Most people who write their first 1-pager realise they’ve been thinking at the wrong grain; the exercise is doing its job when that happens.

The secondary use: the 1-pager is your application material. MATS, ARENA, GovAI, and direct lab applications all evaluate something close to this. If you have a strong 1-pager and demonstrated execution toward it, you’re a far more compelling candidate than someone with strong general credentials and no concrete plan.

Common Failure Modes for People New to the Field

The patterns that waste the first six months for people getting into AI safety:

📚

Permanent reading mode

Reading is necessary; reading without producing artifact is decorative. After 4-6 weeks of reading, you should be replicating a paper or drafting a 1-pager. People who've been "reading up on alignment" for a year without output are not actually closer to contributing than they were at three months.

Fix: produce something — a notebook, a write-up, a 1-pager — by month 2

🎯

Picking the wrong grain

"I want to solve alignment" and "I want to improve the calibration of cyber-uplift evals on Anthropic's pre-deployment suite" are the same career question at different grain. The first is unactionable; the second can be done in a fellowship. Narrow until you have something tractable.

Fix: keep narrowing until your project fits in a 1-pager

🏛️

Optimising for prestige over fit

"I want to work at Anthropic" is a destination, not a plan. The route depends on your background and interests; sometimes the more direct route is METR or an AISI rather than a frontier lab. Pick the right organisation for the work, not the brand.

Fix: pick the route that fits your skills, not the one that sounds best at parties

📐

Skipping the ML fundamentals

For technical roles, the failure mode is wanting to do interpretability without being able to implement a transformer, wanting to do RLHF without understanding PPO, wanting to do agent foundations without the math. The fundamentals are not optional; the alignment-specific layer goes on top of them.

Fix: build the ML stack first; alignment-specific work is layer 2

🌪️

Direction-jumping every two months

Interpretability one month, governance the next, control the third. Each direction has a learning curve; jumping resets it. Pick a direction, give it 3-6 months of real engagement before deciding it's not for you.

Fix: commit to one direction for at least one fellowship cycle

🔇

Working in isolation

Alignment is a small field. Most progress happens through conversations, mentor relationships, code review, and Alignment Forum discussion. Working alone in a notebook for 6 months without external feedback usually produces worse output than 4 months with structured feedback.

Fix: get into a fellowship cohort or an explicit mentor relationship

A Concrete Three-Month Plan for Someone Just Starting

For a hypothetical reader at the end of this playlist who wants to convert reading into contribution. Three months, calibrated to be ambitious-but-achievable for someone with a CS undergrad and curiosity.

📅

Month 1 — Fundamentals + orientation

Karpathy's "Let's build GPT" end-to-end. ARENA-style transformer-from-scratch implementation. Read 10 papers from the further-reading lists across Chapters 1, 3, 7, 8 — choose those that match your interest. Write 1-paragraph summaries of each. Apply to MATS / ARENA / GovAI / an introductory reading-group program for the next cohort.

Output: nanoGPT replicated; 10 paper summaries; fellowship apps submitted

📅

Month 2 — Pick a direction; replicate

From Chapter 7's portfolio, pick one focal direction that genuinely interests you. Replicate the smallest meaningful experiment from a key paper in that direction (CAA on Llama-2; a small SAE on Pythia; a goal-misgeneralisation environment from the DeepMind paper). Write up the replication as a public notebook. Start drafting your 1-pager.

Output: replicated experiment, public notebook, draft 1-pager

📅

Month 3 — Original contribution + visibility

Take your replication one step further — a small variation, a new evaluation, a comparison to a different model. Get feedback from someone in the field (Alignment Forum post; mentor conversation; submission to a workshop). Refine the 1-pager based on what you learned. Apply to permanent roles or further fellowships armed with concrete output.

Output: small original contribution, feedback received, polished 1-pager

🔄

Adjust as you learn

This plan is calibrated for the technical-research route. If you're targeting policy or governance, replace "replicate the paper" with "write a policy brief on [specific RSP/regulatory question]." Same shape — read, narrow, produce, refine, apply — different artifacts.

Principle: same arc, route-specific artifacts

The goal of the three-month plan. Not to “solve” anything. To convert the abstract intent of “I want to contribute to AI safety” into a concrete artifact that demonstrates you can. The artifact is what unlocks the rest of the field — fellowships, mentorships, hires, the next-step opportunities that compound from there.

What This Implies for Practice

📐

Pick the route, not the destination

"AI safety" is six different jobs. Pick which one matches your background, interests, and the way you actually like to work day-to-day. The destination follows from the route.

Principle: routes drive destinations, not the other way around

⏱️

Use fellowships to validate fit cheaply

Three months of MATS or ARENA or GovAI tells you whether the route works for you with much less downside than a multi-year commitment. If the fellowship goes well, you have output to leverage; if it doesn't, you've narrowed your search at low cost.

Principle: fellowships are calibrated risk

📝

Convert reading to artifact early

Permanent reading mode is the most common first-year failure. Before month 3, you should have produced something — a replication, a write-up, a 1-pager. Reading without output isn't progress.

Principle: artifact-driven progress, not consumption-driven

🎯

Narrow until your project fits a 1-pager

If you can't fit your project into the 1-pager structure, it's still too broad. Narrow ruthlessly. The 1-pager is the diagnostic for whether you're at the right grain.

Principle: narrowness is a feature

🤝

Get feedback structurally, not occasionally

Mentor relationships, fellowship cohorts, Alignment Forum posts that solicit comments. The structural feedback loops are what produce good work. Working in isolation almost always produces worse output, slower.

Principle: feedback as infrastructure

🛠️

Build the fundamentals you'll always need

For technical routes: ML, transformer internals, eval harnesses, training infrastructure. For policy routes: regulatory analogues, technical literacy, policy writing. For governance/security: classical security plus ML-specific. The fundamentals compound; the alignment-specific layer goes on top.

Principle: layer 1 first, layer 2 second

At a Glance

WHAT

Contributing to AI safety means picking one of six distinct kinds of work — research, engineering, evaluations, policy, security, distillation — and a route through it that matches your background. Where the work happens: frontier labs, AI Safety Institutes, evaluation orgs, governance organisations, independent research, academia. The on-ramps are structured fellowships (MATS, ARENA, GovAI, intro reading-group programs, lab residencies); the diagnostic is the 1-pager exercise.

WHY

The previous nine chapters laid out what AI safety is. None of it matters if no one builds it. The market is more hireable than three years ago and more demanding in what it takes to clear the bar; the route in is to validate fit through fellowships, build artifacts that demonstrate you can do the work, and apply with concrete output rather than general credentials.

RULE OF THUMB

Pick a route that matches how you like to work, not the destination that sounds best. Use a fellowship to validate the fit. Convert reading to artifact by month 3. Narrow your project until it fits a 1-pager. Get structured feedback. Build the fundamentals first; the alignment-specific layer goes on top.

Key Takeaways

AI safety is six distinct kinds of work, not one. Research, engineering, evaluations, policy, security, distillation. Pick the route that matches your background and how you like to work — the destination follows.
The organisational landscape is broader than “frontier labs.” AISIs, evaluation orgs (METR, Apollo, Redwood, ARC), governance shops (GovAI, RAND, CSET, IAPS), independent researchers, and academia all hire across roles. Optimise for fit with the work, not for prestige of the brand.
Fellowships are calibrated risk. MATS, ARENA, GovAI, intro reading-group programs, lab residencies — all let you test a route in three to six months with much less commitment than a full job switch. Apply to multiple; treat the program as a fit-validation device.
Skill stacks differ by route, but the fundamentals always come first. ML and transformer internals for technical work. Regulatory analogues and policy writing for governance. Classical security plus ML-specific for control. Build layer 1 first; alignment-specific work is layer 2.
The 1-pager is the diagnostic. Problem, approach, artifact, theory of impact, timeline, resources. If your project doesn’t fit the structure, it’s not narrow enough yet. Most “I want to work on AI safety” plans dissolve at the approach step; the 1-pager forces resolution.
Permanent reading mode is the most common first-year failure. After 4–6 weeks of reading, you should be producing artifacts — replicated experiments, write-ups, 1-pagers. Reading without output isn’t progress, no matter how much you read.
Direction-jumping resets the learning curve. Each direction has 3–6 months of ramp before contribution is plausible. Commit to one for at least a fellowship cycle before deciding it’s wrong.
Feedback infrastructure produces better work, faster. Mentor relationships, fellowship cohorts, Alignment Forum posts soliciting comments. The structural feedback loops are non-negotiable; isolation almost always produces worse output than collaborative work in the same time.
Convert intent into artifact, then artifacts into roles. The transition from “I want to contribute” to “I am contributing” is mediated by demonstrable output. Replication notebooks, written 1-pagers, completed fellowship projects — these are what unlock the next-step opportunities that compound from there.