ARTICLE · 13 MIN READ · APRIL 15, 2026
Chapter 10: Contributing to AI Safety — Paths, Skills, and Getting Started
The hardest part of contributing to AI safety isn't picking the right research direction — it's picking a route into the field that matches your background and timeline. This chapter is the field map for that decision.
Why a Career-Path Chapter
The previous nine chapters laid out what AI safety is — the technical case, the alignment portfolio, the security surface, the governance architecture, the evaluation discipline, the control paradigm, and the strongest critiques. This chapter answers a different question:
Given all of that, how do you actually contribute?
The answer isn’t a single path. AI safety is a young field with several distinct routes in — research, engineering, policy, evaluation, independent research, advocacy. Each route has different background requirements, different day-to-day work, different organisations to apply to, and different ways of making real impact.
The structure of this chapter:
- Roles and routes — the live categories of work in AI safety as of 2026, with what each actually involves day-to-day.
- Where the work happens — frontier labs, AI Safety Institutes, evaluation orgs, governance organisations, independent research, academia.
- Fellowships and structured programs — the on-ramps that let you trial-run a direction before committing.
- Skill stacks by route — what to build if you’re targeting research vs. engineering vs. policy.
- The 1-pager / project proposal exercise — a standard exercise that forces you to make a concrete plan.
- Common failure modes — the ways people new to the field accidentally waste their first six months.
This is more practical and less theoretical than previous chapters. The intent is for you to leave with a concrete next-step list, not a philosophical disposition.
Frontier lab: The major commercial developers of frontier AI — Anthropic, OpenAI, Google DeepMind, xAI, Meta, plus a handful of others. Each has internal safety, alignment, and security teams.
AI Safety Institute (AISI): Government bodies established to evaluate frontier AI for public-interest safety. UK AISI (founded 2023), US AISI (2024), with counterparts in Japan, Singapore, Korea, Canada, and the EU. Hire from a mix of ML, policy, and security backgrounds.
Evaluation organisation: Non-profits and small labs whose primary work is independent AI capability and safety evaluation. METR, Apollo Research, Redwood Research, MITRE ATLAS, and others.
Governance organisation: Think tanks and policy shops working on AI policy: GovAI, RAND, CSET, IAPS, AI Now Institute, plus academic centres at Oxford, Cambridge, Stanford, etc.
Independent / Distillation researcher: Individuals working outside formal institutions, supported by grants (Open Phil, Survival & Flourishing Fund, LTFF) or fellowships. Often write on Alignment Forum, LessWrong, or in self-published reports.
MATS: ML Alignment & Theory Scholars program — ~3-month research fellowship pairing scholars with active alignment researchers as mentors. One of the most-respected on-ramps for technical research.
ARENA: Alignment Research Engineer Accelerator — ~6-week intensive program for ML engineering with an alignment focus. Practical, project-driven.
1-pager: A standard exercise used by many fellowship and research programs: write a single-page document describing a specific contribution you intend to make — the problem, your approach, the artifact, the impact. Forces concreteness.
The Roles — What People Actually Do
There are six distinct kinds of work, with different day-to-day rhythms and different paths in.
1. Alignment Research
Frontier research on the alignment problem — interpretability, scalable oversight, agent foundations, RLHF and successors, control. Typical day: experiments on a model, paper writing, internal review, occasional external collaboration. Output: papers, internal research notes, sometimes deployable techniques.
2. Research Engineering
Building the infrastructure that makes research possible — eval harnesses, training pipelines, interpretability tooling, agent scaffolding. Often the bottleneck for research progress. Day: ML engineering, with safety as the application domain. Output: codebases, deployed systems, internal tooling.
3. Evaluations / Red-Teaming
Designing and running evals on frontier models. The science-of-evals craft from Ch. 8 in practice. Day: building threat models, writing eval scripts, analysing results, working with developers on findings. Output: eval suites, evaluation reports, system-card contributions.
4. Policy / Governance
Translating technical AI realities into policy proposals, legal frameworks, and standards. Day: writing policy briefs, engaging regulators, technical advising for legislation, comparative analysis of regimes. Output: policy papers, regulatory comments, draft standards, briefings.
5. AI Security / Control Engineering
Building the operational defenses around deployed AI — model-weight security, classifier-based filtering, deployment pipelines, incident response. Day: classical security engineering applied to AI artifacts and APIs. Output: hardened systems, incident playbooks, security postures.
6. Distillation / Communication
Writing, teaching, course design, blog posts that translate technical AI safety for broader audiences — fellow researchers in adjacent fields, policymakers, general public. Day: reading widely, writing carefully, often combining with one of the other roles. Output: explanatory writing, course material, well-curated resource collections.
The honest observation about role boundaries. Most senior people in the field do two or more of these. A research engineer who writes a Distill-style piece becomes a distillation contributor. A policy person who runs an eval becomes an evaluator. The roles describe modes of work; careers describe combinations over time.
Where the Work Happens — A Field Map
What changed in 2024–2026. The number of AISIs went from one (UK, late 2023) to roughly a dozen by 2026. Government-backed evaluation became a real career path. The frontier labs’ safety teams roughly doubled. Independent funding (Open Phil’s safety-focused tracks) became more selective and more competitive. The market is more hireable than three years ago, and more demanding in what it takes to clear the bar.
Fellowships and Structured Programs — The On-Ramps
The fastest way to validate that a route is right for you, without a multi-year commitment, is one of the structured programs. The major ones:
MATS — ML Alignment & Theory Scholars
~3 months, research-mentor-paired. Cohort-based. Selects scholars matched to specific senior researchers for a focused research project. One of the most respected on-ramps; alumni regularly transition to research roles at frontier labs and eval orgs.
ARENA — Alignment Research Engineer Accelerator
~6 weeks, intensive ML engineering with alignment focus. Project-heavy. Designed for engineers who want to work on alignment but need to ramp on alignment-specific tooling and concepts.
GovAI Fellowship
Multi-month policy research fellowship at the Centre for the Governance of AI (Oxford). Produces a substantive policy paper. Strong alumni record at AISIs, government, and policy think tanks.
Introductory reading-group programs
~6-8 week structured curricula offered by several AI-safety education non-profits. Reading-discussion based; cohort-driven. The most accessible introductory programs; many participants then go on to MATS, ARENA, or direct hires.
SERI MATS / SPAR / similar
A growing ecosystem of cohort programs with similar structures: applicants paired with mentors, time-bounded research project, demonstrated output. Programs vary in selectivity and focus area.
Direct lab residencies / internships
Anthropic, OpenAI, GDM all have residency programs aimed at ML researchers and engineers. Higher bar than open-application fellowships; full-time pay; routinely convert to permanent roles.
The general advice on fellowships. Apply to multiple. Treat the structured program less as a job and more as a six-week to three-month investment in figuring out whether the route works for you. If it does, you have demonstrated output to point at when applying for permanent roles. If it doesn’t, you’ve ruled out a path with much less sunk cost than a full-time job switch would have implied.
Skill Stacks — What to Build, by Route
The honest answer to “what should I learn?” depends on which route you’re targeting. The general-purpose advice “learn ML and read alignment papers” is correct but unhelpful. More specifically:
For Alignment Research
ML fundamentals + transformer internals
You should be able to implement a transformer from scratch in PyTorch and explain why each piece exists. Karpathy's "Let's build GPT" is the benchmark exercise. If you can't build GPT-2 in a notebook, you're not ready for alignment research yet.
Read 30+ alignment papers carefully
Not survey-skim. Carefully — being able to summarise the contribution, methodology, and weaknesses. The MATS reading list and curated lists from major alignment educators are good starting points; this playlist's "further reading" sections are denser still.
Replicate a small alignment experiment
Pick one paper, replicate the smallest meaningful experiment from it. CAA on Llama-2 (Ch. 7), a small SAE on Pythia, a goal-misgeneralisation toy environment. Producing a working replication is more useful than reading 10 more papers without one.
Mathematical maturity
Linear algebra, probability, optimisation, basic functional analysis. For agent foundations and interpretability theory, more — measure theory, category theory, information theory at non-trivial depth. Not strictly required for empirical work, load-bearing for theoretical work.
For Research Engineering
Production ML engineering
Distributed training, GPU-cluster operations, training-pipeline reliability, JAX/PyTorch at scale. The skills that make you productive at an actual lab. Often the binding constraint, not the theory.
Tooling fluency
Weights & Biases, neptune, or similar experiment tracking. Hugging Face transformers, datasets, accelerate. Common eval harnesses (Inspect AI, EleutherAI eval-harness). The infrastructure that ML labs run on.
Code quality discipline
Type-checked, tested, well-structured. Research engineering at frontier labs is the difference between a notebook that works once and a system that 30 researchers use daily. Care about correctness, latency, and maintainability.
Cross-functional collaboration
Research engineers sit between researchers and infrastructure. The technical fluency is necessary but not sufficient — you need to translate, prioritise, and unblock. Soft skills are part of the role.
For Evaluations / Red-Teaming
Threat modeling
The eval-design question is "what specific risk are we trying to measure?" Strong evaluators come with crisp threat models — a story about who the adversary is, what they want, and how the model could enable them.
Empirical methodology
The Hobbhahn science-of-evals discipline (Ch. 8) — elicitation discipline, statistical sufficiency, adversarial robustness, calibration. Most evals fail at methodology before they fail at concept.
Domain knowledge
Cyber evals need cyber knowledge. Bio evals need bio knowledge. Persuasion evals need behavioural-science knowledge. Generic ML credentials are not enough; the eval's quality depends on the evaluator understanding the threat domain.
Writing and reporting
Eval reports are the artifact, not the evaluation. A good evaluator writes results that policymakers, researchers, and developers can actually use. Eval work that doesn't communicate clearly never lands.
For Policy / Governance
Technical literacy at policy depth
You don't need to train models, but you need to read AI papers without panic. Be able to translate "the model passed the eval" or "the RSP was triggered" into something a regulator can act on. The technical-translator role is rare and high-leverage.
Regulatory analogues
Understand how aviation, pharma, finance, and nuclear regulation actually work. The good policy proposals borrow from these; the bad ones reinvent. Your value-add is often comparative analysis between regulatory traditions.
International coordination
Treaties, export controls, mutual recognition, AISI networks. The Ch. 5 territory. International law / international relations background helps; not strictly required, but the senior people you'll work with all have some.
Policy writing
Briefs, white papers, regulatory comments, draft text. The policy product is writing. Strong policy work compounds into influence; weak writing dissipates regardless of underlying analysis quality.
For AI Security / Control
Classical security foundations
Threat modeling, OWASP-tier knowledge of common attacks, hardware-rooted trust, supply-chain security. Ch. 4's argument: AI security is computer security, with new specifics. Bring the classical foundation.
ML-specific attack surfaces
Jailbreaks (GCG, persona, encoding), prompt injection, model extraction, adversarial examples, glitch tokens, backdoored checkpoints. Ch. 4 and Ch. 9 territory; build hands-on familiarity, not just literature awareness.
Defense engineering
Constitutional Classifiers, sandbox design, capability-permissioned tools, audit logging. Building defences that are deployable at scale, not just demonstrable in a notebook.
Incident response posture
Runbooks, monitoring, post-incident review. The work doesn't stop at deployment; the operational layer is where most real safety value lives.
The 1-Pager — Forcing Concreteness
The single most useful exercise that GovAI and similar programs assign: write a one-page document describing a specific contribution you intend to make.
Why the exercise works. Most “I want to work on AI safety” plans dissolve at step 2 — you can’t specify the approach because you haven’t picked a narrow enough problem. The 1-pager forces narrowing. Forces theory of impact. Forces an actual deliverable. Most people who write their first 1-pager realise they’ve been thinking at the wrong grain; the exercise is doing its job when that happens.
The secondary use: the 1-pager is your application material. MATS, ARENA, GovAI, and direct lab applications all evaluate something close to this. If you have a strong 1-pager and demonstrated execution toward it, you’re a far more compelling candidate than someone with strong general credentials and no concrete plan.
Common Failure Modes for People New to the Field
The patterns that waste the first six months for people getting into AI safety:
Permanent reading mode
Reading is necessary; reading without producing artifact is decorative. After 4-6 weeks of reading, you should be replicating a paper or drafting a 1-pager. People who've been "reading up on alignment" for a year without output are not actually closer to contributing than they were at three months.
Fix: produce something — a notebook, a write-up, a 1-pager — by month 2Picking the wrong grain
"I want to solve alignment" and "I want to improve the calibration of cyber-uplift evals on Anthropic's pre-deployment suite" are the same career question at different grain. The first is unactionable; the second can be done in a fellowship. Narrow until you have something tractable.
Fix: keep narrowing until your project fits in a 1-pagerOptimising for prestige over fit
"I want to work at Anthropic" is a destination, not a plan. The route depends on your background and interests; sometimes the more direct route is METR or an AISI rather than a frontier lab. Pick the right organisation for the work, not the brand.
Fix: pick the route that fits your skills, not the one that sounds best at partiesSkipping the ML fundamentals
For technical roles, the failure mode is wanting to do interpretability without being able to implement a transformer, wanting to do RLHF without understanding PPO, wanting to do agent foundations without the math. The fundamentals are not optional; the alignment-specific layer goes on top of them.
Fix: build the ML stack first; alignment-specific work is layer 2Direction-jumping every two months
Interpretability one month, governance the next, control the third. Each direction has a learning curve; jumping resets it. Pick a direction, give it 3-6 months of real engagement before deciding it's not for you.
Fix: commit to one direction for at least one fellowship cycleWorking in isolation
Alignment is a small field. Most progress happens through conversations, mentor relationships, code review, and Alignment Forum discussion. Working alone in a notebook for 6 months without external feedback usually produces worse output than 4 months with structured feedback.
Fix: get into a fellowship cohort or an explicit mentor relationshipA Concrete Three-Month Plan for Someone Just Starting
For a hypothetical reader at the end of this playlist who wants to convert reading into contribution. Three months, calibrated to be ambitious-but-achievable for someone with a CS undergrad and curiosity.
Month 1 — Fundamentals + orientation
Karpathy's "Let's build GPT" end-to-end. ARENA-style transformer-from-scratch implementation. Read 10 papers from the further-reading lists across Chapters 1, 3, 7, 8 — choose those that match your interest. Write 1-paragraph summaries of each. Apply to MATS / ARENA / GovAI / an introductory reading-group program for the next cohort.
Month 2 — Pick a direction; replicate
From Chapter 7's portfolio, pick one focal direction that genuinely interests you. Replicate the smallest meaningful experiment from a key paper in that direction (CAA on Llama-2; a small SAE on Pythia; a goal-misgeneralisation environment from the DeepMind paper). Write up the replication as a public notebook. Start drafting your 1-pager.
Month 3 — Original contribution + visibility
Take your replication one step further — a small variation, a new evaluation, a comparison to a different model. Get feedback from someone in the field (Alignment Forum post; mentor conversation; submission to a workshop). Refine the 1-pager based on what you learned. Apply to permanent roles or further fellowships armed with concrete output.
Adjust as you learn
This plan is calibrated for the technical-research route. If you're targeting policy or governance, replace "replicate the paper" with "write a policy brief on [specific RSP/regulatory question]." Same shape — read, narrow, produce, refine, apply — different artifacts.
The goal of the three-month plan. Not to “solve” anything. To convert the abstract intent of “I want to contribute to AI safety” into a concrete artifact that demonstrates you can. The artifact is what unlocks the rest of the field — fellowships, mentorships, hires, the next-step opportunities that compound from there.
What This Implies for Practice
Pick the route, not the destination
"AI safety" is six different jobs. Pick which one matches your background, interests, and the way you actually like to work day-to-day. The destination follows from the route.
Use fellowships to validate fit cheaply
Three months of MATS or ARENA or GovAI tells you whether the route works for you with much less downside than a multi-year commitment. If the fellowship goes well, you have output to leverage; if it doesn't, you've narrowed your search at low cost.
Convert reading to artifact early
Permanent reading mode is the most common first-year failure. Before month 3, you should have produced something — a replication, a write-up, a 1-pager. Reading without output isn't progress.
Narrow until your project fits a 1-pager
If you can't fit your project into the 1-pager structure, it's still too broad. Narrow ruthlessly. The 1-pager is the diagnostic for whether you're at the right grain.
Get feedback structurally, not occasionally
Mentor relationships, fellowship cohorts, Alignment Forum posts that solicit comments. The structural feedback loops are what produce good work. Working in isolation almost always produces worse output, slower.
Build the fundamentals you'll always need
For technical routes: ML, transformer internals, eval harnesses, training infrastructure. For policy routes: regulatory analogues, technical literacy, policy writing. For governance/security: classical security plus ML-specific. The fundamentals compound; the alignment-specific layer goes on top.
At a Glance
Contributing to AI safety means picking one of six distinct kinds of work — research, engineering, evaluations, policy, security, distillation — and a route through it that matches your background. Where the work happens: frontier labs, AI Safety Institutes, evaluation orgs, governance organisations, independent research, academia. The on-ramps are structured fellowships (MATS, ARENA, GovAI, intro reading-group programs, lab residencies); the diagnostic is the 1-pager exercise.
The previous nine chapters laid out what AI safety is. None of it matters if no one builds it. The market is more hireable than three years ago and more demanding in what it takes to clear the bar; the route in is to validate fit through fellowships, build artifacts that demonstrate you can do the work, and apply with concrete output rather than general credentials.
Pick a route that matches how you like to work, not the destination that sounds best. Use a fellowship to validate the fit. Convert reading to artifact by month 3. Narrow your project until it fits a 1-pager. Get structured feedback. Build the fundamentals first; the alignment-specific layer goes on top.
Key Takeaways
-
AI safety is six distinct kinds of work, not one. Research, engineering, evaluations, policy, security, distillation. Pick the route that matches your background and how you like to work — the destination follows.
-
The organisational landscape is broader than “frontier labs.” AISIs, evaluation orgs (METR, Apollo, Redwood, ARC), governance shops (GovAI, RAND, CSET, IAPS), independent researchers, and academia all hire across roles. Optimise for fit with the work, not for prestige of the brand.
-
Fellowships are calibrated risk. MATS, ARENA, GovAI, intro reading-group programs, lab residencies — all let you test a route in three to six months with much less commitment than a full job switch. Apply to multiple; treat the program as a fit-validation device.
-
Skill stacks differ by route, but the fundamentals always come first. ML and transformer internals for technical work. Regulatory analogues and policy writing for governance. Classical security plus ML-specific for control. Build layer 1 first; alignment-specific work is layer 2.
-
The 1-pager is the diagnostic. Problem, approach, artifact, theory of impact, timeline, resources. If your project doesn’t fit the structure, it’s not narrow enough yet. Most “I want to work on AI safety” plans dissolve at the approach step; the 1-pager forces resolution.
-
Permanent reading mode is the most common first-year failure. After 4–6 weeks of reading, you should be producing artifacts — replicated experiments, write-ups, 1-pagers. Reading without output isn’t progress, no matter how much you read.
-
Direction-jumping resets the learning curve. Each direction has 3–6 months of ramp before contribution is plausible. Commit to one for at least a fellowship cycle before deciding it’s wrong.
-
Feedback infrastructure produces better work, faster. Mentor relationships, fellowship cohorts, Alignment Forum posts soliciting comments. The structural feedback loops are non-negotiable; isolation almost always produces worse output than collaborative work in the same time.
-
Convert intent into artifact, then artifacts into roles. The transition from “I want to contribute” to “I am contributing” is mediated by demonstrable output. Replication notebooks, written 1-pagers, completed fellowship projects — these are what unlock the next-step opportunities that compound from there.
Further Reading
- 80,000 Hours, AI safety technical research career guide — the canonical generalist guide for technical AI safety careers.
- Alignment Forum, “How to pursue a career in technical AI alignment” — community-curated advice and examples.
- ML Alignment & Theory Scholars (MATS) program — application materials and alumni outputs are public; reading them gives you the concrete output bar.
- ARENA program materials — public, including detailed week-by-week curriculum.
- Centre for the Governance of AI (GovAI) Fellowship — application materials and prior fellow outputs.
- Andrej Karpathy, “Let’s build GPT: from scratch, in code, spelled out” — the canonical exercise for transformer fluency.
- Anthropic, OpenAI, GDM Careers pages — the active hiring landscape; reading current postings is the best signal of what skills the field is currently demanding.
- Open Philanthropy, AI Safety grants — the funding landscape for independent research.
Enjoy Reading This Article?
Here are some more articles you might like to read next: