Chapter 9: Learning and Adaptation

The Problem with Static Agents

Before You Start — Key Terms Explained

Training vs Inference: "Training" is when an AI model learns from data — this is the expensive, time-consuming phase where the model's parameters (weights) are updated. "Inference" is when a trained model is used to answer questions — this is fast and cheap. Most deployed agents only do inference. Learning and adaptation are about bringing some form of training into the deployed system.

Parameters/Weights: The numbers inside a neural network that determine how it responds to inputs. GPT-4 has hundreds of billions of parameters. Training changes these numbers. Inference uses them without changing them.

Policy: In reinforcement learning, the "policy" is the agent's strategy — the rule that maps situations (states) to actions. "When I see X, I do Y" is a policy. Learning in RL means finding a better policy.

Reward function: A mathematical signal that tells an RL agent whether its action was good (+reward) or bad (-penalty). The agent learns to maximize cumulative reward over time. Designing the reward function well is one of the hardest parts of RL.

Benchmark: A standardized test suite used to measure an AI system's performance. Examples: HumanEval (coding), MMLU (knowledge), SWE-bench (software engineering). Benchmarks provide comparable, reproducible measurements of capability.

Fine-tuning: Taking a pre-trained model (like GPT-4) and continuing to train it on a smaller, specific dataset to specialize its behavior. Like taking a medical school graduate and doing a residency in cardiology — the base knowledge stays, domain-specific patterns are reinforced.

Every pattern in this series so far — chaining, routing, parallelization, reflection, tool use, planning, multi-agent, memory — shares a fundamental assumption: the agent’s core behavior stays constant. You design the system, deploy it, and it runs. If it makes errors, you as the developer must find them, diagnose them, and manually update the prompts or code.

This is called a static agent — one that executes without improving.

Static agents fail in three important ways:

They can’t handle distribution shift. When the environment changes — user behavior shifts, new topics emerge, the API they rely on changes its output format — static agents degrade silently. They keep running the old strategy in a world that no longer matches the world they were designed for.

They can’t personalize. Every user gets the same agent behavior. A trading bot that works well for a risk-tolerant investor behaves identically for a risk-averse retiree. A customer service agent that works well for technical questions handles casual questions with the same approach. Without learning, there is no adaptation to individual users.

They can’t improve from their own mistakes. A static agent that makes the same error 1,000 times will make it 1,001 times. There is no mechanism to recognize the pattern of failure and correct it.

Learning and adaptation are the patterns that solve all three of these problems. An agent that learns from experience — whether that means updating its parameters, refining its prompts, modifying its code, or building up a richer knowledge base — can improve autonomously without constant manual intervention.

This chapter covers the complete spectrum: from classical machine learning approaches (reinforcement learning, supervised learning) to the frontier of agents that rewrite their own code and discover algorithms better than anything humans have designed.

The Six Ways Agents Learn

Not all learning is the same. Depending on what data is available, what the agent is trying to optimize, and how it interacts with its environment, different learning paradigms apply. Here are the six most relevant for agentic AI systems:

Reinforcement Learning

The agent tries actions in an environment and receives rewards (positive) or penalties (negative) based on the outcome. Over many trials, it learns which actions lead to the best cumulative reward. No teacher provides labeled correct answers — the agent learns entirely from the feedback signal of consequences.

How it learns: Through trial and error, guided by a reward function. The agent explores actions, observes outcomes, and updates its policy to favor high-reward actions.

When to use: Sequential decision-making where you can define a reward function. Robot control, game playing, resource allocation, trading strategies.

Real example: AlphaGo learned to play Go by playing millions of games against itself, receiving +1 for a win and -1 for a loss. No human demonstrated moves — it discovered superhuman strategies through RL.

Agent

→ action →

Environment

↓ reward + new state ↓

Update Policy

PPO: The Algorithm Behind Agent Training

Proximal Policy Optimization (PPO) is the reinforcement learning algorithm that trained ChatGPT’s conversational behavior, powers OpenAI Five (the Dota-playing AI), and underlies most modern RL-trained AI systems. Understanding it is essential for understanding how agents learn from feedback.

What Problem PPO Solves

Before PPO, the standard RL approach was simple: try an action, observe the reward, update the policy to make that action more likely if the reward was positive. Repeat millions of times.

The problem: large, unstable updates. If you update the policy too aggressively based on one batch of experience, you can “fall off a cliff” — the policy changes so dramatically that the agent now performs worse than before, and it’s very hard to recover. This is called catastrophic forgetting or policy collapse.

Imagine teaching a dog a trick. If you reward it so enthusiastically that it changes its entire behavior pattern at once — not just “sit” but its whole approach to interacting with you — you might accidentally teach it something wrong. Small, consistent corrections work better than dramatic overhauls.

PPO solves this with one elegant idea: clip the policy update so it can’t change too much in a single step.

The PPO Algorithm, Step by Step

PPO TRAINING LOOP — collect, evaluate, clip, update

Step 1: Collect Experience

Agent runs in environment using current policy π. Records: (state, action, reward) for each step. This is the "data collection" phase — no learning yet.

Step 2: Compute Advantage Estimates

For each action taken, compute: "Was this better or worse than average?" An action with reward 100 when average is 60 has advantage +40. This tells us which actions were surprisingly good or bad.

Step 3: The Clipped Objective (PPO's Key Innovation)

Calculate the ratio: π_new(a|s) / π_old(a|s). If this ratio is too large (policy changed too much), clip it to [1-ε, 1+ε] where ε ≈ 0.2. This prevents any single update from being too dramatic.

Step 4: Update the Policy

Run gradient descent to maximize the clipped objective. The clip ensures the new policy stays close to the old one — the "trust region" that prevents catastrophic updates.

Repeat Until Converged

Go back to Step 1 with the updated policy. Each cycle, the agent gets slightly better. Thousands of cycles later, the policy is well-optimized.

The clipping mechanism explained for a first-year student. The ratio π_new(a|s) / π_old(a|s) tells you how much the probability of taking action a in state s has changed. If the new policy makes action a twice as likely as the old policy, the ratio is 2.0. If ε = 0.2, the clip restricts this to [0.8, 1.2] — the policy is only allowed to change action probabilities by ±20% per update. Any larger change is simply cut off.

This is like a speed limiter on a car. No matter how hard you press the accelerator, the car won’t exceed 120 km/h. PPO’s clip is a policy change limiter — no matter how good an action seemed, the policy update is constrained.

Where PPO is used in practice. PPO was used to train the RLHF (Reinforcement Learning from Human Feedback) phase of ChatGPT and Claude. After initial pre-training on text data, human raters compared pairs of responses and rated which was better. This preference data trained a reward model. Then PPO was used to fine-tune the LLM to generate responses that scored highly on the reward model — effectively teaching it to be more helpful, harmless, and honest.

DPO: The Simpler Path to Human Alignment

Direct Preference Optimization (DPO) is a more recent algorithm that achieves the same goal as PPO-based RLHF — aligning LLMs with human preferences — with dramatically less complexity.

Why PPO Is Hard for LLM Alignment

The traditional RLHF pipeline using PPO requires two separate training phases:

Train a reward model: Collect thousands of (prompt, response_A, response_B, human_preference) tuples. Train a separate neural network (the reward model) to predict which response a human would prefer.
Fine-tune the LLM with PPO: Run PPO using the reward model as the “critic” that scores the LLM’s outputs. The LLM learns to generate responses that get high scores.

This two-phase approach has serious practical problems. The reward model is imperfect — it’s an approximation of human judgment. The LLM, being optimized by PPO, can discover “reward hacking” — generating responses that score highly on the reward model but are actually bad (e.g., by exploiting weaknesses in how the reward model was trained). The separate reward model also requires additional GPU memory, additional training time, and careful hyperparameter tuning.

How DPO Works

DPO (Ziegler et al., 2023) makes the insight that you don’t need a reward model at all. You can derive the optimal policy directly from the preference data, without the intermediary.

PPO-RLHF vs DPO — two paths to human preference alignment

Human Preference Data

(prompt, preferred_response, rejected_response) — collected from human raters comparing pairs of responses

PPO Path (Complex)

Train Reward Model

Separate neural network learns to score responses. Additional GPU memory, training time, risk of reward hacking.

Fine-tune LLM with PPO

Complex RL training loop. LLM maximizes reward model score. Can be unstable.

DPO Path (Simple)

Direct Fine-tuning on Preferences

No reward model needed. Increases probability of preferred responses. Decreases probability of rejected responses. Standard gradient descent.

Aligned LLM

Generates responses that match human preferences — more helpful, harmless, and honest

The mathematical intuition. DPO uses a mathematical property: the optimal policy under a reward model can be expressed directly in terms of the likelihood ratio between the policy and a reference policy. This means you can skip the reward model entirely and directly train the LLM to:

Increase the log-probability of generating responses labeled as “preferred”
Decrease the log-probability of generating responses labeled as “rejected”

The loss function looks like: L_DPO = -E[log σ(β · (log π(y_w|x) - log π_ref(y_w|x)) - β · (log π(y_l|x) - log π_ref(y_l|x)))]

Don’t worry about the formula — the intuition is: for each (preferred, rejected) pair, adjust the model to make the preferred response more likely and the rejected response less likely, relative to where you started (the reference policy π_ref). That’s it.

Why DPO is gaining adoption. DPO achieves comparable alignment quality to PPO-based RLHF but requires only one training phase instead of two, no separate reward model, standard supervised learning training loops (no RL), and significantly less GPU memory and compute. Several recent frontier models, including variants of Llama and Mistral, use DPO for their alignment phase.

	PPO-RLHF	DPO
Phases	2 (reward model + RL fine-tune)	1 (direct fine-tune)
Reward model needed?	Yes — separate training	No
Training stability	Can be unstable	Generally stable
Risk of reward hacking	Yes	No
Implementation complexity	High	Low
Performance	Slightly better on some benchmarks	Competitive on most tasks

When Agents Learn Without Retraining: Few-Shot and In-Context Adaptation

One of the most practically important forms of learning for deployed LLM agents doesn’t involve changing any model weights at all. In-context learning is when the LLM adapts its behavior based on examples or instructions provided in the current prompt.

This is both the most accessible form of adaptation (no training required) and the most temporary (no persistence between sessions unless you explicitly manage it).

Few-shot learning example. Suppose you want your agent to classify customer support tickets into categories. Rather than fine-tuning a model, you include examples in the prompt:

Classify the following support ticket:

Example 1: "My order hasn't arrived after 2 weeks" → SHIPPING
Example 2: "The item broke after one day of use" → QUALITY
Example 3: "How do I return something?" → RETURN

Now classify: "My package says delivered but I haven't received it"

The model learns the classification scheme from those three examples and applies it to the new ticket — without any gradient updates, without any training, just from the examples in the prompt.

Zero-shot learning. Even more powerfully, you can describe a task and the model can often do it with no examples at all: “Classify this support ticket into one of these categories: SHIPPING, QUALITY, RETURN, BILLING, OTHER. Ticket: …” The model generalizes from the category names and descriptions alone.

Why this works. During pre-training on trillions of tokens of text, the LLM has seen countless examples of few-shot task demonstrations, classification systems, and task descriptions. It has learned the format of task specification itself. When you provide a few examples, you’re activating that learned pattern-matching capability — you’re not teaching it a new skill, you’re reminding it of one it already has.

The limitation. Few-shot and zero-shot adaptation exists only within the context window. When the session ends, the adaptation is gone. For persistent adaptation — the agent consistently handling a new task type better over time — you need either fine-tuning (updating the weights) or memory-based approaches (storing successful interaction patterns for retrieval in future sessions).

Case Study: SICA — The Self-Improving Coding Agent

The Self-Improving Coding Agent (SICA), developed by Maxime Robeyns, Laurence Aitchison, and Martin Szummer, represents one of the most striking demonstrations of agent-based learning: an agent that improves by modifying its own source code.

This is fundamentally different from everything we’ve discussed so far. In reinforcement learning, the model’s weights change. In fine-tuning, the model’s weights change. In SICA, the model’s code changes — the actual Python files that implement the agent’s tools and behavior are rewritten by the agent itself.

The Architecture

SICA SELF-IMPROVEMENT CYCLE — meta-improvement through code modification

Archive of Past Versions

All previous agent versions with their benchmark scores: success rate, time taken, compute cost. Each version is a snapshot of the agent's codebase.

Select Best Version

Pick the version with highest weighted utility score: U = success × w₁ - time × w₂ - compute × w₃. This "best agent so far" becomes the base for the next improvement cycle.

Meta-Improvement Phase

The selected agent analyzes the archive, identifies what's limiting performance, and directly edits its own Python source files. It can add new tools, modify existing functions, or redesign data structures.

Benchmark Evaluation

The modified agent runs against standardized coding challenges. Its success rate, runtime, and compute usage are recorded. This produces the utility score for this new version.

Archive Updated

The new version + its scores join the archive. The cycle repeats — the next meta-improvement cycle starts from the best version in the updated archive.

What SICA Actually Improved

What makes SICA remarkable isn’t just that it can modify code — it’s what it chose to modify, without being told. Starting from a simple file-overwriting approach (replace the entire file to make any change), SICA independently invented increasingly sophisticated tools:

Generation 1 → 2: The Smart Editor. SICA recognized that overwriting entire files was wasteful — most of the file stays the same when you make a small change. It invented a more intelligent editing tool that could make targeted modifications to specific sections of code without touching the rest. This was entirely self-generated.

Generation 2 → 3: The Diff-Enhanced Smart Editor. SICA discovered that describing changes as “diffs” (the Unix diff format — showing only what changed) was more efficient than showing the entire modified file. It added diff-based editing to its toolkit, reducing context length and processing time.

Navigation innovation: The AST Symbol Locator. To edit code intelligently, you need to find the right location quickly. SICA independently invented a tool that parsed the code’s Abstract Syntax Tree (AST) — a structured representation of code that shows its logical structure rather than raw text — to navigate directly to function or class definitions.

What is an Abstract Syntax Tree (AST)? When Python (or any programming language) parses code, it builds a tree structure representing the program’s logical organization. For a function definition def calculate(x, y): return x + y, the AST would show: “FunctionDef named ‘calculate’ with parameters [‘x’, ‘y’] and a body containing Return(Add(Name(‘x’), Name(‘y’)))”. The AST is language-structure-aware, unlike simple text search which is just pattern matching on characters. Using an AST to navigate code is like using a map with labeled roads rather than aerial photography — you find what you’re looking for based on its logical type and name, not its visual appearance.

The Asynchronous Overseer

A critical architectural component in SICA is the asynchronous overseer — a separate LLM that runs concurrently with the main agent and monitors its behavior.

The overseer receives a detailed view of SICA’s actions: a callgraph (which functions called which), a full log of LLM prompts and responses, and tool call records. It periodically evaluates whether SICA is:

Looping — repeating the same sequence of actions without progress
Stagnating — making attempts but not improving
Deviating — taking actions outside expected patterns that might indicate confusion or misalignment

If it detects problems, the overseer can send a correction message to SICA or halt execution entirely.

This is the Reflection and Multi-Agent Collaboration patterns combined and applied to the meta-level: the agent’s learning process itself is supervised by another agent.

Safety: Docker Containerization

An agent that can execute shell commands and edit its own code is potentially dangerous. If it makes a mistake, it could delete files, run infinite loops, or consume all system resources.

SICA addresses this with Docker containerization: the agent runs inside a Docker container — an isolated environment that is completely separated from the host machine’s file system, network, and processes. The agent can only affect files and processes within the container. Even if it runs rm -rf / inside the container, the host machine is untouched.

This is the same principle behind browser sandboxing (websites can’t access your file system) and cloud function isolation (each invocation gets its own isolated environment). Containerization is the industry-standard safety mechanism for running untrusted or potentially dangerous code.

SICA Performance Results

SICA’s utility score (a weighted combination of success rate, time, and compute cost) increased significantly across iterations, with key jumps corresponding to the invention of specific tools:

Introducing the Smart Editor produced a large improvement in code editing efficiency
The AST Symbol Locator enabled faster and more accurate navigation
The Hybrid Symbol Locator (combining fast text search with AST verification) pushed performance higher still
Context-Sensitive Diff Minimization reduced context window usage, allowing more iterations before hitting limits

The full performance trajectory shows the characteristic pattern of AI improvement: step-function jumps at innovation points, with gradual improvement between them as the new capability is refined.

AlphaEvolve: Discovering Algorithms Better Than Humans

AlphaEvolve, developed by Google DeepMind in 2025, takes a different approach to learning: rather than improving an agent’s behavior, it uses AI to discover new algorithms — mathematical procedures for solving problems.

The Architecture

AlphaEvolve combines three components:

1. LLM Ensemble (Gemini Flash + Pro). Gemini Flash generates a large, diverse set of candidate algorithm modifications — exploring widely across the solution space. Gemini Pro then analyzes the most promising candidates in depth, providing more careful refinement. The combination of breadth (Flash) and depth (Pro) mirrors how human research teams work: brainstormers and analysts.

2. Automated Evaluation. Every proposed algorithm modification is automatically tested against predefined criteria — speed, correctness, resource usage. This provides immediate, objective feedback without requiring human review of each proposal. The evaluation is deterministic: given the same algorithm, the score is always the same.

3. Evolutionary Selection. The best-performing algorithm variants are kept. Worse variants are discarded. The surviving variants are used as the “parents” for the next generation of modifications. This is evolutionary computing — the same mechanism that evolution uses to improve biological organisms over generations, applied to algorithms.

ALPHAEVOLVE ARCHITECTURE — LLM-guided evolutionary algorithm discovery

Initial Algorithm Population

Start with existing known algorithms as the initial generation. These are the "seed" solutions.

Evolutionary Loop — runs for many generations

Each generation: propose → evaluate → select → repeat

Gemini Flash

Generate many diverse algorithm variants. Breadth over depth.

Gemini Pro

Analyze + refine best candidates. Depth over breadth.

Automated Evaluation

Run every candidate against objective metrics. Score assigned. No human review per candidate.

Selection

Keep high-scorers, discard low-scorers

Best Algorithm Discovered

After many generations, the surviving algorithm surpasses what human designers achieved.

What AlphaEvolve Discovered

The results from AlphaEvolve are extraordinary:

Data center scheduling: AlphaEvolve discovered a scheduling algorithm that reduced Google’s global compute resource usage by 0.7%. On Google’s scale, this represents enormous savings — equivalent to taking hundreds of thousands of servers offline permanently.

Matrix multiplication: AlphaEvolve found a method for multiplying 4×4 complex-valued matrices using 48 scalar multiplications — fewer than the previously best-known algorithm (which required 49). Matrix multiplication is fundamental to virtually all neural network computation. Even a 2% improvement at this level has enormous downstream impact.

TPU design: AlphaEvolve suggested optimizations to Verilog code (the hardware description language) for Google’s next generation of Tensor Processing Units — AI chips. This is an AI system improving the hardware that will run future AI systems.

Gemini architecture: AlphaEvolve achieved a 23% speed improvement in a core computational kernel within the Gemini model itself — a recursive self-improvement at the infrastructure level.

Why this matters beyond the specific results. AlphaEvolve demonstrates that AI systems can now participate in the discovery of scientific knowledge — not just applying what humans have already figured out, but finding genuinely new solutions that human mathematicians and engineers had not found. This is a qualitative shift in what AI can do.

OpenEvolve: The Open Source Version

OpenEvolve is an open-source evolutionary coding agent that applies the same principles as AlphaEvolve to general code optimization tasks. It is designed to evolve entire code files (not just individual functions) across multiple programming languages.

Architecture

from openevolve import OpenEvolve

# Initialize the evolutionary system
evolve = OpenEvolve(
    initial_program_path = "path/to/initial_program.py",
    evaluation_file      = "path/to/evaluator.py",
    config_path          = "path/to/config.yaml"
)

# Run 1,000 generations of evolution
best_program = await evolve.run(iterations=1000)

# Inspect what the system found
print("Best program metrics:")
for name, value in best_program.metrics.items():
    print(f"  {name}: {value:.4f}")

initial_program_path: The starting point — a working but possibly unoptimized Python program. This is like the “seed” organism in biological evolution.

evaluation_file: A Python script that takes a program and returns scores. This is the fitness function — it defines what “better” means for your specific optimization goal. For a sorting algorithm, the evaluator might measure speed and correctness. For a compression algorithm, it might measure compression ratio and decompression speed.

config_path: Settings for the evolution — which LLM to use, how many candidates to evaluate per generation, the selection pressure, mutation strategies.

evolve.run(iterations=1000): Runs 1,000 generations of the evolutionary loop: generate variants → evaluate all → select survivors → repeat. Returns the highest-scoring program found across all generations.

The Four Core Components (OpenEvolve’s Architecture)

OpenEvolve’s controller orchestrates four components:

Program Database: Stores all programs ever generated, along with their evaluation metrics. Programs that score well are more likely to be selected as “parents” for the next generation. The database also tracks program lineage — which programs were derived from which.

LLM Ensemble: One or more LLMs that generate code modifications. Given a parent program and the evaluation scores, the LLM proposes changes: “The loop on line 42 could be vectorized,” “Replace the dictionary with a more efficient data structure,” “This recursive function could be converted to iterative with a stack.”

Evaluator Pool: Multiple parallel evaluation workers that run candidate programs concurrently. Since evaluation can be computationally expensive, running evaluations in parallel (the Parallelization pattern from Chapter 3) significantly speeds up each generation.

Prompt Sampler: Constructs the prompts sent to the LLM ensemble. It selects representative programs from the database, formats them with their scores, and asks the LLM to generate improvements. The quality of these prompts directly affects the quality of the generated modifications.

Where OpenEvolve Succeeds

OpenEvolve has demonstrated impressive results on competitive programming problems, algorithm optimization benchmarks, and numerical computing tasks. It is particularly effective when:

There is a clear, automatable evaluation metric — you must be able to measure “better” without human judgment per candidate.
The solution space is large and complex — problems where human intuition has natural limits but LLMs can propose creative modifications.
Starting from a working solution — evolution refines; it doesn’t create from nothing. You need a baseline that already works.

Practical Applications

Personalized Assistants

Agents that learn individual user preferences over time — communication style, expertise level, preferred formats, recurring topics — and continuously refine their interaction to match each person.

Memory-Based + Online Learning

Trading Bots

Adaptive decision-making algorithms that adjust model parameters based on high-resolution real-time market data, changing strategy as market conditions evolve.

Reinforcement Learning (PPO)

Fraud Detection

Agents that continuously refine anomaly detection by learning from newly identified fraud patterns — the system gets better at catching fraud as new attack patterns emerge.

Online Learning + Supervised

Game AI

Dynamic opponents that adapt to each player's strategy in real time, providing an appropriate challenge level rather than a fixed difficulty preset.

Reinforcement Learning (PPO/DQN)

Robotic Navigation

Autonomous systems that improve navigation and obstacle avoidance through accumulated sensor data and historical action analysis across different environmental conditions.

RL + Memory-Based Learning

Algorithm Discovery

Systems like AlphaEvolve that discover novel algorithms for matrix operations, scheduling, and hardware design — surpassing solutions that human experts developed over decades.

Evolutionary + LLM Ensemble

The Spectrum of Learning: From Prompts to Self-Modification

It’s worth stepping back to see the full spectrum of how agents can learn and adapt, ordered from simplest to most powerful:

Level	Mechanism	Persistence	Compute cost	Example
0	Prompt tuning	None (session only)	Zero	Few-shot examples in prompt
1	Memory retrieval	Sessions	Low	RAG with user preference store
2	Fine-tuning (supervised)	Permanent	Medium	Domain adaptation on labeled data
3	RLHF with PPO	Permanent	High	ChatGPT alignment training
4	DPO	Permanent	Medium	Llama alignment training
5	Online learning	Continuous updates	Ongoing	Trading bot parameter updates
6	Code self-modification	Persistent agent improvement	High	SICA meta-improvement
7	Evolutionary algorithm discovery	Permanent (new algorithms)	Very high	AlphaEvolve

Each level up represents more autonomous adaptation, more compute investment, and more powerful outcomes. Most production agents operate at levels 0-2 because those are practical and cost-effective. Levels 3-7 represent the frontier of research and are increasingly entering production for organizations with the resources to invest in them.

At a Glance

WHAT

Static agents execute without improving. Learning and adaptation are the mechanisms that allow agents to autonomously get better — changing their weights, prompts, code, or knowledge base based on experience and feedback.

WHY

Real environments change. Users are individuals. Novel situations arise that weren't anticipated at design time. An agent that cannot learn cannot maintain performance in a dynamic world — it degrades until a human manually intervenes to update it.

RULE OF THUMB

Use learning mechanisms when: the environment changes unpredictably, personalization is required, or the cost of manual updates outweighs the cost of automated adaptation. Start at the simplest level (prompts/memory) before investing in fine-tuning or RL.

Key Takeaways

Learning means changing — either the model’s weights (training), the agent’s code (SICA-style), the agent’s knowledge base (memory-based), or the agent’s behavior within a session (few-shot). Each has different persistence, cost, and power characteristics.
PPO is the workhorse RL algorithm. Its core insight — clip updates to prevent catastrophic policy changes — makes it stable enough to train complex agents. It’s behind ChatGPT’s alignment, AlphaGo, and countless production RL systems.
DPO is PPO’s simpler cousin for LLM alignment. By skipping the reward model entirely and directly optimizing on preference pairs, DPO achieves comparable alignment with less complexity, less compute, and less risk of reward hacking.
Few-shot in-context learning is the most accessible form. No training, no compute investment — just well-crafted examples in the prompt. The limitation is that it’s session-scoped: the adaptation disappears when the context ends.
SICA demonstrates self-modification is possible. An agent that edits its own code can invent tools (AST Symbol Locator, Smart Editor) that human designers didn’t build. The key enabling factor is an automated evaluation loop that provides objective feedback. Without measurable feedback, self-improvement is directionless.
Safety requires containerization. Any agent that can execute code or modify files must run in an isolated environment (Docker container) that prevents it from affecting the host system. This is non-negotiable for production self-modifying agents.
AlphaEvolve shows AI can participate in scientific discovery. The combination of LLM creativity (proposing modifications) with automated evaluation (objective scoring) and evolutionary selection (survival of the fittest) produces algorithms that surpass what human experts designed — including in matrix multiplication, chip design, and data center optimization.
OpenEvolve makes evolutionary code optimization accessible. The three-component architecture (LLM ensemble → evaluator pool → program database) is open source and generalizable to any programming task where you can define an automated evaluation metric.
The key condition for all learning systems: automated evaluation. Without a fast, reliable way to measure whether a change is an improvement, learning is impossible. Designing the evaluation function is often harder than building the learning system itself.

Next up — Chapter 10: Model Context Protocol (MCP), where agents gain standardized ways to connect to any tool, server, or data source through a universal interface.

The Problem with Static Agents

The Six Ways Agents Learn

PPO: The Algorithm Behind Agent Training

What Problem PPO Solves

The PPO Algorithm, Step by Step

DPO: The Simpler Path to Human Alignment

Why PPO Is Hard for LLM Alignment

How DPO Works

When Agents Learn Without Retraining: Few-Shot and In-Context Adaptation

Case Study: SICA — The Self-Improving Coding Agent

The Architecture

What SICA Actually Improved

The Asynchronous Overseer

Safety: Docker Containerization

SICA Performance Results

AlphaEvolve: Discovering Algorithms Better Than Humans

The Architecture

What AlphaEvolve Discovered

OpenEvolve: The Open Source Version

Architecture

The Four Core Components (OpenEvolve’s Architecture)

Where OpenEvolve Succeeds

Practical Applications

Personalized Assistants

Trading Bots

Fraud Detection

Game AI

Robotic Navigation

Algorithm Discovery

The Spectrum of Learning: From Prompts to Self-Modification

At a Glance

Key Takeaways

Enjoy Reading This Article?