ARTICLE  ·  15 MIN READ  ·  FEBRUARY 22, 2026

Chapter 13: Human-in-the-Loop

Full autonomy sounds ideal — but in high-stakes domains, the cost of a single AI error is too high. Human-in-the-Loop is the pattern that keeps humans in control of the decisions that matter most.


The Autonomy Paradox

Before You Start — Key Terms Explained

Autonomy: The ability to act independently without human intervention. A fully autonomous agent makes all decisions on its own. A fully human-controlled system requires a human to approve every action. Most production AI systems live somewhere in between.

Escalation: The process of handing a task from an AI agent to a human when the agent reaches the limits of its reliable capability. Escalation policies define: what triggers escalation (complexity, uncertainty, risk level), what information is passed to the human, and what the human is expected to decide.

Human-in-the-Loop (HITL): A design pattern where human judgment is deliberately integrated into the AI's workflow at specific decision points. The human is not just monitoring — they are an active participant in the decision-making process for certain actions.

Human-on-the-Loop (HOTL): A related pattern where a human sets the rules and monitors for exceptions, but the AI handles day-to-day execution autonomously. The human is informed of what happened and can intervene — but doesn't approve each action in advance. Think: setting investment policy and letting the AI trade within that policy.

Alignment: The property of an AI system acting in accordance with human values, intentions, and goals. A system that's technically capable but misaligned might achieve the stated objective while violating implicit constraints — like a customer service bot that resolves complaints by simply closing tickets without actually addressing the issue.

Scalability (in this context): The ability to handle increasing volume. HITL has an inherent scalability limit — humans can't review a million transactions per day. This is the fundamental tension HITL must navigate: depth of oversight vs. breadth of coverage.

AI agents are getting remarkably capable. They can write code, analyze documents, execute complex multi-step workflows, and reason through nuanced problems. The natural next question is: why not just let them run autonomously?

The answer isn’t “because the technology isn’t ready” — in many narrow domains, the technology is perfectly capable of autonomous operation. The answer is more fundamental: in certain decisions, the consequences of errors are so severe that the cost of human oversight is worth paying, regardless of how good the AI is.

Consider:

  • A loan approval system denies credit to 10,000 applicants. The AI was 95% accurate. That’s still 500 people incorrectly denied credit — potentially ruining financial futures.
  • A medical diagnosis system recommends treatment. It’s wrong 2% of the time. That’s millions of misdiagnoses per year if deployed at scale.
  • A legal AI recommends a prison sentence. Even a 1% error rate means hundreds of people jailed incorrectly.

In these domains, “pretty good” is not good enough. The accountability, ethics, and legal requirements demand that a human be responsible for the decision — not because the AI can’t do it, but because the stakes require human judgment, human accountability, and human context.

Human-in-the-Loop (HITL) is the pattern that navigates this reality. It doesn’t reject AI autonomy — it directs autonomy to where it’s appropriate, and maintains human control where it’s necessary.


The Spectrum of Human Involvement

There is no single correct amount of human involvement in an AI system. The right level depends on the risk of errors, the volume of decisions, the expertise required to evaluate them, and the regulatory environment. The spectrum below shows the range:

HUMAN INVOLVEMENT SPECTRUM — click to explore each level
Level 1 — Full Automation
AI acts without any human involvement
The AI makes and executes decisions autonomously. No human reviews outputs or approves actions. Humans may monitor aggregate metrics but don't intervene on individual decisions.
When appropriate: Decisions with very low risk, very high volume, and well-defined success criteria. Spam filtering, autocomplete suggestions, recommendation ranking, ad targeting.
Spam filter automatically moves 10,000 emails to the spam folder. No human reviews each one. If the filter makes an occasional mistake, the cost is minor (missed newsletter) and easily corrected.
Level 2 — Human-on-the-Loop (HOTL)
Human sets policy; AI executes autonomously within it
A human expert defines the rules, boundaries, and policies. The AI then handles all decisions that fall within those policies autonomously and at high speed. Humans monitor for anomalies and update policies when needed.
When appropriate: High-volume decisions with well-defined policies, where the AI must act faster than humans can review. Algorithmic trading, call routing, network management.
Financial manager defines: "Buy AAPL if RSI < 30, sell if RSI > 70, never invest more than 5% in any single stock." AI executes hundreds of trades per day within these rules. Manager reviews daily summary and updates policy monthly.
Level 3 — Human-in-the-Loop (HITL)
Human reviews and approves specific high-stakes decisions
The AI handles routine cases autonomously. For cases that are high-risk, ambiguous, or outside the AI's confident range, it pauses and escalates to a human. The human's decision is final and is recorded for accountability.
When appropriate: Mixed-risk environments where most decisions are routine (AI can handle) but some require human judgment (AI should escalate). Fraud detection, content moderation, medical triage.
Fraud detection AI automatically blocks 95% of clearly fraudulent transactions. For transactions with medium confidence scores, it flags them for human analyst review. The analyst investigates and makes the final call.
Level 4 — AI-Assisted Decision Making
Human decides; AI provides analysis and recommendations
The human always makes the decision. The AI's role is to assist — gathering relevant information, analyzing options, surfacing risks, generating recommendations. The human is fully in control but better informed.
When appropriate: Complex, high-stakes decisions requiring human judgment, where AI can add value through analysis but shouldn't make the final call. Legal judgments, medical diagnoses, strategic business decisions.
Doctor reviews AI analysis of patient's MRI scan: "AI detects pattern consistent with early-stage condition X (confidence: 78%). Similar patterns in 15 past cases: 12 confirmed, 3 not." Doctor makes diagnosis. AI informed the decision, didn't make it.
Level 5 — Full Human Control
Human makes all decisions; AI only provides tools or information
The human is fully in charge. AI may provide data, formatting, or convenience — but no autonomous decision-making, no recommendations, no automation of consequential actions. Required in domains with strict regulatory or ethical constraints.
When appropriate: Legal sentencing, criminal investigations, core civil liberties decisions, creative authorship, diplomatic negotiations.
Judge uses AI only to access case law database and format documents. All judicial reasoning, interpretation of law, and sentencing decisions are made entirely by the judge. No AI recommendations on outcomes.

How HITL Works: The Pattern Architecture

HUMAN-IN-THE-LOOP PATTERN — escalation and review flow
Request Arrives
User query, system event, or incoming data that the agent must process
Agent Processes + Assesses Confidence
Analyzes request, takes action or generates recommendation. Simultaneously estimates: How confident am I? Is this within my reliable capability? What is the risk if I'm wrong?
Escalation Threshold Met?
High risk, low confidence, complex context, flagged category?
No — handle autonomously
Execute Action
Agent acts. Logs decision, action, and confidence score for audit trail.
Yes — escalate to human
Human Review Queue
Agent presents: the request, its analysis, recommended action, confidence level, and reason for escalation. Human decides.
Human Decision
Approve agent recommendation · Modify and approve · Override with different action · Request more information
Execute + Learn
Execute the human-approved action. Store the (situation, human_decision) pair as training data for improving future confidence calibration.

The four key design decisions in any HITL implementation:

  1. Escalation trigger: What conditions send a case to human review? (Confidence below threshold, specific risk categories, dollar amounts above $X, regulatory requirements)
  2. Information presented to human: What does the human see? The agent’s analysis should be complete but not overwhelming — include the key facts, the recommended action, and the specific reason for escalation.
  3. Response options: What can the human do? (Approve, modify, override, defer, flag for training)
  4. Feedback loop: How does the human’s decision improve the agent? Every escalated decision is a labeled training example — use it.

Designing Escalation Policies

An escalation policy is a formal specification of when and how the agent hands off to a human. Without an explicit policy, escalation is ad hoc — inconsistent, unpredictable, and unable to be improved systematically.

What Triggers Escalation

📊

Confidence Threshold

Escalate when the agent's confidence in its output falls below a defined threshold. "If classification confidence < 80%, escalate." Requires the agent to produce calibrated confidence scores.

Content moderation: posts with borderline confidence scores go to human reviewers
💰

Risk / Impact Threshold

Escalate when the consequence of an error exceeds a defined limit. "Any payment above $10,000 requires human approval." Independent of confidence — even a confident agent should get human sign-off on high-impact actions.

Financial systems: transactions above threshold always go to human review
🏷️

Category Rules

Escalate when the request falls into a pre-defined category that always requires human judgment. "All medical advice, legal decisions, and politically sensitive content → human review."

Customer support: any mention of "lawsuit", "discrimination", or "harassment" routes to senior human agent
🔄

Iteration Limit

Escalate when the agent has tried and failed to resolve a situation within N attempts. Prevents infinite retry loops that never resolve and ensures stuck cases get human attention.

Technical support: if troubleshooting tool hasn't resolved issue after 3 attempts, escalate to human specialist
😤

Sentiment / Urgency Signal

Escalate when user signals indicate emotional distress, urgent need, or explicit request for a human. "If customer frustration score > 8/10, transfer to human agent." Respects user autonomy and emotional context.

Call centers: tone analysis triggering transfer to empathetic human agent for distressed callers
🚨

Novelty Detection

Escalate when the request is significantly different from anything the agent has seen in training. "Out-of-distribution input detected → escalate rather than confidently guess." The agent knows what it doesn't know.

Medical AI: unusual symptom combination outside training distribution flagged for physician review

The Code: Technical Support Agent with HITL

The ADK code example demonstrates HITL through an escalation tool — an agent that handles routine support cases autonomously but calls a human specialist for complex ones.

from google.adk.agents import Agent
from google.adk.tools.tool_context import ToolContext
from google.adk.callbacks import CallbackContext
from google.adk.models.llm import LlmRequest
from google.genai import types
from typing import Optional

Why these imports? Agent is the core ADK agent class. ToolContext gives tools access to session state. CallbackContext and LlmRequest are used for the personalization callback — a hook that runs before every LLM call. types is Google’s Generative AI types library for creating structured content.

Defining the Tool Suite

def troubleshoot_issue(issue: str) -> dict:
    """
    Analyzes a technical issue and provides step-by-step troubleshooting guidance.
    Use this for standard technical problems like connectivity issues, software errors,
    or device configuration problems.

    Args:
        issue: Description of the technical problem.
    Returns:
        dict with 'status' and 'report' containing troubleshooting steps.
    """
    return {"status": "success", "report": f"Troubleshooting steps for {issue}."}


def create_ticket(issue_type: str, details: str) -> dict:
    """
    Creates a support ticket to log an unresolved issue for follow-up.
    Use this when basic troubleshooting hasn't resolved the issue.

    Args:
        issue_type: Category of the issue (hardware, software, connectivity, etc.)
        details: Full description of the problem and steps attempted so far.
    Returns:
        dict with 'status' and 'ticket_id' for tracking.
    """
    return {"status": "success", "ticket_id": "TICKET123"}


def escalate_to_human(issue_type: str) -> dict:
    """
    Escalates the customer's issue to a human specialist.
    Use this ONLY when:
    1. The issue is too complex for basic troubleshooting
    2. The customer is highly frustrated or explicitly requests a human
    3. The issue involves safety, legal, or financial implications
    4. Basic troubleshooting has been exhausted without resolution

    Args:
        issue_type: The category of issue requiring human expertise.
    Returns:
        dict confirming escalation with expected response time.
    """
    return {
        "status": "success",
        "message": f"Transferred to human specialist for {issue_type}. Expected wait: 5 minutes."
    }

Why is escalate_to_human a tool? This is the key design insight. Escalation is treated as an action the agent can choose to take — just like using any other tool. The LLM decides when to call it based on the docstring criteria. This means escalation logic is expressed in natural language (the docstring) rather than hardcoded if/else rules. The agent reads: “Use this ONLY when…” and applies judgment to decide if those conditions are met.

Why such a detailed docstring for escalate_to_human? The LLM will only call this tool when its training suggests the description matches the current situation. A vague description like “transfers to a human” gives the agent no guidance on when escalation is appropriate. The four specific conditions in the docstring make escalation selective and purposeful — the agent escalates when it genuinely should, not reflexively.

The Agent Configuration

technical_support_agent = Agent(
    name        = "technical_support_specialist",
    model       = "gemini-2.0-flash-exp",
    instruction = """
You are a technical support specialist for an electronics company.

FIRST, check if the user has a support history in state["customer_info"]["support_history"].
If they do, reference this history in your responses — don't ask them to repeat information
they've already provided in past interactions.

For technical issues, follow this workflow IN ORDER:
1. Use the troubleshoot_issue tool to analyze the problem.
2. Walk the user through the troubleshooting steps clearly, one at a time.
3. If the issue persists after troubleshooting, use create_ticket to log it for follow-up.

For complex issues beyond basic troubleshooting — or if the user is frustrated,
has tried multiple fixes already, or explicitly asks for a human — use escalate_to_human
to transfer them to a specialist who can help more effectively.

Maintain a professional but empathetic tone. Acknowledge that technical problems are
frustrating. Explain what you're doing at each step so the user understands the process.
""",
    tools = [troubleshoot_issue, create_ticket, escalate_to_human]
)

The ordered workflow in the instruction matters. “Follow this workflow IN ORDER” prevents the agent from jumping straight to escalation for routine problems (over-escalation) or from endlessly troubleshooting without ever creating a ticket (under-escalation). The explicit ordering creates a sensible default workflow while still allowing judgment for exceptional cases.

Why “if the user is frustrated… use escalate_to_human”? This sentiment-based escalation rule reflects a real support design principle: when a customer is emotionally distressed, technical competence is secondary to empathy. A human agent can provide the emotional connection that an AI cannot. This condition in the instruction teaches the agent to recognize and respond to frustration signals.

Personalization via Callback

def personalization_callback(
    callback_context: CallbackContext,
    llm_request: LlmRequest
) -> Optional[LlmRequest]:
    """
    Injects customer-specific context before every LLM call.

    This callback runs automatically before each LLM request.
    It reads customer data from session state and injects it as a
    system message so the agent has personalized context.

    Returns None to continue with the (modified) request.
    Returns a complete LlmRequest object to replace it entirely.
    """
    customer_info = callback_context.state.get("customer_info")
    if not customer_info:
        return None  # No personalization data — continue without modification

    # Extract relevant fields with safe defaults
    customer_name     = customer_info.get("name",             "valued customer")
    customer_tier     = customer_info.get("tier",             "standard")
    recent_purchases  = customer_info.get("recent_purchases", [])
    support_history   = customer_info.get("support_history",  [])

    # Build a structured personalization note
    personalization_note = (
        f"\nCUSTOMER CONTEXT:\n"
        f"  Name: {customer_name}\n"
        f"  Support Tier: {customer_tier}\n"
    )

    if recent_purchases:
        personalization_note += f"  Recent Purchases: {', '.join(recent_purchases)}\n"

    if support_history:
        personalization_note += f"  Open Issues: {len(support_history)} past support cases\n"
        # Include the most recent case for immediate context
        personalization_note += f"  Most Recent Case: {support_history[-1]}\n"

    # Inject as system message at the start of the conversation
    system_content = types.Content(
        role  = "system",
        parts = [types.Part(text=personalization_note)]
    )

    if llm_request.contents:
        llm_request.contents.insert(0, system_content)

    return None  # Signal: continue with the modified request

What is a callback? A callback is a function that the framework calls automatically at specific points in the agent’s lifecycle. The personalization_callback is called before every LLM request — this is the “before-LLM” hook. Other callback types include after-LLM (to postprocess output), before-tool-call (to validate or modify tool calls), and after-tool-call (to process tool results). Callbacks let you inject cross-cutting logic without modifying the agent’s core instructions.

Why inject personalization as a system message rather than in the instruction? The instruction is static — it’s the same for every conversation. The personalization callback is dynamic — it reads the current session’s customer data and injects it fresh for each request. If you put customer-specific information in the instruction, you’d need a different agent instance per customer. The callback allows one agent definition to serve all customers with personalized context.

What is callback_context.state? This is the session state for the current conversation — the same session.state dictionary from Chapter 8 (Memory). The customer’s name, tier, and purchase history were stored there when the session was initialized (likely by a separate customer authentication step). The callback reads from it to enrich every LLM call.

Why return None? The callback return value controls flow: None means “continue with the (possibly modified) request.” Returning a complete LlmRequest object would replace the original request entirely. In most personalization cases, None is correct — you want to add context to the existing request, not replace it.


The Human-on-the-Loop Variation

HITL requires a human to be present and responsive for every escalated case. This doesn’t scale to millisecond-latency systems. The Human-on-the-Loop (HOTL) pattern addresses this by letting humans operate at the policy level rather than the decision level.

HUMAN-ON-THE-LOOP (HOTL) — policy-driven, AI-executed
Human Expert Defines Policy
Investment rules, routing criteria, response templates, escalation thresholds. Happens infrequently — weekly or monthly. No real-time involvement needed.
AI Execution Loop — runs continuously, at scale
Executes within policy boundaries without per-case human approval
Monitor + Classify Incoming
Event, transaction, or request arrives
Within Policy Boundaries?
Execute Autonomously
Act immediately within policy. Log for audit.
Flag + Queue
Outside policy → queue for human review (not urgent)
Human Reviews Aggregates + Exceptions
Reviews daily summaries, exception reports, and flagged edge cases. Updates policy based on patterns. Provides feedback on AI decisions.

Two concrete examples of HOTL:

Automated trading: A fund manager writes: “Maintain 70% tech stocks, 30% bonds. No more than 5% in any single company. Auto-sell any stock that drops 10% below purchase price.” The AI executes hundreds of trades per day within these constraints, at speeds no human could match. The manager reviews end-of-day summaries and adjusts strategy monthly.

Call center routing: A manager sets rules: “Route ‘service outage’ mentions to technical specialists. If tone analysis detects high frustration, offer immediate human transfer. Upsell opportunities only to Gold-tier customers.” The AI handles initial interaction with all customers simultaneously, applying these rules in real time. The manager reviews daily reports, adjusts routing rules weekly based on data.


Eight Practical Applications

01

Content Moderation

AI handles clear-cut violations (obvious spam, known illegal content) automatically. Borderline cases — satire vs. hate speech, context-dependent content — escalate to human moderators for nuanced judgment.

Trust & Safety · Social platforms
02

Financial Fraud Detection

AI flags suspicious transactions by pattern. High-risk or ambiguous alerts go to human analysts who investigate further, contact customers, and make the final determination — preserving accountability.

FinTech · Banking · Insurance
03

Medical Diagnosis Support

AI analyzes imaging, lab results, and patient history to surface possible diagnoses with confidence scores. Physicians review the AI analysis and make the clinical decision — informed, not replaced.

Healthcare · Radiology · Pathology
04

Autonomous Driving

Self-driving systems handle normal driving conditions autonomously. Extreme weather, unusual road conditions, and ambiguous edge cases trigger handover to human driver — system knows its limits.

Transportation · AV systems
05

Customer Support Escalation

Chatbot handles routine queries. Complex issues, emotional distress, or explicit human requests trigger seamless handoff to a human support agent — with full conversation context transferred automatically.

Customer Success · E-commerce · SaaS
06

Legal Document Review

AI scans thousands of documents to identify relevant clauses, flag inconsistencies, and surface key passages. Legal professionals review AI findings for accuracy, context, and legal implications.

Legal · Compliance · M&A
07

Data Labeling & Annotation

Human annotators provide ground truth labels for training data. AI pre-labels obvious cases; humans review and correct. The human feedback directly shapes the model's future behavior.

ML training · Computer vision · NLP
08

Generative AI Refinement

AI generates creative content (marketing copy, design concepts, code). Human editors and designers review, refine, and approve final output — ensuring brand alignment, quality, and appropriateness.

Marketing · Creative · Engineering

The Caveats: What HITL Cannot Solve

HITL is powerful, but treating it as a universal solution ignores serious structural limitations that must be addressed honestly.

Caveat 1: Fundamental scalability ceiling. A human reviewer can process perhaps 100-500 decisions per day with careful attention. An AI system can process millions. If 2% of all decisions require human review and you’re handling a million cases per day, you need 20,000 human reviews daily — impossible at most organizations. HITL works when the escalation rate is low enough that humans can genuinely engage with each case rather than rubber-stamping them to clear the queue. If reviewers are overwhelmed, HITL’s quality guarantee collapses.

Caveat 2: Reviewer expertise is not guaranteed. HITL produces quality outcomes only when human reviewers are genuinely more qualified than the AI for the escalated cases. A non-specialist approving medical diagnoses, a junior analyst making fraud determinations, or an untrained moderator ruling on complex speech cases can produce outcomes worse than fully autonomous AI. The expertise required for meaningful oversight must be explicitly resourced.

Caveat 3: Annotation quality is not automatic. When HITL is used to generate training data (humans label outputs to train the model), the quality of that training data depends entirely on the quality of the human annotation. Annotators may have inconsistent standards, implicit biases, or misunderstand the task. Training data is only as good as the annotation guidelines and the annotator training — both require significant investment.

Caveat 4: Privacy constraints complicate oversight. For human reviewers to exercise meaningful judgment, they need full context — which may include sensitive personal information. Medical records, financial history, private communications. Before a case can reach a human reviewer, sensitive data must often be anonymized or redacted — which may remove exactly the context needed to make a good decision. Privacy and effective oversight are in tension, and there is no easy solution.

Caveat 5: Reviewer bias affects model learning. Human reviewers have biases — cultural, demographic, professional. When their decisions become training data, those biases are learned by the model. HITL doesn’t eliminate bias; it changes whose bias shapes the system. Diverse reviewer panels, explicit bias auditing, and disagreement resolution protocols help — but the fundamental challenge remains.


Common Mistakes in HITL Design

Mistake 1: Escalating too much. If 40% of cases are escalated, reviewers face burnout, decisions become rushed, and the human oversight becomes rubber-stamping rather than genuine review. Calibrate escalation thresholds to match reviewer capacity — if you have 100 reviewer-hours per day, your escalation rate should produce fewer than 100 meaningful cases to review.

Mistake 2: Escalating too little. If borderline cases that should go to human review are processed autonomously, the value of HITL is lost. The edge cases are exactly where human judgment matters most. Monitor the cases that were handled autonomously and then had poor outcomes — these might have been better escalated.

Mistake 3: Not closing the feedback loop. Every escalated human decision is valuable training data. If you don’t capture it and use it to retrain or re-calibrate the agent, you’re throwing away the most valuable signal in your system. Every human correction should update either the model, the escalation threshold, or the agent’s instructions.

Mistake 4: Poor information presentation to reviewers. The human reviewer sees only what the system presents. If the escalation interface shows only the raw AI output without the reasoning, context, and specific escalation reason, the reviewer can’t make an informed decision — they’re guessing. Invest in the reviewer interface as much as the agent logic.

Mistake 5: No audit trail. In high-stakes domains (medical, legal, financial), every decision must be traceable: who made it, when, based on what information, with what justification. If a human overrides an AI recommendation, that override and its rationale must be logged. Audit trails are not optional features — they’re core infrastructure for accountable AI.


At a Glance

WHAT

A pattern that deliberately integrates human judgment into AI workflows at high-stakes decision points — through escalation, review, feedback, and oversight — rather than operating fully autonomously or requiring humans to do everything manually.

WHY

In domains with high error costs (medical, legal, financial, safety-critical), technical AI capability is insufficient — accountability, ethics, and regulatory requirements demand human judgment and human responsibility for consequential decisions.

RULE OF THUMB

Use HITL when errors have significant safety, ethical, or financial consequences. Use Human-on-the-Loop when volume requires full automation but policy oversight is sufficient. Always design the feedback loop — human decisions are your best training data.


Key Takeaways

  • HITL is not about AI capability — it’s about accountability. Even if an AI can make a correct decision 99% of the time, certain decisions require a human to be accountable for them. Loan rejections, medical treatments, prison sentences — the person affected has a right to know a human was responsible.

  • The spectrum from full automation to full human control is continuous. Most systems should live somewhere in the middle, with the dial position determined by risk level, volume, expertise availability, and regulatory requirements — not by ideology.

  • Escalation triggers must be explicit and calibrated. Vague escalation logic (“escalate when unsure”) produces inconsistent results. Define specific thresholds: confidence scores, dollar amounts, category rules, iteration limits. Calibrate escalation rates to match reviewer capacity.

  • Escalation is a tool the agent calls. In ADK, escalate_to_human is a function with a docstring that tells the LLM when to call it. This means escalation logic is expressed in the same natural language interface as tool use — making it adjustable without code changes.

  • The personalization callback injects context before every LLM call. This is how the agent knows customer history, tier, and past cases without those details being hardcoded in every conversation. Callbacks are hooks that let you inject cross-cutting logic without modifying the agent’s core definition.

  • Human-on-the-Loop scales better than Human-in-the-Loop. HOTL lets humans set policy once and review aggregates periodically, rather than approving every individual decision. Use HOTL when decisions must happen at machine speed but human policy guidance is essential.

  • Every human decision is a training signal — capture it. The cases escalated to human review are the hardest cases, and the human’s decision is the highest-quality label you can get. If you’re not feeding these back into the model, you’re leaving the most valuable improvement signal unused.

  • HITL primary risk: scalability. A poorly calibrated HITL system where 30% of cases escalate creates overwhelmed reviewers who rubber-stamp decisions rather than genuinely evaluate them — worse than no oversight at all. The escalation rate must be manageable.


This concludes the core agentic AI patterns series. From prompt chaining (Chapter 1) to human-in-the-loop (Chapter 13), you now have the complete toolkit for designing, building, and deploying intelligent AI agent systems — with the depth to understand not just how each pattern works, but why it exists and when to use it.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Chapter 10: Contributing to AI Safety — Paths, Skills, and Getting Started
  • Chapter 9: AI Control — Safety Without Trusting the Model
  • Chapter 19: Evaluation and Monitoring
  • Chapter 18: Guardrails and Safety Patterns
  • Chapter 17: Reasoning Techniques