ARTICLE · 19 MIN READ · FEBRUARY 18, 2026
Chapter 12: Exception Handling and Recovery
Real-world agents fail. Networks time out, APIs return errors, databases go down, data arrives malformed. This chapter is about building agents that handle every failure gracefully — detecting it, recovering from it, and keeping the user informed throughout.
Why Agents Break in the Real World
Exception: An unexpected event that disrupts the normal flow of a program. In Python, exceptions are objects that represent errors — ConnectionError means the network failed, ValueError means the data had the wrong format, TimeoutError means a request took too long. When an exception occurs and isn't handled, the program crashes.
try/except (Python's error handling): A code structure that lets you attempt an operation and gracefully handle any errors it raises. Inside try you write the risky code. Inside except you write what to do if it fails. The program doesn't crash — it follows your fallback path instead.
Exponential backoff: A retry strategy where each retry waits longer than the previous one. Wait 1 second, then 2, then 4, then 8. This prevents "thundering herd" — where thousands of clients all retry simultaneously, overwhelming an already-struggling server.
Graceful degradation: Providing reduced functionality instead of complete failure. A chatbot that can't access the customer database might still answer general questions — reduced capability, but still useful. Better than refusing to respond at all.
Idempotency: A property of operations where calling them multiple times has the same effect as calling them once. GET requests are idempotent (reading data doesn't change it). POST requests often aren't (calling "send_email" twice sends two emails). Important for retry logic — only retry idempotent operations automatically.
Circuit breaker: A pattern that automatically stops calling a failing service after a threshold of failures. Like an electrical circuit breaker that trips when it detects an overload. Prevents cascading failures where one broken service causes the entire system to hang waiting for responses that will never come.
Fallback: An alternative approach that activates when the primary approach fails. If the precise GPS lookup fails, fall back to city-level location data. If the payment processor is down, fall back to a backup processor. Fallbacks ensure some functionality is preserved even when components fail.
Every pattern in this series has operated in idealized conditions: tools work, APIs respond, data arrives in the expected format. But deploy an agent in production and reality is messier:
- The weather API returns a
503 Service Unavailableduring peak traffic - The database query times out because another process is holding a lock
- The LLM returns JSON with a missing field your code expects
- The email service rejects the request because the recipient’s inbox is full
- A third-party service changes its response format without notice
- The user provides malformed input that breaks a tool’s parameter validation
- A network packet drops mid-request, leaving the connection in an ambiguous state
None of these are bugs in your agent’s logic. They’re the normal chaos of distributed systems. And a well-designed agent handles all of them — not by pretending they won’t happen, but by anticipating them and planning responses.
The Exception Handling and Recovery pattern is about building agents that are resilient — capable of detecting failures, responding to them appropriately, and restoring operation — rather than fragile agents that crash on first contact with an unexpected input.
This distinction matters enormously for production deployment. An agent that handles failures gracefully is trustworthy. An agent that crashes unpredictably is a liability, regardless of how intelligent its core reasoning is.
The Three Phases of Exception Management
Exception handling in agents follows a clear three-phase structure:
Phase 1: Error Detection
You can’t handle what you don’t know about. Error detection is the surveillance layer — continuously monitoring for signals that indicate something has gone wrong.
Types of Errors Agents Encounter
ConnectionError · TimeoutError · SSLError HTTPError · status codes 4xx/5xxJSONDecodeError · KeyError · ValueError ContextLengthError · RateLimitError Phase 2: Error Handling Strategies
Once an error is detected, you need a playbook — a structured set of responses based on the error type and severity. The five strategies below cover the full range from “this will resolve itself” to “this needs a human.”
Strategy 1: Retry with Exponential Backoff
The most common strategy for transient errors. The operation failed temporarily — wait a moment and try again.
Why exponential backoff? If your request failed because the server is overloaded and 10,000 other clients all retry at exactly the same time (1 second later), you’ve just made the overload worse. Exponential backoff spreads retries out: different clients wait different amounts before retrying, reducing the “retry storm” problem.
Here’s how to implement retry with exponential backoff in Python:
import time
import random
def retry_with_backoff(func, max_retries=4, base_delay=1.0, max_delay=60.0):
"""
Execute `func` with exponential backoff on failure.
Args:
func: The callable to retry. Should raise an exception on failure.
max_retries: Maximum number of retry attempts (not counting first try).
base_delay: Initial wait time in seconds.
max_delay: Maximum wait time cap in seconds.
Returns:
The result of func() if it eventually succeeds.
Raises:
The last exception raised by func if all retries are exhausted.
"""
last_exception = None
for attempt in range(max_retries + 1): # +1 for the initial try
try:
return func() # Try the operation
except (ConnectionError, TimeoutError) as e:
# These are transient — worth retrying
last_exception = e
if attempt == max_retries:
break # No more retries — fall through to raise
# Exponential backoff with jitter
# jitter = random noise to prevent synchronized retries from multiple clients
wait_time = min(base_delay * (2 ** attempt), max_delay)
jitter = random.uniform(0, wait_time * 0.1) # ±10% randomness
actual_wait = wait_time + jitter
print(f"Attempt {attempt + 1} failed: {e}. Waiting {actual_wait:.1f}s before retry...")
time.sleep(actual_wait)
except Exception as e:
# Non-transient errors (auth failures, invalid requests) — don't retry
raise # Re-raise immediately, no retry
raise last_exception # All retries exhausted — propagate final exception
Why separate
ConnectionError/TimeoutErrorfrom the generalException? Retrying makes sense for transient failures — the server is temporarily busy, the network had a hiccup. It makes no sense for permanent failures — a401 Unauthorizederror won’t become authorized just because you waited and tried again. Separate exception types for different retry policies prevents wasting retries on errors that will never resolve themselves.
What is “jitter”? Without jitter, if 1,000 clients all hit a rate limit at the same moment and all back off for exactly 1 second, they all retry at exactly the same moment — still overwhelming the server. Adding random noise (±10% of the wait time) spreads their retries across a window, significantly reducing the “synchronized retry storm” problem.
Strategy 2: Fallback Mechanisms
When retries are exhausted, activate an alternative approach. Fallbacks are pre-planned alternatives that provide some value when the primary approach is unavailable.
def get_location_info(address: str) -> dict:
"""
Attempt precise location lookup, fall back to city-level if unavailable.
"""
# Primary: precise geocoding API
try:
result = precise_location_api.lookup(address)
return {"precision": "precise", "data": result}
except (APIError, TimeoutError) as e:
print(f"Precise lookup failed: {e}. Activating fallback...")
# Fallback 1: city-level lookup from a different provider
try:
city = extract_city_from_address(address)
result = city_lookup_api.get(city)
return {"precision": "city-level", "data": result, "degraded": True}
except Exception as fallback_error:
print(f"City-level fallback also failed: {fallback_error}")
# Fallback 2: cached/static data
cached = location_cache.get(address)
if cached:
return {"precision": "cached", "data": cached, "degraded": True, "stale": True}
# No fallback available
return {"precision": "none", "error": str(e), "degraded": True}
The fallback hierarchy. Good fallback design has multiple levels: (1) primary approach, (2) alternative service with full data, (3) service with reduced data, (4) cached data, (5) graceful failure message. Each level provides less value but more availability. The agent always has a path forward.
Mark degraded results clearly. The
"degraded": Trueflag in the return value tells downstream code and the LLM that this result is less reliable than normal. The LLM can then communicate appropriately to the user: “I found city-level location data — detailed address lookup was temporarily unavailable.”
Strategy 3: Graceful Degradation
Provide partial functionality rather than complete failure. When one component fails, other components continue working.
def process_user_request(user_query: str, user_id: str) -> dict:
"""
Process request with graceful degradation of individual components.
"""
response = {"query": user_query, "components_used": []}
# Attempt 1: Personalization (nice-to-have, not critical)
try:
preferences = user_preference_db.get(user_id)
response["personalization"] = preferences
response["components_used"].append("personalization")
except Exception as e:
# Personalization failed — continue without it
log_warning(f"Personalization unavailable for {user_id}: {e}")
response["personalization"] = None # Use defaults
# Attempt 2: Real-time data (important, but not blocking)
try:
live_data = market_data_api.get_latest()
response["market_data"] = live_data
response["components_used"].append("real_time_data")
except Exception as e:
# Fall back to cached data from last successful fetch
cached = cache.get("market_data", max_age_seconds=300)
if cached:
response["market_data"] = cached
response["data_freshness"] = "cached (may be up to 5min old)"
response["components_used"].append("cached_data")
else:
response["market_data"] = None
response["data_freshness"] = "unavailable"
# Core functionality — this must succeed
response["answer"] = llm.generate(
query=user_query,
context=response
)
response["components_used"].append("llm_core")
return response
Why not catch everything at the top level? A single
try/exceptaround the entire function would either catch everything (and lose detail about what failed) or re-raise on first failure (no degradation). Component-level try/except gives you surgical control: some components are optional (personalization), some have fallbacks (real-time data → cache), and some are required (the LLM call). Each component’s failure is handled exactly as its criticality demands.
Strategy 4: Circuit Breaker
The circuit breaker pattern prevents cascading failures by automatically stopping calls to a consistently failing service.
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from enum import Enum
class CircuitState(Enum):
CLOSED = "closed" # Normal operation — calls pass through
OPEN = "open" # Failing — calls blocked immediately
HALF_OPEN = "half_open" # Testing — one call allowed to check recovery
@dataclass
class CircuitBreaker:
failure_threshold: int = 5 # Open after this many consecutive failures
recovery_timeout: int = 30 # Seconds to wait before half-open
state: CircuitState = CircuitState.CLOSED
failure_count: int = 0
last_failure_time: datetime = None
def call(self, func, *args, **kwargs):
if self.state == CircuitState.OPEN:
# Check if recovery timeout has elapsed
if datetime.now() - self.last_failure_time > timedelta(seconds=self.recovery_timeout):
self.state = CircuitState.HALF_OPEN
print("Circuit HALF-OPEN: testing if service recovered...")
else:
raise Exception("Circuit OPEN: service unavailable. Skipping call to prevent cascade.")
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
def _on_success(self):
self.failure_count = 0
if self.state == CircuitState.HALF_OPEN:
print("Circuit CLOSED: service has recovered.")
self.state = CircuitState.CLOSED
def _on_failure(self):
self.failure_count += 1
self.last_failure_time = datetime.now()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
print(f"Circuit OPENED after {self.failure_count} failures. Will retry in {self.recovery_timeout}s.")
Why is the circuit breaker needed when you already have retry logic? Without a circuit breaker, every request retries independently — even if you know the service has been failing for the past 10 minutes. The circuit breaker creates shared knowledge: once enough requests have failed, all subsequent requests immediately fail fast rather than waiting for a timeout. This frees up resources and prevents the cascade where one slow service causes the entire system to pile up with waiting threads.
The three states explained: CLOSED = normal (all calls pass through). OPEN = failed (all calls immediately rejected, no actual call made). HALF_OPEN = recovery probe (one call allowed — if it succeeds, CLOSED; if it fails, back to OPEN with refreshed timeout).
Strategy 5: Logging and Observability
You cannot improve what you cannot see. Comprehensive logging transforms failures from mysteries into actionable data.
import logging
import traceback
import json
from datetime import datetime
# Configure structured logging
logger = logging.getLogger(__name__)
def log_agent_error(
error: Exception,
agent_name: str,
action: str,
inputs: dict,
attempt_number: int,
context: dict = None
):
"""
Log a structured error record for debugging and monitoring.
"""
error_record = {
"timestamp": datetime.utcnow().isoformat(),
"agent": agent_name,
"action": action,
"attempt": attempt_number,
"error_type": type(error).__name__,
"error_message": str(error),
"error_traceback": traceback.format_exc(),
"inputs": inputs, # What was the agent trying to do?
"context": context, # What was the system state?
}
# JSON format enables querying with log analytics tools (Splunk, Datadog, etc.)
logger.error(json.dumps(error_record))
Why structured JSON logging instead of plain text? Plain text logs like
"Error: connection failed at 14:32"are readable but not queryable. JSON logs like{"error_type": "ConnectionError", "agent": "billing_agent", "attempt": 3}can be filtered, aggregated, and alerted on by log management tools. You can ask: “Show me allConnectionErrorevents frombilling_agentin the last hour with more than 2 retries” — impossible with plain text, trivial with structured JSON.
Phase 3: Recovery Strategies
Handling an error keeps the system running. Recovery restores it to full health.
State Rollback
When an agent performs multiple steps and fails partway through, incomplete actions can leave the system in a corrupted state. State rollback reverses completed steps.
class TransactionManager:
"""
Manages multi-step agent operations with rollback on failure.
Like a database transaction — either everything succeeds, or everything rolls back.
"""
def __init__(self):
self.completed_steps = [] # Stack of (step_name, undo_function) pairs
def execute_step(self, step_name: str, action, undo_action):
"""Execute one step and register its undo operation."""
try:
result = action()
self.completed_steps.append((step_name, undo_action))
print(f"✓ Step completed: {step_name}")
return result
except Exception as e:
print(f"✗ Step failed: {step_name} — {e}")
self.rollback_all()
raise
def rollback_all(self):
"""Undo all completed steps in reverse order."""
print(f"Rolling back {len(self.completed_steps)} completed steps...")
while self.completed_steps:
step_name, undo_fn = self.completed_steps.pop()
try:
undo_fn()
print(f" ↩ Rolled back: {step_name}")
except Exception as e:
# Log rollback failure but continue trying other rollbacks
print(f" ⚠ Rollback failed for {step_name}: {e}")
# Usage example: booking a travel package
def book_travel_package(flight_id, hotel_id, car_id):
tx = TransactionManager()
try:
# Each step: (action, undo_action)
tx.execute_step(
"book_flight",
action = lambda: flight_api.book(flight_id),
undo_action = lambda: flight_api.cancel(flight_id)
)
tx.execute_step(
"book_hotel",
action = lambda: hotel_api.book(hotel_id),
undo_action = lambda: hotel_api.cancel(hotel_id)
)
tx.execute_step(
"book_car",
action = lambda: car_api.book(car_id),
undo_action = lambda: car_api.cancel(car_id)
)
print("✅ All bookings successful!")
except Exception as e:
print(f"❌ Booking failed: {e}. All completed bookings have been cancelled.")
raise
Why reverse order for rollback? If you completed steps A → B → C and step D fails, you need to undo C before B, and B before A. Undoing in original order could leave dependencies in place that make the undo impossible. Think of building with Lego: you take apart the last piece first, not the first piece first.
Why continue rolling back even if a rollback step fails? Because the goal of rollback is to restore the most complete clean state possible. If hotel rollback fails but car rollback succeeds, you’ve at least recovered the car booking cost. Stopping at the first rollback failure would leave more resources locked.
The ADK Implementation: SequentialAgent with Fallback
Google ADK’s SequentialAgent provides a natural structure for implementing the primary → fallback → response pattern using session state as the coordination mechanism.
from google.adk.agents import Agent, SequentialAgent
Why use a SequentialAgent for exception handling? Each “handler” in the exception handling pipeline is a distinct responsibility: the primary handler tries the best approach, the fallback handler activates if needed, and the response handler presents whatever result was obtained.
SequentialAgentguarantees these run in order and share state throughsession.state— making the coordination explicit and debuggable.
# Agent 1: Attempts the primary approach — high precision
primary_handler = Agent(
name = "primary_handler",
model = "gemini-2.0-flash-exp",
instruction = """
Your job is to get precise location information.
Use the get_precise_location_info tool with the user's provided address.
If the tool succeeds, store the result in state["location_result"].
If the tool fails for any reason, store True in state["primary_location_failed"].
Always set state["primary_location_failed"] to either True or False.
""",
tools = [get_precise_location_info]
)
Why store the failure signal in
state["primary_location_failed"]? The secondary agent needs to know whether the primary succeeded or failed. Session state is the shared communication channel between sequential agents in ADK — it’s the equivalent of a shared variable that both agents can read and write. Without this explicit signal, the fallback agent would have no way to know whether to activate.
Why “Always set state[‘primary_location_failed’] to either True or False”? This is defensive instruction design. Without it, the agent might only set the flag when it fails, leaving it unset (and thus
None) when it succeeds. The fallback agent then can’t reliably distinguish “succeeded” from “flag not set.” Explicit True/False handling is more reliable than absence/presence.
# Agent 2: Conditional fallback — only activates if primary failed
fallback_handler = Agent(
name = "fallback_handler",
model = "gemini-2.0-flash-exp",
instruction = """
Check the value of state["primary_location_failed"].
If it is True:
- Extract the city name from the user's original query
- Use the get_general_area_info tool with that city name
- Store the result in state["location_result"]
- Store "city-level (degraded)" in state["data_precision"]
If it is False:
- Do nothing. The primary handler already succeeded.
""",
tools = [get_general_area_info]
)
Why check state in the instruction rather than in code? The conditional logic (“if failed, activate; if not, do nothing”) is expressed in natural language because the agent’s LLM reads it and decides what to do. This is the ADK way — the LLM is the control flow mechanism for agent behavior. Alternative: pre-check in Python and skip the agent if not needed, which is also valid and more efficient for deterministic conditions.
# Agent 3: Presents results regardless of which path succeeded
response_agent = Agent(
name = "response_agent",
model = "gemini-2.0-flash-exp",
instruction = """
Review the location information in state["location_result"].
Present this information clearly to the user.
If state["data_precision"] is "city-level (degraded)", mention that
you're showing city-level data because detailed lookup was temporarily unavailable.
If state["location_result"] is empty or does not exist, apologize that
you could not retrieve location information and suggest trying again later.
""",
tools = [] # This agent only reasons — no tool calls needed
)
Why does the response agent have
tools=[]? The response agent’s job is purely to interpret and communicate the state — it doesn’t need to call any tools. Giving it an empty tool list makes this explicit and prevents the LLM from attempting unnecessary tool calls. It’s also more efficient — no tool schema is sent to the model.
Why have a separate response agent at all? The response format shouldn’t be the responsibility of either the primary or fallback handler — they’re focused on data retrieval. Separating presentation from retrieval means: if you want to change the response format (add more context, translate to another language, format as HTML), you only change the response agent without touching the retrieval logic.
# Assemble: SequentialAgent ensures guaranteed execution order
robust_location_agent = SequentialAgent(
name = "robust_location_agent",
sub_agents = [primary_handler, fallback_handler, response_agent]
)
What happens if even the fallback fails? The response agent is designed to handle this:
"If state['location_result'] is empty or does not exist, apologize..."This is the final safety net — no matter what happens upstream, the response agent always runs and always produces a response that makes sense to the user. The user never sees an unhandled exception; they always get a coherent message.
The Complete Flow Visualized
graph TD
U([User: "Find info for 123 Main St"]) --> PA[primary_handler]
PA -->|tool call| GPS[get_precise_location_info]
GPS -->|success| S1[state: location_result = precise data<br>primary_location_failed = False]
GPS -->|error: 503| S2[state: primary_location_failed = True]
S1 --> FB[fallback_handler]
S2 --> FB
FB -->|failed=False| SKIP[Do nothing — primary succeeded]
FB -->|failed=True| CITY[get_general_area_info tool]
CITY --> S3[state: location_result = city data<br>data_precision = city-level]
SKIP --> RA[response_agent]
S3 --> RA
RA --> OUT([User sees: location info with appropriate context])
style PA fill:#141b2d,stroke:#2698ba,color:#e0e0e0
style FB fill:#141b2d,stroke:#e6a817,color:#e0e0e0
style RA fill:#141b2d,stroke:#4fc97e,color:#e0e0e0
style OUT fill:#141b2d,stroke:#4fc97e,color:#e0e0e0
Live Failure Simulation
Common Mistakes in Exception Handling
Mistake 1: Catching everything with a bare except Exception. A single catch-all handler loses all information about what went wrong. Different errors need different responses — a 401 Unauthorized needs credentials fixed; a 503 needs a retry. Catch specific exception types and handle each appropriately.
Mistake 2: Retrying non-idempotent operations. If send_email() fails partway, should you retry it? Only if the email service guarantees idempotency (same message ID = only sent once). Otherwise, retrying sends duplicate emails. Always ask: “Is it safe to call this twice?” before adding retry logic.
Mistake 3: Swallowing exceptions silently. except Exception: pass is one of the most dangerous patterns in programming. The error is hidden from logs, from monitoring, from the developer, and from the user. The system appears to be running correctly while actually failing silently. Always log exceptions, even if you handle them gracefully.
Mistake 4: Logging after a failed rollback, not the original error. If your rollback fails, the error you log is the rollback failure — but the root cause is the original operation failure. Log the original error first, then attempt rollback, then log any rollback failures separately.
Mistake 5: No circuit breaker for external dependencies. Without circuit breakers, a slow or failed external service causes your agent to pile up with threads waiting for timeouts. Each request waits 10 seconds before failing — with 100 concurrent users, you suddenly have 1,000 seconds of accumulated wait time. Circuit breakers fail fast, freeing resources immediately.
Mistake 6: Not testing failure paths. Error handling code is only tested when things go wrong — which means in production, when you least want surprises. Deliberately inject failures in testing: mock the API to return 503, send malformed JSON, trigger timeouts. Ensure your error handling actually works before deploying.
At a Glance
A structured approach to detecting, handling, and recovering from operational failures in AI agents — covering network errors, API failures, data format errors, LLM output failures, and logic errors, with specific strategies for each.
Real-world systems fail constantly. An agent without exception handling is fragile — one API error crashes the entire conversation. An agent with exception handling is resilient — it detects problems, recovers where possible, degrades gracefully where not, and always leaves the user informed.
Use this pattern for any production agent that interacts with external systems — which means every production agent. The implementation complexity scales with the criticality of the application: more retries, more fallbacks, and more human escalation paths for higher-stakes systems.
Key Takeaways
-
Expect failures, design for them. Production agents encounter network timeouts, API errors, malformed data, LLM output failures, and resource exhaustion constantly. These are not edge cases — they are normal operating conditions. Design for them explicitly, not as an afterthought.
-
Classify errors before handling them. Transient errors (network glitches, rate limits, server overload) warrant retries. Permanent errors (authentication failures, invalid requests, missing resources) should fail fast — retrying wastes time and resources. Know the difference.
-
Exponential backoff with jitter is the right default retry strategy. Linear backoff creates retry storms. Exponential backoff spreads load. Jitter (randomization) prevents synchronized retries from multiple clients. The formula:
wait = base × 2^attempt + random(0, base × 0.1). -
Fallback hierarchies provide resilience in depth. Primary approach fails? Try alternative service. That fails? Use cached data. That fails? Return a graceful degradation message. Each level provides less value but more availability. Design the full hierarchy before deployment.
-
Circuit breakers prevent cascading failures. When a service is consistently failing, stop calling it immediately rather than making every request wait for a timeout. Fail fast, preserve resources, check periodically whether the service has recovered.
-
Log everything, swallow nothing.
except Exception: passis production debt. Every caught exception should be logged with structured context (agent name, action, inputs, error type, traceback). This data is essential for diagnosing production issues. -
State rollback is critical for multi-step operations. If a 5-step workflow fails at step 3, steps 1 and 2 need to be undone — otherwise you have partial, corrupted state. Use the transaction pattern: register undo operations at each step, roll back in reverse order on failure.
-
In ADK, use SequentialAgent + session state for primary/fallback flows. The primary handler sets
state["failed"] = True/False. The fallback handler reads this signal and activates only if needed. The response agent presents results regardless of which path was taken. Explicit state signaling makes the coordination debuggable.
Next up — Chapter 13: Human-in-the-Loop, where agents pause at critical decision points, present their reasoning, and request human approval before taking irreversible actions.
Enjoy Reading This Article?
Here are some more articles you might like to read next: