ARTICLE · 16 MIN READ · MARCH 18, 2026
Chapter 19: Evaluation and Monitoring
Building an agent is the beginning. Knowing whether it's actually working — accurately, efficiently, safely, and reliably — is the ongoing challenge. This chapter covers the complete framework for measuring and maintaining agent performance in production.
Why Traditional Testing Fails for Agents
Metric: A quantifiable measure of performance. Response latency (seconds), token cost (dollars per query), accuracy (% correct answers), BLEU score (text similarity) — these are all metrics. Metrics make "the agent is performing well" into a measurable, comparable claim.
Concept drift: When the statistical distribution of real-world inputs changes over time, causing a model trained on old data to perform worse on new data. A financial agent trained on pre-pandemic market patterns may drift significantly after economic shocks. Monitoring detects when drift is degrading performance.
A/B testing: Running two versions of something (agent version A and version B) simultaneously on different portions of real traffic to compare their performance on the same metric. The only way to know if a change is actually better — not just apparently better on your test cases.
Trajectory: The complete sequence of steps an agent takes to accomplish a task — which tools it called, in what order, with what parameters, and what decisions it made between steps. Trajectory evaluation asks: did the agent take the right path, not just reach the right destination?
LLM-as-a-Judge: Using a separate LLM to evaluate the quality of another LLM's (or agent's) output. The evaluator LLM is given a rubric, the original question, and the agent's response, and produces a structured quality assessment. Scales better than human evaluation but inherits the judge LLM's biases.
Token counting: Measuring how many tokens were consumed in an LLM call. LLM APIs charge per token — tracking token consumption is essential for cost management. Input tokens (your prompt) and output tokens (the model's response) are typically priced differently.
Precision vs Recall (in trajectory evaluation): Precision = of the steps the agent took, what fraction were correct and necessary? Recall = of all the necessary steps, what fraction did the agent actually take? High precision means few wasted steps. High recall means no critical steps were missed.
Evalset: A curated dataset of test scenarios for evaluating an agent. Each scenario specifies an input, the expected tool calls (trajectory), and the expected final response. Used for systematic regression testing — ensuring new agent versions don't break what already worked.
When a traditional software function is wrong, you know it: the output doesn’t match the expected value. The test fails. You fix the bug. The test passes. Deterministic behavior makes testing straightforward.
AI agents don’t work this way. The same query might produce slightly different answers on different runs due to temperature settings. The “right” answer to “What are the pros and cons of microservices?” depends on context, audience, and intent. There’s no single correct trajectory — an agent might take a longer path that’s still correct, or a shorter path that misses critical detail. And an agent that works perfectly today might drift as its environment, user base, or underlying model changes.
Traditional unit tests catch code bugs. Agent evaluation catches behavioral drift, quality degradation, cost overruns, safety violations, and goal misalignment — a completely different class of problems requiring a completely different class of tools.
This chapter builds the complete evaluation and monitoring framework: what to measure, how to measure it, and what to do when the measurements reveal problems.
The Four Dimensions of Agent Performance
Before choosing metrics, define what “good” means across four dimensions:
Effectiveness
Does the agent achieve its goal? Is the output accurate, complete, and aligned with user intent? Effectiveness metrics measure the *quality* of what the agent produces.
Efficiency
Does the agent achieve its goal with minimal resource consumption? Efficiency metrics measure *how much* it costs — in time, tokens, API calls, and money — to produce results.
Safety & Compliance
Does the agent stay within ethical, legal, and operational boundaries? Safety metrics measure whether the agent's behavior is acceptable — even when technically effective.
Reliability
Does the agent perform consistently over time, under load, and in novel situations? Reliability metrics measure whether performance degrades, drifts, or fails under real-world conditions.
The Three Evaluation Methods
The production strategy: Use all three in a tiered system. Automated metrics run on 100% of production traffic (zero cost, real-time). LLM-as-a-Judge runs on a sample (5-10%) for qualitative monitoring. Human evaluation runs periodically or on triggered samples (flagged by automated metrics or LLM-as-Judge for deeper investigation).
Token Usage Monitoring
For LLM-based agents, token consumption is both a cost metric and a performance signal. An agent that suddenly starts consuming 5× more tokens per request has either changed behavior or hit a new class of requests — both worth investigating.
class LLMInteractionMonitor:
"""
Tracks token consumption across all LLM calls made by an agent.
In production, this would hook into the LLM API's token counter
rather than estimating from string splitting.
"""
def __init__(self):
self.total_input_tokens = 0
self.total_output_tokens = 0
self.interaction_count = 0
self.interactions = [] # for per-interaction analysis
def record_interaction(self, prompt: str, response: str,
actual_input_tokens: int = None,
actual_output_tokens: int = None):
"""
Record one LLM call. Uses actual token counts from API
if available; falls back to word-based estimation.
"""
# Use actual token counts from the API response if available
# (OpenAI: response.usage.prompt_tokens, Google: response.usage_metadata)
if actual_input_tokens is not None:
in_tokens = actual_input_tokens
out_tokens = actual_output_tokens
else:
# Rough estimate: ~4 chars per token for English text
in_tokens = len(prompt.split()) * 1.3 # words * avg tokens/word
out_tokens = len(response.split()) * 1.3
self.total_input_tokens += in_tokens
self.total_output_tokens += out_tokens
self.interaction_count += 1
self.interactions.append({
"prompt_preview": prompt[:100],
"input_tokens": in_tokens,
"output_tokens": out_tokens,
"estimated_cost_usd": (in_tokens * 0.00015 + out_tokens * 0.0006) / 1000
# Cost formula: GPT-4o-mini pricing example
})
def get_total_tokens(self):
return self.total_input_tokens, self.total_output_tokens
def get_cost_estimate_usd(self, model="gpt-4o-mini"):
"""Estimate total cost based on standard pricing."""
pricing = {
"gpt-4o-mini": {"input": 0.15, "output": 0.60}, # per 1M tokens
"gpt-4o": {"input": 2.50, "output": 10.00},
"gemini-flash": {"input": 0.075, "output": 0.30},
}
p = pricing.get(model, pricing["gpt-4o-mini"])
return ((self.total_input_tokens * p["input"] +
self.total_output_tokens * p["output"]) / 1_000_000)
def get_summary(self):
if self.interaction_count == 0:
return "No interactions recorded"
avg_in = self.total_input_tokens / self.interaction_count
avg_out = self.total_output_tokens / self.interaction_count
return (f"Total calls: {self.interaction_count} | "
f"Avg input: {avg_in:.0f} tokens | "
f"Avg output: {avg_out:.0f} tokens | "
f"Est. cost: ${self.get_cost_estimate_usd():.4f}")
# Usage
monitor = LLMInteractionMonitor()
monitor.record_interaction(
prompt = "Tell me a joke.",
response = "Why don't scientists trust atoms? Because they make up everything!",
actual_input_tokens = 8, # from API response
actual_output_tokens = 16
)
print(monitor.get_summary())
# → "Total calls: 1 | Avg input: 8 tokens | Avg output: 16 tokens | Est. cost: $0.0000"
Why track tokens per interaction, not just totals? Averages hide outliers. If your agent averages 500 tokens per call but one call consumed 50,000 tokens (maybe it got stuck in a reasoning loop or received an unusually long document), the average looks fine but you have a serious anomaly. Per-interaction logging enables outlier detection, which is often the most valuable signal.
Why estimate cost alongside token count? Tokens are an engineering metric; dollars are a business metric. When you alert your manager that “total tokens increased 40% this week,” you’ll need to immediately translate that into “that’s an additional $X per day.” Build the cost calculation into your monitoring system from the start, not as an afterthought.
LLM-as-a-Judge: Implementation
Here’s how to build a robust LLM-based evaluator with a structured rubric:
import google.generativeai as genai
import json, logging
from pydantic import BaseModel, Field
from typing import List, Optional
# Define the structured output schema for the judge
class SurveyEvaluation(BaseModel):
overall_score: int = Field(ge=1, le=5, description="Holistic quality score 1-5")
rationale: str = Field(description="Summary of key strengths and weaknesses")
detailed_feedback: List[str] = Field(description="Bullet points per criterion")
concerns: List[str] = Field(description="Specific issues identified")
recommended_action: str = Field(description="Next step: 'Approve as is', 'Revise', etc.")
Why define the output schema as a Pydantic model? The judge LLM might return “I think this is a 4 out of 5” as prose, or a JSON object with the wrong field names, or perfectly valid JSON with a score of 7 (outside the 1-5 range). Pydantic validates all of these failure cases and raises clear errors rather than silently passing malformed data downstream.
Field(ge=1, le=5)means “greater-than-or-equal to 1, less-than-or-equal to 5” — Pydantic enforces this constraint.
LEGAL_SURVEY_RUBRIC = """
You are an expert legal survey methodologist. Evaluate the quality of
the provided legal survey question against five criteria.
Criteria (each scored 1-5):
1. Clarity & Precision — Is the question unambiguous? Is legal terminology precise?
1=Extremely vague, 3=Moderately clear, 5=Perfectly precise and unambiguous
2. Neutrality & Bias — Does the question lead the respondent toward a particular answer?
1=Highly leading/biased, 3=Slightly suggestive, 5=Completely neutral and objective
3. Relevance & Focus — Is the question directly relevant to the survey's objectives?
1=Irrelevant, 3=Loosely related, 5=Directly relevant and tightly focused
4. Completeness — Does it provide sufficient context to answer accurately?
1=Critical information missing, 3=Mostly complete, 5=All necessary context provided
5. Audience Appropriateness — Is the language calibrated for the target legal audience?
1=Inaccessible jargon or oversimplified, 3=Generally appropriate, 5=Perfectly calibrated
Respond ONLY with valid JSON conforming to this schema:
{
"overall_score": <integer 1-5>,
"rationale": "<concise summary>",
"detailed_feedback": ["<criterion 1 feedback>", ..., "<criterion 5 feedback>"],
"concerns": ["<concern 1>", ...],
"recommended_action": "<Approve as is | Revise for neutrality | Clarify scope | ...>"
}
"""
class LLMJudgeForLegalSurvey:
def __init__(self, model_name: str = 'gemini-1.5-flash-latest',
temperature: float = 0.0): # temperature=0 for consistent evaluation
self.model = genai.GenerativeModel(model_name)
self.temperature = temperature
def judge_survey_question(self, survey_question: str) -> Optional[dict]:
full_prompt = f"{LEGAL_SURVEY_RUBRIC}\n\n---\nQUESTION TO EVALUATE:\n{survey_question}\n---"
try:
response = self.model.generate_content(
full_prompt,
generation_config = genai.types.GenerationConfig(
temperature = self.temperature,
response_mime_type = "application/json" # forces structured JSON output
)
)
return json.loads(response.text)
except json.JSONDecodeError as e:
logging.error(f"Judge LLM returned invalid JSON: {e}")
return None
except Exception as e:
logging.error(f"Judge LLM call failed: {e}")
return None
response_mime_type = "application/json": This Gemini configuration parameter instructs the model to produce only valid JSON in its response — no prose, no markdown, no explanation outside the JSON structure. It’s the equivalent oftemperature=0for output format: it makes the response reliably machine-parseable. Not all LLM providers support this; OpenAI’s equivalent isresponse_format={"type": "json_object"}.
temperature = 0.0for the judge. An evaluator must be consistent — the same question evaluated twice should get the same score (or very close). Non-zero temperature introduces randomness: a question might score 3 today and 4 tomorrow for no meaningful reason. For evaluation systems, consistency is more important than creativity.temperature=0makes evaluation reproducible.
Testing the judge on three quality levels:
judge = LLMJudgeForLegalSurvey()
# Example 1: Well-formed question — expect high score
good_question = """
To what extent do you agree that current IP laws in Switzerland adequately
protect AI-generated content, assuming the content meets originality criteria
established by the Federal Supreme Court?
(Select one: Strongly Disagree, Disagree, Neutral, Agree, Strongly Agree)
"""
# Expected: overall_score ~4-5, "Approve as is" or minor revisions
# Example 2: Leading/biased question — expect low score
biased_question = """
Don't you agree that overly restrictive data privacy laws like the FADP are
hindering essential technological innovation and economic growth?
(Select one: Yes, No)
"""
# Expected: overall_score ~1-2, "Revise for neutrality"
# Example 3: Vague question — expect low score
vague_question = "What are your thoughts on legal tech?"
# Expected: overall_score ~1, "Clarify scope" and "Revise for completeness"
for label, question in [("Good", good_question),
("Biased", biased_question),
("Vague", vague_question)]:
result = judge.judge_survey_question(question)
if result:
print(f"\n{label}: score={result['overall_score']}/5 | {result['recommended_action']}")
print(f" Rationale: {result['rationale'][:100]}...")
Trajectory Evaluation
For tool-using agents, the quality of the path matters as much as the quality of the destination. An agent that arrives at the right answer by calling the wrong tools in the wrong order is inefficient, potentially dangerous, and fragile.
Trajectory evaluation compares the agent’s actual sequence of actions against a “ground truth” trajectory that represents the ideal approach.
Choosing the right trajectory metric:
| Scenario | Best metric | Reason |
|---|---|---|
| High-stakes (medical, financial) | Exact match | Deviations from protocol are unacceptable |
| Complex tasks with valid alternatives | In-order match | Allows flexibility while preserving logical order |
| Flexible workflows | Any-order match | Results matter more than sequence |
| Minimizing wasted API calls | Precision | Penalizes unnecessary steps (cost optimization) |
| Safety-critical steps | Recall | Ensures critical steps are never skipped |
ADK Evaluation Framework
Google’s ADK provides three built-in evaluation modes:
graph LR
A[ADK Evaluation] --> B[Web UI — adk web]
A --> C[Pytest Integration]
A --> D[CLI — adk eval]
B --> B1[Interactive session creation\nSave to evalsets\nReal-time status display]
C --> C1[AgentEvaluator.evaluate\nCI/CD pipeline integration\nAutomated regression testing]
D --> D1[adk eval agent_path evalset.json\nAutomated builds\nBatch evaluation]
style A fill:#141b2d,stroke:#2698ba,color:#e0e0e0
style B fill:#141b2d,stroke:#c97af2,color:#e0e0e0
style C fill:#141b2d,stroke:#4fc97e,color:#e0e0e0
style D fill:#141b2d,stroke:#e6a817,color:#e0e0e0
Test File Format (Unit Testing)
{
"eval_set_id": "smart_home_unit_tests",
"turns": [
{
"user_query": "Turn off device_2 in the Bedroom.",
"expected_tool_use": [
{
"tool_name": "set_device_info",
"tool_input": {
"location": "Bedroom",
"device_id": "device_2",
"status": "OFF"
}
}
],
"expected_intermediate_agent_responses": [],
"expected_final_response": "I have set the device_2 status to off."
}
]
}
What each field validates:
expected_tool_usechecks whether the agent called the right tool with the right parameters (trajectory).expected_intermediate_agent_responsescan check what the agent said between tool calls (useful for multi-step reasoning agents).expected_final_responsechecks the user-facing output quality. The ADK evaluator runs the actual agent, captures its behavior, and compares against all three expected values.
Why define
expected_tool_useat the parameter level? Simply checking “did the agent callset_device_info?” isn’t sufficient. An agent that callsset_device_info(location="Living Room")when asked about the Bedroom has failed even though it used the “right” tool. Parameter-level matching catches this. For high-stakes actions (database writes, API calls, financial transactions), parameter validation is critical.
Evalset Format (Integration Testing)
{
"eval_set_id": "math_assistant_integration",
"evals": [
{
"eval_id": "dice_and_prime",
"conversation": [
{
"invocation_id": "turn_1",
"user_query": "What can you do?",
"expected_final_response": "I can roll dice, check prime numbers, and perform mathematical operations."
},
{
"invocation_id": "turn_2",
"user_query": "Roll a 10-sided dice twice and then check if 9 is prime.",
"expected_tool_use": [
{"tool_name": "roll_die", "tool_input": {"sides": 10}},
{"tool_name": "roll_die", "tool_input": {"sides": 10}},
{"tool_name": "check_prime", "tool_input": {"number": 9}}
],
"expected_final_response": "I rolled a 10-sided die twice..."
}
]
}
]
}
Test file vs evalset — what’s the difference? Test files contain a single session with one or more turns. They’re analogous to unit tests — fast, focused, testing specific behaviors. Evalsets contain multiple sessions (multiple “evals”), each with potentially many turns. They’re analogous to integration tests — they test complex, multi-turn conversations that simulate real user workflows. Use test files during active development; use evalsets for pre-deployment regression testing.
Running Evaluations
# Web UI — interactive evaluation and dataset creation
adk web
# CLI — automated evaluation for CI/CD
adk eval ./my_agent/ ./evalsets/production_test.json \
--config ./eval_config.json \
--print_detailed_results
# Run specific evals from a larger evalset
adk eval ./my_agent/ ./evalsets/full_suite.json \
eval_id_1,eval_id_2,eval_id_3 # comma-separated, no spaces
# Pytest integration — include in your test suite
# (in your test_agent.py file):
from google.adk.evaluation import AgentEvaluator
def test_agent_smoke():
AgentEvaluator.evaluate(
agent_module = "my_agent.agent",
eval_dataset = "./evalsets/smoke_test.json",
num_runs = 1,
)
Why three evaluation modes? Each serves a different workflow. The web UI is for building evalsets interactively — you have a real conversation with the agent and save good examples as test cases. The CLI is for automation — run it in your CI/CD pipeline on every pull request to catch regressions before deployment. Pytest integration is for developers who want agent evaluation alongside their existing unit tests in one
pytestrun.
From Agents to Contractors: The Evaluation-Accountability Link
A profound insight from recent research (Gulli et al., 2025): the harder it is to evaluate an agent, the less you can trust it. Evaluation difficulty is a proxy for accountability deficit.
The contractor model directly addresses this by making every agent interaction formally evaluated against explicit, pre-specified criteria:
The four pillars of contractor-style agents:
1. Formalized Contract. Instead of a prompt like “analyze last quarter’s sales,” a contract specifies: “Deliver a 20-page PDF analyzing European market sales from Q1 2025, including five data visualizations, comparative analysis against Q1 2024, and a risk assessment. Acceptable data sources: [listed]. Maximum compute cost: $50. Completion time: 2 hours.” Every output criterion is objectively verifiable.
2. Negotiation Phase. Before execution, the agent can flag issues: “The specified XYZ database is inaccessible. Please provide credentials or approve alternative sources.” This resolves misunderstandings before they become failures — exactly what a human contractor would do before starting a project.
3. Quality-Focused Iterative Execution. For a code contract, the agent generates multiple implementations, runs them against the contract’s unit tests, scores each on performance/security/readability, and only delivers the version that passes all criteria. Internal self-validation before delivery.
4. Hierarchical Decomposition via Subcontracts. A master contract to “build an e-commerce mobile app” generates subcontracts: “Design UI/UX,” “Develop authentication module,” “Create database schema,” “Integrate payment gateway.” Each subcontract is a complete, independent, evaluable unit — enabling both specialization and accountability at every level.
Continuous Monitoring in Production
Performance Tracking
Monitor accuracy, latency, and resource consumption continuously. Set up dashboards with alerting thresholds. A response latency spike at 2am should wake someone up — or at least log an alert — before users start complaining.
p50/p95/p99 latency · error_rate · tokens_per_query · cost_per_dayA/B Testing
Split production traffic between agent version A and version B. Measure the same metrics on both. The only way to know whether a change is actually better in production — not just apparently better on your test cases. Control for confounders (time of day, user segments).
statistical_significance · lift_in_primary_metric · guardrail_metric_complianceDrift Detection
Monitor input distribution (are queries changing?), output quality trends (is accuracy declining?), and tool call patterns (is the agent using different tools than it used to?). Drift often manifests gradually — regular sampling and trend analysis catches it before users complain.
query_distribution_shift · quality_score_trend · tool_call_distributionAnomaly Detection
Identify unusual patterns: sudden spike in guardrail triggers (possible attack), unexpected drop in task completion rate, specific tool call timing out repeatedly, unusual concentration of queries from a single user. Most anomalies are either attacks or bugs — both need rapid response.
guardrail_trigger_spike · tool_error_rate · completion_rate_dropCompliance Auditing
Generate automated reports showing the agent's adherence to ethical guidelines, regulatory requirements, and safety protocols. These reports need to be human-readable for auditors and machine-queryable for automated monitoring. Log everything — "if it didn't get logged, it didn't happen."
policy_violation_rate · escalation_rate · audit_trail_completenessLearning Progress
For agents that learn or adapt (Chapter 9 pattern), track whether learning is actually improving performance. Plot accuracy, cost efficiency, and task completion over time. Ensure improvements generalize (don't just overfit to the evaluation set).
accuracy_over_time · generalization_score · learning_curve_slopeKey Takeaways
-
Traditional testing is insufficient for agents. Code tests catch bugs. Agent evaluation catches behavioral drift, quality degradation, cost overruns, safety violations, and goal misalignment — completely different problem classes requiring different tools.
-
Use all three evaluation methods in a tiered system. Automated metrics on 100% of traffic (free, real-time). LLM-as-a-Judge on 5-10% (scalable qualitative assessment). Human evaluation periodically or on triggered samples (gold standard for calibration). Each layer catches what the others miss.
-
Track tokens per interaction, not just totals. Per-interaction logging reveals outliers that averages hide. A single 50,000-token call (reasoning loop gone wrong) looks fine in an average but is a serious anomaly that needs investigation.
-
temperature=0is mandatory for LLM-as-a-Judge. Evaluation consistency requires determinism. Non-zero temperature means the same response gets different scores on different days — your evaluation system becomes unreliable. -
Trajectory evaluation is as important as output evaluation. The right answer via the wrong path is inefficient, brittle, and potentially dangerous. Evaluate both what the agent produced and how it got there. Choose your trajectory metric based on the stakes: exact match for safety-critical steps, recall when no critical step can be missed.
-
ADK provides three evaluation modes for different workflows. Web UI for interactive evalset building. Pytest for developer CI/CD integration. CLI for automated deployment pipelines. Use all three at different stages of development.
-
The contractor model embeds evaluation into the contract itself. When deliverables are explicitly specified with verifiable criteria upfront, evaluation becomes trivial — either the agent met the contract or it didn’t. This is the future of accountable AI deployment in mission-critical domains.
-
Build monitoring before you need it. Instrumentation added post-deployment has gaps. Build token tracking, error logging, latency measurement, and quality sampling into the agent from day one. The cost of instrumentation is tiny; the cost of debugging un-instrumented production failures is enormous.
Enjoy Reading This Article?
Here are some more articles you might like to read next: