ARTICLE  ·  16 MIN READ  ·  MARCH 18, 2026

Chapter 19: Evaluation and Monitoring

Building an agent is the beginning. Knowing whether it's actually working — accurately, efficiently, safely, and reliably — is the ongoing challenge. This chapter covers the complete framework for measuring and maintaining agent performance in production.


Why Traditional Testing Fails for Agents

Before You Start — Key Terms Explained

Metric: A quantifiable measure of performance. Response latency (seconds), token cost (dollars per query), accuracy (% correct answers), BLEU score (text similarity) — these are all metrics. Metrics make "the agent is performing well" into a measurable, comparable claim.

Concept drift: When the statistical distribution of real-world inputs changes over time, causing a model trained on old data to perform worse on new data. A financial agent trained on pre-pandemic market patterns may drift significantly after economic shocks. Monitoring detects when drift is degrading performance.

A/B testing: Running two versions of something (agent version A and version B) simultaneously on different portions of real traffic to compare their performance on the same metric. The only way to know if a change is actually better — not just apparently better on your test cases.

Trajectory: The complete sequence of steps an agent takes to accomplish a task — which tools it called, in what order, with what parameters, and what decisions it made between steps. Trajectory evaluation asks: did the agent take the right path, not just reach the right destination?

LLM-as-a-Judge: Using a separate LLM to evaluate the quality of another LLM's (or agent's) output. The evaluator LLM is given a rubric, the original question, and the agent's response, and produces a structured quality assessment. Scales better than human evaluation but inherits the judge LLM's biases.

Token counting: Measuring how many tokens were consumed in an LLM call. LLM APIs charge per token — tracking token consumption is essential for cost management. Input tokens (your prompt) and output tokens (the model's response) are typically priced differently.

Precision vs Recall (in trajectory evaluation): Precision = of the steps the agent took, what fraction were correct and necessary? Recall = of all the necessary steps, what fraction did the agent actually take? High precision means few wasted steps. High recall means no critical steps were missed.

Evalset: A curated dataset of test scenarios for evaluating an agent. Each scenario specifies an input, the expected tool calls (trajectory), and the expected final response. Used for systematic regression testing — ensuring new agent versions don't break what already worked.

When a traditional software function is wrong, you know it: the output doesn’t match the expected value. The test fails. You fix the bug. The test passes. Deterministic behavior makes testing straightforward.

AI agents don’t work this way. The same query might produce slightly different answers on different runs due to temperature settings. The “right” answer to “What are the pros and cons of microservices?” depends on context, audience, and intent. There’s no single correct trajectory — an agent might take a longer path that’s still correct, or a shorter path that misses critical detail. And an agent that works perfectly today might drift as its environment, user base, or underlying model changes.

Traditional unit tests catch code bugs. Agent evaluation catches behavioral drift, quality degradation, cost overruns, safety violations, and goal misalignment — a completely different class of problems requiring a completely different class of tools.

This chapter builds the complete evaluation and monitoring framework: what to measure, how to measure it, and what to do when the measurements reveal problems.


The Four Dimensions of Agent Performance

Before choosing metrics, define what “good” means across four dimensions:

Effectiveness

Does the agent achieve its goal? Is the output accurate, complete, and aligned with user intent? Effectiveness metrics measure the *quality* of what the agent produces.

Answer accuracyTask completion rateUser satisfactionHelpfulness score

Efficiency

Does the agent achieve its goal with minimal resource consumption? Efficiency metrics measure *how much* it costs — in time, tokens, API calls, and money — to produce results.

Response latencyToken cost per queryTool calls per taskCompute cost
🛡

Safety & Compliance

Does the agent stay within ethical, legal, and operational boundaries? Safety metrics measure whether the agent's behavior is acceptable — even when technically effective.

Guardrail trigger ratePolicy violation ratePII exposure incidentsAudit compliance
📈

Reliability

Does the agent perform consistently over time, under load, and in novel situations? Reliability metrics measure whether performance degrades, drifts, or fails under real-world conditions.

Error rateDrift metricsUptime/availabilityEdge case handling

The Three Evaluation Methods

Human Evaluation — the gold standard
Human raters review agent responses and score them on quality dimensions. The most reliable method for capturing nuanced, subjective qualities like tone, helpfulness, and appropriateness — things that automated metrics miss entirely.
When to use: Establishing ground truth for new tasks. Calibrating automated evaluation systems (humans provide the "correct" scores that teach LLM-as-a-Judge). Periodic audits of production quality. Any time you need the highest-quality assessment of a small sample.
✓ Captures subtle behaviors (tone, empathy, nuance)
✓ Can evaluate any quality dimension you define
✓ No model capability ceiling — humans catch what LLMs miss
✗ Expensive: $0.50-5.00 per evaluation
✗ Slow: hours to days for large samples
✗ Inconsistent: rater disagreement, fatigue, bias
✗ Not scalable to production monitoring
LLM-as-a-Judge — scalable qualitative evaluation
A separate LLM receives a rubric, the original query, and the agent's response, and produces a structured quality assessment. Scales to thousands of evaluations per hour at a fraction of human evaluation cost. Most effective when the judge LLM is more capable than the agent being judged.
When to use: Evaluating subjective qualities at scale (helpfulness, clarity, tone). Continuous monitoring of production quality without hiring human raters. A/B testing agent versions. Any qualitative dimension that resists reduction to a simple metric.
✓ Scalable: 1,000+ evaluations per minute
✓ Consistent: same rubric applied identically every time
✓ Nuanced: captures qualitative dimensions better than simple metrics
✗ Biased: judge LLM has its own biases and blind spots
✗ Self-serving: same model family tends to rate itself favorably
✗ Limited by judge capability: can't catch what the judge can't detect
Automated Metrics — fast, cheap, objective
Compute deterministic scores from the agent's output without any human or LLM involvement. Fast (milliseconds), cheap (no API calls for evaluation), and objective (same input always produces same metric value). Best for quantity-based and structural metrics.
When to use: Real-time production monitoring (latency, error rates, token usage). Regression testing in CI/CD pipelines. When quality can be expressed as a verifiable criterion (does the output contain X, is it under N words, did it use the correct tool). Cost tracking and budget enforcement.
✓ Real-time: evaluate every single production request
✓ Objective: no human or LLM subjectivity
✓ Free: no additional API calls
✗ Shallow: can't capture nuance, tone, or subtle quality
✗ Metric gaming: agents optimized for a metric can game it
✗ Limited scope: not all qualities reduce to a metric

The production strategy: Use all three in a tiered system. Automated metrics run on 100% of production traffic (zero cost, real-time). LLM-as-a-Judge runs on a sample (5-10%) for qualitative monitoring. Human evaluation runs periodically or on triggered samples (flagged by automated metrics or LLM-as-Judge for deeper investigation).


Token Usage Monitoring

For LLM-based agents, token consumption is both a cost metric and a performance signal. An agent that suddenly starts consuming 5× more tokens per request has either changed behavior or hit a new class of requests — both worth investigating.

class LLMInteractionMonitor:
    """
    Tracks token consumption across all LLM calls made by an agent.
    In production, this would hook into the LLM API's token counter
    rather than estimating from string splitting.
    """
    def __init__(self):
        self.total_input_tokens  = 0
        self.total_output_tokens = 0
        self.interaction_count   = 0
        self.interactions        = []  # for per-interaction analysis

    def record_interaction(self, prompt: str, response: str,
                           actual_input_tokens: int = None,
                           actual_output_tokens: int = None):
        """
        Record one LLM call. Uses actual token counts from API
        if available; falls back to word-based estimation.
        """
        # Use actual token counts from the API response if available
        # (OpenAI: response.usage.prompt_tokens, Google: response.usage_metadata)
        if actual_input_tokens is not None:
            in_tokens  = actual_input_tokens
            out_tokens = actual_output_tokens
        else:
            # Rough estimate: ~4 chars per token for English text
            in_tokens  = len(prompt.split()) * 1.3    # words * avg tokens/word
            out_tokens = len(response.split()) * 1.3

        self.total_input_tokens  += in_tokens
        self.total_output_tokens += out_tokens
        self.interaction_count   += 1
        self.interactions.append({
            "prompt_preview":   prompt[:100],
            "input_tokens":     in_tokens,
            "output_tokens":    out_tokens,
            "estimated_cost_usd": (in_tokens * 0.00015 + out_tokens * 0.0006) / 1000
            # Cost formula: GPT-4o-mini pricing example
        })

    def get_total_tokens(self):
        return self.total_input_tokens, self.total_output_tokens

    def get_cost_estimate_usd(self, model="gpt-4o-mini"):
        """Estimate total cost based on standard pricing."""
        pricing = {
            "gpt-4o-mini": {"input": 0.15, "output": 0.60},  # per 1M tokens
            "gpt-4o":      {"input": 2.50, "output": 10.00},
            "gemini-flash": {"input": 0.075, "output": 0.30},
        }
        p = pricing.get(model, pricing["gpt-4o-mini"])
        return ((self.total_input_tokens  * p["input"] +
                 self.total_output_tokens * p["output"]) / 1_000_000)

    def get_summary(self):
        if self.interaction_count == 0:
            return "No interactions recorded"
        avg_in  = self.total_input_tokens  / self.interaction_count
        avg_out = self.total_output_tokens / self.interaction_count
        return (f"Total calls: {self.interaction_count} | "
                f"Avg input: {avg_in:.0f} tokens | "
                f"Avg output: {avg_out:.0f} tokens | "
                f"Est. cost: ${self.get_cost_estimate_usd():.4f}")

# Usage
monitor = LLMInteractionMonitor()
monitor.record_interaction(
    prompt   = "Tell me a joke.",
    response = "Why don't scientists trust atoms? Because they make up everything!",
    actual_input_tokens  = 8,   # from API response
    actual_output_tokens = 16
)
print(monitor.get_summary())
# → "Total calls: 1 | Avg input: 8 tokens | Avg output: 16 tokens | Est. cost: $0.0000"

Why track tokens per interaction, not just totals? Averages hide outliers. If your agent averages 500 tokens per call but one call consumed 50,000 tokens (maybe it got stuck in a reasoning loop or received an unusually long document), the average looks fine but you have a serious anomaly. Per-interaction logging enables outlier detection, which is often the most valuable signal.

Why estimate cost alongside token count? Tokens are an engineering metric; dollars are a business metric. When you alert your manager that “total tokens increased 40% this week,” you’ll need to immediately translate that into “that’s an additional $X per day.” Build the cost calculation into your monitoring system from the start, not as an afterthought.


LLM-as-a-Judge: Implementation

Here’s how to build a robust LLM-based evaluator with a structured rubric:

import google.generativeai as genai
import json, logging
from pydantic import BaseModel, Field
from typing import List, Optional

# Define the structured output schema for the judge
class SurveyEvaluation(BaseModel):
    overall_score:     int         = Field(ge=1, le=5, description="Holistic quality score 1-5")
    rationale:         str         = Field(description="Summary of key strengths and weaknesses")
    detailed_feedback: List[str]   = Field(description="Bullet points per criterion")
    concerns:          List[str]   = Field(description="Specific issues identified")
    recommended_action: str        = Field(description="Next step: 'Approve as is', 'Revise', etc.")

Why define the output schema as a Pydantic model? The judge LLM might return “I think this is a 4 out of 5” as prose, or a JSON object with the wrong field names, or perfectly valid JSON with a score of 7 (outside the 1-5 range). Pydantic validates all of these failure cases and raises clear errors rather than silently passing malformed data downstream. Field(ge=1, le=5) means “greater-than-or-equal to 1, less-than-or-equal to 5” — Pydantic enforces this constraint.

LEGAL_SURVEY_RUBRIC = """
You are an expert legal survey methodologist. Evaluate the quality of
the provided legal survey question against five criteria.

Criteria (each scored 1-5):
1. Clarity & Precision — Is the question unambiguous? Is legal terminology precise?
   1=Extremely vague, 3=Moderately clear, 5=Perfectly precise and unambiguous

2. Neutrality & Bias — Does the question lead the respondent toward a particular answer?
   1=Highly leading/biased, 3=Slightly suggestive, 5=Completely neutral and objective

3. Relevance & Focus — Is the question directly relevant to the survey's objectives?
   1=Irrelevant, 3=Loosely related, 5=Directly relevant and tightly focused

4. Completeness — Does it provide sufficient context to answer accurately?
   1=Critical information missing, 3=Mostly complete, 5=All necessary context provided

5. Audience Appropriateness — Is the language calibrated for the target legal audience?
   1=Inaccessible jargon or oversimplified, 3=Generally appropriate, 5=Perfectly calibrated

Respond ONLY with valid JSON conforming to this schema:
{
  "overall_score": <integer 1-5>,
  "rationale": "<concise summary>",
  "detailed_feedback": ["<criterion 1 feedback>", ..., "<criterion 5 feedback>"],
  "concerns": ["<concern 1>", ...],
  "recommended_action": "<Approve as is | Revise for neutrality | Clarify scope | ...>"
}
"""

class LLMJudgeForLegalSurvey:
    def __init__(self, model_name: str = 'gemini-1.5-flash-latest',
                 temperature: float = 0.0):  # temperature=0 for consistent evaluation
        self.model       = genai.GenerativeModel(model_name)
        self.temperature = temperature

    def judge_survey_question(self, survey_question: str) -> Optional[dict]:
        full_prompt = f"{LEGAL_SURVEY_RUBRIC}\n\n---\nQUESTION TO EVALUATE:\n{survey_question}\n---"
        try:
            response = self.model.generate_content(
                full_prompt,
                generation_config = genai.types.GenerationConfig(
                    temperature        = self.temperature,
                    response_mime_type = "application/json"  # forces structured JSON output
                )
            )
            return json.loads(response.text)
        except json.JSONDecodeError as e:
            logging.error(f"Judge LLM returned invalid JSON: {e}")
            return None
        except Exception as e:
            logging.error(f"Judge LLM call failed: {e}")
            return None

response_mime_type = "application/json": This Gemini configuration parameter instructs the model to produce only valid JSON in its response — no prose, no markdown, no explanation outside the JSON structure. It’s the equivalent of temperature=0 for output format: it makes the response reliably machine-parseable. Not all LLM providers support this; OpenAI’s equivalent is response_format={"type": "json_object"}.

temperature = 0.0 for the judge. An evaluator must be consistent — the same question evaluated twice should get the same score (or very close). Non-zero temperature introduces randomness: a question might score 3 today and 4 tomorrow for no meaningful reason. For evaluation systems, consistency is more important than creativity. temperature=0 makes evaluation reproducible.

Testing the judge on three quality levels:

judge = LLMJudgeForLegalSurvey()

# Example 1: Well-formed question — expect high score
good_question = """
To what extent do you agree that current IP laws in Switzerland adequately
protect AI-generated content, assuming the content meets originality criteria
established by the Federal Supreme Court?
(Select one: Strongly Disagree, Disagree, Neutral, Agree, Strongly Agree)
"""
# Expected: overall_score ~4-5, "Approve as is" or minor revisions

# Example 2: Leading/biased question — expect low score
biased_question = """
Don't you agree that overly restrictive data privacy laws like the FADP are
hindering essential technological innovation and economic growth?
(Select one: Yes, No)
"""
# Expected: overall_score ~1-2, "Revise for neutrality"

# Example 3: Vague question — expect low score
vague_question = "What are your thoughts on legal tech?"
# Expected: overall_score ~1, "Clarify scope" and "Revise for completeness"

for label, question in [("Good", good_question),
                         ("Biased", biased_question),
                         ("Vague", vague_question)]:
    result = judge.judge_survey_question(question)
    if result:
        print(f"\n{label}: score={result['overall_score']}/5 | {result['recommended_action']}")
        print(f"  Rationale: {result['rationale'][:100]}...")

Trajectory Evaluation

For tool-using agents, the quality of the path matters as much as the quality of the destination. An agent that arrives at the right answer by calling the wrong tools in the wrong order is inefficient, potentially dangerous, and fragile.

Trajectory evaluation compares the agent’s actual sequence of actions against a “ground truth” trajectory that represents the ideal approach.

TRAJECTORY MATCHING METHODS — see how different strategies score the same agent
SCENARIO: Customer asks "What's the current price of AAPL and should I buy?"
✓ IDEAL trajectory (ground truth)
get_stock_price(AAPL) get_analyst_ratings(AAPL) add_disclaimer()
🤖 AGENT trajectory (actual)
search_news(AAPL) get_stock_price(AAPL) ⚠ add_disclaimer MISSING get_analyst_ratings(AAPL)

Choosing the right trajectory metric:

Scenario Best metric Reason
High-stakes (medical, financial) Exact match Deviations from protocol are unacceptable
Complex tasks with valid alternatives In-order match Allows flexibility while preserving logical order
Flexible workflows Any-order match Results matter more than sequence
Minimizing wasted API calls Precision Penalizes unnecessary steps (cost optimization)
Safety-critical steps Recall Ensures critical steps are never skipped

ADK Evaluation Framework

Google’s ADK provides three built-in evaluation modes:

graph LR
    A[ADK Evaluation] --> B[Web UI — adk web]
    A --> C[Pytest Integration]
    A --> D[CLI — adk eval]
    B --> B1[Interactive session creation\nSave to evalsets\nReal-time status display]
    C --> C1[AgentEvaluator.evaluate\nCI/CD pipeline integration\nAutomated regression testing]
    D --> D1[adk eval agent_path evalset.json\nAutomated builds\nBatch evaluation]
    style A fill:#141b2d,stroke:#2698ba,color:#e0e0e0
    style B fill:#141b2d,stroke:#c97af2,color:#e0e0e0
    style C fill:#141b2d,stroke:#4fc97e,color:#e0e0e0
    style D fill:#141b2d,stroke:#e6a817,color:#e0e0e0

Test File Format (Unit Testing)

{
  "eval_set_id": "smart_home_unit_tests",
  "turns": [
    {
      "user_query": "Turn off device_2 in the Bedroom.",
      "expected_tool_use": [
        {
          "tool_name": "set_device_info",
          "tool_input": {
            "location": "Bedroom",
            "device_id": "device_2",
            "status": "OFF"
          }
        }
      ],
      "expected_intermediate_agent_responses": [],
      "expected_final_response": "I have set the device_2 status to off."
    }
  ]
}

What each field validates: expected_tool_use checks whether the agent called the right tool with the right parameters (trajectory). expected_intermediate_agent_responses can check what the agent said between tool calls (useful for multi-step reasoning agents). expected_final_response checks the user-facing output quality. The ADK evaluator runs the actual agent, captures its behavior, and compares against all three expected values.

Why define expected_tool_use at the parameter level? Simply checking “did the agent call set_device_info?” isn’t sufficient. An agent that calls set_device_info(location="Living Room") when asked about the Bedroom has failed even though it used the “right” tool. Parameter-level matching catches this. For high-stakes actions (database writes, API calls, financial transactions), parameter validation is critical.

Evalset Format (Integration Testing)

{
  "eval_set_id": "math_assistant_integration",
  "evals": [
    {
      "eval_id": "dice_and_prime",
      "conversation": [
        {
          "invocation_id": "turn_1",
          "user_query": "What can you do?",
          "expected_final_response": "I can roll dice, check prime numbers, and perform mathematical operations."
        },
        {
          "invocation_id": "turn_2",
          "user_query": "Roll a 10-sided dice twice and then check if 9 is prime.",
          "expected_tool_use": [
            {"tool_name": "roll_die", "tool_input": {"sides": 10}},
            {"tool_name": "roll_die", "tool_input": {"sides": 10}},
            {"tool_name": "check_prime", "tool_input": {"number": 9}}
          ],
          "expected_final_response": "I rolled a 10-sided die twice..."
        }
      ]
    }
  ]
}

Test file vs evalset — what’s the difference? Test files contain a single session with one or more turns. They’re analogous to unit tests — fast, focused, testing specific behaviors. Evalsets contain multiple sessions (multiple “evals”), each with potentially many turns. They’re analogous to integration tests — they test complex, multi-turn conversations that simulate real user workflows. Use test files during active development; use evalsets for pre-deployment regression testing.

Running Evaluations

# Web UI — interactive evaluation and dataset creation
adk web

# CLI — automated evaluation for CI/CD
adk eval ./my_agent/ ./evalsets/production_test.json \
    --config ./eval_config.json \
    --print_detailed_results

# Run specific evals from a larger evalset
adk eval ./my_agent/ ./evalsets/full_suite.json \
    eval_id_1,eval_id_2,eval_id_3  # comma-separated, no spaces

# Pytest integration — include in your test suite
# (in your test_agent.py file):
from google.adk.evaluation import AgentEvaluator

def test_agent_smoke():
    AgentEvaluator.evaluate(
        agent_module    = "my_agent.agent",
        eval_dataset    = "./evalsets/smoke_test.json",
        num_runs        = 1,
    )

Why three evaluation modes? Each serves a different workflow. The web UI is for building evalsets interactively — you have a real conversation with the agent and save good examples as test cases. The CLI is for automation — run it in your CI/CD pipeline on every pull request to catch regressions before deployment. Pytest integration is for developers who want agent evaluation alongside their existing unit tests in one pytest run.


A profound insight from recent research (Gulli et al., 2025): the harder it is to evaluate an agent, the less you can trust it. Evaluation difficulty is a proxy for accountability deficit.

The contractor model directly addresses this by making every agent interaction formally evaluated against explicit, pre-specified criteria:

CONTRACTOR LIFECYCLE — formalized evaluation at every stage
Contract Submitted
Precise specification: deliverables, format, data sources, scope, expected cost and duration. Everything objectively verifiable.
Contract Assessment
Agent evaluates feasibility, cost estimate, ambiguities. Can request clarification before committing. Prevents failures from underspecified requirements.
Accepted or Revised?
Revision needed
Contract Revision
Agent flags ambiguities, cost overruns, missing data. Negotiation before execution.
Accepted
Contract Execution
Generates plan → executes tasks → self-validates → generates subcontracts for complex subtasks.
Contract Deliverables
Verifiable against contract specifications. Evaluation is built into the contract — no post-hoc interpretation of "was this good enough?"

The four pillars of contractor-style agents:

1. Formalized Contract. Instead of a prompt like “analyze last quarter’s sales,” a contract specifies: “Deliver a 20-page PDF analyzing European market sales from Q1 2025, including five data visualizations, comparative analysis against Q1 2024, and a risk assessment. Acceptable data sources: [listed]. Maximum compute cost: $50. Completion time: 2 hours.” Every output criterion is objectively verifiable.

2. Negotiation Phase. Before execution, the agent can flag issues: “The specified XYZ database is inaccessible. Please provide credentials or approve alternative sources.” This resolves misunderstandings before they become failures — exactly what a human contractor would do before starting a project.

3. Quality-Focused Iterative Execution. For a code contract, the agent generates multiple implementations, runs them against the contract’s unit tests, scores each on performance/security/readability, and only delivers the version that passes all criteria. Internal self-validation before delivery.

4. Hierarchical Decomposition via Subcontracts. A master contract to “build an e-commerce mobile app” generates subcontracts: “Design UI/UX,” “Develop authentication module,” “Create database schema,” “Integrate payment gateway.” Each subcontract is a complete, independent, evaluable unit — enabling both specialization and accountability at every level.


Continuous Monitoring in Production

📊

Performance Tracking

Monitor accuracy, latency, and resource consumption continuously. Set up dashboards with alerting thresholds. A response latency spike at 2am should wake someone up — or at least log an alert — before users start complaining.

p50/p95/p99 latency · error_rate · tokens_per_query · cost_per_day
🔀

A/B Testing

Split production traffic between agent version A and version B. Measure the same metrics on both. The only way to know whether a change is actually better in production — not just apparently better on your test cases. Control for confounders (time of day, user segments).

statistical_significance · lift_in_primary_metric · guardrail_metric_compliance
🌊

Drift Detection

Monitor input distribution (are queries changing?), output quality trends (is accuracy declining?), and tool call patterns (is the agent using different tools than it used to?). Drift often manifests gradually — regular sampling and trend analysis catches it before users complain.

query_distribution_shift · quality_score_trend · tool_call_distribution
🚨

Anomaly Detection

Identify unusual patterns: sudden spike in guardrail triggers (possible attack), unexpected drop in task completion rate, specific tool call timing out repeatedly, unusual concentration of queries from a single user. Most anomalies are either attacks or bugs — both need rapid response.

guardrail_trigger_spike · tool_error_rate · completion_rate_drop
📋

Compliance Auditing

Generate automated reports showing the agent's adherence to ethical guidelines, regulatory requirements, and safety protocols. These reports need to be human-readable for auditors and machine-queryable for automated monitoring. Log everything — "if it didn't get logged, it didn't happen."

policy_violation_rate · escalation_rate · audit_trail_completeness
📚

Learning Progress

For agents that learn or adapt (Chapter 9 pattern), track whether learning is actually improving performance. Plot accuracy, cost efficiency, and task completion over time. Ensure improvements generalize (don't just overfit to the evaluation set).

accuracy_over_time · generalization_score · learning_curve_slope

Key Takeaways

  • Traditional testing is insufficient for agents. Code tests catch bugs. Agent evaluation catches behavioral drift, quality degradation, cost overruns, safety violations, and goal misalignment — completely different problem classes requiring different tools.

  • Use all three evaluation methods in a tiered system. Automated metrics on 100% of traffic (free, real-time). LLM-as-a-Judge on 5-10% (scalable qualitative assessment). Human evaluation periodically or on triggered samples (gold standard for calibration). Each layer catches what the others miss.

  • Track tokens per interaction, not just totals. Per-interaction logging reveals outliers that averages hide. A single 50,000-token call (reasoning loop gone wrong) looks fine in an average but is a serious anomaly that needs investigation.

  • temperature=0 is mandatory for LLM-as-a-Judge. Evaluation consistency requires determinism. Non-zero temperature means the same response gets different scores on different days — your evaluation system becomes unreliable.

  • Trajectory evaluation is as important as output evaluation. The right answer via the wrong path is inefficient, brittle, and potentially dangerous. Evaluate both what the agent produced and how it got there. Choose your trajectory metric based on the stakes: exact match for safety-critical steps, recall when no critical step can be missed.

  • ADK provides three evaluation modes for different workflows. Web UI for interactive evalset building. Pytest for developer CI/CD integration. CLI for automated deployment pipelines. Use all three at different stages of development.

  • The contractor model embeds evaluation into the contract itself. When deliverables are explicitly specified with verifiable criteria upfront, evaluation becomes trivial — either the agent met the contract or it didn’t. This is the future of accountable AI deployment in mission-critical domains.

  • Build monitoring before you need it. Instrumentation added post-deployment has gaps. Build token tracking, error logging, latency measurement, and quality sampling into the agent from day one. The cost of instrumentation is tiny; the cost of debugging un-instrumented production failures is enormous.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Chapter 10: Contributing to AI Safety — Paths, Skills, and Getting Started
  • Chapter 9: AI Control — Safety Without Trusting the Model
  • Chapter 18: Guardrails and Safety Patterns
  • Chapter 17: Reasoning Techniques
  • Chapter 8: AI Evaluations — The Science of Knowing What Models Can and Will Do