Chapter 16: Resource-Aware Optimization

The Overspending Problem

Before You Start — Key Terms Explained

Token: The unit LLMs use for pricing and context length measurement. Roughly 4 characters of text = 1 token. "What is the capital of France?" is about 8 tokens. A 10-page document is roughly 2,500 tokens. Every LLM API call charges you for both input tokens (your prompt) and output tokens (the model's response).

Model tiers: LLM providers offer multiple models at different price/capability points. "Flash" or "Mini" models are fast and cheap, designed for high-volume simple tasks. "Pro" or "Opus" models are powerful and expensive, designed for complex reasoning. The price difference can be 10–100× between tiers.

Latency: The time between sending a request and receiving the full response. Cheaper, smaller models typically respond in 0.5–2 seconds. Larger, more capable models may take 5–30 seconds for complex reasoning. In real-time applications, latency directly affects user experience.

Router agent: An agent whose sole job is to classify incoming requests and forward them to the appropriate downstream handler. The router itself typically uses a cheap, fast model — the routing decision must not cost more than the savings it generates.

Critique agent: An agent that evaluates the quality of another agent's output. Used for quality assurance, self-correction loops, and identifying systematic failures in routing decisions (e.g., if the cheap model keeps producing bad results for certain query types).

OpenRouter: A third-party service that provides a unified API endpoint for hundreds of AI models from different providers (OpenAI, Anthropic, Google, Meta, etc.). Instead of integrating each provider separately, you send all requests to OpenRouter with a model name, and it handles routing, billing, and failover.

Graceful degradation: When a preferred system is unavailable, falling back to a less capable but still functional alternative rather than failing completely. (Covered in depth in Chapter 12.)

Imagine you run a customer support system powered by GPT-4 Pro. It handles 100,000 queries per day. Many of those queries are:

“What are your business hours?” (trivial lookup)
“How do I reset my password?” (template response)
“Is my order shipped?” (database lookup)
“What’s the return policy?” (static FAQ)

These questions do not require sophisticated reasoning. GPT-4o-mini handles them perfectly. But if you’re routing everything to GPT-4 Pro, you’re paying 15–20× more per query than necessary — for zero quality improvement on simple questions.

At 100,000 queries/day, 80% of which are simple:

Without optimization: 100,000 × $0.003/query = $300/day = $9,000/month
With optimization: 80,000 × $0.0003 + 20,000 × $0.003 = $24 + $60 = $84/day = $2,520/month

That’s a 70% cost reduction — same quality, same user experience, same output for the queries that matter.

Resource-aware optimization is the pattern that makes this happen systematically. It’s not just about cost — it’s about matching the computational resource to the task’s actual requirements across three dimensions: cost (API pricing), latency (response time), and quality (output accuracy and sophistication).

The Model Tier Landscape

Before building a resource-aware system, you need to understand the trade-offs across model tiers:

MODEL TIER COMPARISON — hover to explore Relative values — exact pricing varies by provider and date

Flash / Mini (fast, cheap)

Standard / Balanced

Pro / Opus (powerful, expensive)

The chart shows the fundamental trade-off: cost and speed move in opposite directions from quality. A Flash model is 5× faster and much cheaper, but scores lower on complex reasoning. A Pro model excels at reasoning but is slower and expensive. Resource-aware optimization exploits this: use the cheap model where it’s good enough, reserve the expensive one for where it’s actually needed.

The Three-Agent Architecture

Resource-aware optimization is typically implemented as a three-agent system:

RESOURCE-AWARE OPTIMIZATION ARCHITECTURE

Incoming Query

User question, task, or request — unknown complexity at arrival time

Router Agent (Flash model)

Classifies complexity: simple / reasoning / internet_search. Uses a cheap model — routing cost must be much less than the savings it generates.

Flash Model

Simple queries · Fast · Low cost · High volume capacity

Pro Model

Complex reasoning · Thorough · High cost · Reserved for hard problems

Search + Model

Live data needed · Retrieve first · Synthesize second

Critique Agent (optional)

Evaluates output quality. Feeds back into router logic — if Flash keeps failing on certain query types, router learns to route those to Pro instead.

Final Response

Correct answer · Delivered within budget · Appropriate latency for the query type

Interactive: Query Router Demo

See how the router classifies different queries and routes them to the right model:

QUERY ROUTER LIVE DEMO

QUERY What is the capital of Australia?

The Code: Three Implementations

Implementation 1: ADK with Model Tiers

from google.adk.agents import Agent

# Tier 1: Fast, cheap — for simple queries, routing, classification
gemini_flash_agent = Agent(
    name        = "GeminiFlashAgent",
    model       = "gemini-2.5-flash",
    description = "A fast, cost-efficient agent for simple, well-defined queries.",
    instruction = "Answer concisely. For factual questions, give the direct answer without elaboration. Be fast."
)

# Tier 2: Powerful, expensive — for complex reasoning
gemini_pro_agent = Agent(
    name        = "GeminiProAgent",
    model       = "gemini-2.5-pro",
    description = "A highly capable agent for complex analytical and reasoning tasks.",
    instruction = "Take your time to reason carefully. Show your thought process. Prioritize accuracy over brevity."
)

Why different instructions for different tiers? The instruction shapes the model’s behavior beyond just its capability. The Flash agent is told to be concise and fast — it shouldn’t pad its responses. The Pro agent is told to reason carefully and show work — it should use its full capability, even if that means longer responses. The instruction activates the tier’s strengths.

Implementation 2: Query Router Agent

from google.adk.agents import BaseAgent
from google.adk.events import Event
from google.adk.agents.invocation_context import InvocationContext
from typing import AsyncGenerator

class QueryRouterAgent(BaseAgent):
    """Routes incoming queries to the appropriate model tier based on complexity."""
    name:        str = "QueryRouter"
    description: str = "Routes queries to Flash (simple) or Pro (complex) based on analysis."

    async def _run_async_impl(
        self, context: InvocationContext
    ) -> AsyncGenerator[Event, None]:
        user_query    = context.current_message.text
        query_words   = len(user_query.split())
        query_lower   = user_query.lower()

        # Complexity signals
        is_long_query      = query_words > 20
        needs_reasoning    = any(kw in query_lower for kw in
                                ['why', 'how', 'explain', 'compare', 'analyze',
                                 'evaluate', 'design', 'recommend', 'trade-off'])
        needs_current_data = any(kw in query_lower for kw in
                                ['today', 'this week', 'latest', 'current',
                                 'recent', 'now', '2025', '2026'])
        needs_math         = any(kw in query_lower for kw in
                                ['calculate', 'solve', 'equation', 'probability',
                                 'percent', 'if x then'])

        # Routing decision
        if needs_current_data:
            route = "search_and_synthesize"
            model_used = "google_search + gemini-2.5-flash"
        elif needs_reasoning or needs_math or is_long_query:
            route = "complex"
            model_used = "gemini-2.5-pro"
        else:
            route = "simple"
            model_used = "gemini-2.5-flash"

        yield Event(
            author  = self.name,
            content = f"Routing '{user_query[:50]}...' → {route} (model: {model_used})"
        )

Rule-based vs LLM-based routing. This implementation uses keyword matching — fast and predictable, but brittle. It will misclassify edge cases (e.g., “why is the sky blue?” is classified as complex because of “why” but is actually simple). A more sophisticated router would use an LLM to classify queries based on semantic understanding rather than keywords. The trade-off: LLM routing is ~$0.0001 per classification call but far more accurate. Use keyword routing for high-volume, cost-critical systems; use LLM routing when accuracy matters more than classification cost.

Why check needs_current_data first? The routing conditions are evaluated in priority order. If a query needs current data, that overrides complexity — even a simple question about today’s news requires search. By checking this first, you ensure real-time queries always get search capability, not just a model upgrade.

Implementation 3: The Critique Agent

CRITIC_SYSTEM_PROMPT = """
You are the Critique Agent — the quality assurance layer of our multi-model system.
Your function: evaluate responses from other models and identify systematic failures.

For each response you review, assess:
1. ACCURACY: Is the factual content correct?
2. COMPLETENESS: Does it fully address the question?
3. APPROPRIATE DEPTH: Is the detail level right for the question's complexity?
4. ROUTING SIGNAL: Was this the right model for this query type?

Return structured feedback:
- verdict: "correct" | "needs_refinement" | "wrong_model"
- if wrong_model: suggest "should_use_flash" | "should_use_pro"
- specific_issues: list of concrete problems found
- confidence: 0.0 to 1.0

Be constructive. Flag systematic misrouting — if Flash keeps struggling with
a category, that's a signal to update the router's classification rules.
"""

Why “wrong_model” is a routing signal, not just a quality signal. When the Critique Agent says “wrong_model: should_use_flash,” it’s identifying a case where the router sent a simple query to the Pro model unnecessarily — overspending. When it says “wrong_model: should_use_pro,” it’s identifying a case where the router sent a complex query to Flash — producing a poor result. Both are actionable: the first wastes money, the second wastes quality. Tracking these misrouting events over time reveals systematic biases in the router’s classification logic that can be corrected.

OpenRouter: Multi-Provider Optimization

OpenRouter is a third-party service that simplifies resource-aware optimization by providing:

A single API endpoint for 200+ models from all major providers
Automatic model selection (openrouter/auto)
Sequential fallback chains
Real-time pricing and latency data

import requests, json

# Simple request with automatic model selection
response = requests.post(
    url     = "https://openrouter.ai/api/v1/chat/completions",
    headers = {
        "Authorization": f"Bearer {OPENROUTER_API_KEY}",
        "X-Title":       "My Agent App",
    },
    data = json.dumps({
        "model": "openrouter/auto",     # ← OpenRouter picks the best model for this query
        "messages": [
            {"role": "user", "content": "What is the capital of France?"}
        ]
    })
)

"model": "openrouter/auto": OpenRouter analyzes the prompt and selects the most cost-effective model that can handle it well. For a simple factual question, it might choose a Flash-tier model. For a complex reasoning question, it upgrades automatically. The selection considers: prompt complexity, available model performance data, current pricing, and latency requirements.

Sequential Fallback Chain

# Fallback chain: try Claude 3.5 Sonnet first, fall back to a cheaper model if it fails
response = requests.post(
    url  = "https://openrouter.ai/api/v1/chat/completions",
    headers = {"Authorization": f"Bearer {OPENROUTER_API_KEY}"},
    data = json.dumps({
        "models": [
            "anthropic/claude-sonnet-4-5",  # Try first — preferred quality
            "openai/gpt-4o-mini",           # Fallback 1 — if Claude is unavailable
            "google/gemini-flash-1.5",      # Fallback 2 — cheapest option
        ],
        "messages": [{"role": "user", "content": user_query}]
    })
)

# The response includes which model was actually used
model_used = response.json()["model"]
print(f"Served by: {model_used}")

Why the fallback chain is valuable in production. LLM APIs have outages, rate limits, and regional availability issues. If you’re hard-coded to one model and it goes down, your entire application fails. The fallback chain provides automatic resilience: Claude is down → seamlessly serve from GPT-4o-mini → users never notice. This is graceful degradation (Chapter 12) applied at the model layer.

The response always tells you which model was used. This is critical for cost tracking, quality auditing, and identifying when fallbacks fire more than expected. If you notice you’re serving from the fallback 30% of the time, that’s a signal your primary model has reliability issues you should address.

Implementation 4: Full Three-Tier System with OpenAI

import os, json, requests
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def classify_prompt(prompt: str) -> dict:
    """Use a cheap model to classify query complexity — the router itself must be cheap."""
    response = client.chat.completions.create(
        model       = "gpt-4o-mini",   # Cheapest capable classifier
        temperature = 0,               # Deterministic — routing must be consistent
        messages    = [
            {
                "role": "system",
                "content": (
                    "Classify the user prompt into exactly one category: "
                    "simple, reasoning, or internet_search.\n\n"
                    "- simple: direct factual questions answerable from training data\n"
                    "- reasoning: logic, math, multi-step analysis, trade-off evaluation\n"
                    "- internet_search: requires current data, real-time events, recent news\n\n"
                    "Respond ONLY with JSON: {\"classification\": \"<category>\"}"
                )
            },
            {"role": "user", "content": prompt}
        ]
    )
    return json.loads(response.choices[0].message.content)

temperature=0 for the classifier. The classification decision must be deterministic and consistent. If the same query is classified differently on each call, the system becomes unpredictable — the same query might get routed to Flash one time and Pro the next, producing different costs and quality. temperature=0 ensures the same input always produces the same classification.

Why use gpt-4o-mini for classification, not a cheaper model? The classifier needs to reliably distinguish “simple,” “reasoning,” and “internet_search” — this requires semantic understanding beyond keyword matching. Using a model that’s too cheap might produce unreliable classifications, routing complex queries to Flash (wrong quality) or simple queries to Pro (wrong cost). The classification model should be the cheapest model that classifies accurately for your query distribution.

def generate_response(prompt: str, classification: str, search_results=None) -> tuple:
    """Route to the appropriate model based on classification."""
    if classification == "simple":
        model      = "gpt-4o-mini"    # $0.15/1M input, $0.60/1M output
        full_prompt = prompt

    elif classification == "reasoning":
        model       = "o4-mini"       # Reasoning model with chain-of-thought
        full_prompt = prompt

    elif classification == "internet_search":
        model = "gpt-4o"              # Standard model for synthesis with search context
        if search_results:
            search_context = "\n".join([
                f"Title: {r['title']}\nSnippet: {r['snippet']}\nURL: {r['link']}"
                for r in search_results
            ])
            full_prompt = f"""Use ONLY these web results to answer. Cite sources.\n\n{search_context}\n\nQuestion: {prompt}"""
        else:
            full_prompt = f"Note: web search returned no results. Answer from training data with a caveat.\n\n{prompt}"

    response = client.chat.completions.create(
        model    = model,
        messages = [{"role": "user", "content": full_prompt}],
    )
    return response.choices[0].message.content, model

Why different models for different categories? Each model was chosen for the category it dominates:

gpt-4o-mini for “simple”: cheapest model that handles factual recall with high reliability

o4-mini for “reasoning”: the o-series models have explicit chain-of-thought reasoning built in, dramatically improving performance on math and multi-step logic vs standard models

gpt-4o for “search”: synthesis tasks require reading and integrating retrieved text, a task standard models handle well; the full gpt-4o is better than mini for this use case

def handle_prompt(prompt: str) -> dict:
    """Orchestrate the full resource-aware pipeline."""
    # Step 1: Classify (cheap)
    classification = classify_prompt(prompt)["classification"]

    # Step 2: Retrieve if needed (medium cost)
    search_results = None
    if classification == "internet_search":
        search_results = google_search(prompt)

    # Step 3: Generate (model cost depends on classification)
    answer, model_used = generate_response(prompt, classification, search_results)

    return {
        "classification": classification,
        "model_used":     model_used,
        "response":       answer,
    }

The pipeline’s cost structure. Every query pays the classification cost (cheap, ~$0.0001). Then the routing kicks in:

Simple queries: pay cheap generation cost (~$0.0003) → total ~$0.0004

Reasoning queries: pay Pro generation cost (~$0.015) → total ~$0.0151

Search queries: pay search API cost (~$0.005) + standard generation (~$0.008) → total ~$0.0131

The classification cost is negligible. The savings come from routing ~70-80% of queries to cheap models instead of expensive ones.

The Nine Optimization Techniques

Resource-aware optimization extends beyond model switching. Here’s the full spectrum:

Dynamic Model Switching

Route to different model tiers based on query complexity. The pattern we've been building throughout this chapter.

Adaptive Tool Selection

Choose between tools based on cost and speed. Use a cached database lookup before an expensive live API call. Prefer a local index before a web search.

Contextual Pruning

Summarize or truncate long conversation history before sending to the LLM. Fewer input tokens = lower cost. Prioritize the most recent and most relevant context.

Proactive Resource Prediction

Forecast expected query volume and provision resources in advance. Avoids cold-start latency and capacity shortfalls during peak traffic.

Cost-Sensitive Multi-Agent Coordination

Optimize communication costs between agents in addition to computation. Batch small messages, compress serialized state, avoid redundant cross-agent calls.

Energy-Efficient Deployment

For edge devices (IoT, mobile), minimize inference frequency and prefer quantized, compressed models. Important for battery-constrained environments.

Parallelization Awareness

When multiple independent sub-tasks exist, parallelize them (Chapter 3 pattern). Total time = max(T_tasks), not sum(T_tasks). Coordinate resource allocation across parallel branches.

Learned Resource Allocation

Use the Critique Agent's feedback over time to retrain the router. If Flash consistently fails on certain query patterns, update the classifier. The system improves from its own production traffic.

Graceful Degradation + Fallback

When primary resources are exhausted or unavailable, automatically fall back to cheaper/simpler alternatives. Serve a reduced-quality response rather than failing completely. (Chapter 12 pattern applied at the resource layer.)

At a Glance

WHAT

A pattern for dynamically matching computational resources to task requirements. Simple queries route to cheap, fast models. Complex queries route to powerful, expensive ones. A critique agent monitors quality and improves routing over time.

WHY

Routing all queries to the most capable (and most expensive) model wastes 70-90% of API costs on queries that a much cheaper model handles equally well. Resource-aware optimization preserves quality where it matters while eliminating waste where it doesn't.

RULE OF THUMB

Start by measuring your query distribution: what fraction is simple, complex, or search-requiring? If >50% is simple, this pattern pays for itself immediately. Use OpenRouter for easy multi-model fallback; build custom routers when query distribution requires specialized classification.

Key Takeaways

Over-provisioning is the default failure mode. Without resource optimization, developers pick one capable model and route everything to it. This works but wastes 70-90% of API budget on queries that didn’t need that capability.
The routing cost must be much less than the savings. If classifying a query costs $0.001 and the savings from routing to a cheaper model is $0.0005, routing costs more than it saves. Use the cheapest reliable classifier — gpt-4o-mini, Gemini Flash, or keyword matching.
temperature=0 for all routing and classification calls. The routing decision must be deterministic. Non-zero temperature introduces randomness into which model serves a query, making cost and quality unpredictable and unreproducible.
The three-tier classification (simple / reasoning / internet_search) covers 95% of cases. These three categories map cleanly to different computational requirements: factual recall, chain-of-thought reasoning, and real-time retrieval. Most production systems can start with just these three.
OpenRouter provides zero-code fallback chains. Instead of implementing retry/fallback logic yourself, the "models": [list] parameter in OpenRouter handles it automatically. The simplest resilient multi-model system is a one-line configuration change.
The Critique Agent converts production traffic into training data. Every query where the Critique Agent says “wrong_model” is a labeled example of router misclassification. Collect these, analyze the patterns, and update the router’s classification logic. The system improves itself from its own errors.
Model tiers are not fixed — the landscape changes fast. What’s “Pro” today becomes “Flash” next year. Gemini 2.5 Flash matches the capability of last year’s Pro models. Build your resource-aware system with abstraction between the routing logic and the specific model names — use configuration or constants, not hardcoded strings, so updating model choices requires changing one line not twenty.