ARTICLE  ·  16 MIN READ  ·  FEBRUARY 14, 2026

Chapter 11: Goal Setting and Monitoring

Without goals, agents react. With goals, agents pursue. This chapter shows how to give AI agents specific objectives, measurable success criteria, and the feedback loops that keep them on track.


Why Goals Transform Agents

Before You Start — Key Terms Explained

Goal state: The desired end condition — what "done" or "success" looks like. "The customer's billing issue is resolved" is a goal state. "The agent has replied to the customer" is not — that's just an action, not a goal.

Initial state: Where you start from. The current situation that the agent must change. Understanding the initial state is as important as understanding the goal — the plan is a path from one to the other.

SMART goals: A framework for writing well-defined goals: Specific (clear about what), Measurable (you can determine if it's achieved), Achievable (within the agent's capabilities), Relevant (connected to actual user needs), Time-bound (has a defined deadline or iteration limit). A vague goal like "help the user" is not SMART. "Generate Python code that passes all provided unit tests within 5 iterations" is SMART.

Feedback loop: A cycle where the output of a process is used as input to the same process in the next iteration. In goal monitoring, the agent's progress toward the goal is measured, and that measurement is fed back to influence the agent's next action. This is how agents self-correct.

Stopping condition: The rule that determines when to exit the feedback loop. Without a stopping condition, a goal-driven agent can loop forever. Good stopping conditions include: goal achieved (success), maximum iterations reached (timeout), and no improvement detected over N iterations (stagnation).

Self-evaluation: When the agent uses the same LLM (or a separate one) to judge whether its own output meets the stated goals. This is related to the Reflection pattern (Chapter 4), but specifically oriented around goal achievement rather than general quality.

Every agent in this series so far reacts to inputs. You send a message; it responds. You make a request; it executes. The input determines the output, and the process is complete. These are reactive systems — they have no persistent purpose beyond the current request.

But many of the most valuable things we want AI agents to do are not single-turn reactions. They are sustained pursuits of outcomes:

  • “Resolve this customer’s billing issue” — might require multiple tool calls, a database lookup, an email, and a confirmation check
  • “Write code that passes all these tests” — requires iteration: write, test, fix, retest
  • “Keep this project on track” — requires continuous monitoring of task statuses and deadlines
  • “Maximize portfolio returns within risk tolerance” — requires ongoing evaluation of market conditions

These require agents that don’t just respond to the current input, but maintain a goal state across multiple steps, monitor their own progress toward that state, and adapt when they’re not making sufficient progress.

That’s the Goal Setting and Monitoring pattern.

The analogy: planning a trip. You don’t just spontaneously appear at your destination. You define where you want to go (the goal state), assess where you currently are (the initial state), plan the steps (book tickets, pack, travel), and continuously monitor your progress (check departure board, track your flight, navigate to the hotel). If something goes wrong — flight delayed, hotel overbooked — you don’t abandon the goal; you replan the path to the same destination.


The Anatomy of a Well-Defined Agent Goal

Not all goals are equal. A poorly specified goal produces an agent that achieves the wrong thing confidently. A well-specified goal produces an agent that achieves the right thing reliably. The difference is the SMART framework — a goal-writing discipline from project management that applies directly to AI agents.

SMART GOALS FOR AI AGENTS — click each letter to explore
S
Specific — The goal must be unambiguous
A specific goal tells the agent *exactly* what it needs to achieve. Vague goals lead to vague results — the agent will pursue the path of least resistance, which is rarely what you actually wanted.
✗ Vague"Help users with their code problems." — What kind of help? Debugging? Explanation? Rewriting? What qualifies as "help"? The agent has no way to know when it's done.
✓ Specific"Generate a working Python function that solves the given problem, includes a docstring, and handles the specified edge cases." — Clear what's needed, clear what format, clear scope.
M
Measurable — Success must be detectable
The monitoring component requires that you can determine, at any point, whether the goal has been achieved. Without measurability, monitoring is impossible — you can't track progress toward something you can't measure.
✗ Unmeasurable"Write good code." — "Good" is subjective. How would the agent know when it's achieved "good"? How would you know?
✓ Measurable"Write code that passes all 5 provided unit tests and has no functions longer than 20 lines." — Both criteria can be objectively checked by running the tests and measuring line counts.
A
Achievable — Within the agent's actual capabilities
A goal must be achievable given the agent's available tools, knowledge, and authority. Setting an unachievable goal creates an agent that loops forever without progress, consuming resources without producing results.
✗ Unachievable"Solve any customer complaint instantly." — "Any" is too broad. "Instantly" may conflict with API latency. Some complaints require human judgment the agent cannot provide.
✓ Achievable"Resolve billing complaints that match known error patterns using the billing_update tool. Escalate to human agents for complex cases." — Bounded scope + clear escalation path.
R
Relevant — Connected to what actually matters
The goal must be connected to the real objective — optimizing for the wrong metric is one of the most dangerous failure modes in AI systems. An agent that maximizes its stated metric while missing the actual intent is worse than no agent at all.
✗ Misaligned"Maximize the number of customer support tickets closed per hour." — The agent might close tickets prematurely without actually resolving issues, just to inflate the metric.
✓ Relevant"Resolve customer issues such that customers confirm resolution and don't reopen the ticket within 48 hours." — Measures actual resolution, not just ticket closure velocity.
T
Time-bound — With a defined stopping condition
Every goal-driven agent loop needs a stopping condition. Without one, the agent either runs forever (burning API credits) or waits indefinitely for a condition that may never come. Time-bound goals can be calendar-bounded ("by Friday") or iteration-bounded ("within 5 attempts").
✗ Unbounded"Keep refining the code until it's perfect." — "Perfect" is never achieved. The loop runs until you manually kill the process or exhaust your budget.
✓ Time-bound"Refine the code for up to 5 iterations. Stop early if all goals are met. After 5 iterations, return the best version achieved so far with a note on remaining gaps." — Clear exit conditions for both success and timeout.

The Monitoring Feedback Loop

Goals without monitoring are just aspirations. The monitoring component is what makes the goal operational — it continuously checks: “Are we there yet? Are we making progress? Do we need to change course?”

GOAL SETTING AND MONITORING PATTERN
Define Goal + Success Criteria
SMART goal: specific, measurable, achievable, relevant, time-bound. Define what "done" looks like in concrete, checkable terms.
Execute Action Step
Agent takes the next action toward the goal: generates output, calls a tool, makes a decision, updates a state value.
Monitor: Evaluate Progress
Check current state against goal criteria. Run tests, ask the LLM to judge output, query a metric, compare to a threshold. Answer: "Have we met the goal? Are we making progress?"
Goal Met?
Or max iterations reached?
Not yet met
Adapt Strategy
Use monitoring feedback to adjust: refine output, try a different tool, replan approach, request more information.
↑ Back to Execute
Goal achieved
Deliver Result
Return the final output. Log achievement. Update any persistent state.

The difference from Reflection (Chapter 4). The Reflection pattern evaluates output quality and improves it. The Goal Setting and Monitoring pattern evaluates progress toward a specific, predefined objective. The distinction:

  • Reflection: “Is this output good?” → improve quality
  • Goal Monitoring: “Has this output met the stated goal criteria?” → achieve the specific target

In practice, both patterns are often combined: the agent uses reflection to improve individual outputs and goal monitoring to determine when those outputs finally satisfy the predefined success criteria.


Watch the Goal Loop in Action

GOAL-DRIVEN CODE AGENT — live iteration demo
GOAL Write a Python factorial function: simple, correct, handles edge cases (negative, float), includes docstring
MAX ITERATIONS 5

The demo shows three iterations. Notice:

  • Iteration 1: Classic off-by-one bug + no docstring + no edge cases → False
  • Iteration 2: Bug fixed, docstring added, negative handled → but float inputs and docstring completeness still missing → False
  • Iteration 3: All criteria met → True → loop exits

This is the goal-setting and monitoring loop in action: generate → judge → refine → judge → success.


The Code: A Goal-Driven Code Generation Agent

Now let’s look at how this pattern is implemented in Python with LangChain and OpenAI.

import os
import random
import re
from pathlib import Path
from langchain_openai import ChatOpenAI
from dotenv import load_dotenv

load_dotenv()  # loads OPENAI_API_KEY from .env file
llm = ChatOpenAI(model="gpt-4o", temperature=0.1)

Why temperature=0.1 instead of 0? For code generation, you want deterministic, focused output — not creative random variation. But temperature=0 can sometimes be too rigid, producing identical outputs when asked to improve. 0.1 adds just enough randomness to explore slightly different approaches on each iteration while remaining focused.

Setting Up the Goal

def run_code_agent(use_case: str, goals_input: str, max_iterations: int = 5) -> str:
    # Parse goals from a comma-separated string into a list
    goals = [g.strip() for g in goals_input.split(",")]

    print(f"\n🎯 Use Case: {use_case}")
    print("🎯 Goals:")
    for g in goals:
        print(f"  - {g}")

Why parse goals from a comma-separated string? This makes the function easy to call from a command line or web UI: goals_input = "simple, tested, handles edge cases". Each goal becomes a separate item in the list. The agent then checks each one individually during evaluation.

The goals list is the core of the pattern. Everything else — the code generation, the critique, the iteration — serves the purpose of achieving every item on this list. If you have 5 goals, the loop continues until all 5 are met or max_iterations is exhausted.

Generating Code

def generate_prompt(use_case, goals, previous_code, feedback):
    goal_str = "\n".join(f"- {g}" for g in goals)

    if not previous_code:
        # First iteration: generate from scratch
        return f"""Write a Python function that solves the following problem:

Problem: {use_case}

Your code must meet ALL of these goals:
{goal_str}

Respond with ONLY the Python code. No explanations, no markdown fences, just the code."""
    else:
        # Subsequent iterations: refine based on critique
        return f"""Here is a previous attempt at solving this problem:

Problem: {use_case}

Goals to meet:
{goal_str}

Previous code:
{previous_code}

Critique of previous code:
{feedback}

Write an improved version that addresses ALL critique points. Respond with ONLY the Python code."""

Why two different prompts (first vs subsequent iterations)? The first iteration has no prior context — we simply specify the goal and ask for a solution. Subsequent iterations have crucial additional context: the previous attempt and the specific critique of why it failed. Giving the LLM this context dramatically improves the refinement quality — it’s not just told “try again,” it’s told exactly what was wrong and why.

“Respond with ONLY the Python code” — this is critical. Without this instruction, the LLM might respond with: “Sure! Here’s the code: python def factorial...”. Then your code extraction logic has to parse markdown fences, prose, and potentially multiple code blocks. The explicit instruction eliminates this parsing complexity.

The Critique (Monitoring) Step

def get_code_feedback(code: str, goals: list) -> object:
    goal_str = "\n".join(f"- {g}" for g in goals)
    critique_prompt = f"""You are an expert code reviewer.

Review the following code against these goals:
{goal_str}

Code to review:
{code}

For each goal, state whether it is met and why.
Then provide specific, actionable feedback on any unmet goals.
Be precise — point to exact line numbers and specific issues."""

    return llm.invoke(critique_prompt)

This is the monitoring step. The same LLM (or a different one) evaluates the generated code against the stated goals. The critique prompt forces the evaluator to go through each goal individually — this is important because a general “is this code good?” question would produce vague feedback. Goal-by-goal evaluation produces specific, actionable critiques that the generator can act on.

The Stopping Condition (Goals Met Check)

def goals_met(feedback_text: str, goals: list) -> bool:
    # Ask the LLM to make a binary judgment: are ALL goals met?
    check_prompt = f"""Given this code review:
{feedback_text}

And these goals:
{', '.join(goals)}

Answer with a single word: True if ALL goals are fully met, False if any goal is not fully met."""

    response = llm.invoke(check_prompt)
    return "true" in response.content.strip().lower()

Why ask the LLM for a binary True/False judgment? Machine-readable output enables programmatic control. The loop condition is if goals_met(...). If you asked the LLM to describe whether goals are met in prose, you’d have to parse sentiment and intent from a paragraph — much harder and more error-prone. The explicit True/False instruction makes the stopping condition reliable.

Why .strip().lower()? The LLM might output "True", "true", "TRUE", "True." (with a period), or even "True, all goals are met." (with a sentence). "true" in response.content.strip().lower() handles all of these variants safely — it checks if the string “true” appears anywhere in the lowercased response.

The limitation of self-evaluation. When the same LLM that generated the code also judges whether it meets goals, there’s a risk of the judge being too lenient on the generator’s own work. The model might rationalize why a buggy solution actually meets the goals. This is the same cognitive bias problem discussed in Chapter 4 (Reflection). The solution, discussed below, is to use a separate agent for the judging role.

The Main Loop

    previous_code = ""
    feedback = ""

    for i in range(max_iterations):
        print(f"\n=== 🔁 Iteration {i + 1} of {max_iterations} ===")

        # STEP 1: Generate (or refine) code
        prompt = generate_prompt(use_case, goals, previous_code,
                                 feedback if isinstance(feedback, str) else feedback.content)
        code_response = llm.invoke(prompt)
        code = clean_code_block(code_response.content.strip())

        # STEP 2: Evaluate (monitoring)
        feedback = get_code_feedback(code, goals)
        feedback_text = feedback.content.strip()

        # STEP 3: Check stopping condition
        if goals_met(feedback_text, goals):
            print("✅ All goals met. Stopping.")
            break

        # STEP 4: Prepare for next iteration (strategy adaptation)
        previous_code = code

    # Return the final result, save to file
    final_code = add_comment_header(code, use_case)
    return save_code_to_file(final_code, use_case)

The four-step loop maps directly to the monitoring pattern: Generate (execute action) → Get feedback (monitor) → Check if goals met (evaluate) → Update previous_code (adapt strategy). This is the fundamental goal monitoring cycle implemented in Python.

previous_code = code at the end of each iteration. The next iteration’s generator receives the previous iteration’s output. Each refinement builds on the last — it’s not starting from scratch each time, it’s improving an existing solution. This is more efficient and produces better results than generating independently each time.

What if max_iterations is reached without meeting goals? The loop exits with break (goal met) or naturally when range(max_iterations) is exhausted. In both cases, the final code variable contains the last generated version. This is saved and returned — the best attempt, even if goals weren’t fully met, is still returned rather than nothing.

Saving the Result

def save_code_to_file(code: str, use_case: str) -> str:
    # Generate a short filename from the use case description
    summary_prompt = f"Summarize this use case in one lowercase word or phrase, max 10 chars, suitable for a Python filename:\n\n{use_case}"
    raw_summary = llm.invoke(summary_prompt).content.strip()
    short_name = re.sub(r"[^a-zA-Z0-9_]", "", raw_summary.replace(" ", "_").lower())[:10]

    # Add random suffix to avoid filename collisions
    random_suffix = str(random.randint(1000, 9999))
    filename = f"{short_name}_{random_suffix}.py"
    filepath = Path.cwd() / filename

    with open(filepath, "w") as f:
        f.write(code)

    return str(filepath)

Why generate the filename with an LLM? The use case description might be long and contain special characters. Asking the LLM to summarize it into a valid Python filename identifier is convenient. The re.sub(r"[^a-zA-Z0-9_]", "") then strips any remaining invalid characters as a safety net.

Why add random_suffix? If you run the agent multiple times with the same use case, you don’t want the second run to overwrite the first. The random 4-digit suffix ensures unique filenames per run.


The Critical Limitation: One LLM Both Writes and Judges

The implementation above has an important structural weakness: the same LLM generates the code and evaluates whether it meets the goals. This creates a subtle but significant problem.

When a language model generates code, it has already committed to an internal representation of what the code should look like. When that same model then evaluates whether the code meets goals, it reads the code through the lens of its prior commitment. It tends to be more lenient on its own work — recognizing its own intentions and reading them into the code even when they’re not fully implemented.

This is exactly the same cognitive bias that makes human code authors poor reviewers of their own code — and exactly why Chapter 4 (Reflection) recommends using a separate Critic agent with a different system prompt.

The Multi-Agent Solution

A more robust architecture uses specialized agents, each with a dedicated role:

MULTI-AGENT GOAL MONITORING — separation of concerns
Goal + Use Case Input
SMART goal definition with specific, measurable success criteria
Peer Programmer
Generates code. Focuses entirely on solving the problem. Does not evaluate its own output.
Code Reviewer
Evaluates code against the stated goals. Returns structured feedback + True/False verdict. Completely separate from the generator.
Test Writer
Generates unit tests for the code. Provides objective, executable validation that complements the LLM reviewer's judgment.
Documenter
Ensures docstrings, comments, and README are complete. Checks documentation goals independently.
All Goals Met?
Reviewer + Test Writer both confirm
Not met
Feedback → Programmer
All met
Final Output

Why this is better. Each agent has a single, focused responsibility. The Code Reviewer’s system prompt is entirely dedicated to finding flaws — it has no stake in defending the code it’s reviewing (it didn’t write it). The Test Writer generates executable tests, providing an objective, deterministic validation layer that doesn’t rely on LLM judgment at all. The separation of concerns produces higher-quality, more objective evaluation.

This architecture naturally maps to how senior engineering teams work: a developer writes code, a separate reviewer evaluates it against requirements, a QA engineer runs tests. The goal (ship working, documented, tested code) is monitored by multiple independent validators.


Practical Applications

01

Customer Support Automation

Goal: "Resolve customer's billing inquiry." Monitor: verify billing change in database, confirm user acknowledgment. Escalate if goal not achievable within 3 tool calls.

Customer Success · SaaS platforms
02

Personalized Learning

Goal: "Student achieves 80%+ accuracy on algebra exercises." Monitor: track quiz scores per topic. Adapt teaching materials when performance falls below threshold.

EdTech · Tutoring systems
03

Project Management

Goal: "Milestone X complete by date Y." Monitor: task completion status, team velocity, open blockers. Flag at-risk milestones before they miss deadlines.

Enterprise · Agile teams
04

Automated Trading

Goal: "Maximize portfolio returns within defined risk tolerance." Monitor: portfolio value, volatility metrics, drawdown percentage. Halt trading if risk thresholds breached.

FinTech · Algorithmic trading
05

Autonomous Vehicles

Goal: "Transport passengers from A to B safely." Monitor: environment (obstacles, signals), vehicle state (speed, fuel), route progress. Replan on deviation or hazard.

Robotics · AV systems
06

Content Moderation

Goal: "Identify and remove harmful content with <2% false positive rate." Monitor: classification confidence, human reviewer override rate. Adjust thresholds to maintain goal metrics.

Trust & Safety · Social platforms

Common Mistakes When Setting Agent Goals

Mistake 1: Metric misalignment — optimizing for the proxy, not the goal. You set the goal as “close 50 support tickets per day.” The agent closes tickets by providing generic responses and immediately marking them resolved. Ticket count is high; actual resolution rate is low. Always define goals in terms of the outcome you care about, not the metric that’s easy to measure.

Mistake 2: No stopping condition. The classic infinite loop. An agent tasked with “refine the report until it’s perfect” has no stopping condition. Perfect is never reached. Use explicit bounds: “up to 5 iterations” or “until no improvement is detected over 2 consecutive iterations.”

Mistake 3: Self-evaluation by the same LLM that generated the output. As discussed, the generator is biased toward approving its own work. Use a separate agent with a distinct reviewer persona, or better yet, use an automated test suite that doesn’t involve the LLM at all for the evaluation step.

Mistake 4: Too many simultaneous goals. An agent with 15 goals has a 15-item checklist it must satisfy simultaneously. Each additional goal makes it less likely all are met in any given iteration, and harder to identify which specific goal caused failure. Start with 3-5 well-defined goals. Add more only as you observe the agent consistently meeting the baseline set.

Mistake 5: Goals that conflict. “Maximize response speed” and “maximize response completeness” are in tension. “Minimize API calls” and “gather comprehensive information” conflict. When goals conflict, the agent can’t satisfy all of them simultaneously — it will make arbitrary trade-offs. Explicitly rank goals by priority, or resolve conflicts before handing them to the agent.

Mistake 6: No escalation path. If the goal isn’t achievable (missing tool access, ambiguous requirements, edge case outside training data), the agent loops until max_iterations. Include an explicit escalation: “If goal is not met within 3 iterations, return current best attempt with a description of remaining gaps and what human intervention is needed.”


At a Glance

WHAT

Equipping agents with explicit, measurable objectives and feedback loops that track progress toward those objectives. The agent doesn't just execute actions — it pursues specific outcomes and self-corrects when off course.

WHY

Reactive agents can't handle multi-step tasks that require sustained pursuit of an outcome. Goal setting and monitoring transforms agents from "answer this question" to "achieve this outcome" — enabling genuinely autonomous operation.

RULE OF THUMB

Use when the agent must execute a multi-step process, adapt to dynamic conditions, and reliably achieve a specific high-level objective without constant human intervention. Always define SMART goals and include an explicit stopping condition.


Key Takeaways

  • Goals transform reactive agents into purposeful ones. Without goals, agents answer questions. With goals, agents pursue outcomes — a qualitative difference that enables genuine autonomy on complex, multi-step tasks.

  • SMART goals prevent the most common failure modes. Specific eliminates ambiguity. Measurable enables monitoring. Achievable prevents infinite loops on impossible objectives. Relevant ensures you’re optimizing for what actually matters. Time-bound prevents runaway resource consumption.

  • The monitoring feedback loop is the operational core. Execute → Evaluate → Adapt is the cycle. Every iteration makes measurable progress toward the goal or reveals that the approach needs to change. Without monitoring, you just have an agent that runs to completion without knowing if it succeeded.

  • Self-evaluation by the generator is biased. The same LLM that generated an output will be more lenient when judging that output. Use a separate agent with a distinct critic persona, or better, use automated tests that don’t involve the LLM for objective evaluation.

  • True/False stopping conditions are more reliable than prose. When you need a programmatic stopping condition, extract a binary signal from the LLM rather than parsing sentiment from a paragraph. "true" in response.lower() is simple, robust, and predictable.

  • Every goal system needs an escalation path. When the goal isn’t achievable within the iteration budget — due to missing tools, ambiguous requirements, or out-of-distribution inputs — the agent should return its best attempt with an explanation of what’s missing, not silently fail or loop forever.

  • Multi-agent architectures produce more objective monitoring. Separating the Generator (Peer Programmer), the Evaluator (Code Reviewer), and the Validator (Test Writer) into independent agents with distinct system prompts produces higher-quality, less biased evaluation than self-assessment.


Next up — Chapter 12: Safety and Guardrails, where we examine how to build agents that are not just capable, but reliably safe — preventing harmful outputs, unauthorized actions, and runaway behavior.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Chapter 10: Contributing to AI Safety — Paths, Skills, and Getting Started
  • Chapter 9: AI Control — Safety Without Trusting the Model
  • Chapter 19: Evaluation and Monitoring
  • Chapter 18: Guardrails and Safety Patterns
  • Chapter 17: Reasoning Techniques