ARTICLE · 16 MIN READ · FEBRUARY 14, 2026
Chapter 11: Goal Setting and Monitoring
Without goals, agents react. With goals, agents pursue. This chapter shows how to give AI agents specific objectives, measurable success criteria, and the feedback loops that keep them on track.
Why Goals Transform Agents
Goal state: The desired end condition — what "done" or "success" looks like. "The customer's billing issue is resolved" is a goal state. "The agent has replied to the customer" is not — that's just an action, not a goal.
Initial state: Where you start from. The current situation that the agent must change. Understanding the initial state is as important as understanding the goal — the plan is a path from one to the other.
SMART goals: A framework for writing well-defined goals: Specific (clear about what), Measurable (you can determine if it's achieved), Achievable (within the agent's capabilities), Relevant (connected to actual user needs), Time-bound (has a defined deadline or iteration limit). A vague goal like "help the user" is not SMART. "Generate Python code that passes all provided unit tests within 5 iterations" is SMART.
Feedback loop: A cycle where the output of a process is used as input to the same process in the next iteration. In goal monitoring, the agent's progress toward the goal is measured, and that measurement is fed back to influence the agent's next action. This is how agents self-correct.
Stopping condition: The rule that determines when to exit the feedback loop. Without a stopping condition, a goal-driven agent can loop forever. Good stopping conditions include: goal achieved (success), maximum iterations reached (timeout), and no improvement detected over N iterations (stagnation).
Self-evaluation: When the agent uses the same LLM (or a separate one) to judge whether its own output meets the stated goals. This is related to the Reflection pattern (Chapter 4), but specifically oriented around goal achievement rather than general quality.
Every agent in this series so far reacts to inputs. You send a message; it responds. You make a request; it executes. The input determines the output, and the process is complete. These are reactive systems — they have no persistent purpose beyond the current request.
But many of the most valuable things we want AI agents to do are not single-turn reactions. They are sustained pursuits of outcomes:
- “Resolve this customer’s billing issue” — might require multiple tool calls, a database lookup, an email, and a confirmation check
- “Write code that passes all these tests” — requires iteration: write, test, fix, retest
- “Keep this project on track” — requires continuous monitoring of task statuses and deadlines
- “Maximize portfolio returns within risk tolerance” — requires ongoing evaluation of market conditions
These require agents that don’t just respond to the current input, but maintain a goal state across multiple steps, monitor their own progress toward that state, and adapt when they’re not making sufficient progress.
That’s the Goal Setting and Monitoring pattern.
The analogy: planning a trip. You don’t just spontaneously appear at your destination. You define where you want to go (the goal state), assess where you currently are (the initial state), plan the steps (book tickets, pack, travel), and continuously monitor your progress (check departure board, track your flight, navigate to the hotel). If something goes wrong — flight delayed, hotel overbooked — you don’t abandon the goal; you replan the path to the same destination.
The Anatomy of a Well-Defined Agent Goal
Not all goals are equal. A poorly specified goal produces an agent that achieves the wrong thing confidently. A well-specified goal produces an agent that achieves the right thing reliably. The difference is the SMART framework — a goal-writing discipline from project management that applies directly to AI agents.
The Monitoring Feedback Loop
Goals without monitoring are just aspirations. The monitoring component is what makes the goal operational — it continuously checks: “Are we there yet? Are we making progress? Do we need to change course?”
The difference from Reflection (Chapter 4). The Reflection pattern evaluates output quality and improves it. The Goal Setting and Monitoring pattern evaluates progress toward a specific, predefined objective. The distinction:
- Reflection: “Is this output good?” → improve quality
- Goal Monitoring: “Has this output met the stated goal criteria?” → achieve the specific target
In practice, both patterns are often combined: the agent uses reflection to improve individual outputs and goal monitoring to determine when those outputs finally satisfy the predefined success criteria.
Watch the Goal Loop in Action
The demo shows three iterations. Notice:
- Iteration 1: Classic off-by-one bug + no docstring + no edge cases →
False - Iteration 2: Bug fixed, docstring added, negative handled → but float inputs and docstring completeness still missing →
False - Iteration 3: All criteria met →
True→ loop exits
This is the goal-setting and monitoring loop in action: generate → judge → refine → judge → success.
The Code: A Goal-Driven Code Generation Agent
Now let’s look at how this pattern is implemented in Python with LangChain and OpenAI.
import os
import random
import re
from pathlib import Path
from langchain_openai import ChatOpenAI
from dotenv import load_dotenv
load_dotenv() # loads OPENAI_API_KEY from .env file
llm = ChatOpenAI(model="gpt-4o", temperature=0.1)
Why
temperature=0.1instead of 0? For code generation, you want deterministic, focused output — not creative random variation. Buttemperature=0can sometimes be too rigid, producing identical outputs when asked to improve.0.1adds just enough randomness to explore slightly different approaches on each iteration while remaining focused.
Setting Up the Goal
def run_code_agent(use_case: str, goals_input: str, max_iterations: int = 5) -> str:
# Parse goals from a comma-separated string into a list
goals = [g.strip() for g in goals_input.split(",")]
print(f"\n🎯 Use Case: {use_case}")
print("🎯 Goals:")
for g in goals:
print(f" - {g}")
Why parse goals from a comma-separated string? This makes the function easy to call from a command line or web UI:
goals_input = "simple, tested, handles edge cases". Each goal becomes a separate item in the list. The agent then checks each one individually during evaluation.
The goals list is the core of the pattern. Everything else — the code generation, the critique, the iteration — serves the purpose of achieving every item on this list. If you have 5 goals, the loop continues until all 5 are met or
max_iterationsis exhausted.
Generating Code
def generate_prompt(use_case, goals, previous_code, feedback):
goal_str = "\n".join(f"- {g}" for g in goals)
if not previous_code:
# First iteration: generate from scratch
return f"""Write a Python function that solves the following problem:
Problem: {use_case}
Your code must meet ALL of these goals:
{goal_str}
Respond with ONLY the Python code. No explanations, no markdown fences, just the code."""
else:
# Subsequent iterations: refine based on critique
return f"""Here is a previous attempt at solving this problem:
Problem: {use_case}
Goals to meet:
{goal_str}
Previous code:
{previous_code}
Critique of previous code:
{feedback}
Write an improved version that addresses ALL critique points. Respond with ONLY the Python code."""
Why two different prompts (first vs subsequent iterations)? The first iteration has no prior context — we simply specify the goal and ask for a solution. Subsequent iterations have crucial additional context: the previous attempt and the specific critique of why it failed. Giving the LLM this context dramatically improves the refinement quality — it’s not just told “try again,” it’s told exactly what was wrong and why.
“Respond with ONLY the Python code” — this is critical. Without this instruction, the LLM might respond with: “Sure! Here’s the code:
python def factorial...”. Then your code extraction logic has to parse markdown fences, prose, and potentially multiple code blocks. The explicit instruction eliminates this parsing complexity.
The Critique (Monitoring) Step
def get_code_feedback(code: str, goals: list) -> object:
goal_str = "\n".join(f"- {g}" for g in goals)
critique_prompt = f"""You are an expert code reviewer.
Review the following code against these goals:
{goal_str}
Code to review:
{code}
For each goal, state whether it is met and why.
Then provide specific, actionable feedback on any unmet goals.
Be precise — point to exact line numbers and specific issues."""
return llm.invoke(critique_prompt)
This is the monitoring step. The same LLM (or a different one) evaluates the generated code against the stated goals. The critique prompt forces the evaluator to go through each goal individually — this is important because a general “is this code good?” question would produce vague feedback. Goal-by-goal evaluation produces specific, actionable critiques that the generator can act on.
The Stopping Condition (Goals Met Check)
def goals_met(feedback_text: str, goals: list) -> bool:
# Ask the LLM to make a binary judgment: are ALL goals met?
check_prompt = f"""Given this code review:
{feedback_text}
And these goals:
{', '.join(goals)}
Answer with a single word: True if ALL goals are fully met, False if any goal is not fully met."""
response = llm.invoke(check_prompt)
return "true" in response.content.strip().lower()
Why ask the LLM for a binary True/False judgment? Machine-readable output enables programmatic control. The loop condition is
if goals_met(...). If you asked the LLM to describe whether goals are met in prose, you’d have to parse sentiment and intent from a paragraph — much harder and more error-prone. The explicitTrue/Falseinstruction makes the stopping condition reliable.
Why
.strip().lower()? The LLM might output"True","true","TRUE","True."(with a period), or even"True, all goals are met."(with a sentence)."true" in response.content.strip().lower()handles all of these variants safely — it checks if the string “true” appears anywhere in the lowercased response.
The limitation of self-evaluation. When the same LLM that generated the code also judges whether it meets goals, there’s a risk of the judge being too lenient on the generator’s own work. The model might rationalize why a buggy solution actually meets the goals. This is the same cognitive bias problem discussed in Chapter 4 (Reflection). The solution, discussed below, is to use a separate agent for the judging role.
The Main Loop
previous_code = ""
feedback = ""
for i in range(max_iterations):
print(f"\n=== 🔁 Iteration {i + 1} of {max_iterations} ===")
# STEP 1: Generate (or refine) code
prompt = generate_prompt(use_case, goals, previous_code,
feedback if isinstance(feedback, str) else feedback.content)
code_response = llm.invoke(prompt)
code = clean_code_block(code_response.content.strip())
# STEP 2: Evaluate (monitoring)
feedback = get_code_feedback(code, goals)
feedback_text = feedback.content.strip()
# STEP 3: Check stopping condition
if goals_met(feedback_text, goals):
print("✅ All goals met. Stopping.")
break
# STEP 4: Prepare for next iteration (strategy adaptation)
previous_code = code
# Return the final result, save to file
final_code = add_comment_header(code, use_case)
return save_code_to_file(final_code, use_case)
The four-step loop maps directly to the monitoring pattern: Generate (execute action) → Get feedback (monitor) → Check if goals met (evaluate) → Update previous_code (adapt strategy). This is the fundamental goal monitoring cycle implemented in Python.
previous_code = codeat the end of each iteration. The next iteration’s generator receives the previous iteration’s output. Each refinement builds on the last — it’s not starting from scratch each time, it’s improving an existing solution. This is more efficient and produces better results than generating independently each time.
What if
max_iterationsis reached without meeting goals? The loop exits withbreak(goal met) or naturally whenrange(max_iterations)is exhausted. In both cases, the finalcodevariable contains the last generated version. This is saved and returned — the best attempt, even if goals weren’t fully met, is still returned rather than nothing.
Saving the Result
def save_code_to_file(code: str, use_case: str) -> str:
# Generate a short filename from the use case description
summary_prompt = f"Summarize this use case in one lowercase word or phrase, max 10 chars, suitable for a Python filename:\n\n{use_case}"
raw_summary = llm.invoke(summary_prompt).content.strip()
short_name = re.sub(r"[^a-zA-Z0-9_]", "", raw_summary.replace(" ", "_").lower())[:10]
# Add random suffix to avoid filename collisions
random_suffix = str(random.randint(1000, 9999))
filename = f"{short_name}_{random_suffix}.py"
filepath = Path.cwd() / filename
with open(filepath, "w") as f:
f.write(code)
return str(filepath)
Why generate the filename with an LLM? The use case description might be long and contain special characters. Asking the LLM to summarize it into a valid Python filename identifier is convenient. The
re.sub(r"[^a-zA-Z0-9_]", "")then strips any remaining invalid characters as a safety net.
Why add
random_suffix? If you run the agent multiple times with the same use case, you don’t want the second run to overwrite the first. The random 4-digit suffix ensures unique filenames per run.
The Critical Limitation: One LLM Both Writes and Judges
The implementation above has an important structural weakness: the same LLM generates the code and evaluates whether it meets the goals. This creates a subtle but significant problem.
When a language model generates code, it has already committed to an internal representation of what the code should look like. When that same model then evaluates whether the code meets goals, it reads the code through the lens of its prior commitment. It tends to be more lenient on its own work — recognizing its own intentions and reading them into the code even when they’re not fully implemented.
This is exactly the same cognitive bias that makes human code authors poor reviewers of their own code — and exactly why Chapter 4 (Reflection) recommends using a separate Critic agent with a different system prompt.
The Multi-Agent Solution
A more robust architecture uses specialized agents, each with a dedicated role:
Why this is better. Each agent has a single, focused responsibility. The Code Reviewer’s system prompt is entirely dedicated to finding flaws — it has no stake in defending the code it’s reviewing (it didn’t write it). The Test Writer generates executable tests, providing an objective, deterministic validation layer that doesn’t rely on LLM judgment at all. The separation of concerns produces higher-quality, more objective evaluation.
This architecture naturally maps to how senior engineering teams work: a developer writes code, a separate reviewer evaluates it against requirements, a QA engineer runs tests. The goal (ship working, documented, tested code) is monitored by multiple independent validators.
Practical Applications
Customer Support Automation
Goal: "Resolve customer's billing inquiry." Monitor: verify billing change in database, confirm user acknowledgment. Escalate if goal not achievable within 3 tool calls.
Customer Success · SaaS platformsPersonalized Learning
Goal: "Student achieves 80%+ accuracy on algebra exercises." Monitor: track quiz scores per topic. Adapt teaching materials when performance falls below threshold.
EdTech · Tutoring systemsProject Management
Goal: "Milestone X complete by date Y." Monitor: task completion status, team velocity, open blockers. Flag at-risk milestones before they miss deadlines.
Enterprise · Agile teamsAutomated Trading
Goal: "Maximize portfolio returns within defined risk tolerance." Monitor: portfolio value, volatility metrics, drawdown percentage. Halt trading if risk thresholds breached.
FinTech · Algorithmic tradingAutonomous Vehicles
Goal: "Transport passengers from A to B safely." Monitor: environment (obstacles, signals), vehicle state (speed, fuel), route progress. Replan on deviation or hazard.
Robotics · AV systemsContent Moderation
Goal: "Identify and remove harmful content with <2% false positive rate." Monitor: classification confidence, human reviewer override rate. Adjust thresholds to maintain goal metrics.
Trust & Safety · Social platformsCommon Mistakes When Setting Agent Goals
Mistake 1: Metric misalignment — optimizing for the proxy, not the goal. You set the goal as “close 50 support tickets per day.” The agent closes tickets by providing generic responses and immediately marking them resolved. Ticket count is high; actual resolution rate is low. Always define goals in terms of the outcome you care about, not the metric that’s easy to measure.
Mistake 2: No stopping condition. The classic infinite loop. An agent tasked with “refine the report until it’s perfect” has no stopping condition. Perfect is never reached. Use explicit bounds: “up to 5 iterations” or “until no improvement is detected over 2 consecutive iterations.”
Mistake 3: Self-evaluation by the same LLM that generated the output. As discussed, the generator is biased toward approving its own work. Use a separate agent with a distinct reviewer persona, or better yet, use an automated test suite that doesn’t involve the LLM at all for the evaluation step.
Mistake 4: Too many simultaneous goals. An agent with 15 goals has a 15-item checklist it must satisfy simultaneously. Each additional goal makes it less likely all are met in any given iteration, and harder to identify which specific goal caused failure. Start with 3-5 well-defined goals. Add more only as you observe the agent consistently meeting the baseline set.
Mistake 5: Goals that conflict. “Maximize response speed” and “maximize response completeness” are in tension. “Minimize API calls” and “gather comprehensive information” conflict. When goals conflict, the agent can’t satisfy all of them simultaneously — it will make arbitrary trade-offs. Explicitly rank goals by priority, or resolve conflicts before handing them to the agent.
Mistake 6: No escalation path. If the goal isn’t achievable (missing tool access, ambiguous requirements, edge case outside training data), the agent loops until max_iterations. Include an explicit escalation: “If goal is not met within 3 iterations, return current best attempt with a description of remaining gaps and what human intervention is needed.”
At a Glance
Equipping agents with explicit, measurable objectives and feedback loops that track progress toward those objectives. The agent doesn't just execute actions — it pursues specific outcomes and self-corrects when off course.
Reactive agents can't handle multi-step tasks that require sustained pursuit of an outcome. Goal setting and monitoring transforms agents from "answer this question" to "achieve this outcome" — enabling genuinely autonomous operation.
Use when the agent must execute a multi-step process, adapt to dynamic conditions, and reliably achieve a specific high-level objective without constant human intervention. Always define SMART goals and include an explicit stopping condition.
Key Takeaways
-
Goals transform reactive agents into purposeful ones. Without goals, agents answer questions. With goals, agents pursue outcomes — a qualitative difference that enables genuine autonomy on complex, multi-step tasks.
-
SMART goals prevent the most common failure modes. Specific eliminates ambiguity. Measurable enables monitoring. Achievable prevents infinite loops on impossible objectives. Relevant ensures you’re optimizing for what actually matters. Time-bound prevents runaway resource consumption.
-
The monitoring feedback loop is the operational core. Execute → Evaluate → Adapt is the cycle. Every iteration makes measurable progress toward the goal or reveals that the approach needs to change. Without monitoring, you just have an agent that runs to completion without knowing if it succeeded.
-
Self-evaluation by the generator is biased. The same LLM that generated an output will be more lenient when judging that output. Use a separate agent with a distinct critic persona, or better, use automated tests that don’t involve the LLM for objective evaluation.
-
True/False stopping conditions are more reliable than prose. When you need a programmatic stopping condition, extract a binary signal from the LLM rather than parsing sentiment from a paragraph.
"true" in response.lower()is simple, robust, and predictable. -
Every goal system needs an escalation path. When the goal isn’t achievable within the iteration budget — due to missing tools, ambiguous requirements, or out-of-distribution inputs — the agent should return its best attempt with an explanation of what’s missing, not silently fail or loop forever.
-
Multi-agent architectures produce more objective monitoring. Separating the Generator (Peer Programmer), the Evaluator (Code Reviewer), and the Validator (Test Writer) into independent agents with distinct system prompts produces higher-quality, less biased evaluation than self-assessment.
Next up — Chapter 12: Safety and Guardrails, where we examine how to build agents that are not just capable, but reliably safe — preventing harmful outputs, unauthorized actions, and runaway behavior.
Enjoy Reading This Article?
Here are some more articles you might like to read next: