ARTICLE · 42 MIN READ · JANUARY 05, 2026
Chapter 1: Prompt Chaining
LLMs choke on complex tasks. Prompt chaining fixes that — by breaking one hard problem into a sequence of simple ones, each feeding the next.
The Problem with One Big Prompt
LLM (Large Language Model): A type of AI trained on enormous amounts of text (books, websites, code) that can generate human-like text in response to prompts. GPT-4, Gemini, Claude — these are all LLMs.
Prompt: The text you send to an LLM. Think of it as your question or instruction. The quality of the prompt directly affects the quality of the response.
Context window: The LLM's "working memory" — the maximum amount of text it can see at once. If you send a very long conversation, the model may "forget" things from early on because they fell outside the window.
Hallucination: When an LLM confidently states something false. It doesn't know that it's wrong — it's just predicting plausible-sounding text based on patterns it learned. This is a core reliability problem that prompt chaining helps solve.
Temperature: A setting (0 to 1) that controls how "creative" vs "predictable" the output is. Temperature 0 = always picks the most likely next word (deterministic). Temperature 1 = more random and creative. For factual tasks, use 0. For brainstorming, use higher values.
API (Application Programming Interface): A way for programs to talk to each other. When you use LangChain to call GPT-4, it's sending your prompt to OpenAI's API over the internet and getting a response back.
Imagine asking a single intern to — in one go — read a 50-page report, extract key data, spot trends, write a summary, and draft an email to the CEO. All at once. No notes allowed.
They’d probably do a few things okay and quietly drop the rest.
That’s exactly what happens when you give a large language model (LLM) one massive, complicated instruction. The model gets overwhelmed. It forgets parts of the prompt. It loses track of what it was doing. It makes stuff up to fill in the gaps.
This has a name: cognitive overload. And it’s the #1 reason LLM outputs go wrong on complex tasks.
Here’s what failure looks like in practice:
| What goes wrong | What it means |
|---|---|
| Instruction neglect | The model quietly ignores parts of your prompt |
| Contextual drift | It forgets the original goal halfway through |
| Error propagation | One early mistake snowballs into a broken answer |
| Hallucination | When unsure, it confidently makes things up |
Why LLMs Struggle with Complex Instructions: A Deep Dive
To really understand why prompt chaining is necessary, you need to understand a bit about how LLMs actually work under the hood. You do not need to know any mathematics here — just the intuitions.
What is a token?
Before an LLM can process any text, it first converts that text into tokens. A token is the model’s basic unit of language — roughly speaking, it is a short piece of a word. The word "chaining" might be one token. The word "unbelievable" might be broken into three tokens: "un", "believ", and "able". A rule of thumb that works well in practice: one token is approximately four characters of English text, so 100 tokens is roughly 75 words.
Why does this matter? Because every part of the LLM’s processing — from its limits, to its costs, to its attention capacity — is measured in tokens, not words or sentences. When you see that GPT-4o has a “128,000-token context window,” that translates to roughly 96,000 words, or about the length of a full novel.
Tokens also directly determine cost. OpenAI and other API providers charge per thousand tokens of input and output. A 1,000-token prompt plus a 500-token response costs you 1,500 tokens. Understanding this helps you write efficient prompts: every unnecessary word you include in a prompt costs money and consumes precious context space.
What is a context window, really?
The context window is the total amount of text — measured in tokens — that an LLM can see at one time. Think of it as a spotlight: the model can only pay attention to whatever falls within the spotlight’s beam. Everything outside the beam is invisible to the model.
If you write a 200-token prompt, the model can see all 200 tokens simultaneously. If you write a 5,000-token prompt with a crucial instruction at the very beginning, the model technically still “sees” it — but here is where attention comes in.
It is worth pausing on the phrase “everything outside the beam is invisible.” This is not a metaphor — it is literally true. If a conversation gets long enough that early messages exceed the context window, the model has absolutely no access to them. It cannot even tell you that it has forgotten — it simply has no record that those messages ever existed. This is why very long chat sessions with AI assistants sometimes feel like the AI “forgot” something important you said an hour ago. It literally did. The conversation history that your client software was maintaining got trimmed to fit within the context window.
How attention works (no math, I promise)
Modern LLMs use a mechanism called attention. The idea is that when generating the next word of an output, the model does not treat every previous word as equally important. It looks back at all the tokens in the context window and assigns each one a weight — a score that says “how much does this token matter for predicting what comes next?”
Attention is powerful because it lets the model connect distant parts of the text. For example, if a pronoun appears at position 3,000 in a document and its antecedent (the noun it refers to) appeared at position 100, attention can still link them. The model does not have to process tokens in strict sequence — it can “look back” anywhere in the context simultaneously.
The critical problem is this: attention is not perfect, and it is not free. As context windows get longer, the amount of computation required to pay attention to everything grows rapidly. The model has a limited budget of attention to spread across all tokens in the context. When you write a very long, complex prompt, the attention available to each individual instruction gets diluted. Critical instructions near the beginning or buried in the middle of a long prompt get assigned lower attention weights — which means the model effectively pays less attention to them.
This is sometimes called the lost-in-the-middle problem: instructions at the very start or very end of a long prompt get the most attention; instructions in the middle get the least. Research from Stanford (Liu et al., 2023) confirmed this experimentally: models perform significantly worse at tasks that require using information positioned in the middle of long contexts, compared to the same information positioned at the beginning or end. If your most important constraint is instruction 4 of 8 in a long prompt, there is a real chance the model will under-weight it.
Tokens, attention, and instruction neglect
Put this together and you get a clear picture of why “one big prompt” fails:
- You write a prompt with 10 distinct instructions.
- The LLM tokenizes it and runs attention across all 10 instructions simultaneously.
- The model has a fixed capacity for what it can “hold in mind” at once.
- Generating the output is a sequential, word-by-word process. As the model writes each word, it pays attention back to the prompt — but as the output grows, the prompt tokens must “compete” with the growing output tokens for attention.
- Instructions that were already receiving diluted attention get diluted further as the output grows.
- The model “forgets” or ignores some of your instructions — not because it is broken, but because this is the mathematically expected behavior of attention under resource constraints.
The model is not being lazy when it ignores your instructions. It is doing exactly what it was designed to do — predicting the most likely next token given everything in its context window. The problem is that your carefully-crafted instructions have become a small part of a very large context, and the attention mechanism cannot give every instruction equal weight.
What hallucination actually is
A hallucination occurs when the LLM generates text that is factually incorrect or fabricated. This happens for a specific reason: the LLM is, at its core, a very sophisticated text completion machine. It does not have access to a “fact database” it can look up. It learned patterns from training data, and when it encounters a situation where it does not have a confident prediction, it generates the most statistically plausible-sounding next token — even if that token represents something that never happened.
Think of it like this: imagine you were asked to complete the sentence “The capital of France is __” after reading ten million documents that mention Paris. You would confidently say “Paris.” Now imagine you were asked to complete “The obscure 14th-century French mathematician who first proved __” after reading far fewer documents that mention this specific mathematician. You might confabulate a name — not maliciously, but because your pattern-matching training leads you to produce something plausible-sounding. This is exactly what LLMs do.
Hallucinations are especially common in complex prompts because complex prompts push the model into territory where it is less confident. When a model is handling 10 instructions simultaneously and reaches a point where it is uncertain about instruction 6, it “fills in the gap” with the most plausible-sounding continuation of the text. The model cannot tell you “I’m not sure about this part” — it just produces text, confident-sounding text, even when it is wrong.
Prompt chaining dramatically reduces hallucinations because each step is simpler, better-defined, and more squarely within the model’s confident domain. A model that is only asked to summarize a given paragraph — not to simultaneously extract, format, analyze, and report — stays within a tight, well-defined task and produces far fewer hallucinations.
The compounding failure problem
One last failure mode worth naming: error propagation. In a complex prompt, if the model makes one early mistake, all subsequent reasoning is built on a flawed foundation. The model cannot “backtrack” mid-generation to fix an error — it only moves forward, each token conditioning on all previous tokens. A small wrong assumption at step 1 of a 10-step reasoning process can produce a completely wrong final answer even if steps 2–10 were “correct” given step 1.
Imagine a row of dominos. If the first domino falls in the wrong direction, every subsequent domino falls the wrong way too, and the final state of the row is completely different from what you intended. There is no mid-sequence correction. Error propagation in LLM reasoning is identical: the first flawed inference tilts every subsequent inference, and the final answer may be far from correct even though each individual reasoning step looked locally plausible. Prompt chaining breaks the domino chain into isolated segments with validation checkpoints between them. If one domino falls wrong within a segment, the damage is contained to that segment — it cannot reach the next one.
The Fix: Chain Your Prompts
Prompt chaining solves this by doing what any good manager does — breaking the big task into smaller steps, where each step has exactly one job.
Step 1 does its job. Passes its output to Step 2. Step 2 does its job. Passes to Step 3. And so on.
Think of it like an assembly line. Each station handles one thing and does it well. The car doesn’t get assembled all at once by one person.
A restaurant kitchen is not one person doing everything. It is a brigade: the prep cook handles raw ingredients, the sauce cook handles sauces, the grill cook handles proteins, the expeditor checks every plate before it leaves. Each person has one domain of expertise and receives a clear, specific input. The output of one station becomes the input of the next. A prompt chain is structurally identical. Step 1 is the prep cook — it cleans and organises the raw input. Step 2 is the line cook — it transforms the prepped input into an intermediate product. Step 3 is the finishing station — it produces the final output. When every station is focused on exactly one job, quality is consistent and errors are caught early rather than being plated and served to the customer.
The mechanical explanation of why this works connects directly to everything we said about attention: when each prompt contains only one instruction, the attention mechanism has only one job to focus on. There is no dilution, no competition between instructions, no opportunity for the lost-in-the-middle problem to occur. Every token in the prompt is pointing at the same goal, so the model’s entire attention budget is directed toward exactly what you want.
Here’s another way to think about it: instead of sending the LLM a 10-point brief and hoping it handles all 10 points well, you are sending it a 1-point brief ten times in a row, with each point having the full, undivided attention of the model. The total number of LLM calls is higher, and there is more Python code involved — but the reliability improvement is so dramatic that this trade-off is almost always worth it.
Reading the Demo: What Is Actually Happening at Each Step
If you just clicked “Run Chain” and watched the demo, you saw text appearing step by step in three boxes. Let’s talk through exactly what is happening at each transition, because this is the heart of prompt chaining.
The raw input is an unstructured sentence: “Market report: AI adoption in enterprise grew 340% in 2024. Healthcare led at 67%, finance at 52%. Top blockers: data privacy (43%) and integration costs (38%).” This is the kind of messy, human-written text you encounter in the real world. It has no fixed format. The numbers are embedded in prose. The structure is implicit, not explicit. If you tried to extract the healthcare adoption percentage with a simple string search, you would have to write brittle code that breaks the moment someone rephrases the sentence slightly.
Step 1 — Summarise. The first prompt receives this raw sentence and has exactly one job: condense it into a cleaner summary. Notice that the output is still plain text — but it is tighter, more clearly structured prose. The word “surged” replaces the dry “grew.” The sentence breaks make the key facts easier to scan. This step is not yet converting anything to data; it is just cleaning and organizing the language. The model works well here because it is doing what it was trained on billions of examples to do: rewrite text to be clearer. There is no ambiguity about what “summarise” means, and the model’s full attention is on that one task.
Step 2 — Extract trends as JSON. Now the output of Step 1 — that clean summary — becomes the input of a completely new prompt. This second prompt instructs the model to identify the top trends and emit them as JSON. JSON, which stands for JavaScript Object Notation, is a structured data format that uses {} for objects and [] for arrays. Here is what each symbol means in the output you see in the demo:
-
{opens a JSON object — a collection of key-value pairs. -
"trends"is a key — a named slot that holds a value. -
: [means the value of “trends” is an array (a list of items). -
{inside the array opens one item object with its own key-value pairs. -
"trend": "Healthcare AI adoption"is one key-value pair — the name of the trend. -
"data": "67% of enterprise AI deployments"is another key-value pair — the supporting statistic.
Crucially, JSON is not just human-readable — it is machine-readable. A Python program can parse it instantly using json.loads(). This is what makes the chain technically composable: you are not just passing text around, you are passing structured data that code can act on. The model excels at Step 2 because it is not being asked to summarize or write anything creative — it is being asked to slot the already-clean facts from Step 1 into a pre-defined structure. The task is narrow, the input is clean, and the expected output format is explicit.
Step 3 — Draft email. The JSON output of Step 2 becomes the input of a third prompt. This prompt is the creative step: take these structured facts and produce a polished, professional email. Notice that this step receives structured data, not the messy original paragraph. Because the facts have already been cleaned and extracted by Steps 1 and 2, Step 3 can focus entirely on tone, structure, and clarity. The model is not guessing which facts are important — they were already selected. It is not guessing which numbers are accurate — they were already validated. It just writes. And because writing is what language models do most naturally, Step 3 produces excellent results.
Why the connectors light up. In the demo, you see the vertical pipes change color as each step activates. This is a visual metaphor for data flow — the idea that information is moving from one node to the next. In the real code you will see later, this data flow is handled by LangChain’s pipe operator (|), which does the same thing: takes the output of the left component and passes it as input to the right component. The color change signals completion — “this step is done, its output is now flowing to the next.”
The key insight: each step only has one job. The summariser doesn’t worry about emails. The trend-extractor doesn’t worry about formatting. When every step is focused, every step does well.
How to Make the Chain Reliable
There’s one thing that can break a chain: messy data between steps.
If Step 1 outputs a rambling paragraph and Step 2 is trying to parse it as structured data — it fails. The output needs to be clean and predictable.
The fix is structured output. You tell the LLM: “output as JSON, nothing else.” Like this:
{
"trends": [
{
"trend": "Healthcare AI adoption",
"data": "67% of enterprise deployments"
},
{
"trend": "Finance AI adoption",
"data": "52% of enterprise deployments"
}
]
}
JSON is machine-readable. It can be passed directly into the next prompt without any ambiguity. This is what makes chains robust at scale.
Why JSON specifically? You could, in theory, pass plain text between steps. In simple chains, this works fine. But consider what happens when Step 2 in your chain is not another LLM prompt — it is a Python function, or a database query, or an API call. Plain text is not directly usable by code. JSON, on the other hand, can be parsed into a Python dictionary with a single line: data = json.loads(response). From there, you can access specific fields (data["trends"][0]["trend"]), validate that required fields exist, count items, or do any other programmatic operation. Structured output turns an LLM from a text-in-text-out black box into a proper node in a software pipeline.
How to enforce JSON output. The prompt instruction "No extra text, no explanation — JSON only" is the first line of defense. Most capable models (GPT-4, Claude, Gemini) will follow this instruction reliably when the task is well-defined. For extra reliability, newer LLM APIs support a feature called structured outputs or function calling, where you define a schema (a blueprint describing exactly what fields you expect) and the API guarantees the response matches that schema. LangChain exposes this via the .with_structured_output() method. This is the gold standard for production chains where a malformed response would break downstream code.
What happens when the JSON is malformed? This is real: sometimes the model will output almost-valid JSON with a trailing comma, or wrap the JSON in a Markdown code fence ( json ... ). The fix is to add a post-processing step — a Python function that strips common issues before passing to the next step. For example:
import re, json
def clean_json(raw: str) -> dict:
# Remove Markdown code fences if present
cleaned = re.sub(r'^```(?:json)?\s*', '', raw.strip())
cleaned = re.sub(r'\s*```$', '', cleaned)
return json.loads(cleaned)
This is an example of inserting regular Python code between LLM calls — one of the core powers of prompt chaining that we will return to in depth.
Assign a Role at Each Step
One more trick: give the model a role at each step. It anchors the model’s tone and focus.
- Step 1 prompt starts with: “You are a Market Analyst. Your sole task is to summarise…”
- Step 2 prompt starts with: “You are a Data Analyst. Extract trends from…”
- Step 3 prompt starts with: “You are a Business Writer. Draft a clear email…”
It sounds silly, but it works. The model “picks up” the persona and stays consistent within that step.
Why does role assignment work? When the model was trained, it saw enormous amounts of text written by people in different professional roles — analysts, lawyers, engineers, writers. The statistical patterns of how a “market analyst” writes are different from how a “business writer” writes. By telling the model which role to adopt, you are cueing it to use the patterns associated with that role. You are not programming a persona in a literal sense — you are activating a cluster of learned writing patterns that are appropriate for the task. The effect is real and measurable: a prompt that begins with a relevant role description consistently produces better outputs than the same prompt without a role.
System prompts vs. user prompts. In most LLM APIs, there are two types of prompt: the system prompt (sometimes called the system message), which is set at the beginning of a conversation and defines the model’s overall behavior, and the user prompt, which is the actual instruction for a given step. Role assignment belongs in the system prompt. Task-specific instructions (like “here is the text, extract the trends”) belong in the user prompt. This separation keeps your chains clean and easy to modify independently.
Where Prompt Chaining Applies
This pattern shows up everywhere once you recognise it.
Information Processing
Raw data → clean → extract entities → search database → generate report. Each arrow is a prompt.
A concrete example: imagine you receive PDF invoices from vendors. A one-step prompt to “read this invoice and give me the total, vendor name, and date as JSON” sounds simple — but PDFs are messy. Text extraction from PDFs produces garbled formatting. Numbers are written in different ways (“one thousand and fifty” vs “$1,050.00” vs “1050”). Abbreviations vary by vendor.
A chained approach handles this gracefully:
- Normalisation prompt: “Here is raw text extracted from a PDF invoice. Clean up the formatting and standardise any numbers — write out dollar amounts as plain integers.”
- Extraction prompt: “Here is a cleaned invoice. Extract vendor name, invoice date, and total amount as JSON with keys
vendor,date,total_usd.” - Validation step (regular Python code): Check that
total_usdis a number,dateis a parseable date,vendoris a non-empty string. If any check fails, trigger a retry prompt specifically targeting the missing field. - Record-insertion step: Pass the validated JSON to a database insert function.
None of these steps is individually difficult. But combining them into one prompt would produce unreliable outputs. Chaining them produces a pipeline you can trust in production.
Complex Questions
“What caused the 1929 crash and how did government respond?”
You can’t answer that well in one go. You split it:
Steps 1 and 2 can even run in parallel — two LLM calls at the same time — then synthesise at the end. This is faster and more reliable than one giant prompt.
Parallel chains are an important pattern that the simple linear model does not capture. In Python with LangChain, you can use RunnableParallel to run two chains simultaneously and collect both results. For a question like “What caused the 1929 crash and how did government respond?” you could run:
- Chain A: “Summarise the economic causes of the 1929 crash.”
- Chain B: “Summarise the government’s policy responses to the 1929 crash.”
Both chains execute at the same time (two separate API calls happening concurrently, like two tabs loading in your browser simultaneously). When both complete, a synthesis prompt receives both outputs: “Given these two summaries — economic causes and government responses — write a single coherent analysis.” The final answer is better than any single prompt could produce, and it took less wall-clock time than running A and B sequentially.
Data Extraction (with Retry Logic)
Prompt chaining isn’t always linear. You can add conditional steps:
- Try to extract fields from an invoice
- Check: did we get everything?
- If not → run another prompt specifically targeting the missing fields
- Repeat until valid
This is how you build reliable OCR pipelines. The LLM extracts text, normalises it (“one thousand and fifty” → 1050), then hands arithmetic to an external calculator. Each step does what it’s best at.
The retry logic pattern is worth spelling out in full because it is not obvious at first. Imagine your extraction prompt is supposed to return three fields: vendor, date, and total_usd. After Step 1, you parse the JSON in Python. You notice that total_usd is missing — the model returned null or omitted the field entirely. Instead of failing the entire pipeline, you run a targeted retry prompt: “The following invoice text was partially processed. We are specifically missing the total amount. Please extract only the total_usd field from this text.” This targeted prompt, because it has an extremely narrow focus, has a much higher success rate than re-running the full extraction from scratch. If this second attempt also fails, you can escalate — either to a human reviewer, or to a more powerful (and more expensive) model like GPT-4o. This is graceful degradation: the chain handles its own failures rather than crashing.
Content Generation
Blog posts, reports, documentation — all follow a natural chain:
- Generate topic ideas
- Pick one, generate an outline
- Write each section (using previous section as context)
- Final review pass
The reason section-by-section generation works better than one-shot generation is the same attention argument from earlier, applied to output length. A model asked to write a 5,000-word article in one shot must maintain coherence, avoid repetition, track its own argument, and manage transitions — all simultaneously. As the output grows longer, the early parts of the article start falling outside the zone of high attention. Contradictions creep in. Points made in section 1 are re-stated in section 4 as if for the first time.
Section-by-section generation solves this by giving each section the model’s full attention, while also feeding the previous section as context so the model knows where the argument has been. The human (or automated pipeline) handles the global structure via the outline, and the model handles the local structure of each section. Divide and conquer.
Conversations with Memory
Each conversation turn becomes a prompt that includes the previous exchange. The chain is the memory. This is how conversational AI maintains context across a long back-and-forth.
This is worth pausing on because it reveals something fundamental: LLMs have no built-in memory. Every API call is completely stateless. The model does not “remember” what you said in a previous message unless you include that previous message in the current prompt. The “conversation history” feature you see in ChatGPT is implemented by your client software, not the model — the software is assembling a prompt that includes all previous turns and sending the whole thing to the model each time.
This means that long conversations have a fundamental limitation: eventually, the conversation history exceeds the context window. The fix is a memory summarisation chain: when the conversation gets too long, an intermediate prompt summarises the oldest turns (“Summarise the key points from these earlier messages in two sentences”), and the summary replaces the full history. The model gets a compact representation of the past that fits within its context window. This is a chain within a chain — a meta-level prompt managing the context of another prompt.
Code Generation
- Understand the request → write pseudocode
- Generate code from pseudocode
- Run linter/tests (this is a tool call, not an LLM call)
- If errors → fix them with another prompt
- Add docs
The critical thing here: you can insert regular code between LLM calls. Run a linter. Call an API. Do a database lookup. The chain isn’t just LLM calls — it’s a proper pipeline.
Tool calls deserve a full explanation because they are one of the most powerful concepts in agentic AI. A tool call (also called a function call in some APIs) is when the LLM generates a structured request to run a piece of external code, rather than generating prose. For example, instead of writing “I would run the tests now,” the model might output: {"tool": "run_tests", "args": {"file": "test_main.py"}}. Your Python code intercepts this output, actually runs the tests, captures the output, and feeds the test results back into the next prompt. The LLM never “runs” anything itself — it only generates requests. Your Python code is the executor. This separation of LLM-as-reasoner and Python-as-executor is fundamental to building reliable AI systems. It means that any failure in the external tool is clearly isolated in the code layer, not buried inside an LLM’s output. You can handle tool failures, retries, and fallbacks with standard Python exception handling.
The Code
We’ll build a two-step LangChain pipeline that:
- Reads a laptop spec in plain English
- Extracts the specs from it
- Formats them as a clean JSON object
Think of each step as one worker doing one job, then passing a note to the next worker.
Install
pip install langchain langchain-openai python-dotenv
Let’s break down what each of these packages actually does and why you need each one.
langchain is the core framework. It provides the abstractions that make chaining possible — things like ChatPromptTemplate, StrOutputParser, RunnablePassthrough, and the | pipe operator. If LangChain were a toolbox, this package is the box itself, plus the most commonly used tools.
langchain-openai is a plugin (also called an integration or provider). LangChain is designed to be model-agnostic — you can swap out OpenAI for Anthropic’s Claude, Google’s Gemini, or a local model running on your laptop. The integration packages handle the specifics of each provider’s API. langchain-openai knows how to format prompts in the structure that OpenAI’s API expects, how to handle authentication with your API key, how to handle API errors and retries, and how to parse the response format that OpenAI returns. You install it separately from langchain because different teams use different models, and you do not want to install code for every possible model provider — just the one you are using.
python-dotenv solves a very specific problem: how do you give your Python script access to your API key without typing it directly into your code? If you hardcode api_key = "sk-abc123..." in your script and then push your code to GitHub, your secret key is now public and anyone can use it (and you will get a bill). The solution is a .env file — a plain text file that lives on your computer but is never committed to version control. python-dotenv reads this file and loads its contents into the program’s environment variables — a set of key-value pairs that any program running on your computer can access. LangChain’s OpenAI integration automatically looks for OPENAI_API_KEY in the environment, so once load_dotenv() is called, you never need to mention the key again in your code.
What happens if you skip the install step? Python raises an ImportError at the top of your script: "ModuleNotFoundError: No module named 'langchain'". The fix is always to run the pip install command in the same environment (virtual environment or system Python) that you are using to run the script.
Set Up Your API Key
# .env file (never commit this)
OPENAI_API_KEY=sk-...
from dotenv import load_dotenv
load_dotenv() # reads .env and puts OPENAI_API_KEY into the environment
Line-by-line breakdown:
The .env file is not Python — it is just a simple list of KEY=VALUE pairs, one per line. The format is deliberately minimal. There are no quotes around the value (though most implementations accept them). There are no spaces around the =. This file should live in the root directory of your project (the same folder as your Python scripts).
from dotenv import load_dotenv imports the load_dotenv function from the python-dotenv package. The from ... import ... syntax means “go to this package, find this specific thing, and make it available in the current file.” After this line, you can call load_dotenv as if you defined it yourself.
load_dotenv() runs the function with no arguments. It finds the .env file by looking in the current directory and each parent directory in turn, reads it line by line, and for each KEY=VALUE pair, calls the equivalent of os.environ["KEY"] = "VALUE". After this line runs, your API key is accessible anywhere in your program via os.environ.get("OPENAI_API_KEY"), and LangChain will find it automatically when it needs to authenticate.
What happens if you skip load_dotenv()? You get an AuthenticationError when you first try to call the model, usually with a message like “No API key provided.” The error appears at runtime, not when you import the library, because the library only needs the key when it makes an actual API call — not when you import it.
What happens if you commit your .env file? Create a .gitignore file in your project root with .env on a line by itself. This tells git to pretend the file does not exist. As a safety net, GitHub’s secret-scanning feature will alert you if you accidentally push a file that looks like it contains an API key, but prevention is far better than detection.
Never, ever hardcode your API key directly in your Python code as a string literal. Not even temporarily. Not even "just for testing." The risk of accidentally committing it to a git repository is too high, and the habit of hardcoding secrets is too easy to carry forward into production systems. Always use environment variables loaded from a .env file that is listed in your .gitignore.
Initialise the Model
from langchain_openai import ChatOpenAI
# temperature=0 means deterministic — the model picks the most likely token
# every time, no randomness. Good for extraction tasks.
# temperature=1 = more creative/varied. Bad for structured data extraction.
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
Line-by-line breakdown:
from langchain_openai import ChatOpenAI imports the ChatOpenAI class. A class in Python is a blueprint for creating objects. Think of it like a car blueprint: the blueprint is not a car, but you can use it to manufacture as many cars as you want, each with different specifications. ChatOpenAI is the blueprint for an object that knows how to talk to OpenAI’s chat API.
ChatOpenAI(model="gpt-4o-mini", temperature=0) creates an instance of that class — an actual object you can use, like manufacturing one specific car from the blueprint. The parentheses call the class’s __init__ method (its constructor) with the specified arguments. Let’s look at each argument in detail:
model="gpt-4o-mini" specifies which OpenAI model to use. This is a keyword argument — model is the name of the parameter, and "gpt-4o-mini" is the value. OpenAI offers multiple models with different capability/cost trade-offs:
-
"gpt-4o-mini"— fast, cheap, excellent for straightforward tasks like extraction and formatting. The right choice for learning and simple pipelines. -
"gpt-4o"— the full, more capable and more expensive version. Use for complex reasoning tasks. - The string value must exactly match one of OpenAI’s model IDs. You can find the current list on platform.openai.com/docs/models.
temperature=0 is the most important parameter to understand deeply. Temperature controls the randomness of the model’s output. Technically, it is a parameter that modifies the probability distribution over possible next tokens before the model samples from that distribution.
Here is an intuition for what that means. After processing your prompt, the LLM computes a probability for every possible next token — maybe "CPU" has probability 40%, "processor" has probability 35%, "chip" has probability 20%, and everything else shares the remaining 5%. Before sampling, these probabilities are modified by the temperature value:
- At
temperature=0: the distribution is made infinitely “sharp” — the model always picks the single most-probable token ("CPU"in this example). The output is completely deterministic: the same prompt always produces the same output. - At
temperature=1: the model samples from the distribution as-is."CPU"comes up 40% of the time,"processor"35% of the time, and so on. Output varies between runs. - At
temperature=2: the distribution is flattened further, making low-probability tokens more likely. Output becomes more surprising and creative, but also potentially more incoherent.
Why use temperature=0 for extraction? Because extraction is not creative. There is exactly one correct answer: the CPU speed, RAM amount, and storage capacity that appear in the input text. You want the model to reliably give you that answer every time, not a different phrasing each run. temperature=0 locks the model into its “most confident” response.
Why use higher temperature for creative tasks? When you ask “give me 10 blog post ideas about machine learning,” you want variety. Running the same prompt at temperature=0.8 three times will give you three different sets of ideas — which is exactly what you want for brainstorming.
Notice that just writing llm = ChatOpenAI(...) does not send any request to OpenAI. You are just configuring the object. The object stores your model preference and temperature setting, and it knows where to find your API key. Nothing happens over the network until you actually call the model with .invoke().
What happens if you use temperature=1 for extraction? The model’s output will vary between runs. Sometimes it will correctly output "3.5 GHz octa-core". Sometimes it might write "octa-core processor running at 3.5 GHz". These are semantically equivalent but structurally different — and if the next step in your chain is programmatically parsing a specific format, that variation can break your pipeline.
What happens if you misspell the model name? You get an error from the API when you first call it, something like "The model 'gpt-4o-mni' does not exist." The error appears at runtime (not at import time) and is informative enough that you can fix it immediately.
A very common mistake is to set temperature=0 on every step, including creative steps like "draft an email." For Step 3 in our demo chain, you might want temperature=0.3 or 0.5 to get natural-sounding prose rather than the most statistically average sentence the model has ever seen. Match your temperature to your task: 0 for structured extraction, 0.3–0.7 for balanced prose, 0.7–1.0 for brainstorming. In a chain, you can create multiple ChatOpenAI instances with different temperatures and use the right one for each step.
Define the Prompts
from langchain_core.prompts import ChatPromptTemplate
# Prompt 1: extract specs from the raw text
# {text_input} is a placeholder — gets replaced with actual text at runtime
prompt_extract = ChatPromptTemplate.from_template("""
You are a hardware spec extractor. Your only job is to pull out technical
specifications from the text below. Be precise. Output only what's in the text.
Text: {text_input}
""")
# Prompt 2: take the extracted specs and structure them as JSON
# {specifications} gets filled with the OUTPUT of prompt 1
prompt_transform = ChatPromptTemplate.from_template("""
You are a data formatter. Take the hardware specifications below and output
a JSON object with exactly three keys: "cpu", "memory", "storage".
No extra text, no explanation — JSON only.
Specifications: {specifications}
""")
Line-by-line breakdown:
from langchain_core.prompts import ChatPromptTemplate imports the ChatPromptTemplate class from LangChain’s core library. The langchain_core package contains the fundamental building blocks that work regardless of which LLM provider you use — this is the model-agnostic layer that sits above the provider-specific integrations like langchain-openai.
ChatPromptTemplate.from_template(...) is a class method — a method you call directly on the class itself, not on an instance. It is like asking the factory to produce a product for you without first creating an instance of the factory. It takes a string (your prompt template) and returns a fully configured ChatPromptTemplate object that knows how to format itself into the message structure that OpenAI’s API expects.
The key feature of ChatPromptTemplate is that it supports placeholders: text wrapped in curly braces like {text_input} that gets replaced with actual values at runtime. This is called a template literal — a string with named slots that can be filled in dynamically. If you have ever used Python’s f-strings (f"Hello, {name}!"), placeholders work the same way conceptually, except the filling-in happens when you call .invoke() rather than at the point of the string definition.
Let’s look at the first template carefully:
You are a hardware spec extractor. Your only job is to pull out technical
specifications from the text below. Be precise. Output only what's in the text.
Text: {text_input}
"You are a hardware spec extractor." is the role assignment described earlier. It tells the model which cluster of learned writing patterns to activate.
"Your only job is to pull out technical specifications" is a scope constraint. It explicitly tells the model not to do anything else — not to interpret, not to add context, not to make inferences about what the specifications might mean. This reduces hallucinations because it forbids the model from inventing information that was not in the input.
"Be precise. Output only what's in the text." is a fidelity constraint. It reinforces that the model should not summarize, rephrase, or infer — just extract verbatim. The word “only” is important here: it is doing real work in limiting the model’s behaviour.
Text: {text_input} is where the actual input data goes. The {text_input} placeholder is replaced with the actual laptop description when .invoke() is called. The label Text: before the placeholder is good practice — it clearly signals to the model where the data begins, distinguishing it from the instructions above. This kind of clear visual separation between instructions and data helps the model attend to each part appropriately.
The second template has the same structure, but the placeholder is {specifications} — and critically, this placeholder gets filled not with user input but with the output of the first chain step. This is the core mechanic of chaining: the output variable of one step is the input variable of the next.
"No extra text, no explanation — JSON only." in the second template is a particularly strong constraint. It explicitly bans the two most common forms of unwanted output: preamble (“Here are the extracted specs:”) and postamble (“I hope that helps!”). These two phrases, harmless to a human reader, would make json.loads() raise a JSONDecodeError immediately.
Why does from_template() return an object rather than a string? Because a ChatPromptTemplate object is more than a string — it knows how to format itself into the messages array that OpenAI’s API expects (a list of role-tagged messages: [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}]). When you use the | operator to connect it to an LLM, LangChain automatically calls the template’s formatting logic before sending the request. You never have to manually construct the message format.
What is the triple-quoted string (the """...""")? In Python, a triple-quoted string can span multiple lines without requiring explicit newline characters. It preserves all whitespace and line breaks exactly as written. This is useful for prompt templates because prompts often need to be multi-line and clearly formatted. The blank line between the instructions and the Text: {text_input} line is intentional — it creates visual separation in the prompt that helps the model distinguish instructions from data.
What happens if you use a placeholder name that does not match what you pass in .invoke()? LangChain raises a KeyError with a message telling you which key was expected but not found. For example, if your template has {text_input} but you call .invoke({"input_text": "..."}), you get KeyError: 'text_input'. The names must match exactly, including capitalisation.
A subtle but common mistake is to write instructions that are too vague. Compare "Extract the specs" with "Pull out the CPU speed, RAM amount, and storage capacity from the text below. If any of these three fields are not mentioned in the text, write 'not found' for that field." The second version is far more reliable because it tells the model exactly what fields to look for and what to do when data is absent. Vague prompts produce vague outputs. Precise prompts produce precise outputs. Every word in a prompt is a token that costs money — make every token count.
Wire the Chain with LCEL
This is the key step. LangChain’s LCEL (LangChain Expression Language) uses the | pipe operator — same as Unix pipes — to connect components.
from langchain_core.output_parsers import StrOutputParser
# StrOutputParser converts the model's response object → plain string
# Without it, you'd get an AIMessage object, not a string
# --- Chain 1: extract specs ---
# Flow: prompt_extract fills {text_input}
# → sends to llm → gets AI response
# → StrOutputParser strips it to plain text
extraction_chain = prompt_extract | llm | StrOutputParser()
# --- Chain 2: full pipeline ---
# {"specifications": extraction_chain} runs extraction_chain first,
# then puts the result into {specifications} of prompt_transform
full_chain = (
{"specifications": extraction_chain} # ← runs chain 1, stores result
| prompt_transform # ← fills {specifications}
| llm # ← sends to model
| StrOutputParser() # ← strips to plain string
)
Line-by-line breakdown — this is the most important section:
from langchain_core.output_parsers import StrOutputParser imports the StrOutputParser class. Let’s understand what problem this solves.
When you call an OpenAI-compatible model through LangChain, the raw response you get back is not a plain Python string. It is an AIMessage object — a Python object that contains the text content, plus metadata like token counts, model name, finish reason, and more. This metadata is useful for advanced use cases, but for our purposes, we just want the text. StrOutputParser is a component that extracts the .content attribute (the text content) from the AIMessage and returns it as a plain Python str. Without it, passing the LLM output to the next prompt would fail because ChatPromptTemplate expects a string to fill its {specifications} placeholder — not an AIMessage object.
Now let’s look at extraction_chain = prompt_extract | llm | StrOutputParser().
The | pipe operator in LangChain is the same character used in Unix shell pipelines (as in cat file.txt | grep "error" | wc -l). Both use the same concept: the output of the left side becomes the input of the right side. In LangChain, this works through a mechanism called the Runnable interface. Every component in LangChain — prompts, LLMs, output parsers, chains — implements this interface, which means every one of them has an .invoke(input) method that accepts an input, processes it, and returns an output.
When you write A | B, LangChain creates a new object — a RunnableSequence — that, when .invoke(x) is called on it, does: B.invoke(A.invoke(x)). So prompt_extract | llm | StrOutputParser() creates a pipeline where:
-
prompt_extract.invoke({"text_input": "..."})formats the template with the provided data, producing a list of messages in the format OpenAI expects. -
llm.invoke(messages)sends those messages to OpenAI’s API over the internet and returns anAIMessageobject. -
StrOutputParser().invoke(ai_message)extracts the.contenttext and returns a plain Python string.
This is called declarative composition — you are declaring what you want to happen (the data flow), not how to execute it step by step (imperative code). The framework handles the execution details, error handling, and connection between components.
Why is the chain “lazy”? After writing extraction_chain = prompt_extract | llm | StrOutputParser(), nothing has happened. No API calls have been made. No data has flowed. No money has been spent. The | operator does not execute anything — it creates a description of what should happen — a pipeline object — and stores it in the variable extraction_chain. Execution only happens when you call .invoke(). This design is intentional and powerful: it lets you define, modify, and compose chains before running them, test individual components, and share pipeline definitions across your codebase without triggering API calls.
Now let’s look at the more complex full_chain definition:
full_chain = (
{"specifications": extraction_chain}
| prompt_transform
| llm
| StrOutputParser()
)
{"specifications": extraction_chain} is a Python dictionary where the value is an entire chain (extraction_chain), not a static string. LangChain recognises this pattern: when it sees a dictionary with runnable values in a pipe expression, it automatically wraps it in a RunnableParallel object. When this step executes, it runs each runnable value, collects the outputs, and returns a dictionary where each key maps to the corresponding output. So {"specifications": extraction_chain} produces {"specifications": "<the extracted text>"} at runtime — which is exactly the format that prompt_transform needs to fill its {specifications} placeholder.
The line | prompt_transform takes that dictionary and passes it to ChatPromptTemplate.invoke({"specifications": "..."}). The template fills in the placeholder and returns the formatted prompt messages.
| llm sends the formatted messages to OpenAI and returns an AIMessage.
| StrOutputParser() extracts the text content and returns the final plain string — the JSON-formatted laptop specs.
What is RunnablePassthrough and why does it exist? RunnablePassthrough is a special runnable that passes its input through to its output completely unchanged. This sounds useless at first — why would you want a component that does nothing? — but it becomes essential in specific chain patterns. Imagine you want to pass both the original user input and a processed version of it to a synthesis step. Without RunnablePassthrough, you lose the original input as soon as the first transformation runs. With it, you can write:
from langchain_core.runnables import RunnablePassthrough
chain = (
{"original": RunnablePassthrough(), "processed": processing_chain}
| synthesis_prompt
| llm
| StrOutputParser()
)
RunnablePassthrough() receives the original input and passes it through unchanged, while processing_chain transforms it. Both are available to synthesis_prompt. Think of RunnablePassthrough as a fork in the pipeline that lets you preserve a copy of the data at a particular point in the flow.
What happens under the hood when you call .invoke()? LangChain constructs a directed acyclic graph (DAG) of the components — nodes are components, edges are data dependencies. It performs a topological sort to determine the correct execution order. For our full_chain, the execution order is: extraction_chain → prompt_transform → llm → StrOutputParser. If any step raises an exception, the exception propagates up and you see the error in your Python terminal, along with the LangChain stack trace showing which component failed.
A very common mistake when first using LCEL is to forget | StrOutputParser() at the end of a sub-chain whose output needs to be used as a string. Your code runs without errors up to the point where the AIMessage object reaches something that expects a string — then you get a confusing error like AttributeError: 'AIMessage' object has no attribute 'format'. Always add | StrOutputParser() at the end of any sub-chain whose output needs to fill a prompt placeholder. If you intentionally want the full AIMessage object (for its token count metadata, for example), omit it — but know what you are doing and handle the object type explicitly in the next step.
Run It
result = full_chain.invoke({
"text_input": "New laptop: 3.5 GHz octa-core CPU, 16GB RAM, 1TB NVMe SSD."
})
print(result)
.invoke() triggers the whole pipeline. Under the hood, this is what happens:
1. prompt_extract receives: text_input = "New laptop: 3.5 GHz..."
2. llm receives the filled prompt → returns: "CPU: 3.5 GHz octa-core..."
3. StrOutputParser strips it to a plain string
4. prompt_transform receives: specifications = "CPU: 3.5 GHz octa-core..."
5. llm receives the filled prompt → returns: {"cpu": "3.5 GHz octa-core"...}
6. StrOutputParser strips it → final output
Output:
{
"cpu": "3.5 GHz octa-core",
"memory": "16GB",
"storage": "1TB NVMe SSD"
}
Line-by-line breakdown of the .invoke() call:
full_chain.invoke({"text_input": "New laptop: 3.5 GHz octa-core CPU, 16GB RAM, 1TB NVMe SSD."}) starts the pipeline. The argument to .invoke() is a Python dictionary. The key "text_input" must exactly match the placeholder name in prompt_extract’s template string. If your chain had multiple external inputs (values not computed by intermediate steps), you would include all of them in this dictionary.
The input string "New laptop: 3.5 GHz octa-core CPU, 16GB RAM, 1TB NVMe SSD." is deliberately simple to make the demo clear. In a real application, this input might come from a web form, a CSV file, a database query, a PDF extraction library, or any other data source. The chain does not care where the input came from — it just needs the string.
result = ... captures the final output of the full chain, which is a Python string containing a JSON-formatted object. Notice that result is a str (a string), not a Python dictionary. If you want to programmatically access individual fields, you need to parse it: data = json.loads(result). After parsing, data["cpu"] gives you "3.5 GHz octa-core", data["memory"] gives you "16GB", and so on.
print(result) displays the string to your terminal. The output will look like valid JSON with indentation because the model typically includes formatting when generating JSON.
What happens if the LLM wraps its JSON in Markdown code fences? Some models, especially when prompted to output JSON, will wrap it in a Markdown code block like:
```json
{
"cpu": "3.5 GHz octa-core",
...
}
```
This is valid Markdown but invalid JSON. json.loads(result) would raise a json.JSONDecodeError. The fix is a post-processing step to strip the code fence markers before parsing. Alternatively, use LangChain’s JsonOutputParser class, which handles JSON extraction robustly including stripping code fences.
How long does .invoke() take? A round trip to OpenAI’s API typically takes between 0.5 and 5 seconds, depending on network conditions, server load, model choice, and the length of the generated response. Our full_chain makes two API calls (one for each LLM step), so total time is roughly the sum of both. For a long-running pipeline with many steps, latency accumulates — which is a strong reason to run independent steps in parallel when possible.
What happens if the API is down or returns an error? LangChain surfaces the underlying openai library exception — typically an openai.APIConnectionError, openai.RateLimitError, or openai.APIStatusError. In production, you wrap .invoke() in a try-except block and implement retry logic with exponential backoff. LangChain also has built-in retry support via the .with_retry() method.
Mistake 1: Passing the wrong key name to .invoke(). If your template uses {text_input} but you call .invoke({"input": "..."}), you get a KeyError. Check that every key in your .invoke() dictionary exactly matches a placeholder in your template.
Mistake 2: Trying to use result as a dictionary without parsing it. result is a Python string, not a dictionary. Accessing result["cpu"] raises a TypeError. Parse first: data = json.loads(result), then data["cpu"].
Mistake 3: Not handling API errors. API calls fail. Networks drop. Rate limits hit. Always wrap .invoke() in a try-except block in any code that runs in production or is important enough to notice when it breaks.
What You Get at Each Step
To see intermediate outputs (helpful for debugging), run each chain separately:
# See what extraction_chain produces before it hits the formatter
raw_extraction = extraction_chain.invoke({
"text_input": "New laptop: 3.5 GHz octa-core CPU, 16GB RAM, 1TB NVMe SSD."
})
print("After step 1:", raw_extraction)
# → "CPU: 3.5 GHz octa-core\nRAM: 16GB\nStorage: 1TB NVMe SSD"
# Now run the full thing
final = full_chain.invoke({"text_input": "..."})
print("After step 2:", final)
# → {"cpu": "3.5 GHz octa-core", "memory": "16GB", "storage": "1TB NVMe SSD"}
This is one of the best things about LCEL — any sub-chain is itself a runnable. You can test each step independently before connecting them.
Line-by-line breakdown of debugging:
raw_extraction = extraction_chain.invoke(...) runs only the first chain — the extraction step — without the formatting step. This is powerful for debugging because it lets you verify that Step 1 is working correctly before worrying about Step 2. If raw_extraction contains garbage output, you know the problem is in the extraction prompt, not the formatting prompt. If raw_extraction is clean, you know the problem (if any) is in the transition or in Step 2’s prompt.
print("After step 1:", raw_extraction) should give you something like:
After step 1: CPU: 3.5 GHz octa-core
RAM: 16GB
Storage: 1TB NVMe SSD
This is plain text — not JSON. It is loosely structured, but a human can read it and verify it is correct. The extraction chain’s job is done: it pulled the right facts out of the input sentence. Now the formatting chain’s job is to convert those facts into JSON.
This incremental testing approach is one of the most important practices in building LLM pipelines. Never build a 10-step chain and test it all at once. Build Step 1, test it. Build Step 2, test it with the known-good output of Step 1. Add Step 3, test it in isolation. Only connect steps once you have verified each one individually. If something breaks in the connected chain, you immediately know which connection is the problem.
What if Step 1’s output is messy? For example, suppose raw_extraction comes back as:
Here are the technical specifications I found in the text:
- CPU: 3.5 GHz octa-core processor
- Memory: 16 gigabytes of RAM
- Storage: 1 terabyte NVMe SSD
I hope that helps!
The model added preamble and postamble despite your instruction to “output only what’s in the text.” The fix is to make the constraint more explicit in the prompt: "Output the specifications directly, one per line, with no preamble, no postamble, and no explanation." The pattern is always the same: observe the failure mode, diagnose which instruction is being violated, strengthen that instruction, and test again.
Mistake 1: Testing the full chain before testing individual steps. When something goes wrong in a multi-step chain, you will not know which step caused it without testing steps individually. Always verify each sub-chain produces correct output before wiring it to the next step.
Mistake 2: Forgetting that extraction_chain.invoke() costs API credits. Every call to .invoke() on any chain that includes an llm makes a real API call and costs real money (fractions of a cent per call, but it adds up during development). Use short test inputs to keep development costs low. Use LangSmith or similar observability tools to track costs across your testing sessions.
Mistake 3: Assuming that passing a step once means it will always pass. LLMs are probabilistic. Even with temperature=0, edge cases in input phrasing can trigger unexpected outputs. Test with a diverse set of inputs that cover the range of real-world variations you expect to encounter.
Why This Beats One Big Prompt
Here’s simulated data from running both approaches 60 times on 60 different laptop descriptions. The Y-axis is output quality (0 = completely broken JSON or wrong fields, 1 = perfect extraction).
Play through the 60 experiments. Notice:
- Chain (cyan): consistently above the 0.70 threshold. Reliable.
- Single-prompt (red): scattered. Many fall below acceptable quality.
This gap is why chaining exists. It’s not magic — it’s just focus at each step.
Reading the Chart: What the Data Is Telling You
Let’s make sure you understand every element of this chart before moving on, because the visual tells a story that is worth reading carefully.
The X-axis (horizontal) represents individual experimental runs — 60 different laptop descriptions, each with slightly different phrasing, different levels of detail, and different potential for ambiguity. Experiment 1 might be a clean, simple sentence like our example. Experiment 20 might describe the RAM in gigabytes using a different abbreviation, or mention storage in terabytes, or use an unusual brand name for the processor.
The Y-axis (vertical) is the quality score, ranging from 0.30 (at the bottom) to 0.90 (at the top). A score of 1.0 means the output was perfectly formatted, correctly extracted JSON with all three keys present and correct. A score of 0.0 means the output was completely unusable. The chart focuses on the range 0.30–0.90 because scores outside this range are rare in practice.
The dashed yellow line at 0.70 is the quality threshold — the minimum acceptable score for the output to be usable in a downstream application. An output below this line has either wrong values, missing keys, or invalid JSON syntax — all of which would break any code trying to use it programmatically. 0.70 is a common industry threshold for “acceptable reliability” in automated pipelines.
The cyan dots represent the chained approach. Notice how they cluster tightly above 0.70, with only a small number of individual experiments dipping below. The chained approach is consistent — it produces reliable results across a wide variety of inputs, including the unusual ones.
The red dots represent the single-prompt approach. Notice how they are scattered across the full vertical range. Some experiments produced excellent results (above 0.70), but many — roughly half — fell below the quality threshold. This inconsistency is the critical problem with the one-shot approach. A pipeline that works 50% of the time is not a pipeline you can rely on.
Why does the single-prompt approach have such high variance? Because different laptop descriptions trigger different failure modes in the same prompt. A very simple description might work fine — the model has seen millions of similar examples and handles it confidently. A description with an unusual phrasing might cause the model to produce malformed JSON or include extra commentary. A description with multiple numbers might cause the model to mix up which number belongs to which field. The single-prompt approach has no intermediate checkpoint where these problems can be caught and corrected before they reach the output.
The chained approach absorbs these variations because each step handles only one type of challenge. The extraction step’s only concern is getting the right values out of ambiguous text. The formatting step’s only concern is converting clean text to JSON. Each step has a narrow, well-defined job, which means each step is robust across a much wider range of inputs.
This gap is why chaining exists. It’s not magic — it’s just focus at each step.
Why Chaining Works: The Deep Dive
We’ve seen the empirical evidence — the chart shows chaining outperforms single-prompt approaches. Now let’s understand the theory — the fundamental reasons why step-by-step processing produces better results than one-shot generation.
Reason 1: Attention Concentration
As established in the opening section, an LLM’s attention mechanism gives each token in the context a weight. In a single complex prompt with many instructions, each instruction receives a fraction of the total attention budget. In a chained prompt where each step has one instruction, that instruction receives virtually all of the attention budget for that step. The model does a better job not because it became smarter between the two approaches, but because the focused prompt lets it apply its existing intelligence more effectively.
A single-prompt approach is like illuminating a room with a floodlight: everything is visible at low brightness, and nothing stands out. A chained approach is like illuminating one specific detail at a time with a focused flashlight: each item of interest is brightly lit in its turn, and you can see it clearly. The amount of light is the same (the model's capacity is the same), but the result is dramatically different. The model's attention is the beam — narrow it down and what it illuminates comes into sharp focus.
Reason 2: Error Isolation and Correction
In a chain, you can validate the output of each step before passing it to the next. This creates natural checkpoints — points in the pipeline where a Python function examines the data and decides whether it is safe to continue. If Step 2 produces malformed JSON, you can catch this with json.loads() in a try-except block and trigger a retry prompt — without re-running Step 1 or losing its results. In a single-prompt approach, there are no checkpoints. If the model makes a wrong assumption midway through its generation, the error propagates silently into the final output. You only discover the problem at the very end, by which point it is too late to pinpoint where things went wrong.
Think of checkpoints as error firewalls: each step can only pass well-formed data to the next step. A bad output at Step 2 never gets to corrupt Step 3’s output, because Python code catches the bad output before it reaches Step 3.
Reason 3: Specialisation Per Step
Different tasks have different optimal settings. Extraction tasks benefit from temperature=0 (deterministic, precise). Creative tasks benefit from higher temperature (varied, expressive). Role assignments vary — a summarizer needs a different persona than an email writer. In a single prompt, you choose one set of settings for the entire task. In a chain, you optimise each step independently. Step 1 might use temperature=0, model="gpt-4o-mini". Step 3 might use temperature=0.5, model="gpt-4o" for higher quality creative output. This per-step optimisation is simply impossible in a single-prompt approach.
Reason 4: The Opportunity to Insert Code
This is the most underrated advantage of chaining. Between any two LLM calls, you can execute arbitrary Python code. This means:
- Validation: Run
json.loads()to verify JSON is syntactically valid. Check that all required fields are present. Verify numeric values are within expected ranges. - Transformation: Convert units (
"1TB"→1024), normalise strings (uppercase → lowercase), deduplicate values, sort lists. - External lookups: Query a SQL database, call a REST API, read a file, search a vector store for similar documents.
- Business logic: Apply company-specific rules that would be extremely difficult to express reliably in a prompt (“if the vendor is in our approved list and the total is under $5,000, auto-approve”).
- Branching: Decide which prompt to run next based on the content of the previous output — route to different specialists based on the category of the request.
None of these are possible in a single prompt because a single prompt is just text — it has no execution environment, no access to databases, no ability to call code. A chain is a Python program with LLM calls embedded in it, which makes it infinitely more powerful than a collection of prompts.
Reason 5: Debuggability and Iteration Speed
When a single-prompt system fails, the only debugging tools you have are the final (broken) output and the original prompt. There are no intermediate states to inspect. When a chain fails, you can test each sub-chain independently, add print() statements between steps to inspect intermediate outputs, and immediately identify which step produced the bad output. This makes iteration much faster: you fix the broken step, test only that step in isolation, verify the fix, and reconnect.
This is not a minor convenience. In practice, 80% of building a reliable LLM system is the debugging and iteration cycle. A system that is easy to debug gets to reliability 10x faster than one that hides its internal state from you. Chaining forces good modularity, and good modularity makes debugging fast.
Context Engineering
Here’s something most tutorials skip: what you put around the prompt matters as much as the prompt itself.
There’s a discipline called Context Engineering — the art of giving the model a rich, complete picture before it generates a response.
Prompt engineering asks: “how do I phrase this question?”
Context engineering asks: “what does the model need to know before I even ask?”
A basic LLM answers from its training data. A context-engineered agent:
- Checks your calendar before scheduling something
- Reads your previous emails before drafting a new one
- Fetches the latest docs before writing code
The difference is the environment you build around the model — not the model itself. This is why two teams using the same LLM can get wildly different results.
Breaking Down the Four Context Types
The diagram shows four distinct types of context that feed into the model before it generates a response. Let’s examine each one deeply.
1. System Prompt — Who the Model Is
The system prompt is a special message that appears at the very beginning of every conversation, before any user input. It is invisible to end users in most applications, but it is the most powerful lever an application developer has over the model’s behavior.
A well-crafted system prompt tells the model:
- Its role:
"You are a customer support agent for Acme Corporation." - Its constraints:
"You only discuss topics related to Acme products. You never give medical or legal advice." - Its format preferences:
"Always respond in three sentences or fewer. Use bullet points for lists of steps." - Its knowledge boundaries:
"You do not have access to information after January 2025. For anything more recent, tell the user to check the company website." - Its tone:
"Be warm, professional, and empathetic. Avoid corporate jargon and never use passive voice."
Every word in the system prompt consumes tokens from the context window, so system prompts must be precise and dense. Padding them with vague instructions like "be a helpful AI" wastes tokens on low-value content that the model already tries to do by default. The best system prompts are specific, boundary-setting, and role-defining.
In LangChain, you can include a system prompt in your ChatPromptTemplate using the from_messages method:
from langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_messages([
("system", "You are a hardware spec extractor. Output only extracted specs, one per line."),
("human", "Text: {text_input}")
])
This creates a two-message prompt: a system message (the model’s governing instructions) and a human message (the current request). The ChatOpenAI object knows how to format these into the messages array that OpenAI’s API expects — a list like [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}].
2. Retrieved Docs — Facts from Your Knowledge Base
A base LLM knows only what was in its training data, which has a knowledge cutoff — it knows nothing about events after a certain date, and it knows nothing about your company’s internal documents, private data, or proprietary processes.
Retrieval-Augmented Generation (RAG) is itself a form of prompt chaining that solves this problem. The pipeline looks like:
- User asks a question (e.g., “What is our company’s refund policy for digital products?”).
- The question is converted into a vector — a list of numbers that mathematically represents the meaning of the question in a high-dimensional space, where similar meanings produce similar vectors. This process is called embedding.
- The vector is compared against a database of pre-processed document embeddings (your knowledge base — perhaps thousands of support articles, policies, and documentation pages).
- The most similar documents are retrieved from the database.
- Those retrieved documents are included in the prompt as context:
"Answer the user's question using only the information in the following documents: [document 1] [document 2] [document 3]..." - The model generates an answer grounded in those specific documents.
The beauty of RAG is that the model answers from your provided documents rather than from its training data. This means the answer can be current (you update the knowledge base, not the model), accurate (the source documents are authoritative), and specific to your domain. Hallucinations are reduced dramatically because the model is explicitly told to use the provided information. If the answer is not in the documents, a well-designed RAG system will say so rather than fabricating an answer.
RAG is a three-step chain: retrieve → inject into prompt → generate. The retrieval step is a regular code step (a vector database query), not an LLM call. The injection step constructs the prompt by filling a template with the retrieved documents. The generation step is an LLM call. Three steps, one job each.
3. Tool Outputs — API and Calendar Results
Some questions require real-time data that no amount of training can provide. “What is the current weather in Tokyo?” “What meetings do I have tomorrow?” “What is the current inventory level for SKU 12345?” These questions require accessing live data sources that exist outside the model’s training data.
Tool outputs are the results of external API calls or function calls that are included in the prompt context before the model generates its response. In a calendar-aware scheduling agent:
- User asks:
"Schedule a one-hour meeting with Sarah for next Tuesday morning." - Chain Step 1 (Python code): Call the Calendar API with Sarah’s ID and next Tuesday’s date. Get back a list of her available time slots.
- Chain Step 2 (LLM):
"Given these available time slots: [API response], draft a professional meeting invitation for the user to review before sending. Include the time, date, and a suggested meeting agenda." - Chain Step 3 (Python code): Display the draft to the user, wait for confirmation, then call the Calendar API again to actually create the event.
The LLM never “calls” the Calendar API directly. It cannot — it has no network access or execution environment. Your Python code calls the API, gets the response, and includes that response in the next LLM prompt. The LLM’s job is to reason about and present the data, not to fetch it. This separation of LLM-as-reasoner and code-as-executor is fundamental to building reliable AI systems, because it means that any failure in the external API is clearly isolated in the code layer and can be handled with standard Python exception handling.
4. User History — What They Said Before
As discussed in the “Conversations with Memory” section, LLMs are completely stateless — every API call is an isolated event. The conversational continuity you see in products like ChatGPT is implemented entirely by the client software, which assembles a prompt that includes the full conversation history and sends it to the model each time.
This means that long conversations have a fundamental limitation: eventually, the conversation history exceeds the context window and the oldest messages must be dropped. The fix is a memory management chain: when the conversation gets too long, a separate LLM call summarises the oldest messages into a compact paragraph, and that summary replaces the full text of those messages. The model gets a compressed representation of the past that fits within its context window.
Strategies for managing user history in production systems include:
- Full history (simple): Include every message up to the context window limit. Simple to implement, but uses many tokens and hits the limit quickly for long conversations.
- Summary + recent (balanced): Keep a rolling summary of messages older than N turns, and include full text of only the most recent N turns. Efficient and scalable.
- Selective recall (advanced): Use a retrieval mechanism (like RAG) to find the most relevant past exchanges for the current question. If the user asked about billing three weeks ago and is now asking about billing again, retrieve that specific exchange. This is the most token-efficient approach but requires a vector database.
Context Engineering vs. Prompt Engineering: Why the Distinction Matters
Prompt engineering is a narrow discipline focused on phrasing. Given that you are going to make one LLM call, how do you word the input to get the best output? It is valuable, but it has hard limits: a perfectly engineered prompt cannot give the model today’s stock price. It cannot make the model remember conversations it was not part of. It cannot make the model access your company’s internal database.
Context engineering is a systems discipline focused on information architecture. What is the complete information environment you need to construct so the model has everything it needs to answer correctly? It encompasses prompt engineering as one sub-problem, but also includes retrieval system design, memory management strategy, tool integration architecture, and pipeline design.
The implication is profound: the model is not the product. The context engineering system you build around the model is the product. Two teams using identical models with different context engineering approaches can get results that differ by an order of magnitude in quality, accuracy, and reliability. This is why some AI products feel dramatically better than others even when both are “powered by GPT-4.”
Think of the LLM as a highly capable consultant who has been in an information blackout for a year — no phone, no internet, no news. They are brilliant, and they remember everything they knew before the blackout. But they only know what you bring them from the outside world. Context engineering is the discipline of deciding what information to bring into that meeting room, in what format, and in what order, before the consultant starts working. The quality of the briefing determines the quality of the advice.
Key Takeaways
The core idea. One complex prompt = one overwhelmed LLM. A chain of focused prompts = reliable, controllable output.
The mechanism. Each step outputs structured data (usually JSON) that feeds directly into the next step’s prompt. Clean handoffs between steps are everything.
The superpower. Between LLM calls, you can run real code — validators, APIs, calculators, databases. The chain is a pipeline, not just a list of prompts.
When to use it. Whenever your task has more than two stages, requires external data, or would benefit from independent verification at each step.
If your task would make an intern ask “which part do you want me to do first?” — it needs a chain.
The Mental Model to Carry Forward
At the highest level, prompt chaining teaches you to think differently about what an LLM is and how to use it. An LLM is not a magic box you throw hard problems at and hope for the best. It is a powerful, specialised component that performs best when given narrow, well-defined tasks with clean inputs and clear output expectations.
Every complex AI task you will ever build can be decomposed into a sequence of narrow tasks. Each task has one job. Each job produces a structured output. Each output is validated before it flows to the next step. Real code lives between the LLM calls, handling the parts that LLMs are bad at — arithmetic, external lookups, format enforcement, business logic. The whole system is testable step by step, debuggable at each checkpoint, and improvable one step at a time.
This is the mental model of agentic AI: not a single powerful AI doing everything, but a well-designed pipeline of focused, reliable components — some powered by LLMs, some by regular code — working together to solve problems that neither could solve alone.
In the next chapter, we will look at how this pipeline model extends to agents — systems where an LLM can decide which tool to use next, rather than following a fixed sequence, turning a rigid chain into a dynamic, self-directing reasoning loop.
References
- LangChain Expression Language — python.langchain.com
- LangGraph — langchain-ai.github.io/langgraph
- Prompt Chaining Guide — promptingguide.ai
- Crew AI — docs.crewai.com
- Google Vertex AI Prompt Optimizer — cloud.google.com
Enjoy Reading This Article?
Here are some more articles you might like to read next: