Table of Contents
- Why Token Budgets Matter in Agent Workflows
- Understanding Token Counting Across Hops
- Architecture for Budget Tracking
- Implementing Per-Hop Budget Enforcement
- Budget Propagation and Context Carryover
- Monitoring and Observability
- Cost Reclamation and Optimisation
- Real-World Implementation Patterns
- Common Pitfalls and How to Avoid Them
- Summary and Next Steps
Why Token Budgets Matter in Agent Workflows
When you deploy agentic AI systems in production, token consumption becomes your second-largest operational expense after infrastructure. A single multi-hop agent workflow—where an agent calls another agent, which calls a tool, which returns context back up the chain—can easily consume 10x the tokens of a simple request-response call.
Without explicit token budgets, you face three critical risks: runaway costs that surprise you mid-quarter, degraded user experience as tokens exhaust and agents fail silently, and compliance gaps when you can’t trace which workflows consumed what resources.
At PADISO, we’ve seen founders ship agentic AI systems that worked brilliantly in staging but cost $50K/month in production because no one enforced token budgets at the agent level. The fix isn’t complicated, but it requires discipline: you need to define, track, and enforce token budgets at every hop in your workflow, not just at the API call level.
Token budgets are especially critical if you’re building on OpenAI’s API, Anthropic’s Claude, Google’s Gemini, or any other LLM provider. Each provider counts tokens differently, and context length varies by model. A workflow that fits comfortably in GPT-4 Turbo’s 128K context window might overflow in Claude 3 Opus’s context if you’re not careful about token propagation.
This guide walks you through the patterns, tools, and guardrails that production teams use to keep multi-hop agent workflows within budget while maintaining quality and speed.
Understanding Token Counting Across Hops
What Tokens Are and Why They Matter
Tokens are the atomic unit of cost and context in LLM APIs. A token is roughly 4 characters of English text, but the exact count depends on the tokenizer. When you call an LLM API, you pay for two separate token counts: input tokens (your prompt plus context) and output tokens (the model’s response).
In a single-hop workflow—user sends a prompt, LLM responds—token counting is straightforward. You measure input and output tokens once, multiply by the provider’s per-token rate, and move on.
In a multi-hop agent workflow, token counting becomes exponentially more complex. Here’s why:
Hop 1: Agent A receives a user query (100 tokens). It needs context, so it calls a retrieval tool that returns 500 tokens of documents. Agent A now has 600 tokens of context and generates a response (200 tokens). Cost: 800 input tokens + 200 output tokens.
Hop 2: Agent A’s response (200 tokens) plus the original query (100 tokens) plus the retrieved documents (500 tokens) get passed to Agent B. Agent B adds its own system prompt (150 tokens) and calls a different tool, which returns another 300 tokens. Agent B now has 1,250 tokens of context and generates a response (150 tokens). Cost: 1,250 input tokens + 150 output tokens.
Hop 3: Agent B’s response (150 tokens) gets passed back to Agent A, which now has 1,050 tokens of context (original query + first response + second response + context from both hops). Agent A generates a final response (100 tokens). Cost: 1,050 input tokens + 100 output tokens.
Total tokens consumed: (800 + 1,250 + 1,050) input + (200 + 150 + 100) output = 3,100 + 450 = 3,550 tokens. If you’re on GPT-4 Turbo at $0.01 per 1K input tokens and $0.03 per 1K output tokens, that single workflow costs $0.04. Multiply that by 10,000 daily workflows, and you’re at $400/day or $12,000/month—just for token costs, before infrastructure.
Without budget enforcement, a poorly designed workflow can easily double or triple that cost. An agent that recursively calls itself to refine answers, or that retrieves too much context at each hop, or that doesn’t truncate responses before passing them downstream, will blow through your budget in weeks.
Token Counting Across Providers
Different providers count tokens differently, and their documentation is your source of truth.
OpenAI’s tokenisation guide is the most widely used reference. OpenAI uses a consistent tokenizer across GPT-3.5, GPT-4, and GPT-4 Turbo. You can estimate tokens using their online tokenizer or the tiktoken Python library. The rule of thumb is 1 token ≈ 4 characters, but special tokens (function calls, JSON delimiters) add overhead.
Anthropic’s token counting documentation is more recent and detailed. Anthropic provides a count_tokens() method in their SDK that’s accurate to the byte. Claude’s tokenizer is different from OpenAI’s—the same prompt might be 150 tokens in Claude and 120 tokens in GPT-4. Anthropic also documents that system prompts, tool definitions, and XML tags all consume tokens, so you can’t ignore them in your budget calculations.
Google’s Gemini tokenisation guide covers token counting for Gemini 1.5 Pro and Flash. Gemini’s context window is much larger (1M tokens), but token counting is still essential for cost control. Google’s tokenizer also differs from OpenAI’s and Anthropic’s.
Cohere’s tokenisation documentation and Mistral’s tokenisation guide provide similar detail for their respective models. The key takeaway: you can’t assume token counts are portable across providers. If you migrate from OpenAI to Anthropic, you must re-measure token consumption across your entire workflow.
Context Carryover and Token Explosion
The biggest source of token bloat in multi-hop workflows is context carryover. When Agent A calls Agent B, should Agent B see the entire conversation history up to that point? The answer is “it depends,” but most teams naively pass the entire context, which causes token counts to explode exponentially.
Consider a three-hop workflow where each hop doubles the context:
- Hop 1: 500 tokens input (query + context)
- Hop 2: 1,000 tokens input (Hop 1 context + new context)
- Hop 3: 2,000 tokens input (Hop 1 + Hop 2 context + new context)
Total: 3,500 tokens. But if you add a fourth hop, you’re at 7,000 tokens. A fifth hop: 14,000 tokens. A six-hop workflow with this pattern hits 28,000 tokens—and you’re still not including the model’s output tokens.
This is why context management techniques are essential. You need to decide, at each hop, what context is actually necessary for the downstream agent to do its job. Often, you don’t need the full conversation history—you need a summary, or just the last N messages, or a structured extraction of the key facts.
Architecture for Budget Tracking
Define Budget Tiers
Start by defining budget tiers for different workflow types. This isn’t a one-size-fits-all problem. A simple retrieval-augmented generation (RAG) query might have a 5,000-token budget. A complex multi-step research workflow might have a 50,000-token budget. A real-time customer support agent might have a 2,000-token budget because you need sub-second response times.
At PADISO, when we’re building AI & Agents Automation systems for clients, we always start with a budget tier matrix:
| Workflow Type | Budget (Tokens) | Max Hops | Use Case |
|---|---|---|---|
| Simple retrieval | 5,000 | 1–2 | FAQ lookup, document search |
| Moderate reasoning | 15,000 | 2–4 | Customer support, content generation |
| Complex research | 50,000 | 4–6 | Data analysis, report generation |
| Recursive refinement | 100,000 | 6+ | Multi-stage problem solving |
These are starting points. You’ll adjust them based on your actual usage patterns and cost tolerance. The key is to be explicit: every workflow type gets a budget, and that budget is enforced at runtime.
Design a Budget Propagation Model
Decide how budgets flow through your workflow. The two most common models are:
Model 1: Parent Allocates to Children
The root agent (the one called by the user) receives the full budget. When it calls a child agent, it allocates a portion of its remaining budget to the child. For example:
- Root agent receives 50,000 tokens.
- It allocates 20,000 tokens to Child Agent A and 25,000 tokens to Child Agent B, keeping 5,000 tokens for itself.
- Child Agent A, when it calls a tool or sub-agent, allocates from its 20,000-token budget.
- If Child Agent A exceeds 20,000 tokens, it fails gracefully and returns an error to the root agent.
This model is simple and gives the root agent full control, but it requires the root agent to know ahead of time how to allocate budgets fairly.
Model 2: Global Budget with Per-Hop Limits
All agents in a workflow share a single global budget pool. Each hop has a maximum token allowance (e.g., no single hop can consume more than 10,000 tokens), but the total across all hops is capped at a global limit (e.g., 50,000 tokens). If any hop hits its per-hop limit, it fails. If the global pool is exhausted, all remaining hops fail.
This model is more flexible and doesn’t require pre-allocation, but it’s harder to implement because you need a central budget manager that all agents consult.
For most teams, Model 1 (parent allocates to children) is easier to implement and reason about. You start with a clear budget, divide it among your agents, and let each agent manage its own subtree. If a subtree runs out of budget, it fails gracefully, and the parent can decide how to handle it (retry with a larger budget, skip that branch, etc.).
Instrument Your Agent Framework
You need to instrument your agent framework—whether that’s LangChain, LlamaIndex, or a custom framework—to track token usage at every step.
LangChain’s token usage tracking documentation shows how to wrap LLM calls to capture input and output tokens. Here’s the pattern:
from langchain.callbacks import get_openai_callback
with get_openai_callback() as cb:
response = agent.run(query)
print(f"Tokens used: {cb.total_tokens}")
print(f"Cost: ${cb.total_cost}")
But this only works for a single call. For multi-hop workflows, you need to wrap the entire agent execution and attribute token usage to each hop. A better pattern is to create a custom callback handler that logs token usage for each LLM call, tool call, and agent step:
class BudgetTrackingCallback(BaseCallbackHandler):
def __init__(self, budget_limit):
self.budget_limit = budget_limit
self.tokens_used = 0
self.hops = []
def on_llm_end(self, response, **kwargs):
tokens = response.llm_output.get('token_usage', {})
input_tokens = tokens.get('prompt_tokens', 0)
output_tokens = tokens.get('completion_tokens', 0)
total = input_tokens + output_tokens
self.tokens_used += total
self.hops.append({
'step': len(self.hops),
'input_tokens': input_tokens,
'output_tokens': output_tokens,
'total': total,
'cumulative': self.tokens_used
})
if self.tokens_used > self.budget_limit:
raise BudgetExceededError(
f"Budget exceeded: {self.tokens_used} > {self.budget_limit}"
)
LlamaIndex’s observability documentation provides similar patterns for tracing token usage across complex workflows. The key insight is that you need to instrument at the LLM call level, not the agent level, so you can see exactly where tokens are being spent.
Implementing Per-Hop Budget Enforcement
Set Hard Limits on Individual Hops
Even if your global budget is 50,000 tokens, you should set hard limits on individual hops. This prevents a single misbehaving agent from consuming your entire budget.
A reasonable per-hop limit depends on your workflow, but here’s a rule of thumb:
- Simple retrieval or tool calls: 2,000–5,000 tokens
- Reasoning or summarisation: 5,000–10,000 tokens
- Complex analysis: 10,000–20,000 tokens
- Never exceed 30,000 tokens per hop unless you have a very good reason
The logic is that if a single hop needs more than 30,000 tokens to complete, you’ve probably designed the workflow wrong. Break it into smaller steps, summarise context more aggressively, or use a different approach entirely.
To enforce per-hop limits, add a check before each LLM call:
def call_agent(agent, query, remaining_budget, hop_limit=5000):
if remaining_budget < 1000:
raise BudgetExhaustedError("Insufficient budget for next hop")
# Estimate input tokens
input_tokens = estimate_tokens(query, agent.system_prompt)
if input_tokens > hop_limit:
# Truncate context or summarise
query = summarise_context(query, hop_limit - 1000)
input_tokens = estimate_tokens(query, agent.system_prompt)
response = agent.run(query)
output_tokens = response.token_count
total_used = input_tokens + output_tokens
if total_used > hop_limit:
raise HopBudgetExceededError(
f"Hop consumed {total_used} tokens, limit is {hop_limit}"
)
return response, total_used
The key is the estimation step. Before you call the LLM, estimate how many tokens your input will consume. If it’s above the hop limit, summarise or truncate the context. This prevents most budget overruns before they happen.
Estimate Tokens Before Calling the LLM
Token estimation is critical. You can’t enforce budgets if you don’t know, before the call, how many tokens you’re about to consume.
For OpenAI models, use the tiktoken library:
import tiktoken
encoding = tiktoken.encoding_for_model("gpt-4")
num_tokens = len(encoding.encode(text))
For Anthropic Claude, use the official SDK:
import anthropic
client = anthropic.Anthropic()
token_count = client.messages.count_tokens(
model="claude-3-opus-20240229",
messages=[{"role": "user", "content": text}]
)
For other providers, check their documentation. The point is: always estimate before you call. This gives you a chance to adjust your input (truncate context, summarise, simplify the prompt) before you commit tokens.
Handle Budget Exhaustion Gracefully
When a hop runs out of budget, don’t just fail silently. Return a clear error that indicates why the workflow stopped and what the user can do about it.
class BudgetExceededError(Exception):
def __init__(self, hop_index, tokens_used, budget_limit, remaining_global):
self.hop_index = hop_index
self.tokens_used = tokens_used
self.budget_limit = budget_limit
self.remaining_global = remaining_global
message = (
f"Hop {hop_index} exceeded budget. "
f"Used {tokens_used} tokens, limit is {budget_limit}. "
f"Global budget remaining: {remaining_global} tokens. "
f"Try simplifying your query or increasing the budget."
)
super().__init__(message)
When you catch this error in your application, you have a few options:
- Retry with a larger budget. If the user is willing to wait and pay more, re-run the workflow with a higher budget.
- Simplify the workflow. Remove unnecessary hops, reduce context, or use a simpler model.
- Return a partial result. If you’ve completed some hops successfully, return what you have and explain what couldn’t be completed.
- Queue for later. If this is a background job, queue it with a larger budget and retry when resources are available.
The worst thing you can do is fail silently and return an empty result. The user won’t know if the workflow succeeded or ran out of budget.
Budget Propagation and Context Carryover
Decide What Context to Carry Forward
At each hop, you have a choice: pass the entire conversation history forward, or pass only the essential context. Passing everything is easy but expensive. Passing only what’s necessary requires discipline but saves tokens.
Here’s a decision tree:
Should Agent B see the original user query?
- Yes, if Agent B needs to understand the user’s intent.
- No, if Agent B is just executing a specific task that Agent A has already interpreted.
Should Agent B see Agent A’s intermediate reasoning?
- Yes, if Agent B needs to understand why Agent A made certain decisions.
- No, if Agent B just needs the facts that Agent A extracted.
Should Agent B see the results of previous tool calls?
- Yes, if Agent B needs to build on those results.
- No, if Agent B is calling different tools that don’t depend on previous results.
For example, in a research workflow:
- Hop 1: Agent A retrieves documents about a topic (500 tokens of context).
- Hop 2: Agent B needs to synthesise those documents. It should see the documents (500 tokens), but not Agent A’s internal reasoning (200 tokens). Pass 500 tokens, not 700.
- Hop 3: Agent C needs to fact-check the synthesis. It should see the original documents (500 tokens) and Agent B’s synthesis (150 tokens), but not Agent A’s retrieval logs. Pass 650 tokens, not 850.
This kind of selective context propagation can reduce token usage by 30–50% without sacrificing quality.
Implement Context Summarisation
When context is too large to pass forward efficiently, summarise it. Summarisation trades a small amount of token cost (the summarisation step) for large savings downstream (smaller context for future hops).
For example:
def summarise_if_needed(context, max_tokens=5000):
context_tokens = estimate_tokens(context)
if context_tokens <= max_tokens:
return context, context_tokens
# Summarise
summary_prompt = f"Summarise the following in under 1000 tokens:\n\n{context}"
summary = llm.generate(summary_prompt)
summary_tokens = estimate_tokens(summary)
# Cost: summarisation overhead + smaller context for future hops
# Benefit: reduced tokens for all downstream hops
return summary, summary_tokens
Summarisation is especially valuable when you’re passing context across many hops. A 10,000-token document summarised to 1,000 tokens saves 9,000 tokens per downstream hop. If you have 5 downstream hops, that’s 45,000 tokens saved—more than enough to justify the 2,000 tokens spent on summarisation.
Use Structured Extraction Instead of Full Context
When possible, extract structured data instead of passing full text. For example, if Agent A retrieves a research paper, don’t pass the entire paper to Agent B. Extract the key findings, methodology, and conclusions into a structured format:
{
"title": "...",
"key_findings": ["...", "..."],
"methodology": "...",
"limitations": ["...", "..."]
}
This structured format is often 80% smaller than the full text and contains all the information Agent B needs. The extraction step costs some tokens, but the savings downstream more than compensate.
Monitoring and Observability
Log Token Usage at Every Step
Set up logging that captures token usage for every LLM call, tool invocation, and agent hop. This is your primary tool for understanding where tokens are being spent and identifying optimisation opportunities.
At minimum, log:
- Timestamp of the call
- Agent/hop identifier (which agent made the call)
- LLM model used
- Input tokens consumed
- Output tokens generated
- Total tokens and cumulative total
- Budget remaining after the call
- Latency of the call
Example log entry:
{
"timestamp": "2024-01-15T10:23:45Z",
"workflow_id": "user-123-research-456",
"hop": 2,
"agent": "synthesis_agent",
"model": "gpt-4-turbo",
"input_tokens": 3200,
"output_tokens": 450,
"total_tokens": 3650,
"cumulative_tokens": 8900,
"budget_limit": 50000,
"budget_remaining": 41100,
"latency_ms": 2340
}
Store these logs in a structured format (JSON, Parquet, or a database) so you can query them later. This is essential for debugging, cost analysis, and optimisation.
Build a Token Usage Dashboard
Create a dashboard that shows:
- Total tokens consumed (daily, weekly, monthly) vs. budget
- Tokens per workflow type (retrieval, reasoning, etc.)
- Average tokens per hop (identify which agents are token-hungry)
- Budget exhaustion rate (percentage of workflows that hit budget limits)
- Cost per workflow (if you want to charge users for API usage)
Tools like Grafana, Datadog, or even a simple Google Sheets dashboard can work. The key is visibility: if you can’t see token usage, you can’t optimise it.
Set Up Alerts for Budget Anomalies
Configure alerts that trigger when:
- A single workflow consumes more than 2x its expected budget
- A hop consumes more than its per-hop limit
- Daily token consumption exceeds a threshold (e.g., 10M tokens)
- A specific agent consistently exceeds its budget
These alerts help you catch runaway workflows before they drain your entire budget.
Cost Reclamation and Optimisation
Identify and Eliminate Redundant Context
Review your logs and look for workflows where the same context is being passed through multiple hops without being used. This is pure waste.
For example, if you’re passing a 3,000-token document through 5 hops but only the first hop actually uses it, you’re wasting 12,000 tokens. Instead, have the first hop extract what the downstream hops need and pass only that.
Reduce Retrieval Context
RAG (Retrieval-Augmented Generation) workflows often retrieve too much context. Instead of retrieving the top 10 documents (which might be 5,000 tokens), retrieve the top 3–5 (1,000–2,000 tokens). The quality difference is usually minimal, but the token savings are huge.
If you’re using a retrieval system, tune the max_tokens parameter aggressively:
# Before: retrieve up to 5000 tokens
results = retriever.retrieve(query, max_tokens=5000)
# After: retrieve up to 1500 tokens
results = retriever.retrieve(query, max_tokens=1500)
Test with your actual queries and measure quality impact. Often, you’ll find that 1,500 tokens of context gives you 95% of the quality of 5,000 tokens.
Simplify Prompts
Every character in your system prompt consumes tokens. If your system prompt is 500 tokens, that’s 500 tokens per call, even if the user’s query is tiny.
Audit your system prompts and remove anything unnecessary:
- Don’t repeat instructions. “You are a helpful assistant” appears in many prompts, but it’s often redundant.
- Use examples sparingly. Few-shot examples are valuable, but each example consumes tokens. Use 1–2 examples instead of 5.
- Be specific, not verbose. “Return JSON” is better than “Please return the results in JSON format, which should be valid JSON that can be parsed by standard JSON parsers.”
A 200-token system prompt that’s optimised to 100 tokens saves 100 tokens per call. Across 10,000 daily calls, that’s 1M tokens saved per day.
Use Smaller Models When Possible
GPT-4 is powerful but expensive. GPT-3.5 Turbo is 10x cheaper. Smaller open-source models (Mistral, Llama 2) are even cheaper.
For each workflow, ask: “Do I really need GPT-4, or would GPT-3.5 Turbo be sufficient?” For many tasks—summarisation, extraction, simple reasoning—GPT-3.5 is excellent. Reserve GPT-4 for complex reasoning, creative tasks, or high-stakes decisions.
If you’re building production AI systems, work with a partner who understands model selection. At PADISO, our AI Strategy & Readiness service includes a detailed audit of which models are right for each workflow, which typically reduces token costs by 20–30% without sacrificing quality.
Batch Process When Possible
If you’re processing many similar requests, batch them together. Instead of calling the LLM 100 times with individual requests, call it once with 100 requests in a structured format.
For example, instead of:
for item in items:
response = llm.generate(f"Extract keywords from: {item}")
keywords.append(response)
Do:
prompt = "Extract keywords from each of the following:\n\n"
for item in items:
prompt += f"- {item}\n"
response = llm.generate(prompt)
keywords = parse_response(response)
Batching reduces overhead and often improves consistency. The token cost is similar, but you get better throughput and lower latency.
Real-World Implementation Patterns
Pattern 1: Sequential Agents with Shared Budget
You have Agent A → Agent B → Agent C, each performing a different task. They share a global budget of 50,000 tokens.
class WorkflowExecutor:
def __init__(self, global_budget=50000):
self.global_budget = global_budget
self.tokens_used = 0
self.agents = []
def add_agent(self, agent, hop_limit=10000):
self.agents.append({"agent": agent, "hop_limit": hop_limit})
def execute(self, initial_query):
context = initial_query
results = []
for i, agent_config in enumerate(self.agents):
agent = agent_config["agent"]
hop_limit = agent_config["hop_limit"]
remaining = self.global_budget - self.tokens_used
if remaining < 1000:
raise BudgetExhaustedError("Global budget exhausted")
# Estimate tokens
input_tokens = estimate_tokens(context, agent.system_prompt)
if input_tokens > hop_limit:
context = summarise_context(context, hop_limit - 1000)
input_tokens = estimate_tokens(context, agent.system_prompt)
# Execute agent
response = agent.run(context)
output_tokens = response.token_count
hop_tokens = input_tokens + output_tokens
if hop_tokens > hop_limit:
raise HopBudgetExceededError(f"Hop {i} exceeded {hop_limit} tokens")
self.tokens_used += hop_tokens
results.append(response)
context = response.text # Pass response to next agent
return results, self.tokens_used
This pattern is simple and works well for linear workflows. Each agent gets a hop limit, and the global budget is enforced across all hops.
Pattern 2: Branching Agents with Budget Allocation
Agent A calls two child agents (B and C) in parallel, each with its own budget allocation:
class ParallelWorkflow:
def __init__(self, global_budget=50000):
self.global_budget = global_budget
def execute(self, query):
# Allocate budgets
agent_a_budget = 5000
agent_b_budget = 20000
agent_c_budget = 20000
reserve = 5000 # Keep some for final synthesis
total_allocated = agent_a_budget + agent_b_budget + agent_c_budget + reserve
assert total_allocated <= self.global_budget
# Run Agent A
response_a = self.run_with_budget(
agent_a, query, agent_a_budget
)
# Run Agents B and C in parallel
response_b = self.run_with_budget(
agent_b, response_a.text, agent_b_budget
)
response_c = self.run_with_budget(
agent_c, response_a.text, agent_c_budget
)
# Synthesise results
synthesis_prompt = f"Synthesise:\n{response_b.text}\n{response_c.text}"
response_final = self.run_with_budget(
agent_a, synthesis_prompt, reserve
)
return response_final
def run_with_budget(self, agent, query, budget):
# Truncate context if needed
input_tokens = estimate_tokens(query, agent.system_prompt)
if input_tokens > budget * 0.8: # Reserve 20% for output
query = query[:int(len(query) * 0.8)]
response = agent.run(query)
assert response.token_count <= budget
return response
This pattern works well when you have independent subtasks that can be parallelised. The key is pre-allocating budgets fairly across branches.
Pattern 3: Recursive Agents with Depth Limits
An agent calls itself recursively to refine answers. You need both a token budget and a depth limit to prevent infinite recursion:
class RecursiveAgent:
def __init__(self, base_agent, max_depth=3, budget_per_level=10000):
self.base_agent = base_agent
self.max_depth = max_depth
self.budget_per_level = budget_per_level
def refine(self, query, depth=0, tokens_used=0):
if depth >= self.max_depth:
return query, tokens_used
remaining_budget = self.budget_per_level - tokens_used
if remaining_budget < 1000:
return query, tokens_used
# Generate refinement
refinement_prompt = f"Refine this answer:\n{query}\n\nProvide a better version."
response = self.base_agent.run(refinement_prompt)
tokens_used += response.token_count
if tokens_used > self.budget_per_level:
return query, tokens_used # Return original if budget exceeded
# Recursively refine
return self.refine(response.text, depth + 1, tokens_used)
This pattern is useful for iterative refinement, but be careful: recursion can quickly blow through budgets. Always have a depth limit and a per-level budget cap.
Common Pitfalls and How to Avoid Them
Pitfall 1: Not Estimating Tokens Before Calling the LLM
Problem: You call the LLM without estimating tokens first, assuming you have enough budget. The call exceeds your hop limit, and you fail ungracefully.
Solution: Always estimate tokens before calling the LLM. Use the provider’s official tokeniser (tiktoken for OpenAI, Anthropic’s SDK for Claude, etc.). If the estimated tokens exceed your limit, truncate or summarise the context before calling.
Pitfall 2: Passing the Entire Conversation History Through Every Hop
Problem: Each hop receives the full conversation history, causing token counts to explode exponentially. A 3-hop workflow that should cost 10,000 tokens ends up costing 50,000 tokens.
Solution: Be selective about what context you pass forward. Ask yourself: “Does the next agent actually need this information?” Often, the answer is no. Pass only the essential facts, summaries, or extracted data.
Pitfall 3: Not Monitoring Token Usage
Problem: You deploy a workflow without logging token usage. A month later, you get a surprise bill for $50,000 because a workflow is running out of control.
Solution: Instrument your agents to log token usage for every call. Set up a dashboard to monitor daily consumption. Configure alerts for anomalies. If you can’t see it, you can’t control it.
Pitfall 4: Setting Budgets Too Tight
Problem: You set a 5,000-token budget per hop, but legitimate workflows need 7,000 tokens. They fail repeatedly, and users get frustrated.
Solution: Start with generous budgets (e.g., 20,000 tokens per hop) and gradually reduce them based on actual usage data. Look at your logs: what’s the 95th percentile of tokens consumed per hop? Set your budget slightly above that.
Pitfall 5: Using the Same Budget for All Workflow Types
Problem: You set a global 50,000-token budget for all workflows. Simple retrieval queries finish in 5,000 tokens, but complex research workflows need 80,000 tokens. The complex workflows always fail.
Solution: Define different budget tiers for different workflow types. Simple queries get 10,000 tokens. Complex workflows get 100,000 tokens. Allocate budgets based on the actual requirements of each workflow type.
Pitfall 6: Ignoring System Prompts and Tool Definitions
Problem: You estimate tokens for the user’s query but forget to include the system prompt and tool definitions. The actual token count is 50% higher than you estimated, and you exceed your budget.
Solution: When estimating tokens, include everything: the system prompt, the user’s query, any tool definitions, and any context being passed from previous hops. Use the provider’s official token counter, which includes all of these.
Summary and Next Steps
Token budget management across multi-hop agent workflows is non-negotiable for production AI systems. Without it, you face runaway costs, degraded user experience, and unpredictable failures. With it, you ship faster, control costs, and build reliable systems.
Here’s what you need to do:
-
Define budget tiers for your workflow types. Start with the matrix in this guide and adjust based on your actual usage.
-
Instrument your agents to track token usage at every step. Use the provider’s official tokeniser to estimate before you call and measure after you call.
-
Implement per-hop limits to prevent individual agents from consuming your entire budget. Hard limits are better than soft recommendations.
-
Be selective about context propagation. Don’t pass the entire conversation history through every hop. Summarise, extract, or truncate as needed.
-
Monitor and alert. Set up logging and dashboards so you can see where tokens are being spent. Configure alerts for anomalies.
-
Optimise continuously. Review your logs monthly. Identify which workflows are token-hungry. Simplify prompts, reduce retrieval context, or use smaller models.
If you’re building production AI systems and need help with architecture, cost control, or compliance, PADISO can help. Our AI & Agents Automation service includes token budget design and implementation. Our AI Strategy & Readiness audit identifies cost optimisation opportunities specific to your workflows. And our fractional CTO leadership can embed token budget discipline into your engineering culture.
Start small: pick one critical workflow, implement token budget tracking, and measure the impact. Once you see the cost savings and reliability improvements, roll it out across your entire system. The effort is minimal, but the payoff is huge.
For a deeper dive into your specific architecture, book a call with our team at PADISO. We’ll help you design token budgets that scale with your business.