PADISO.ai: AI Agent Orchestration Platform - Launching May 2026
Back to Blog
Guide 23 mins

Token Counting in Production: Pre-Flight Checks That Save Money

Master token counting in production AI systems. Learn three critical failure modes and pre-flight checks that prevent runaway costs and audit failures.

The PADISO Team ·2026-05-23

Table of Contents

  1. Why Token Counting Matters in Production
  2. The Three Failure Modes Every Team Hits
  3. Building Your Pre-Flight Check Framework
  4. Token Counting Tools and Implementation
  5. Real-World Cost Blowout Scenarios
  6. Integrating Token Counting into Your Deployment Pipeline
  7. Monitoring and Alerting for Token Overruns
  8. Compliance and Audit Readiness
  9. Next Steps: From Theory to Production

Why Token Counting Matters in Production

Token counting isn’t abstract infrastructure work—it’s the difference between shipping an AI agent that costs $50 per request and one that costs $5,000. Every Padiso client agent runs a token count before submission because we’ve watched teams burn through budgets in hours, not weeks.

When you deploy an agentic AI system into production, tokens are your unit of currency. Whether you’re using OpenAI’s GPT-4, Anthropic’s Claude, or Google’s Vertex AI, every input token and every output token costs money. More critically, every token consumed counts against your rate limits, your quota, and your compliance footprint. A single runaway loop can generate millions of tokens in minutes, turning a profitable product into a financial disaster.

Token counting in production isn’t just cost control—it’s operational hygiene. It’s the pre-flight check that catches problems before they bill. When you understand how many tokens your agent will consume before it runs, you can:

  • Predict costs accurately before deploying to production
  • Set hard limits that prevent runaway expenses
  • Optimise prompts to reduce token waste without sacrificing quality
  • Pass compliance audits by demonstrating cost controls and resource governance
  • Debug performance bottlenecks by correlating token consumption with latency

At Padiso, we’ve built token counting checks into every deployment pipeline. Our fractional CTO approach means we treat your AI infrastructure like operators, not consultants. That means measuring, validating, and optimizing before anything touches production.


The Three Failure Modes Every Team Hits

We’ve seen hundreds of agentic AI deployments. The teams that fail fast and cheap are the ones that catch these three failure modes before production:

Failure Mode 1: The Silent Prompt Bloat

Your prompt starts clean. “You are a customer support agent. Answer questions concisely.” Forty-eight tokens. Then you add context. Product documentation. Customer history. Company policies. System instructions for error handling. Fallback prompts for edge cases. Suddenly your base prompt is 8,000 tokens before the user even types a question.

This is the most insidious failure mode because it’s invisible. Every single request now starts with an 8,000-token overhead. If you’re running 1,000 requests per day, that’s 8 million tokens of pure waste. At current OpenAI pricing, that’s roughly $120 per day in wasted context—$3,600 per month—and your product doesn’t work any better.

We caught this at a Sydney fintech startup last year. Their AI agent was supposed to help with loan applications. The prompt included the entire product manual, regulatory compliance text, and a 20-example few-shot learning block. Base prompt: 12,000 tokens. They were processing 500 applications per day. That’s 6 million tokens daily in prompt overhead alone, costing them $180 per day just to load context.

The pre-flight check caught it in staging. We ran the prompt through Tiktoken, counted the tokens, and immediately saw the problem. Solution: compress the prompt, move static context to retrieval-augmented generation (RAG), and use prompt templates instead of monolithic instructions. Result: 2,000-token base prompt, same quality, 80% cost reduction.

Failure Mode 2: The Hallucinated Tool Call Loop

Your agent has access to tools—APIs, databases, external services. It’s supposed to call them intelligently. But sometimes the model hallucinates. It invents tool names that don’t exist. It calls the same tool in a loop because it misunderstands the response. It tries to pass invalid parameters, gets an error, and tries again with a slightly different invalid parameter.

This is catastrophic for token counting. A single hallucinated loop can consume 50,000 tokens in seconds. The agent tries to call a tool, gets an error response (tokens consumed), tries again, gets another error, and repeats. Each iteration consumes the full conversation history plus the error message plus the new attempt. By iteration 20, you’ve consumed more tokens than your entire daily budget.

We documented this in detail in our Agentic AI Production Horror Stories guide, where we broke down a real case: a customer service agent that got stuck in a tool-calling loop for 3 minutes before hitting the rate limit. Total token consumption: 2.4 million tokens. Cost: $36. The agent was supposed to answer a simple question in 50 tokens.

The pre-flight check for this is brutal but essential: instrument your agent with token counting at every step. Count tokens before the API call, count tokens after the response, and set a hard circuit breaker if token consumption exceeds a threshold (e.g., 10,000 tokens for a single request). This catches the loop before it balloons.

Failure Mode 3: The Uncontrolled Context Window Expansion

Your agent maintains conversation history. That’s good for context. But if you’re not actively pruning or summarizing old messages, the context window grows with every turn. By turn 50 of a conversation, you’re including all 50 previous turns in your API request. That’s exponential token growth.

Here’s the math: a typical conversation turn is 200-500 tokens. By turn 50, you’re including 10,000-25,000 tokens of historical context. By turn 100, you’re at 20,000-50,000 tokens. If you’re running multi-turn conversations at scale, this becomes a runaway cost driver.

One of our portfolio companies—a venture studio client building an AI research assistant—hit this in staging. Users would start a research session and run 30-40 turns of questions and answers. The agent was including the full history in every request. By turn 30, the context window was so large that the model’s response quality actually degraded (the model gets confused by too much context). They were paying premium rates for worse outputs.

The pre-flight check: implement context management. Summarize or prune conversation history after every N turns. Use sliding windows instead of full history. Count tokens before and after context management to validate the optimization. We reduced their per-conversation token cost by 65% by implementing smart summarization.


Building Your Pre-Flight Check Framework

A pre-flight check is a validation gate that runs before any code touches production. It answers one question: “Will this agent consume tokens as expected?” Here’s the framework we use at Padiso:

Step 1: Establish Your Token Budget

Before you write a single line of prompt code, define your token budget. This is the maximum number of tokens you’re willing to consume per request, per user, per day.

Start with a business constraint: What’s your maximum acceptable cost per request? If you’re running a customer support agent and you can afford $0.10 per interaction, work backwards. At current OpenAI pricing (roughly $0.03 per 1K input tokens, $0.06 per 1K output tokens), that’s roughly 2,000-3,000 tokens per request.

Now define tiers:

  • Input token budget: Prompts, context, user queries. This should be 60-70% of your total budget.
  • Output token budget: Model responses. This should be 30-40% of your total budget.
  • Safety margin: Add 20% on top as a buffer for edge cases.

For our fintech example above, they set a budget of 3,000 tokens per request (input + output combined). That meant their prompt had to stay under 2,000 tokens, leaving 1,000 for user query and response.

Step 2: Count Your Baseline Prompt

Take your system prompt and count the tokens. Use the official tools:

Don’t estimate. Don’t guess. Count.

Here’s a simple Python snippet for OpenAI:

import tiktoken

encoding = tiktoken.encoding_for_model("gpt-4")
system_prompt = """You are a customer support agent..."""
token_count = len(encoding.encode(system_prompt))
print(f"System prompt tokens: {token_count}")

Run this in your CI/CD pipeline. If the token count exceeds your baseline, fail the build. This catches prompt bloat before it reaches staging.

Step 3: Simulate User Queries and Count Total Consumption

Your prompt is 2,000 tokens. But what about the user query? What about context you’re injecting from RAG or a database lookup?

Create a test suite of realistic user queries. For each one, simulate the full request:

  1. System prompt (2,000 tokens)
  2. User query (estimate 50-200 tokens)
  3. Context from RAG (estimate 500-2,000 tokens)
  4. Few-shot examples if using them (estimate 200-500 tokens)
  5. Tool definitions if using function calling (estimate 200-500 tokens)

Count the total for each test case. If any exceeds your budget, you’ve found a failure mode before production.

Step 4: Set Circuit Breakers and Hard Limits

In production, implement hard limits at the agent level:

from langchain.callbacks import TokenCountingCallbackHandler

token_counter = TokenCountingCallbackHandler()
max_tokens_per_request = 3000

# Before calling the model
tokens_so_far = token_counter.prompt_tokens
if tokens_so_far > max_tokens_per_request:
    raise TokenBudgetExceeded(f"Request would consume {tokens_so_far} tokens, limit is {max_tokens_per_request}")

This is not a soft warning. This is a hard stop. If you’re about to exceed your token budget, fail the request and alert the user. Better to say “I couldn’t process that” than to burn $1,000 on a single request.


Token Counting Tools and Implementation

You don’t need to build token counting from scratch. The ecosystem has solid tools. Here’s what we use at Padiso:

LangChain Token Counting

LangChain’s token counting is production-grade. It integrates with your agent code and tracks tokens across the entire chain:

from langchain.callbacks import get_openai_callback

with get_openai_callback() as cb:
    result = agent.run("What's my account balance?")
    print(f"Total tokens: {cb.total_tokens}")
    print(f"Total cost: ${cb.total_cost}")

This gives you real-time visibility into token consumption. Wrap it around every agent execution in production.

Anthropic’s Native Token Counting

If you’re using Claude, Anthropic’s token counting API is built in:

from anthropic import Anthropic

client = Anthropic()

message = client.messages.count_tokens(
    model="claude-3-5-sonnet-20241022",
    system="You are a helpful assistant.",
    messages=[
        {"role": "user", "content": "What is 2+2?"}
    ],
)
print(f"Input tokens: {message.input_tokens}")

Run this before every production request. It’s a single API call and it’s free.

Monitoring and Dashboards

Don’t just count tokens—monitor them. We use Datadog and custom dashboards to track:

  • Average tokens per request: Should be stable and predictable
  • 95th percentile token consumption: Catches outliers
  • Daily token spend: Should align with your budget
  • Token consumption by agent type: Which agents are most expensive?
  • Context window growth over time: Are conversations bloating?

Set alerts: if average tokens per request spike by 20%, alert the team. If daily spend exceeds budget by 10%, trigger an incident.


Real-World Cost Blowout Scenarios

Let’s ground this in actual numbers. Here are three scenarios we’ve seen at Padiso clients:

Scenario 1: The Unoptimised RAG Pipeline

Company: E-commerce platform (Series A, Sydney-based)

Agent: Product recommendation engine using RAG

Problem: The RAG pipeline was retrieving the top 50 product documents for every query, passing all 50 into the context. Each product document was 1,000 tokens (description, specs, reviews). That’s 50,000 tokens of context per request.

Token consumption:

  • System prompt: 500 tokens
  • RAG context (50 documents): 50,000 tokens
  • User query: 100 tokens
  • Model response: 200 tokens
  • Total: 50,800 tokens per request

At $0.03 per 1K input tokens and $0.06 per 1K output tokens, that’s roughly $1.53 per request. Running 1,000 requests per day: $1,530 per day, or $45,900 per month.

Pre-flight check: We counted tokens in staging and immediately saw the problem. Solution: retrieve only the top 5 documents, re-rank them with a smaller model, and pass only the top 2 into the main agent. New token count: 2,500 tokens per request. Cost: $0.08 per request, $80 per day, $2,400 per month.

Savings: $43,500 per month. The agent’s recommendation quality actually improved because it was less confused by too much context.

Scenario 2: The Runaway Few-Shot Loop

Company: Fintech startup (Seed stage, building loan origination AI)

Agent: Loan application classifier

Problem: The prompt included 50 examples of loan applications (few-shot learning). Each example was 400 tokens. That’s 20,000 tokens of examples in the base prompt. Plus, the agent was using function calling, and when the model got confused about which function to call, it would retry. Each retry included the full context plus the error message.

Token consumption (worst case):

  • System prompt: 500 tokens
  • Few-shot examples: 20,000 tokens
  • User query: 200 tokens
  • First function call attempt: 500 tokens
  • Error response: 200 tokens
  • Retry 1: 500 tokens
  • Retry 2: 500 tokens
  • Retry 3: 500 tokens (hits circuit breaker)
  • Total: 23,400 tokens for a single request

Cost: $0.70 per request. Processing 200 applications per day: $140 per day, or $4,200 per month. And 30% of requests hit the retry loop, so actual cost was closer to $5,000 per month.

Pre-flight check: We instrumented the agent with per-step token counting. We immediately saw that few-shot examples were consuming 85% of tokens. Solution: move examples to a retrieval system (store them in a vector database, retrieve only the most relevant 3 examples per request). New token count: 2,000 tokens per request. Cost: $0.06 per request, $12 per day, $360 per month.

Savings: $4,640 per month. Plus, the agent’s accuracy improved because it was using more relevant examples instead of a fixed set.

Scenario 3: The Uncontrolled Context Window

Company: B2B SaaS platform (Series B, enterprise customer support)

Agent: Multi-turn customer support bot

Problem: The agent maintained full conversation history. Users would run 50-100 turn conversations. By turn 50, the context window was 25,000 tokens. By turn 100, it was 50,000 tokens.

Token consumption (by conversation turn):

  • Turn 1: 1,000 tokens (prompt + query)
  • Turn 5: 3,000 tokens (prompt + 4 turns of history + query)
  • Turn 10: 6,000 tokens (prompt + 9 turns of history + query)
  • Turn 50: 30,000 tokens (prompt + 49 turns of history + query)

Average conversation: 30 turns. Average tokens per turn: 15,000. Cost per conversation: $0.45. Processing 500 conversations per day: $225 per day, or $6,750 per month.

Pre-flight check: We tracked token consumption over conversation length and immediately saw the exponential growth. Solution: implement sliding window context management. Keep only the last 10 turns in context, summarise older turns into a single summary message. New token consumption: stable at 3,000 tokens per turn. Cost per conversation: $0.09. Processing 500 conversations per day: $45 per day, or $1,350 per month.

Savings: $5,400 per month. Plus, response quality improved because the model wasn’t drowning in irrelevant context.


Integrating Token Counting into Your Deployment Pipeline

Token counting isn’t a one-time check—it’s a continuous process. Here’s how we integrate it into every Padiso deployment:

Pre-Commit Hooks

Before code is even committed, run token counts on any changed prompts:

#!/bin/bash
# .git/hooks/pre-commit

for prompt_file in $(git diff --cached --name-only | grep -E '\.prompt\.txt$'); do
    token_count=$(python count_tokens.py "$prompt_file")
    if [ "$token_count" -gt 2000 ]; then
        echo "Error: Prompt exceeds token limit ($token_count > 2000)"
        exit 1
    fi
done

This catches prompt bloat before it reaches code review.

CI/CD Pipeline Validation

In your CI/CD pipeline (GitHub Actions, GitLab CI, etc.), add a token counting step:

- name: Validate Token Counts
  run: |
    python -m pytest tests/test_token_counts.py
    pytest tests/test_token_counts.py::test_system_prompt_under_limit
    pytest tests/test_token_counts.py::test_rag_context_under_limit
    pytest tests/test_token_counts.py::test_tool_definitions_under_limit

If any test fails, the build fails. No exceptions.

Staging Environment Instrumentation

Before production, run your agent against a realistic test suite in staging. Instrument every request with token counting:

from datetime import datetime
import json

class TokenCountingMiddleware:
    def __init__(self, agent, budget_per_request=3000):
        self.agent = agent
        self.budget = budget_per_request
        self.logs = []

    def run(self, query):
        start_tokens = self.count_current_tokens()
        result = self.agent.run(query)
        end_tokens = self.count_current_tokens()
        
        tokens_consumed = end_tokens - start_tokens
        log_entry = {
            "timestamp": datetime.now().isoformat(),
            "query": query,
            "tokens_consumed": tokens_consumed,
            "budget_exceeded": tokens_consumed > self.budget,
            "result": result
        }
        self.logs.append(log_entry)
        
        if tokens_consumed > self.budget:
            print(f"WARNING: Request consumed {tokens_consumed} tokens (budget: {self.budget})")
        
        return result

    def report(self):
        total_tokens = sum(log["tokens_consumed"] for log in self.logs)
        avg_tokens = total_tokens / len(self.logs)
        max_tokens = max(log["tokens_consumed"] for log in self.logs)
        budget_violations = sum(1 for log in self.logs if log["budget_exceeded"])
        
        print(f"Total requests: {len(self.logs)}")
        print(f"Total tokens: {total_tokens}")
        print(f"Average tokens per request: {avg_tokens:.0f}")
        print(f"Max tokens in a single request: {max_tokens}")
        print(f"Budget violations: {budget_violations}")

Run this against 1,000 realistic queries in staging. If you see budget violations, fix the agent before production.

Production Monitoring

Once in production, continuous monitoring is essential. We use structured logging and dashboards:

import logging
import structlog

logger = structlog.get_logger()

class ProductionTokenCounter:
    def __init__(self, agent, budget=3000, alert_threshold=0.8):
        self.agent = agent
        self.budget = budget
        self.alert_threshold = alert_threshold

    def run(self, user_id, query):
        tokens_before = self.get_token_estimate(query)
        result = self.agent.run(query)
        tokens_after = self.get_token_estimate(result)
        total_tokens = tokens_before + tokens_after
        
        logger.info(
            "agent_execution",
            user_id=user_id,
            query_length=len(query),
            tokens_consumed=total_tokens,
            budget=self.budget,
            budget_utilisation=total_tokens / self.budget,
            result_length=len(result)
        )
        
        if total_tokens > self.budget * self.alert_threshold:
            logger.warning(
                "token_budget_warning",
                user_id=user_id,
                tokens_consumed=total_tokens,
                budget=self.budget
            )
        
        if total_tokens > self.budget:
            logger.error(
                "token_budget_exceeded",
                user_id=user_id,
                tokens_consumed=total_tokens,
                budget=self.budget
            )
        
        return result

This logs every execution with token consumption. You can now query your logs to find:

  • Which users are driving high token consumption?
  • Which queries are most expensive?
  • Are token counts increasing over time (sign of context bloat)?
  • Are we hitting budget violations regularly?

Monitoring and Alerting for Token Overruns

Token counting is only useful if you act on the data. Here’s how we set up monitoring at Padiso:

Key Metrics to Track

  1. Average tokens per request: Should be stable. If it spikes, something changed.
  2. 95th percentile token consumption: Catches outliers. If this is 2x your mean, you have a tail risk.
  3. Daily token spend: Should be predictable. If it jumps 30%, investigate.
  4. Budget utilisation rate: Track what percentage of your token budget you’re using daily. Aim for 60-80%.
  5. Request failure rate due to token limits: Should be near zero. If it’s >1%, your budget is too tight or your agent is broken.

Alert Rules

Set up alerts in your monitoring tool (Datadog, New Relic, CloudWatch, etc.):

  • Alert 1: If average tokens per request increases by >15% in a 1-hour window, page the team.
  • Alert 2: If 95th percentile token consumption exceeds 80% of your budget, page the team.
  • Alert 3: If daily token spend exceeds your daily budget by >10%, page the team.
  • Alert 4: If request failure rate due to token limits exceeds 1%, page the team.
  • Alert 5: If a single request consumes >90% of your per-request budget, log it for review.

These aren’t optional. These are the guardrails that prevent financial disasters.

Incident Response

When an alert fires, have a playbook:

  1. Immediate: Check if a deployment happened in the last hour. If so, roll back.
  2. Diagnosis: Query your token logs. Which queries are expensive? Is it a specific user? A specific query type?
  3. Remediation: If it’s prompt bloat, trim the prompt. If it’s context bloat, implement summarization. If it’s a bug, fix it.
  4. Validation: After remediation, re-run your pre-flight checks in staging before re-deploying.
  5. Postmortem: Why did this slip through? Update your pre-flight checks to catch it next time.

We’ve seen teams skip incident response and just pay the bill. That’s how you burn through budgets. Treat every token overrun like a production incident.


Compliance and Audit Readiness

Token counting isn’t just about cost—it’s about compliance. When you’re pursuing SOC 2 compliance or ISO 27001 certification, regulators care about resource governance.

Why Auditors Care About Token Counting

Auditors (and compliance tools like Vanta) look for evidence that you:

  1. Control resource consumption: Can you prove you’re not letting AI agents run wild?
  2. Monitor costs: Can you show that you’re tracking and alerting on unusual spend?
  3. Have incident response: When something goes wrong, can you demonstrate you caught it and fixed it?
  4. Test before production: Can you show that you validated token consumption in staging before deploying?

Token counting gives you all of this.

Documentation for Audits

Keep records of:

  1. Token budgets: Document your per-request, per-user, per-day token budgets and why you set them.
  2. Pre-flight checks: Document what pre-flight checks you run before every deployment.
  3. Monitoring and alerting: Document your alert rules and why they’re set at those thresholds.
  4. Incident logs: Document every incident where token consumption exceeded expectations, what you did about it, and what you changed to prevent it happening again.
  5. Test results: Document your staging test results showing token consumption stayed within budget.

This becomes your evidence that you have “adequate controls over AI resource consumption.” Auditors love this.

Vanta Integration

If you’re using Vanta for compliance automation, integrate your token counting logs. Vanta can ingest your token consumption data and demonstrate to auditors that you’re monitoring and controlling resource usage.

We help Padiso clients integrate token counting into their Vanta compliance dashboards as part of our Security Audit and Compliance services. This turns raw logs into compliance evidence.


Practical Implementation: From Theory to Production

Let’s walk through a complete example. You’re building a customer support agent using agentic AI. Here’s how token counting fits into your workflow:

Week 1: Design Phase

  1. Define token budget: You can afford $0.05 per request. That’s roughly 1,500 tokens per request.
  2. Allocate budget: System prompt 800 tokens, context 500 tokens, query 100 tokens, response 100 tokens.
  3. Document assumptions: “We assume average customer query is 100 tokens. We assume RAG context is 500 tokens. We assume response is 100 tokens.”

Week 2: Development Phase

  1. Write system prompt: “You are a customer support agent…” Count tokens: 450 tokens. ✓ Under budget.
  2. Implement RAG: Retrieve top 3 documents per query. Estimate context: 400 tokens. ✓ Under budget.
  3. Add function calling: Define tools (create ticket, lookup order, etc.). Estimate tool definitions: 200 tokens. ✓ Under budget.
  4. Write test suite: Create 50 realistic customer queries. Count tokens for each. Max: 1,200 tokens. ✓ Under budget.

Week 3: Staging Phase

  1. Deploy to staging: Run your test suite against the agent in staging.
  2. Instrument with token counting: Log tokens for every request.
  3. Analyze results: Average 800 tokens per request. 95th percentile: 1,100 tokens. ✓ Under budget.
  4. Load test: Run 100 concurrent requests. Average 850 tokens per request. ✓ Stable.
  5. Generate report: “Token consumption validated. All requests under budget. Ready for production.”

Week 4: Production Phase

  1. Deploy with monitoring: Agent goes live with token counting instrumentation.
  2. Set alerts: Alert if average tokens exceed 900, or if any request exceeds 1,400 tokens.
  3. Daily reviews: Check token consumption dashboard every morning.
  4. Incident response: When alert fires (and it will), follow your playbook.

Ongoing: Optimization Phase

  1. Monthly analysis: Look at token consumption trends. Is it stable? Increasing?
  2. Identify optimizations: Which queries are most expensive? Can you optimize the prompt?
  3. A/B test: Test a new prompt version in staging. Does it reduce tokens without hurting quality?
  4. Deploy improvements: Roll out optimizations to production.
  5. Measure impact: Did token consumption decrease? Did quality stay the same or improve?

At Padiso, we’ve seen teams reduce token consumption by 40-60% through this process without sacrificing quality. The key is treating token counting as an ongoing operational discipline, not a one-time check.


Why This Matters for Padiso Clients

Every Padiso client agent runs a token count before submission because we’ve learned the hard way that token counting is the difference between a sustainable AI product and a financial disaster.

When you work with us on CTO as a Service or AI & Agents Automation, token counting is built into our deployment process. We don’t just build agents—we build agents that we can afford to run.

Our approach is grounded in operational reality. We’ve helped Sydney startups and enterprise teams ship agentic AI at scale. We’ve caught runaway loops before they cost $50,000. We’ve optimised prompts to cut costs by 80%. We’ve built monitoring systems that catch problems before they become incidents.

If you’re building agentic AI and you’re not counting tokens, you’re flying blind. You don’t know if your agent costs $0.01 per request or $10 per request. You don’t know if you’re trending toward profitability or bankruptcy. You don’t know if a deployment broke your economics.

Token counting is the operational discipline that separates teams that ship sustainable AI from teams that ship expensive mistakes.


Next Steps: From Theory to Production

Here’s your action plan:

This Week

  1. Pick a model: OpenAI, Anthropic, Google, or Hugging Face. Choose one.
  2. Install a token counter: Use Tiktoken for OpenAI, Anthropic’s API for Claude, or Hugging Face tokenizers for open models.
  3. Count your current prompt: How many tokens is your system prompt right now? If you don’t have one, write a basic one and count it.
  4. Set a token budget: What’s your maximum acceptable cost per request? Work backwards from that to a token budget.

Next 2 Weeks

  1. Build a test suite: Create 50-100 realistic queries for your use case.
  2. Count tokens for each test case: Run your prompt + context + query through your token counter for each test case.
  3. Identify outliers: Which queries consume the most tokens? Why?
  4. Optimise: Trim your prompt, implement RAG to reduce context, or use other techniques to bring token consumption under budget.
  5. Validate in staging: Deploy your agent to a staging environment and run your test suite against it with token counting instrumentation.

Next Month

  1. Deploy with monitoring: Add token counting to production. Log every request.
  2. Set alerts: Configure alerts for token budget violations.
  3. Review daily: Check your token consumption dashboard every morning.
  4. Incident response: When alerts fire, follow your playbook.
  5. Optimise: Identify the most expensive queries and optimise them.

Ongoing

  1. Monthly reviews: Analyse token consumption trends. Are you trending up or down?
  2. Continuous optimisation: Test new prompts, new retrieval strategies, new context management approaches.
  3. Compliance: Maintain documentation of your token budgets, pre-flight checks, and incident response for audits.

If you’re building agentic AI at scale and you want to avoid the three failure modes we’ve documented, token counting is non-negotiable. It’s the pre-flight check that saves money, prevents incidents, and keeps your AI product sustainable.

At Padiso, we integrate token counting into every deployment. We’ve seen it catch problems before they become disasters. We’ve seen it reduce costs by 60-80%. We’ve seen it turn unsustainable AI products into profitable, scalable systems.

If you’re serious about shipping agentic AI that works, start counting tokens today. If you want help building a token counting framework into your deployment pipeline, we can help. Our AI & Agents Automation service includes everything from pre-flight checks to production monitoring. We’ve helped Sydney startups and enterprise teams ship AI at scale. We can help you too.

The teams that win aren’t the ones with the fanciest prompts or the most sophisticated agents. They’re the ones that measure, monitor, and optimise relentlessly. Token counting is where that discipline starts.