PADISO.ai: AI Agent Orchestration Platform - Launching May 2026
Back to Blog
Guide 21 mins

Claude Token Economics: The 2026 Cost Lever You Are Underusing

Master Claude token economics to cut AI costs by 30–60%. Real benchmarks, code patterns, and implementation strategies for 2026.

The PADISO Team ·2026-06-03

Table of Contents

  1. Why Token Economics Matter Now
  2. Understanding Claude’s Token Pricing Model
  3. The Real Cost Breakdown: Input vs. Output Tokens
  4. Prompt Compression and Caching Strategies
  5. Batch Processing: The 50% Margin Play
  6. Context Window Optimisation for Long-Form Workflows
  7. Code Patterns to Implement This Week
  8. Real Benchmarks: Where Teams Are Saving
  9. Avoiding Common Token Waste
  10. Building a Token Budget and Monitoring
  11. Next Steps: Your 90-Day Roadmap

Why Token Economics Matter Now

If you’re shipping AI applications in 2026, token economics are no longer a nice-to-have optimisation—they’re a margin lever that separates profitable AI businesses from cash-burning ones. The difference between a carelessly built Claude integration and one optimised for token efficiency can be 30–60% of your total API spend, compounding across millions of requests.

The stakes are higher than they were even 12 months ago. As Anthropic doubles estimate for Claude Code token spend shows, enterprise teams are underestimating token consumption by 2x. Teams that don’t actively manage token spend find themselves with API bills that grow faster than revenue, and by the time they notice, architectural changes are painful and expensive.

This isn’t theoretical. At PADISO, we’ve worked with founders and operators building agentic AI systems, multi-tenant SaaS platforms, and AI-heavy workflows across financial services, insurance, and enterprise software. The teams winning aren’t the ones with the smartest models—they’re the ones who treat tokens like lines of code. Every token has a cost. Every request should be audited. Every pattern should be measured.

This guide walks you through the concrete, implementable tactics to optimise Claude token spend. You’ll learn the actual cost structure, the code patterns that work, the benchmarks from real production systems, and the 90-day roadmap to lock in savings.


Understanding Claude’s Token Pricing Model

Before you can optimise, you need to understand what you’re paying for. Claude’s pricing model is deceptively simple on the surface but has several layers that most teams miss.

The Input vs. Output Token Split

Claude charges separately for input tokens and output tokens. As of 2026, input tokens typically cost 1/4 to 1/3 the price of output tokens, depending on the model. This asymmetry is critical: it means that a 100,000-token prompt followed by a 500-token response costs dramatically less than a 500-token prompt followed by a 100,000-token response.

According to Anthropic’s official pricing documentation, Claude 3.5 Sonnet (the workhorse model for most teams) charges approximately $3 per million input tokens and $15 per million output tokens. Claude 3 Opus is more expensive but faster on complex reasoning. Claude 3 Haiku is cheaper but less capable on nuanced tasks.

This pricing structure incentivises a specific architectural pattern: front-load your context (the input), minimise your output requests, and batch where possible. Teams that ignore this pattern end up with output token spend that’s 3–5x higher than it needs to be.

Batch Processing Discounts

One of the most underutilised levers in Claude’s pricing is the Batch API, which offers a 50% discount on input tokens. If you’re processing requests that don’t need real-time responses, batching is non-negotiable. A 50% discount on input tokens directly translates to 15–25% total cost reduction for most workflows.

The catch: batching requires asynchronous processing. Your system needs to queue requests, submit them in batches, poll for results, and handle the latency. For real-time applications like chatbots or live dashboards, batching doesn’t apply. But for any background job—document processing, code review, analysis, report generation, data transformation—batching is free margin.

Long-Context Pricing and Caching

Claude supports context windows up to 200,000 tokens (and experimental support for even longer). The pricing model here is nuanced: longer context windows don’t cost more per token, but they do increase the total token count per request. However, Claude offers prompt caching, which allows you to cache up to 90% of your input tokens at a 10% cost (instead of full price) on subsequent requests to the same cached content.

This is a game-changer for workflows where you’re repeatedly processing documents, codebases, or knowledge bases against different queries. The first request pays full price; subsequent requests pay only 10% of the input token cost for the cached portion.


The Real Cost Breakdown: Input vs. Output Tokens

Let’s ground this in concrete numbers. Assume you’re building an AI-powered code review system for a 50-person engineering team. Each day, you process 200 pull requests, each averaging 2,000 lines of code (roughly 8,000 tokens when including context).

Naive approach (no optimisation):

  • 200 requests × 8,000 input tokens = 1.6M input tokens/day
  • 200 requests × 1,500 output tokens (detailed review) = 300K output tokens/day
  • Daily cost: (1.6M × $3/1M) + (300K × $15/1M) = $4.80 + $4.50 = $9.30/day
  • Monthly cost: $279/month (assuming 30 working days)

Optimised approach (batching + caching + compression):

  • Use Batch API: 50% discount on input = 1.6M × $1.50/1M = $2.40
  • Use prompt caching for repository context (shared across 200 reviews): first request pays full price, remaining 199 pay 10% = (1 × $3) + (199 × $0.30) = $3.30
  • Compress output requests (structured JSON instead of prose): 300K × 0.6 (compression factor) × $15/1M = $2.70
  • Daily cost: $2.40 + $3.30 + $2.70 = $8.40/day
  • Monthly cost: $252/month

That’s a 10% reduction. Not bad. But scale it: across 1,000 pull requests per day (typical for a 200-person engineering org), the monthly savings jump to $2,700+. And that’s before you factor in architectural changes like using cheaper models (Haiku) for triage and reserving Sonnet for complex decisions.

Where Output Tokens Really Hurt

Output tokens are the silent killer. Teams often focus on minimising input (“let’s trim the context”) but ignore output bloat. A single poorly designed prompt can generate 5,000-token responses when a structured format would yield 500 tokens.

Example: asking Claude to “write a detailed analysis” of a dataset vs. asking Claude to “return a JSON object with fields: anomalies (array), severity (enum), recommendation (string)” can reduce output tokens by 80–90%. The structured approach forces Claude to be concise and gives your downstream system a predictable format to parse.

This is where designing machine learning systems principles apply: every output should be instrumented, measured, and audited. If your average output is growing, something in your prompt design is degrading.


Prompt Compression and Caching Strategies

Prompt compression sounds like a dark art, but it’s a systematic practice. The goal is to reduce the token count of your input without losing semantic meaning or task clarity.

Technique 1: Instruction Compression

Many teams write verbose system prompts. Example (naive):

You are an expert software engineer with 20 years of experience in building scalable systems. 
Your job is to review pull requests and identify potential bugs, performance issues, and 
architectural concerns. Be thorough but concise in your feedback. Focus on:
- Security vulnerabilities
- Performance bottlenecks
- Code style and readability
- Testing coverage

Compressed version:

Role: Code reviewer. Output JSON with keys: security, performance, style, testing.
Be concise. Flag only actionable issues.

The compressed version is 70% shorter and Claude understands it just as well. The key: Claude doesn’t need flattery or lengthy context-setting. It responds to clear, structured instructions.

Technique 2: Example Pruning

Few-shot prompts (showing examples) are powerful but expensive. Instead of including 5–10 examples, include 2–3 high-quality examples. Claude is smart enough to generalise from minimal examples, and you save 60–70% of example tokens.

If you must include many examples, use prompt caching. The examples are cached, so subsequent requests pay only 10% of their token cost.

Technique 3: Dynamic Context Loading

Don’t load all context upfront. Load only what’s relevant. If you’re processing a document with 50 sections, don’t send all 50 sections to Claude. Send the query, let Claude identify which sections are relevant, then send only those sections. This two-stage approach often reduces total token spend by 30–40%.

At PADISO, we’ve implemented this pattern across document processing systems for financial services and insurance clients. The first pass (“which sections are relevant?”) is cheap (Haiku, 500 tokens, $0.0075). The second pass (detailed analysis of relevant sections) is focused and doesn’t waste tokens on irrelevant context.

Technique 4: Leveraging Prompt Caching

Prompt caching is underutilised because it requires architectural thinking. Here’s how it works:

  1. You send a prompt with a special cache_control header on the final block.
  2. Claude caches the input tokens and charges 10% of the normal rate for cached tokens on subsequent requests.
  3. The cache persists for 5 minutes (for API usage) or longer (for enterprise contracts).

Ideal use cases:

  • Knowledge base queries: cache the knowledge base, vary only the query.
  • Codebase analysis: cache the entire codebase, vary the analysis task.
  • Document review: cache the document, vary the review criteria.

Example: a 100,000-token knowledge base costs $300 on the first request. On the next 9 requests (within the 5-minute window), it costs only $30 each. Over 100 requests in a day, you save $27 vs. paying $30,000.


Batch Processing: The 50% Margin Play

The Batch API is the single biggest lever for cost reduction, and most teams don’t use it. Here’s why you should.

How Batching Works

Instead of sending individual requests to the Claude API and paying full price, you accumulate requests in a batch file, submit the batch, and wait for results (typically 1 hour, sometimes up to 24 hours). Anthropic charges 50% of the normal input token price for batched requests, and the same output token price.

For a system processing 1,000 requests per day:

  • Non-batched: 1,000 × 8,000 input tokens × $3/1M = $24/day
  • Batched: 1,000 × 8,000 input tokens × $1.50/1M = $12/day
  • Savings: $12/day = $360/month

The latency trade-off is the only barrier. If your application requires real-time responses (chatbots, live dashboards), batching doesn’t work. But for background jobs, overnight processing, or asynchronous workflows, batching is non-negotiable.

Ideal Batching Workflows

  • Document processing: Ingest 1,000 documents overnight, batch-process them, results ready by morning.
  • Code review: Collect pull requests throughout the day, batch-review them at 10 PM, engineers see feedback by 9 AM.
  • Data analysis: Batch-process monthly analytics, generate reports asynchronously.
  • Content generation: Batch-generate blog posts, emails, or product descriptions during off-peak hours.
  • Compliance and audit: Batch-scan logs or transactions for anomalies, batch-generate audit reports.

Implementation Path

Batching requires three changes to your system:

  1. Queue layer: Instead of calling Claude immediately, queue the request (in a database or message queue like SQS).
  2. Batch builder: Periodically (every hour, or when you hit 10,000 requests), format queued requests into the Batch API JSON format and submit.
  3. Results poller: Poll the Batch API for completed results, write them back to your database, trigger downstream processing.

This is a weekend project for a competent engineer. The ROI is immediate and compounds.


Context Window Optimisation for Long-Form Workflows

Claude’s 200,000-token context window is powerful, but it’s easy to waste. Longer context doesn’t mean better outputs—it means more tokens, more cost, and sometimes slower responses.

The Context Window Paradox

Intuition says: “More context = better answers.” Reality is more nuanced. Claude performs best when context is relevant and dense. A 50,000-token context window with highly relevant information often outperforms a 200,000-token window padded with noise.

The cost implication: don’t use the full context window just because it’s available. Use only what’s necessary.

Chunking and Retrieval

Instead of loading an entire 100,000-token document, use a retrieval strategy:

  1. Index the document (split into 1,000-token chunks, embedded with a vector database).
  2. Retrieve relevant chunks based on the query (semantic search).
  3. Send only retrieved chunks to Claude (typically 5,000–20,000 tokens).
  4. Claude answers based on the focused context.

This pattern reduces token spend by 80–90% compared to sending the entire document. It also improves answer quality because Claude isn’t drowning in irrelevant information.

At PADISO, we’ve implemented this for financial services and insurance clients processing regulatory documents, policy manuals, and claims histories. The retrieval layer is built once, then reused across hundreds of queries.

Multi-Turn Conversations and Context Accumulation

In multi-turn conversations (chatbot-style interactions), context accumulates with each turn. A 10-turn conversation can easily reach 50,000 tokens if you’re including the full conversation history.

Optimisation strategies:

  • Summarise old turns: After 5 turns, summarise the conversation and discard the original turns. Keep only the summary (1/10th the tokens).
  • Selective history: Include only the last 3 turns, not the entire conversation.
  • Separate context: Store conversation metadata (user intent, key decisions) separately, and inject only relevant metadata into each turn.

Code Patterns to Implement This Week

Here are production-ready patterns you can implement immediately.

Pattern 1: Batching with Python

import anthropic
import json
import time

client = anthropic.Anthropic()

def create_batch_request(requests):
    """Convert list of requests to Batch API format."""
    batch_requests = []
    for idx, req in enumerate(requests):
        batch_requests.append({
            "custom_id": f"request-{idx}",
            "params": {
                "model": "claude-3-5-sonnet-20241022",
                "max_tokens": 1024,
                "messages": [
                    {"role": "user", "content": req["prompt"]}
                ]
            }
        })
    return batch_requests

def submit_batch(requests):
    """Submit batch and return batch ID."""
    batch_data = create_batch_request(requests)
    batch = client.beta.messages.batches.create(
        requests=batch_data
    )
    return batch.id

def poll_batch(batch_id):
    """Poll for batch completion."""
    while True:
        batch = client.beta.messages.batches.retrieve(batch_id)
        if batch.processing_status == "ended":
            return batch
        time.sleep(30)  # Poll every 30 seconds

def get_batch_results(batch_id):
    """Retrieve results from completed batch."""
    batch = client.beta.messages.batches.retrieve(batch_id)
    results = {}
    for result in batch.request_results:
        custom_id = result.custom_id
        content = result.result.message.content[0].text
        results[custom_id] = content
    return results

# Usage
requests = [
    {"prompt": "Summarise this code: def foo(): pass"},
    {"prompt": "What's the capital of Australia?"},
]
batch_id = submit_batch(requests)
print(f"Batch submitted: {batch_id}")
batch = poll_batch(batch_id)
results = get_batch_results(batch_id)
for custom_id, content in results.items():
    print(f"{custom_id}: {content}")

Pattern 2: Prompt Caching

import anthropic

client = anthropic.Anthropic()

def query_with_cache(knowledge_base, query):
    """Query a knowledge base with prompt caching."""
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": "You are a helpful assistant. Answer questions based on the knowledge base provided."
            },
            {
                "type": "text",
                "text": f"Knowledge base:\n{knowledge_base}",
                "cache_control": {"type": "ephemeral"}
            }
        ],
        messages=[
            {
                "role": "user",
                "content": query
            }
        ]
    )
    return response.content[0].text

# Knowledge base (cached after first request)
kb = """Python is a programming language...
[100,000 tokens of documentation]
"""

# First request: full price
result1 = query_with_cache(kb, "What is Python?")
print(f"Result 1: {result1}")

# Second request: 10% price for cached knowledge base
result2 = query_with_cache(kb, "How do I install Python?")
print(f"Result 2: {result2}")

Pattern 3: Dynamic Context Loading

import anthropic

client = anthropic.Anthropic()

def identify_relevant_sections(document, query):
    """Identify which sections of a document are relevant to a query."""
    response = client.messages.create(
        model="claude-3-haiku-20250122",  # Cheaper model for triage
        max_tokens=500,
        messages=[
            {
                "role": "user",
                "content": f"Document has sections: {list(document.keys())}\nQuery: {query}\nReturn JSON with 'relevant_sections' array."
            }
        ]
    )
    import json
    result = json.loads(response.content[0].text)
    return result["relevant_sections"]

def analyze_document(document, query):
    """Analyze a document by first identifying relevant sections."""
    # Step 1: Identify relevant sections (cheap)
    relevant = identify_relevant_sections(document, query)
    
    # Step 2: Build context from relevant sections only
    context = "\n".join([document[section] for section in relevant if section in document])
    
    # Step 3: Analyze with full context (expensive model, focused input)
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",  # Full model for analysis
        max_tokens=2048,
        messages=[
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuery: {query}"
            }
        ]
    )
    return response.content[0].text

# Usage
document = {
    "section_1": "[5000 tokens about sales]",
    "section_2": "[5000 tokens about engineering]",
    "section_3": "[5000 tokens about finance]",
}
query = "What are our engineering challenges?"
result = analyze_document(document, query)
print(result)

Pattern 4: Structured Output to Reduce Tokens

import anthropic
import json

client = anthropic.Anthropic()

def extract_structured(text, schema):
    """Extract structured data from text, reducing output tokens."""
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[
            {
                "role": "user",
                "content": f"Extract data from this text and return as JSON matching this schema: {json.dumps(schema)}\n\nText: {text}"
            }
        ]
    )
    return json.loads(response.content[0].text)

# Schema (forces concise output)
schema = {
    "name": "string",
    "age": "integer",
    "city": "string",
    "occupation": "string"
}

text = "My name is John Smith, I'm 35 years old, I live in Sydney, and I work as a software engineer."
result = extract_structured(text, schema)
print(result)  # Output: {"name": "John Smith", "age": 35, "city": "Sydney", "occupation": "software engineer"}

Real Benchmarks: Where Teams Are Saving

These numbers come from production systems we’ve deployed at PADISO.

Financial Services: Document Processing

Client: A wealth management firm processing 500 client documents per month (prospectuses, policy documents, regulatory filings).

Naive approach:

  • Load entire document (avg 80,000 tokens)
  • Ask Claude to extract key information
  • Total: 500 × 80,000 input tokens + 500 × 5,000 output tokens
  • Cost: (40M × $3/1M) + (2.5M × $15/1M) = $120 + $37.50 = $157.50/month

Optimised approach:

  • Use retrieval to identify relevant sections (Haiku, 500 tokens per request)
  • Load only relevant sections (avg 15,000 tokens)
  • Use batching (50% discount on input)
  • Total: (500 × 500 × $1.50/1M) + (500 × 15,000 × $1.50/1M) + (500 × 2,000 × $15/1M)
  • Cost: $0.375 + $11.25 + $15 = $26.625/month

Savings: 83% reduction ($130.875/month)

Insurance: Claims Triage

Client: A general insurer processing 10,000 claims per month, routing to appropriate handlers.

Naive approach:

  • Full claim context (avg 5,000 tokens per claim)
  • Real-time routing (can’t use batching)
  • Cost: (10,000 × 5,000 × $3/1M) + (10,000 × 500 × $15/1M) = $150 + $75 = $225/month

Optimised approach:

  • Use Haiku for triage (cheaper, sufficient for routing): (10,000 × 3,000 × $0.80/1M) + (10,000 × 300 × $4/1M) = $24 + $12 = $36/month
  • Use batching for detailed analysis (50% discount, async): (10,000 × 5,000 × $1.50/1M) + (10,000 × 1,000 × $15/1M) = $75 + $150 = $225 → $112.50/month
  • Total: $36 + $112.50 = $148.50/month

Savings: 34% reduction ($76.50/month)

SaaS: Code Review Automation

Client: A 100-person engineering team, 500 pull requests per week.

Naive approach:

  • Full code context (avg 10,000 tokens per PR)
  • Real-time feedback (no batching)
  • Cost: (500 × 10,000 × $3/1M) + (500 × 2,000 × $15/1M) = $15 + $15 = $30/week = $120/month

Optimised approach:

  • Batch processing overnight (50% discount on input)
  • Prompt caching for repository context (cached across all 500 PRs)
  • Structured output (reduce output tokens by 60%)
  • Cost: (1 × 50,000 × $3/1M) + (499 × 50,000 × $0.30/1M) + (500 × 800 × $15/1M) = $0.15 + $7.49 + $6 = $13.64/week = $54.56/month

Savings: 55% reduction ($65.44/month)

Scaled across a 1,000-person company with 5,000 PRs/week, that’s $654.56/month → $300/month = $4,266 annual savings. For a startup, that’s runway. For an enterprise, it’s a rounding error—but the pattern applies to every AI-heavy workflow.


Avoiding Common Token Waste

These are the patterns we see repeatedly that destroy token budgets.

Waste 1: Repetitive System Prompts

Every request includes a 500-token system prompt explaining the task. Across 10,000 requests per day, that’s 5M tokens/day of redundant input.

Fix: Use prompt caching for system prompts. Cache once, pay 10% on subsequent requests.

Waste 2: Full Conversation History in Multi-Turn

A 20-turn conversation accumulates 50,000+ tokens of old turns. Most recent turns are irrelevant.

Fix: Keep only the last 3–5 turns. Summarise older turns into a 500-token summary.

Waste 3: Verbose Output Requests

“Give me a detailed analysis” generates 5,000-token responses. “Return JSON with: summary (50 words), anomalies (list), recommendation (string)” generates 500-token responses.

Fix: Always specify output format and length constraints.

Waste 4: No Model Tiering

Using Sonnet (expensive) for every task, including simple classification or triage that Haiku (cheap) could handle.

Fix: Use Haiku for triage, classification, and simple tasks. Reserve Sonnet for complex reasoning, code, and nuanced analysis.

Waste 5: Processing Entire Datasets When Sampling Suffices

Analysing 100,000 customer records to find patterns. You don’t need all 100,000—a 1,000-record sample often suffices.

Fix: Sample strategically. Process full dataset only when necessary.

Waste 6: Real-Time Requests That Could Be Batched

Processing requests immediately (paying full price) when they could wait 1 hour (50% discount).

Fix: Audit your workflows. Anything asynchronous should be batched.


Building a Token Budget and Monitoring

Token economics only matter if you measure them. Here’s how to build a monitoring system.

Step 1: Establish a Baseline

Run your current system for 2 weeks without optimisation. Track:

  • Total requests per day
  • Average input tokens per request
  • Average output tokens per request
  • Total cost per day
  • Cost per business outcome (cost per document processed, cost per PR reviewed, cost per analysis)

Example dashboard:

Date       | Requests | Avg Input | Avg Output | Total Cost | Cost/Outcome
2026-01-01 | 1,000    | 8,000     | 1,500      | $24.50     | $0.0245
2026-01-02 | 1,050    | 8,200     | 1,600      | $26.10     | $0.0248
2026-01-03 | 980      | 7,900     | 1,450      | $23.50     | $0.0240

Step 2: Set Targets

Based on your baseline, set 90-day targets:

  • 20% reduction in cost per outcome (via prompt compression, caching, batching)
  • 30% reduction in average input tokens (via retrieval, context pruning)
  • 40% reduction in average output tokens (via structured output)

Step 3: Implement Monitoring

Build a simple logging layer:

import logging

logger = logging.getLogger("claude_usage")

def log_token_usage(request_id, model, input_tokens, output_tokens, cost, outcome):
    logger.info(f"request_id={request_id}, model={model}, input={input_tokens}, output={output_tokens}, cost={cost}, outcome={outcome}")

# Parse logs into a database for analysis

Then query your logs daily:

SELECT 
  DATE(timestamp),
  COUNT(*) as requests,
  AVG(input_tokens) as avg_input,
  AVG(output_tokens) as avg_output,
  SUM(cost) as daily_cost,
  SUM(cost) / COUNT(*) as cost_per_request
FROM claude_usage
GROUP BY DATE(timestamp)
ORDER BY DATE(timestamp) DESC;

Step 4: Weekly Review and Adjustment

Every Friday, review the metrics:

  • Are we on track to hit our 20% cost reduction target?
  • Which requests are outliers (unusually high token count)?
  • Are there new patterns we can optimise?

Adjust tactics weekly. If prompt compression isn’t yielding results, move to batching. If batching is maxed out, implement caching.


Next Steps: Your 90-Day Roadmap

Here’s a concrete plan to lock in token savings over the next quarter.

Week 1–2: Baseline and Quick Wins

  1. Audit current spend: Run your system for 1 week, capture baseline metrics.
  2. Identify lowest-hanging fruit: Which workflows could be batched? Which prompts are verbose? Which could use cheaper models (Haiku)?
  3. Implement prompt compression: Reduce system prompts by 50% using the techniques above. Expected impact: 5–10% cost reduction.
  4. Set up monitoring: Build the logging and dashboard above.

Week 3–4: Caching and Model Tiering

  1. Implement prompt caching for any workflow with repeated context (knowledge bases, documents, codebases). Expected impact: 15–30% cost reduction for affected workflows.
  2. Implement model tiering: Use Haiku for triage and classification, Sonnet for complex reasoning. Expected impact: 20–40% cost reduction for triage-heavy workflows.
  3. Review week 2 metrics: Are you on track?

Week 5–8: Batching and Retrieval

  1. Implement batching for any asynchronous workflow. Expected impact: 50% cost reduction on input tokens for batched requests.
  2. Implement retrieval for document-heavy workflows. Expected impact: 30–80% cost reduction for document processing.
  3. Optimise output formats: Move from prose to structured JSON. Expected impact: 40–80% reduction in output tokens.

Week 9–12: Advanced Optimisations

  1. Context window optimisation: Implement dynamic context loading for multi-section documents.
  2. Batch size optimisation: Experiment with batch sizes (larger batches = higher throughput, but longer latency).
  3. Model fine-tuning: If you have 1M+ tokens of proprietary data, consider fine-tuning a Claude model. Expected impact: 20–50% cost reduction on specific tasks.

Expected Outcomes

By the end of 90 days:

  • Cost reduction: 30–60% reduction in cost per outcome
  • Latency improvement: 20–40% faster responses (via model tiering, caching)
  • Reliability: Batching and retrieval reduce hallucinations and improve consistency
  • Scalability: Your system can now handle 2–3x more requests at the same cost

For teams at PADISO, this is where we focus when optimising AI systems. If you’re building agentic AI, multi-tenant SaaS, or AI-heavy workflows and want a structured approach to token economics, our AI Strategy & Readiness team can help. We’ve also built tools and frameworks for token monitoring and cost forecasting that we use across platform engineering and custom software development projects.


Summary

Token economics are not a future concern—they’re a present margin lever. The difference between a carelessly built Claude integration and one optimised for token efficiency is 30–60% of your API spend, compounding across millions of requests.

The tactics are straightforward:

  1. Understand pricing: Input tokens cost 1/4 the price of output tokens. Batching costs 50% of normal input price.
  2. Compress prompts: Reduce system prompts and examples by 50–70%.
  3. Implement caching: Cache repeated context (knowledge bases, documents, code) at 10% cost.
  4. Batch asynchronous requests: Use the Batch API for any non-real-time workflow. 50% savings on input tokens.
  5. Use retrieval: Load only relevant context, not entire documents.
  6. Tier models: Use Haiku for triage, Sonnet for complex reasoning.
  7. Structure output: Force Claude to return JSON instead of prose. Reduce output tokens by 60–80%.
  8. Monitor relentlessly: Track cost per outcome, not just total cost.

Implement these patterns over 90 days, and you’ll cut your AI infrastructure costs by 30–60%. For a 50-person startup running AI-heavy workflows, that’s $3,000–$10,000 per month in runway. For an enterprise, it’s millions in annual savings.

Start this week. Pick one workflow. Measure it. Optimise it. Scale the pattern. The margin is there—you’re just not seeing it yet.

If you need help auditing your AI spend or building an optimisation roadmap, PADISO’s AI Quickstart Audit is a fixed-fee, 2-week engagement that tells you exactly where you are, what to ship first, and what 90 days could unlock. We’ve helped founders and operators at seed-to-Series-B startups and enterprises cut AI costs while improving performance. Book a call to discuss your specific workflows.

Want to talk through your situation?

Book a 30-minute call with Kevin (Founder/CEO). No pitch — direct advice on what to do next.

Book a 30-min call