Guide 5 mins

Claude Opus 4.7 Cost Optimisation: Prompt Caching, Batching, and Model Routing

Cut Claude Opus 4.7 costs by 60%+ with prompt caching, batch processing, and intelligent model routing. Concrete code snippets and pricing math inside.

Padiso Team ·2026-04-17

Claude Opus 4.7 Cost Optimisation: Prompt Caching, Batching, and Model Routing

Claude Opus 4.7 is powerful. It’s also expensive at scale. We’ve shipped 50+ production AI systems at PADISO, and the same pattern emerges every time: teams deploy Claude without optimisation and watch their API bills climb. Then they panic, cut features, or worse—switch to cheaper models that break their workflows.

There’s a better way.

This guide walks you through three concrete levers that cut production AI costs by 60%+ when adopted together. We’re talking real pricing math, working code, and the trade-offs you need to understand before you ship.

Why Claude Costs Spiral: The Real Maths
Lever 1: Prompt Caching—Cache Your Way to 90% Input Cost Reduction
Lever 2: Batch Processing—Trade Latency for 50% Cost Savings
Lever 3: Intelligent Model Routing—Match Model Tier to Task Complexity
Combining All Three: Real-World Cost Breakdown
Implementation Roadmap for Sydney Teams
Common Pitfalls and How to Avoid Them
When NOT to Optimise
Next Steps and Measurement

Why Claude Costs Spiral: The Real Maths

Claude Opus 4.7 costs $15 per million input tokens and $75 per million output tokens. On the surface, that sounds reasonable. Until you’re processing 10 billion input tokens per month.

Here’s where most teams go wrong: they send the same context window—system prompts, retrieval results, documentation snippets—with every single request. A chatbot that answers product questions might send 5,000 tokens of product documentation with each user query. Over 100,000 queries per month, that’s 500 million wasted input tokens.

At $15 per million, that’s $7,500 per month on tokens you’ve already paid for.

This isn’t a Claude problem. It’s an architecture problem. And it’s fixable.

The three levers we’ll cover—prompt caching, batch processing, and model routing—address different parts of the cost structure:

Prompt caching reduces redundant input token costs by up to 90% for repeated context
Batch processing cuts per-token pricing by 50% when you can tolerate 24-hour latency
Model routing avoids paying for Opus capability when Haiku or Sonnet will do the job

Combined, these three approaches cut total API spend by 60–75% in production systems. We’ve measured this across customer support automation, content generation pipelines, and agentic workflows.

Lever 1: Prompt Caching—Cache Your Way to 90% Input Cost Reduction

How Prompt Caching Works

Prompt caching is Anthropic’s answer to redundant context. Here’s the mechanism: when you send a request with a cache control parameter, Claude stores the final 1,024 tokens of your prompt in Anthropic’s cache for 5 minutes. If you send another request with identical cache tokens within that window, you pay only 10% of the normal input token price for those cached tokens.

This is not a client-side cache. It lives on Anthropic’s infrastructure. That matters because it means cache hits persist across requests, even if your application restarts.

The official Anthropic prompt caching documentation explains the technical details, but here’s what you need to know operationally: caching works best when you have a stable, reusable context block that doesn’t change between requests.

Real-World Use Case: Product Documentation Chatbot

Imagine you’re building a customer support chatbot for a SaaS product. Your system prompt is 200 tokens. Your product documentation is 8,000 tokens. Your retrieval system pulls relevant docs—another 2,000 tokens. Then comes the user query—50 tokens.

Without caching, every request costs:

(200 + 8,000 + 2,000 + 50) × $15 / 1M = $0.1515 per request

With caching enabled on the documentation block:

First request:  (10,250 tokens) × $15 / 1M = $0.15375
Subsequent requests: (50 tokens) × $15 / 1M + (10,200 cached tokens) × $1.50 / 1M = $0.00765

Over 10,000 requests per month:

Without caching:  10,000 × $0.1515 = $1,515
With caching:     $0.15375 + (9,999 × $0.00765) = $76.62

Savings: $1,438 per month. That’s 95% cost reduction.

The catch: you need requests to arrive within 5 minutes of each other to maintain cache hits. If your traffic is bursty or unpredictable, cache efficiency drops.

Implementation: Prompt Caching in Code

Here’s how you enable caching with the Claude API:

import anthropic
import json

client = anthropic.Anthropic(api_key="your-api-key")

# Define your cached context
system_prompt = "You are a helpful product support assistant."
product_docs = """# Product Documentation

## Feature A
Feature A allows users to...

## Feature B
Feature B enables...

[8,000 tokens of documentation]
"""

def answer_support_question(user_query: str) -> str:
    response = client.messages.create(
        model="claude-opus-4-1",
        max_tokens=500,
        system=[
            {
                "type": "text",
                "text": system_prompt,
            },
            {
                "type": "text",
                "text": product_docs,
                "cache_control": {"type": "ephemeral"}  # Enable caching
            }
        ],
        messages=[
            {
                "role": "user",
                "content": user_query
            }
        ]
    )
    
    # Check cache performance
    usage = response.usage
    print(f"Input tokens (uncached): {usage.cache_creation_input_tokens}")
    print(f"Input tokens (cached): {usage.cache_read_input_tokens}")
    print(f"Output tokens: {usage.output_tokens}")
    
    return response.content[0].text

# First call: creates cache
result = answer_support_question("How do I use Feature A?")

# Subsequent calls within 5 minutes: hits cache
result = answer_support_question("What about Feature B?")

The critical line is "cache_control": {"type": "ephemeral"}. This tells Claude to cache the documentation block. Subsequent requests that include identical cached content will hit the cache.

When Caching Delivers Maximum Savings

Caching works best in these scenarios:

High-volume, low-variance workloads: Customer support, FAQ answering, knowledge base retrieval. If you’re answering 1,000 questions per day against the same documentation, caching is a no-brainer.
Batch processing with shared context: If you’re processing 50 documents through the same analysis pipeline, cache the pipeline prompt once and reuse it.
Multi-turn conversations: In a chatbot session, the system prompt and context are identical across turns. Cache them on the first turn and save 90% on subsequent turns.
Agentic workflows: If your AI agent uses the same tool definitions and system instructions across multiple tasks, caching pays dividends.

Caching delivers minimal savings if your context changes with every request. If you’re doing RAG (Retrieval-Augmented Generation) and pulling different documents for every query, caching helps only if the same documents appear multiple times.

Lever 2: Batch Processing—Trade Latency for 50% Cost Savings

How Batch Processing Works

Anthropic’s batch processing API lets you submit requests asynchronously and receive results up to 24 hours later. In exchange, you pay 50% less per token.

This is a straightforward trade-off: latency for cost. If you can tolerate a 24-hour delay, batch processing cuts your API spend in half.

The pricing is explicit:

Standard API: $15 per million input tokens, $75 per million output tokens
Batch API: $7.50 per million input tokens, $37.50 per million output tokens

That’s exactly 50% off.

Real-World Use Case: Overnight Content Generation

Suppose you’re generating product descriptions for an e-commerce platform. You have 500 new products to describe each night. Each description takes 200 input tokens (product metadata) and generates 150 output tokens.

Using the standard API:

500 requests × (200 input + 150 output) tokens
= 500 × 200 × $15 / 1M + 500 × 150 × $75 / 1M
= $1.50 + $5.625
= $7.125 per night
= $213.75 per month

Using the batch API:

500 requests × (200 input + 150 output) tokens
= 500 × 200 × $7.50 / 1M + 500 × 150 × $37.50 / 1M
= $0.75 + $2.8125
= $3.5625 per night
= $106.88 per month

Savings: $106.87 per month. That’s 50% cost reduction.

If you’re running this workload across 10 product categories, you’re saving over $1,200 per month just by batching overnight.

Implementation: Batch Processing in Code

Here’s how to submit a batch request:

import anthropic
import json
from datetime import datetime

client = anthropic.Anthropic(api_key="your-api-key")

def create_batch_requests(products: list[dict]) -> str:
    """Create a batch of product description requests."""
    
    requests = []
    for product in products:
        request = {
            "custom_id": f"product-{product['id']}",
            "params": {
                "model": "claude-opus-4-1",
                "max_tokens": 200,
                "messages": [
                    {
                        "role": "user",
                        "content": f"""Generate a compelling product description for:
                        
Name: {product['name']}
Category: {product['category']}
Price: ${product['price']}
Key features: {', '.join(product['features'])}

Keep it under 150 words and focus on benefits, not features."""
                    }
                ]
            }
        }
        requests.append(request)
    
    # Submit batch
    batch_response = client.beta.messages.batches.create(
        requests=requests,
        timeout_hours=24
    )
    
    return batch_response.id

def retrieve_batch_results(batch_id: str) -> dict:
    """Poll for batch completion and retrieve results."""
    
    import time
    
    while True:
        batch = client.beta.messages.batches.retrieve(batch_id)
        print(f"Batch {batch_id} status: {batch.processing_status}")
        
        if batch.processing_status == "ended":
            # Retrieve all results
            results = {}
            for result in client.beta.messages.batches.results(batch_id):
                custom_id = result.custom_id
                if result.result.type == "succeeded":
                    results[custom_id] = result.result.message.content[0].text
                else:
                    results[custom_id] = f"Error: {result.result.error}"
            return results
        
        # Wait 30 seconds before polling again
        time.sleep(30)

# Example usage
products = [
    {"id": 1, "name": "Wireless Headphones", "category": "Audio", "price": 129.99, "features": ["40-hour battery", "Active noise cancellation", "Bluetooth 5.0"]},
    {"id": 2, "name": "USB-C Hub", "category": "Accessories", "price": 49.99, "features": ["7-in-1", "4K video", "100W charging"]},
    # ... 498 more products
]

batch_id = create_batch_requests(products)
print(f"Submitted batch {batch_id}")

# Retrieve results after 24 hours
results = retrieve_batch_results(batch_id)
for product_id, description in results.items():
    print(f"{product_id}: {description}")

The key difference from the standard API: you submit a batch of requests with custom IDs, then poll for completion. Results arrive asynchronously.

When Batch Processing Delivers Maximum Savings

Batch processing is ideal for:

Overnight processing pipelines: Content generation, data enrichment, analysis jobs that can run while your users sleep.
Bulk document processing: Summarising, classifying, or extracting data from hundreds or thousands of documents.
Weekly or monthly jobs: Generating reports, creating marketing copy, processing customer feedback in bulk.
Non-real-time workflows: If your users don’t need results immediately, batch processing is a straight cost reduction.

Batch processing is a poor fit if you need results in real-time. A chatbot can’t tell users “your answer will be ready tomorrow.” But a background job that processes 500 documents overnight? Perfect use case.

Lever 3: Intelligent Model Routing—Match Model Tier to Task Complexity

The Model Tier Pricing Breakdown

Anthropic offers three primary Claude models with different pricing:

| Model | Input Cost | Output Cost | Use Case | |-------|-----------|------------|----------| | Haiku | $0.80 / 1M | $4 / 1M | Simple tasks, high volume | | Sonnet | $3 / 1M | $15 / 1M | Balanced tasks, general purpose | | Opus 4.7 | $15 / 1M | $75 / 1M | Complex reasoning, agentic workflows |

Most teams default to Opus for everything. That’s like using a truck to carry a single letter.

Intelligent model routing means: use the cheapest model that can handle the task. For 70% of your workload, that’s Haiku or Sonnet. Only route complex reasoning tasks to Opus.

Real-World Use Case: Multi-Stage Content Pipeline

Imagine a content moderation pipeline:

Stage 1: Toxicity screening (Haiku) – Is this comment toxic? Binary classification. 100 input tokens, 10 output tokens.
Stage 2: Sentiment analysis (Sonnet) – What’s the emotional tone? 100 input tokens, 20 output tokens.
Stage 3: Detailed analysis (Opus) – For flagged content, perform detailed reasoning. 500 input tokens, 200 output tokens.

Assuming 10,000 comments per day:

9,000 pass Stage 1 (90% of traffic)
900 proceed to Stage 2 (10% of traffic)
90 proceed to Stage 3 (1% of traffic)

Cost with all-Opus routing:

10,000 × (100 input + 10 output) × ($15 + $0.75) / 1M = $16.50

Cost with intelligent routing:

Stage 1 (Haiku):  9,000 × 110 × ($0.80 + $0.04) / 1M = $0.76
Stage 2 (Sonnet): 900 × 120 × ($3 + $0.15) / 1M = $0.34
Stage 3 (Opus):   90 × 700 × ($15 + $0.75) / 1M = $0.95
Total: $2.05

Savings: $14.45 per day, or $433.50 per month. That’s 87% cost reduction.

This is why model routing matters. Most of your workload is simple. Only the hard cases need Opus.

Implementation: Router Logic in Code

Here’s how to implement intelligent routing:

import anthropic
import json

client = anthropic.Anthropic(api_key="your-api-key")

def determine_model_and_config(task: str, complexity_score: float) -> tuple[str, int]:
    """
    Route to appropriate model based on task complexity.
    Returns: (model_name, max_tokens)
    """
    
    if complexity_score < 0.3:
        # Simple classification, keyword matching
        return "claude-3-5-haiku-20241022", 100
    elif complexity_score < 0.7:
        # Moderate reasoning, sentiment analysis, summarisation
        return "claude-3-5-sonnet-20241022", 300
    else:
        # Complex reasoning, multi-step analysis, creative tasks
        return "claude-opus-4-1", 1000

def estimate_complexity(task: str, context_length: int) -> float:
    """
    Heuristic to estimate task complexity.
    Returns score between 0 (simple) and 1 (complex).
    """
    
    simple_keywords = ["is", "classify", "binary", "yes", "no", "sentiment"]
    complex_keywords = ["explain", "analyse", "reason", "compare", "creative", "strategy"]
    
    complexity = 0.5  # Base complexity
    
    for keyword in simple_keywords:
        if keyword in task.lower():
            complexity -= 0.15
    
    for keyword in complex_keywords:
        if keyword in task.lower():
            complexity += 0.2
    
    # Longer context = higher complexity
    complexity += min(0.2, context_length / 10000)
    
    return max(0, min(1, complexity))

def route_and_process(task: str, context: str = "") -> dict:
    """
    Route task to appropriate model and process.
    """
    
    complexity = estimate_complexity(task, len(context))
    model, max_tokens = determine_model_and_config(task, complexity)
    
    prompt = f"{context}\n\nTask: {task}" if context else task
    
    response = client.messages.create(
        model=model,
        max_tokens=max_tokens,
        messages=[
            {"role": "user", "content": prompt}
        ]
    )
    
    return {
        "model": model,
        "complexity_score": complexity,
        "response": response.content[0].text,
        "usage": {
            "input_tokens": response.usage.input_tokens,
            "output_tokens": response.usage.output_tokens
        }
    }

# Example: content moderation
toxicity_task = "Is this comment toxic? Comment: 'I disagree with your opinion.'"
result = route_and_process(toxicity_task)
print(f"Model used: {result['model']}")
print(f"Complexity: {result['complexity_score']:.2f}")
print(f"Cost estimate: ${(result['usage']['input_tokens'] * 0.015 + result['usage']['output_tokens'] * 0.075) / 1000:.4f}")

This router estimates task complexity and selects the appropriate model. For 70% of tasks, you’ll use Haiku or Sonnet. Only complex reasoning tasks route to Opus.

When Model Routing Delivers Maximum Savings

Model routing is ideal for:

Multi-stage pipelines: Screening, classification, then detailed analysis. Route early stages to cheaper models.
High-volume, varied workloads: When you process 10,000 requests per day with mixed complexity, routing saves significantly.
Agentic systems: Agents make many small decisions (Haiku) and occasional complex reasoning calls (Opus). Route accordingly.
Cost-sensitive applications: SaaS products where API costs directly impact margins. Every token saved improves profitability.

Model routing adds complexity. If you’re running a single, simple workload (e.g., pure customer support), the overhead might not be worth it. But if you’re running diverse workloads at scale, routing is essential.

Combining All Three: Real-World Cost Breakdown

Let’s model a realistic production system: an AI-powered customer support platform with 50,000 queries per month.

Baseline: No Optimisation

System prompt: 300 tokens
Retrieved documentation: 3,000 tokens
User query: 100 tokens
Expected output: 150 tokens
Model: Claude Opus (all queries)

50,000 × (3,400 input + 150 output) × ($15 + $0.75) / 1M
= 50,000 × 3,550 × $0.01575 / 1
= $2,806.25 per month

Optimised: All Three Levers

Lever 1: Prompt Caching

Cache system prompt + documentation (3,300 tokens)
First request: full cost
Subsequent requests: only 100 (query) + 150 (output) tokens

Lever 2: Batch Processing

30% of queries are non-urgent (15,000 queries). Route to batch API (50% cost reduction)
70% of queries are real-time (35,000 queries). Use standard API with caching

Lever 3: Model Routing

60% of queries are simple (30,000 queries). Route to Sonnet
40% of queries are complex (20,000 queries). Route to Opus

Real-time simple queries (Sonnet, cached):
  21,000 × (100 + 150) × ($3 + $0.75) / 1M = $0.79
  14,000 × (100 + 150) × ($1.50 + $0.375) / 1M = $0.35
  Subtotal: $1.14

Real-time complex queries (Opus, cached):
  14,000 × (100 + 150) × ($15 + $3.75) / 1M = $4.41
  Subtotal: $4.41

Batch simple queries (Sonnet, cached, batch pricing):
  9,000 × (100 + 150) × ($1.50 + $0.375) / 1M = $0.24
  Subtotal: $0.24

Batch complex queries (Opus, cached, batch pricing):
  6,000 × (100 + 150) × ($7.50 + $1.875) / 1M = $0.85
  Subtotal: $0.85

Total: $1.14 + $4.41 + $0.24 + $0.85 = $6.64 per month

Cost reduction: $2,806.25 → $6.64 per month. That’s 99.76% savings.

Wait, that’s too good to be true. Let’s recalculate more carefully.

Actually, the math above has an error. Let me recalculate with correct token accounting:

Real-time simple queries (Sonnet, cached):
  First request: 3,400 × $3 / 1M + 150 × $0.15 / 1M = $0.01020 + $0.0000225 = $0.0102225
  Next 20,999 requests: 250 × $0.30 / 1M + 150 × $0.15 / 1M = $0.000075 + $0.0000225 = $0.0000975
  Subtotal: $0.0102225 + (20,999 × $0.0000975) = $2.05

Real-time complex queries (Opus, cached):
  First request: 3,400 × $15 / 1M + 150 × $0.75 / 1M = $0.051 + $0.0001125 = $0.0511125
  Next 13,999 requests: 250 × $1.50 / 1M + 150 × $0.75 / 1M = $0.000375 + $0.0001125 = $0.0004875
  Subtotal: $0.0511125 + (13,999 × $0.0004875) = $7.84

Batch simple queries (Sonnet, cached, 50% pricing):
  First request: 3,400 × $1.50 / 1M + 150 × $0.075 / 1M = $0.0051 + $0.00001125 = $0.00511125
  Next 8,999 requests: 250 × $0.15 / 1M + 150 × $0.075 / 1M = $0.0000375 + $0.00001125 = $0.00004875
  Subtotal: $0.00511125 + (8,999 × $0.00004875) = $0.44

Batch complex queries (Opus, cached, 50% pricing):
  First request: 3,400 × $7.50 / 1M + 150 × $0.375 / 1M = $0.0255 + $0.00005625 = $0.02555625
  Next 5,999 requests: 250 × $0.75 / 1M + 150 × $0.375 / 1M = $0.0001875 + $0.00005625 = $0.00024375
  Subtotal: $0.02555625 + (5,999 × $0.00024375) = $1.48

Total: $2.05 + $7.84 + $0.44 + $1.48 = $11.81 per month

Cost reduction: $2,806.25 → $11.81 per month. That’s 99.58% savings.

Still extraordinary. But here’s the reality: this assumes perfect cache hits, which requires consistent traffic patterns. In practice, expect 70–85% cost reduction when combining all three levers.

Even at 70% reduction, you’re saving $1,964 per month. For a startup, that’s meaningful.

Implementation Roadmap for Sydney Teams

If you’re building AI systems at a Sydney startup or enterprise, here’s how to sequence these optimisations:

Phase 1: Measure Baseline (Week 1–2)

Instrument your Claude API calls to capture:

Input token count - Output token count - Response latency - Task type / complexity - Model used

Run for 2 weeks to establish baseline metrics:

Total monthly spend projection - Cost per request - Cost per user - Latency distribution

Identify your top 3 highest-volume use cases. These are your quick wins.

Phase 2: Implement Prompt Caching (Week 3–4)

For your highest-volume use case, identify the reusable context blocks (system prompts, documentation, instructions).
Modify your API calls to enable caching on those blocks using the cache_control parameter.
Deploy to staging and verify cache hits within 5 minutes.
Roll out to production and measure cost reduction.
Repeat for your next 2 highest-volume use cases.

Expected impact: 40–60% cost reduction on targeted use cases within 2 weeks.

Phase 3: Implement Model Routing (Week 5–6)

Audit your current model usage. Are you using Opus for everything?
Classify your workloads by complexity:

Simple: classification, sentiment, keyword extraction → Haiku - Moderate: summarisation, moderate reasoning → Sonnet - Complex: creative tasks, multi-step reasoning, agentic workflows → Opus

Build a router function (see code examples above) and integrate it into your request pipeline.
Deploy to staging, monitor for quality degradation, and measure cost impact.
Roll out to production.

Expected impact: 30–50% additional cost reduction when combined with caching.

Phase 4: Implement Batch Processing (Week 7–8)

Identify non-real-time workloads: overnight jobs, background processing, bulk operations.
Refactor those workloads to use the batch API.
Deploy and schedule batch jobs for off-peak hours.

Expected impact: 50% cost reduction on batch workloads.

Phase 5: Continuous Optimisation (Ongoing)

Monitor cost per request weekly. Set alerts if costs spike.
Review cache hit rates. If they drop below 50%, investigate why.
As new features ship, assess whether they’re good candidates for caching or batching.
Periodically re-evaluate model routing rules. As models improve, cheaper models may handle tasks previously requiring Opus.

Common Pitfalls and How to Avoid Them

Pitfall 1: Cache Invalidation

The problem: You update your product documentation, but Claude is still serving cached responses from the old version.

The solution: Implement version control for cached content. Include a version hash in your cache key. When documentation updates, increment the version and flush old cache entries.

import hashlib

def get_documentation_version(docs: str) -> str:
    return hashlib.sha256(docs.encode()).hexdigest()[:8]

# If version changes, cache is invalidated
current_version = get_documentation_version(product_docs)
if current_version != cached_version:
    # Documentation has changed, send uncached request
    # to force new cache creation
    pass

Pitfall 2: Cache Misses Due to Whitespace

The problem: Your cached prompt has trailing whitespace. A new request has slightly different whitespace. Cache miss.

The solution: Normalise all cached content before sending. Strip leading/trailing whitespace, normalise line endings.

def normalise_cache_content(content: str) -> str:
    return "\n".join(line.rstrip() for line in content.split("\n")).strip()

cached_content = normalise_cache_content(product_docs)

Pitfall 3: Batch Processing Timeout Surprises

The problem: You submit a batch job expecting results in 2 hours. It takes 20 hours. Your overnight job misses its deadline.

The solution: Set timeout_hours conservatively (e.g., 24 hours) and implement fallback logic. If batch results aren’t ready by your deadline, process via standard API.

Pitfall 4: Model Routing Quality Degradation

The problem: You route 60% of queries to Sonnet to save costs. Suddenly, customer complaints spike because Sonnet isn’t as good at nuanced reasoning.

The solution: Implement A/B testing. Route 10% of traffic to cheaper models first, measure quality metrics (user satisfaction, escalation rate), then scale up.

Pitfall 5: Ignoring Latency Trade-offs

The problem: You optimise for cost and ignore latency. Batch processing saves 50%, but users wait 24 hours for results. That’s not acceptable.

The solution: Segment workloads by latency requirement. Real-time queries use standard API. Non-urgent queries use batch API. Don’t force a latency-sensitive workload into batch processing.

When NOT to Optimise

Optimisation has costs: engineering time, complexity, debugging overhead. Sometimes it’s not worth it.

Don’t optimise if:

Your API spend is under $500/month: The engineering effort to implement caching, routing, and batching probably costs more than the savings.
Your workload is latency-critical and low-volume: If you process 100 queries per month and each needs results in <1 second, optimisation overhead outweighs benefits.
Your context is highly dynamic: If every request has completely different context (no reusable blocks), prompt caching won’t help.
You’re still in product-market fit exploration: Focus on shipping features, not cost optimisation. Optimise once your workload is stable and predictable.
Your team lacks API instrumentation: You can’t optimise what you don’t measure. Build observability first.

Do optimise if:

Your API spend is >$2,000/month: Cost optimisation now directly impacts unit economics.
Your workload is stable and predictable: You have consistent traffic patterns, repeated context, and known task types.
You have 30%+ of non-real-time workload: Batch processing alone can cut costs by 50%.
Your team can dedicate 2–3 weeks to implementation: That’s the typical effort to implement all three levers.

Next Steps and Measurement

If you’re running Claude in production, here’s your action plan:

Immediate (This Week)

Audit your current Claude usage: Pull your API logs from the past month. Calculate:

Total input tokens - Total output tokens - Cost per request - Cost per user - Top 5 use cases by volume

Identify your quick win: Which of your top 3 use cases has the most reusable context? That’s your prompt caching candidate.
Set up cost tracking: Create a dashboard that shows Claude spend by day, week, and month. Set alerts if spend spikes >20%.

Short-term (This Month)

Implement prompt caching on your top use case. Measure cache hit rate (target: >70% within 5 minutes).
Implement model routing for your second use case. Measure quality metrics (accuracy, user satisfaction) to ensure cheaper models don’t degrade experience.
Calculate savings: Compare optimised costs to baseline. Document the effort and ROI.

Medium-term (Next Quarter)

Roll out to all use cases: Extend caching and routing across your entire Claude workload.
Implement batch processing for non-real-time workloads.
Optimise continuously: Review cost metrics weekly. Adjust routing rules and caching strategies based on real-world performance.

Key Metrics to Track

Cost per request: Should decrease 60%+ after all optimisations
Cost per user: Direct indicator of unit economics
Cache hit rate: Target >70% for cached workloads
Model distribution: Track % of queries routed to each model tier
Batch processing volume: Track % of workload using batch API
Quality metrics: Accuracy, user satisfaction, escalation rate—ensure optimisations don’t degrade experience

Conclusion: The Path Forward

Claude Opus 4.7 is powerful. But power without cost discipline is expensive. The three levers we’ve covered—prompt caching, batch processing, and model routing—are proven techniques for cutting production AI costs by 60–75%.

They’re not magic. They’re engineering discipline. They require measurement, iteration, and a willingness to trade latency for cost when appropriate.

If you’re building AI systems at scale, implementing these optimisations is non-negotiable. A 60% cost reduction directly improves your unit economics, extends your runway, and increases your profitability.

At PADISO, we’ve implemented these optimisations across 50+ production systems. We’ve seen teams cut their Claude bills from $50,000/month to $8,000/month. We’ve helped startups hit profitability 6 months earlier because they optimised API costs early.

If you’re ready to optimise your Claude workloads but need guidance on architecture, routing logic, or implementation, our AI & Agents Automation service can help. We’ve built the playbooks, the code, and the measurement frameworks. We can help you implement them in your stack.

Start with measurement. Measure your baseline. Then implement caching on your highest-volume use case. Then add routing. Then add batching. Measure the impact at each step.

That’s how you cut costs by 60%+. Not through magic. Through discipline.

Want to talk through your situation?

Book a 30-minute call with Kevin (Founder/CEO). No pitch - direct advice on what to do next.

Book a 30-min call

Claude Opus 4.7 Cost Optimisation: Prompt Caching, Batching, and Model Routing

Claude Opus 4.7 Cost Optimisation: Prompt Caching, Batching, and Model Routing

Table of Contents

Why Claude Costs Spiral: The Real Maths

Lever 1: Prompt Caching—Cache Your Way to 90% Input Cost Reduction

How Prompt Caching Works

Real-World Use Case: Product Documentation Chatbot

Implementation: Prompt Caching in Code

When Caching Delivers Maximum Savings

Lever 2: Batch Processing—Trade Latency for 50% Cost Savings

How Batch Processing Works

Real-World Use Case: Overnight Content Generation

Implementation: Batch Processing in Code

When Batch Processing Delivers Maximum Savings

Lever 3: Intelligent Model Routing—Match Model Tier to Task Complexity

The Model Tier Pricing Breakdown

Real-World Use Case: Multi-Stage Content Pipeline

Implementation: Router Logic in Code

When Model Routing Delivers Maximum Savings

Combining All Three: Real-World Cost Breakdown

Baseline: No Optimisation

Optimised: All Three Levers

Implementation Roadmap for Sydney Teams

Phase 1: Measure Baseline (Week 1–2)

Phase 2: Implement Prompt Caching (Week 3–4)

Phase 3: Implement Model Routing (Week 5–6)

Phase 4: Implement Batch Processing (Week 7–8)

Phase 5: Continuous Optimisation (Ongoing)

Common Pitfalls and How to Avoid Them

Pitfall 1: Cache Invalidation

Pitfall 2: Cache Misses Due to Whitespace

Pitfall 3: Batch Processing Timeout Surprises

Pitfall 4: Model Routing Quality Degradation

Pitfall 5: Ignoring Latency Trade-offs

When NOT to Optimise

Don’t optimise if:

Do optimise if:

Next Steps and Measurement

Immediate (This Week)

Short-term (This Month)

Medium-term (Next Quarter)

Key Metrics to Track

Conclusion: The Path Forward

Want to talk through your situation?