Table of Contents
- Why Batch Processing Matters for Sonnet 4.6
- Understanding Sonnet 4.6 Batch Architecture
- Core Batch Processing Patterns
- Prompt Design for Batch Workflows
- Output Validation and Error Handling
- Cost Optimisation Strategies
- Common Failure Modes and Solutions
- Real-World Implementation Examples
- Scaling Beyond the Basics
- Summary and Next Steps
Why Batch Processing Matters for Sonnet 4.6 {#why-batch-processing-matters}
Sonnet 4.6 is a capable mid-tier language model that sits between cost and performance. When you’re processing thousands or millions of tasks—document classification, data extraction, content generation, or quality scoring—running them one at a time through the standard API wastes money and time.
Batch processing lets you submit hundreds or thousands of requests in a single operation, receive results asynchronously, and pay significantly less per token. For teams at PADISO working with startups and enterprises on AI & Agents Automation, batch processing is often the difference between a proof-of-concept that costs $50,000 and one that costs $5,000.
The trade-off is latency. You don’t get results back in 2 seconds; you get them back in minutes to hours. But for non-real-time workloads—overnight data pipelines, weekly report generation, monthly compliance checks—that’s perfectly acceptable. In fact, it’s often preferred because it aligns with business rhythm.
Sonnet 4.6 is particularly well-suited to batch because it’s fast enough to process large volumes without becoming a bottleneck, yet smart enough to handle nuanced tasks like entity extraction, sentiment analysis, or code review. It’s the workhorse model for production batch pipelines.
Understanding Sonnet 4.6 Batch Architecture {#understanding-sonnet-architecture}
Before you write code, you need to understand how Anthropic’s batch system works. The official Anthropic batch processing documentation and the Claude API batch processing guide are your canonical references.
Here’s the flow:
- You prepare a JSONL file containing your requests. Each line is a JSON object with a unique custom ID, the messages you want to send, and model parameters.
- You upload the file to Anthropic’s batch API endpoint.
- The system queues and processes your requests asynchronously, often batching them with other customers’ requests to maximise utilisation.
- Results come back in a JSONL file you can download, keyed by the custom IDs you assigned.
- You parse and use the results in your pipeline.
The key architectural insight: Anthropic batches your requests together with others, which is why you get a discount (typically 50% off token costs). In exchange, you lose real-time guarantees. A batch can take anywhere from a few minutes to a few hours depending on queue depth and complexity.
Custom IDs are critical. They’re how you map results back to your original inputs. If you’re processing 10,000 documents, your custom IDs might be doc_12345, doc_12346, and so on. Without them, you’ll have no way to know which result corresponds to which input.
The batch API also handles retries transparently. If a request fails due to a transient error, it will retry automatically. This is a major advantage over synchronous API calls where you have to implement your own retry logic.
Core Batch Processing Patterns {#core-patterns}
The Standard Request-Response Pattern
The simplest pattern is straightforward: prepare requests, submit them, wait for results, process them. Here’s the conceptual flow:
Input Data → JSONL Preparation → Upload → Queue → Processing → Download → Parse Results
Each request in your JSONL file looks like this:
{
"custom_id": "request-1",
"params": {
"model": "claude-3-5-sonnet-20241022",
"max_tokens": 1024,
"messages": [
{
"role": "user",
"content": "Classify this text as positive, negative, or neutral: 'I love this product'"
}
]
}
}
You repeat this for every item in your dataset. If you have 10,000 documents, you generate 10,000 such requests. The file size limit is 100 MB, so you’re typically looking at batches of 5,000–50,000 requests depending on message length.
Once uploaded, you poll the batch status endpoint until it completes. Then you download the results JSONL file and parse it.
The Streaming Results Pattern
For very large batches, downloading a single 500 MB results file and parsing it all at once can strain memory. Instead, stream the results file line by line, process each result, and immediately commit it to your database or downstream system.
This pattern is especially useful when you’re writing results directly into a data warehouse. You don’t accumulate results in memory; you stream them through.
The Chained Batch Pattern
Sometimes you need to process data in stages. For example:
- Stage 1: Extract entities from documents (5,000 documents).
- Stage 2: Enrich extracted entities with external data.
- Stage 3: Score or rank the enriched entities.
You can chain batches: submit Stage 1, wait for results, use those results to generate Stage 2 requests, submit Stage 2, and so on. This is more complex to orchestrate but allows you to build sophisticated multi-step workflows.
The Hybrid Pattern: Batch + Synchronous
Not everything needs to be batch. You might use batch for bulk processing but keep synchronous API calls for interactive or low-latency tasks. For example, a customer-facing chatbot uses synchronous calls (you need a response in 2 seconds), but your overnight data pipeline uses batch (you can wait 4 hours).
The key is choosing the right tool for the job. Batch is for throughput; synchronous is for latency.
Prompt Design for Batch Workflows {#prompt-design}
Batch processing doesn’t change the fundamentals of prompt design, but it does change the economics and constraints. Here’s what you need to know.
Consistency Over Creativity
In batch, you’re usually processing hundreds or thousands of similar items. You want consistent, predictable outputs. Avoid high-temperature settings (temperature > 0.5) unless you have a specific reason. Temperature 0.0 gives you deterministic outputs, which is what you want for classification, extraction, or scoring tasks.
If you’re generating creative content (marketing copy, story variations), you might use temperature 0.7–1.0, but be prepared for higher variance in output quality. You’ll need more rigorous validation downstream.
Structured Output
Always ask for structured output when possible. Instead of:
Summarise this article.
Use:
Summarise this article in JSON format:
{
"title": "...",
"summary": "...",
"key_points": ["...", "..."],
"sentiment": "positive|neutral|negative"
}
Structured output makes downstream parsing and validation easier. It also tends to improve model performance because you’re being explicit about what you want.
Token Budget Awareness
In batch, you pay for every token. If you’re processing 100,000 items and each one generates 2,000 output tokens unnecessarily, you’re wasting money. Be ruthless about token budgets.
Set max_tokens conservatively. If you’re classifying sentiment, 50 tokens is plenty. If you’re extracting entities from a long document, 500 might be right. If you’re generating a summary, 1,000 might be necessary. Test with a small sample first to understand your actual token usage, then set the budget slightly above that.
Prompt Versioning
When you’re processing large batches, you can’t easily change your prompt mid-way. If you discover a bug in your prompt after submitting 50,000 requests, you have to wait for them to complete, then reprocess.
Version your prompts explicitly. Store them in a database or version control. Tag each batch with the prompt version. This way, if you need to reprocess, you know exactly which prompt was used.
Few-Shot Examples
For complex tasks, include a few examples in your prompt. Few-shot learning improves accuracy and consistency. But be aware: each example adds tokens to every request. If you have 10,000 requests and each example adds 100 tokens, that’s 1 million extra tokens across the batch.
Balance accuracy against cost. Sometimes a simpler prompt with less accuracy is acceptable if it saves 30% on token costs.
Output Validation and Error Handling {#output-validation}
Batch processing introduces new failure modes. The API might return a valid response, but the content might be invalid, incomplete, or nonsensical. You need robust validation.
Structured Output Validation
If you asked for JSON output, validate that each response is actually valid JSON before you try to parse it. Use a JSON schema validator to ensure the structure matches what you expected.
import json
from jsonschema import validate, ValidationError
schema = {
"type": "object",
"properties": {
"sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
"confidence": {"type": "number", "minimum": 0, "maximum": 1}
},
"required": ["sentiment", "confidence"]
}
for line in results_file:
result = json.loads(line)
try:
output = json.loads(result["message"]["content"])
validate(instance=output, schema=schema)
except (json.JSONDecodeError, ValidationError) as e:
# Handle invalid output
log_error(result["custom_id"], e)
mark_for_reprocessing(result["custom_id"])
Semantic Validation
Beyond structure, validate semantics. If you asked for a classification, did you get one of the expected classes? If you asked for a number between 0 and 100, is the result actually in that range?
Semantic validation catches cases where the model returns valid JSON but nonsensical content. For example, it might classify a positive review as “maybe” instead of “positive”, or score something as 150 when you asked for 0–100.
Fallback and Reprocessing
When validation fails, you have options:
- Fallback to a default. If sentiment classification fails, default to “neutral” and move on.
- Reprocess with a different prompt. Maybe the original prompt was ambiguous. Try a clearer version.
- Escalate to human review. For high-stakes tasks, flag failures for a human to review.
- Reject and skip. If the data is corrupted or the task is impossible, skip it.
Choose based on the cost of error. For internal analytics, a fallback might be fine. For customer-facing features or compliance tasks, human review might be necessary.
Error Rate Monitoring
Track your validation failure rate. If 5% of outputs fail validation, that’s concerning and suggests your prompt might be unclear or your task might be too ambiguous for the model. If 0.1% fail, that’s probably acceptable.
Set a threshold. If your error rate exceeds it, pause and investigate before reprocessing the entire batch.
Cost Optimisation Strategies {#cost-optimisation}
Batch processing is cheaper than synchronous, but you can optimise further. Here are concrete strategies that PADISO uses when building AI & Agents Automation systems for clients.
Right-Sizing Token Usage
Every unnecessary token costs money. Audit your prompts:
- Remove verbose instructions. Instead of “Please carefully analyse this text and provide a detailed summary”, use “Summarise this text”.
- Eliminate redundant examples. If one example is enough, don’t include three.
- Set max_tokens appropriately. If you’re classifying sentiment, 50 tokens is enough. Don’t set it to 1,000.
- Use system messages sparingly. They’re cheaper than user messages, but they still cost. Only include what’s necessary.
A 10% reduction in average tokens per request translates directly to 10% savings across your entire batch. For a batch processing 100,000 requests, that could be hundreds of dollars.
Model Selection
Sonnet 4.6 is the right choice for most batch workloads. It’s faster and cheaper than Opus, yet more capable than Haiku. But for very simple tasks—basic classification, format conversion—Haiku might be sufficient and costs 80% less.
Run a small pilot with both models on a representative sample of your data. Measure accuracy and cost. If Haiku gets 95% accuracy and Sonnet gets 98%, is the 3% accuracy improvement worth 5x the cost? Probably not for internal analytics, but maybe for customer-facing features.
Batch Size Optimisation
AnthropicCharges per token, not per request. But very small batches are inefficient because they’re processed separately. Very large batches might take longer to queue and process.
The sweet spot is usually 5,000–20,000 requests per batch. This is large enough to get good queue efficiency but small enough that you’re not waiting hours for results.
If you have 100,000 items to process, submit five batches of 20,000 rather than one batch of 100,000. You’ll get results faster (the first batch completes while later ones are processing) and can start downstream processing sooner.
Caching and Deduplication
If you’re reprocessing data or have duplicate items, use prompt caching. If you have the same system prompt and many requests that share context, caching can reduce token costs by 90%.
Also, deduplicate your input data. If you’re processing 10,000 documents and 500 are duplicates, process only the unique 9,500 and copy results for duplicates. That’s 5% savings with no quality loss.
Monitoring and Alerting
Set up monitoring on your batch costs. Track cost per batch, cost per item, and total monthly spend. Set alerts if costs spike unexpectedly. This catches bugs early—for example, if you accidentally set max_tokens to 10,000 instead of 1,000, you’ll notice immediately.
Common Failure Modes and Solutions {#failure-modes}
Engineering teams hit these problems repeatedly when scaling batch processing. Here’s how to avoid or fix them.
Failure Mode 1: Custom ID Collisions
Problem: You generate custom IDs without ensuring uniqueness. Two requests end up with the same ID. Results are ambiguous.
Solution: Use a deterministic, collision-free ID scheme. UUID4 is safe but verbose. A better approach: use a hash of the input data plus a sequence number.
import hashlib
def generate_custom_id(item_id, content_hash, sequence):
return f"{item_id}_{content_hash[:8]}_{sequence}"
Or, if you’re processing database records, use the database primary key plus a batch timestamp: doc_12345_20240115_batch1.
Failure Mode 2: Lost Results
Problem: You submit a batch, it completes, but you don’t download the results file before it expires or you forget which batch ID corresponds to which data.
Solution: Immediately download results after a batch completes. Store the batch ID and results file path in your database. Log everything.
response = anthropic.beta.messages.batches.retrieve(batch_id)
if response.processing_status == "succeeded":
results = anthropic.beta.messages.batches.results(batch_id)
results_file = f"results_{batch_id}.jsonl"
with open(results_file, 'w') as f:
for result in results:
f.write(json.dumps(result) + '\n')
db.insert_batch_record(batch_id, results_file, status="downloaded")
Failure Mode 3: Timeout Waiting for Results
Problem: You submit a batch and wait synchronously for it to complete. If it takes 4 hours, your application hangs.
Solution: Use asynchronous polling. Submit the batch, store the batch ID, and check status periodically (every 5 minutes). Continue with other work.
import time
batch = anthropic.beta.messages.batches.create(...)
batch_id = batch.id
while True:
status = anthropic.beta.messages.batches.retrieve(batch_id)
if status.processing_status == "succeeded":
download_results(batch_id)
break
elif status.processing_status == "failed":
handle_failure(batch_id)
break
else:
print(f"Batch {batch_id} still processing...")
time.sleep(300) # Check every 5 minutes
Failure Mode 4: Invalid JSON in Results
Problem: You asked for JSON output, but the model returned malformed JSON (missing quotes, trailing commas, etc.). Your parser crashes.
Solution: Use a lenient JSON parser or implement recovery logic.
import json
import re
def parse_json_lenient(text):
try:
return json.loads(text)
except json.JSONDecodeError:
# Try to fix common issues
text = re.sub(r',\s*}', '}', text) # Remove trailing commas
text = re.sub(r',\s*]', ']', text)
try:
return json.loads(text)
except:
return None # Fallback: return None and handle downstream
Or use a library like json5 which is more lenient.
Failure Mode 5: Rate Limit Exceeded
Problem: You submit too many batches too quickly and hit Anthropic’s rate limits.
Solution: Implement exponential backoff and respect rate limits.
import time
max_retries = 5
for attempt in range(max_retries):
try:
batch = anthropic.beta.messages.batches.create(...)
return batch
except RateLimitError:
wait_time = 2 ** attempt # 1, 2, 4, 8, 16 seconds
print(f"Rate limited. Waiting {wait_time}s before retry...")
time.sleep(wait_time)
Also, stagger your batch submissions. Don’t submit 100 batches in parallel; submit them one or two at a time.
Failure Mode 6: Hallucination and Nonsense Outputs
Problem: The model generates plausible-looking but completely wrong outputs. A document classifier returns “purple” when asked for positive/negative/neutral. An entity extractor invents entities that don’t exist in the text.
Solution: This is hard to catch automatically, but you can reduce it:
- Use lower temperature (0.0–0.3). Reduces randomness.
- Use constrained output formats. Ask for JSON with specific enum values, not free text.
- Include negative examples. Show what NOT to do.
- Sample and manually review. For the first batch, manually review 100 random results. If error rate is >5%, investigate your prompt.
For critical tasks, consider using Anthropic’s news on Claude 4 capabilities or running a smaller sample through a more capable model first to validate your approach.
Real-World Implementation Examples {#implementation-examples}
Here are concrete examples you can adapt.
Example 1: Document Classification Pipeline
You have 50,000 support tickets. You want to classify each as “bug”, “feature request”, or “question”.
import json
from anthropic import Anthropic
client = Anthropic()
# 1. Prepare requests
requests = []
for i, ticket in enumerate(tickets):
requests.append({
"custom_id": f"ticket_{ticket['id']}",
"params": {
"model": "claude-3-5-sonnet-20241022",
"max_tokens": 50,
"messages": [
{
"role": "user",
"content": f"""Classify this support ticket as 'bug', 'feature_request', or 'question'.
Respond in JSON: {{"classification": "...", "confidence": 0.0-1.0}}
Ticket: {ticket['text']}"""
}
]
}
})
# 2. Write to JSONL
with open('batch_requests.jsonl', 'w') as f:
for req in requests:
f.write(json.dumps(req) + '\n')
# 3. Upload batch
with open('batch_requests.jsonl', 'rb') as f:
batch = client.beta.messages.batches.create(
requests=json.load(f)
)
print(f"Batch {batch.id} submitted")
# 4. Poll for completion
import time
while True:
status = client.beta.messages.batches.retrieve(batch.id)
print(f"Status: {status.processing_status}")
if status.processing_status in ["succeeded", "failed"]:
break
time.sleep(30)
# 5. Process results
if status.processing_status == "succeeded":
results = client.beta.messages.batches.results(batch.id)
for result in results:
ticket_id = result["custom_id"]
try:
content = result["message"]["content"][0]["text"]
classification = json.loads(content)
db.update_ticket(ticket_id, classification=classification["classification"])
except (json.JSONDecodeError, KeyError) as e:
print(f"Error processing {ticket_id}: {e}")
Example 2: Data Extraction from Invoices
You have 10,000 invoice images. You want to extract vendor name, invoice number, and total amount from each.
import base64
import json
from anthropic import Anthropic
client = Anthropic()
requests = []
for invoice in invoices:
with open(invoice['image_path'], 'rb') as f:
image_data = base64.standard_b64encode(f.read()).decode('utf-8')
requests.append({
"custom_id": f"invoice_{invoice['id']}",
"params": {
"model": "claude-3-5-sonnet-20241022",
"max_tokens": 200,
"messages": [
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": image_data
}
},
{
"type": "text",
"text": """Extract these fields from the invoice:
- vendor_name
- invoice_number
- total_amount (numeric only)
Respond in JSON format."""
}
]
}
]
}
})
Example 3: Batch with Streaming Results
For very large batches, stream results instead of loading everything into memory.
import json
from anthropic import Anthropic
import psycopg2
client = Anthropic()
conn = psycopg2.connect("dbname=mydb user=postgres")
cur = conn.cursor()
batch_id = "batch_123456" # From a previous submission
results = client.beta.messages.batches.results(batch_id)
for result in results:
custom_id = result["custom_id"]
try:
content = result["message"]["content"][0]["text"]
data = json.loads(content)
cur.execute(
"UPDATE documents SET classification = %s WHERE id = %s",
(data["classification"], custom_id.replace("doc_", ""))
)
except (json.JSONDecodeError, KeyError) as e:
cur.execute(
"UPDATE documents SET error = %s WHERE id = %s",
(str(e), custom_id.replace("doc_", ""))
)
# Commit every 1000 records to avoid long transactions
if result["index"] % 1000 == 0:
conn.commit()
conn.commit()
conn.close()
Scaling Beyond the Basics {#scaling-beyond}
Once you’ve got batch processing working, here’s how to scale it.
Multi-Stage Pipelines
Chain batches together for complex workflows. For example, a document processing pipeline might look like:
- Stage 1: Extract key information from documents (batch of 50,000).
- Stage 2: Enrich extracted data with external APIs (batch of 50,000).
- Stage 3: Score and rank results (batch of 50,000).
Each stage waits for the previous one to complete. Use a job queue (like Celery or Apache Airflow) to orchestrate this.
Parallel Processing
If you have multiple independent datasets, process them in parallel. Submit batches for dataset A, dataset B, and dataset C simultaneously. This doesn’t reduce total time (you still wait for the slowest batch), but it improves throughput if you have multiple data sources.
Incremental Processing
Don’t reprocess everything every time. If you processed documents 1–10,000 yesterday and today you have documents 10,001–10,500, only process the new ones.
Track which items have been processed. Store results in a database with timestamps. Before submitting a new batch, filter out items that have already been processed.
Monitoring and Observability
Instrument your batch pipeline:
- Log every batch submission with timestamp, item count, and prompt version.
- Track batch status and alert if a batch fails.
- Monitor token usage per batch and per item.
- Sample and validate outputs from each batch.
- Track cost and alert if it spikes.
Use tools like Datadog, New Relic, or even simple CloudWatch logs. The goal is visibility into what’s happening and early warning of problems.
Integration with Data Platforms
When building systems at scale, integrate batch processing with your data warehouse or lake. For teams working on Platform Development in Sydney or other regions, this often means:
- Input: Query raw data from data warehouse (Snowflake, BigQuery, Redshift).
- Processing: Submit to batch API.
- Output: Write results back to data warehouse.
- Orchestration: Use dbt, Airflow, or Dagster to orchestrate the pipeline.
This keeps everything in your data stack and makes it easy to build downstream analytics and dashboards.
Handling Large Volumes
If you’re processing millions of items, you can’t do it in one batch. Split into multiple batches (as mentioned earlier) but also consider:
- Sampling first. Run a small batch (1,000 items) to validate your prompt and check accuracy before committing to processing 1 million.
- Progressive batching. Process in waves: 10,000 items today, 10,000 tomorrow, etc. This spreads cost and risk.
- Prioritisation. Process high-value items first. If you have 1 million documents but only 100,000 are high-priority, process those first and see if the results are useful before doing the rest.
Integration with PADISO Services
If you’re building batch processing systems for complex workflows—especially those involving AI automation, compliance, or platform engineering—PADISO can help. We work with startups and enterprises to design and implement production-grade AI systems.
Our AI & Agents Automation service includes designing batch pipelines, optimising costs, and ensuring outputs meet quality standards. We’ve helped clients reduce processing costs by 40–60% through prompt optimisation and model selection.
For teams building data platforms or undergoing digital transformation, our Platform Engineering teams (based in Sydney, Los Angeles, Chicago, Boston, Seattle, Austin, Dallas, Houston, Atlanta, Denver, San Diego, Toronto, Vancouver, and Montreal) can integrate batch processing into your data architecture, ensuring scalability and reliability.
Check out our case studies to see how we’ve helped other teams ship AI systems at scale.
Summary and Next Steps {#summary}
Batch processing with Sonnet 4.6 is a powerful way to process large volumes of data cost-effectively. Here’s what you need to remember:
Core Patterns:
- Submit requests as JSONL, wait for asynchronous processing, download results.
- Use custom IDs to map results back to inputs.
- Chain batches for multi-stage workflows.
- Combine batch and synchronous processing as needed.
Prompt Design:
- Keep temperature low (0.0–0.3) for consistency.
- Request structured output (JSON).
- Set max_tokens conservatively.
- Version your prompts.
Validation:
- Validate JSON structure and semantics.
- Implement fallback and reprocessing logic.
- Monitor error rates.
Cost Optimisation:
- Right-size token usage.
- Choose the right model (Sonnet for most tasks, Haiku for simple ones).
- Batch in sizes of 5,000–20,000 requests.
- Deduplicate and cache where possible.
Common Pitfalls:
- Ensure unique custom IDs.
- Download results immediately.
- Use asynchronous polling.
- Implement lenient JSON parsing.
- Sample and validate outputs from the first batch.
Next Steps:
- Start small. Pick one workflow (classification, extraction, scoring) and process 1,000 items. Validate the approach before scaling.
- Measure baseline. Track cost per item, error rate, and latency. This becomes your benchmark.
- Optimise iteratively. Refine your prompt, adjust max_tokens, test different models. Measure improvement.
- Build monitoring. Set up logging and alerting so you know what’s happening in production.
- Scale progressively. Move from 1,000 to 10,000 to 100,000 items as confidence grows.
Batch processing isn’t glamorous, but it’s reliable, cost-effective, and essential for production AI systems. Master it and you’ll unlock massive scale at reasonable cost.
For guidance on integrating batch processing into a larger AI strategy or platform, reach out to PADISO. We help teams at all stages—from validating a concept to scaling to millions of items—design and implement systems that work.