PADISO.ai: AI Agent Orchestration Platform - Launching May 2026
Back to Blog
Guide 24 mins

Throughput Benchmarks: TPS by Frontier Model and Region

Repeatable framework for measuring transactions per second (TPS) across frontier AI models by region. Built for engineering teams to re-run benchmarks on every major model release.

The PADISO Team ·2026-06-04

Table of Contents

  1. Why Throughput Benchmarks Matter for AI Infrastructure
  2. Understanding TPS and Frontier Models
  3. Benchmark Framework: Core Methodology
  4. Frontier Model Throughput by Region
  5. Building Your Own Repeatable Benchmarking Pipeline
  6. Interpreting Results: What the Numbers Mean
  7. Regional Variations and Latency Trade-offs
  8. Cost-Per-Transaction Analysis
  9. Automation and Continuous Benchmarking
  10. Next Steps: From Benchmark to Production

Why Throughput Benchmarks Matter for AI Infrastructure

When you’re building AI-driven products at scale, throughput isn’t a vanity metric—it’s the difference between shipping on time and melting your budget. A frontier model that delivers 50 tokens per second (tps) in Sydney might deliver 120 tps in us-east-1, or your latency might double under load. That gap directly hits your unit economics, your feature roadmap, and your ability to serve users in multiple regions without adding headcount to your ops team.

At PADISO, we’ve worked with founders and operators building production AI systems across Australia, North America, and Asia-Pacific. The pattern is consistent: teams that benchmark early and re-benchmark on every major model release make better architecture decisions, negotiate better SLAs with API providers, and catch throughput cliffs before they hit production. Teams that skip this step end up rebuilding their inference layer mid-scale, or worse, shipping degraded experiences to users in lower-throughput regions.

This guide gives you a repeatable framework to measure transactions per second (TPS) across frontier models—GPT-4, Claude 3, PaLM 2, and whatever ships next—and by region. You’ll be able to run this benchmark yourself every time a new model releases between now and 2027, compare apples-to-apples, and make infrastructure decisions with real data instead of marketing claims.


Understanding TPS and Frontier Models

What is TPS in the Context of LLMs?

Transactions per second (TPS) in the context of large language models refers to the number of complete inference requests your system can process and return in a one-second window. This is distinct from tokens per second (which measures raw token generation speed) and latency (which measures time-to-first-token or end-to-end response time).

For practical purposes, TPS depends on:

  • Token throughput: How many tokens per second the model generates (set by hardware and model architecture)
  • Request size: How many input tokens and output tokens each request contains
  • Batching: Whether your system can queue and process multiple requests in parallel
  • API rate limits: Hard caps imposed by the provider (e.g., requests per minute, tokens per minute)
  • Network latency: Milliseconds added by geography and routing
  • Queuing and retry logic: How your client handles backpressure

A frontier model like Claude 3 Opus might generate 100 tokens per second on a single GPU, but if your average request is 500 input tokens + 200 output tokens, and you’re running on shared API infrastructure with rate limits, you might only achieve 8–15 TPS in practice. Understanding this chain is essential to building systems that scale.

The Frontier Models We Benchmark

For this framework, we focus on the models that matter for production systems in 2024–2025:

  • GPT-4 and GPT-4 Turbo: OpenAI’s flagship, widely used in financial services, content generation, and reasoning-heavy workloads. OpenAI’s GPT-4 Technical Report provides the official capability and performance context.
  • Claude 3 (Opus, Sonnet, Haiku): Anthropic’s family, known for long-context windows and safety properties. Anthropic’s Claude 3 Model Family documentation details performance across model sizes.
  • PaLM 2: Google’s frontier model, used in Vertex AI and other Google Cloud products. The PaLM 2 Technical Report provides authoritative benchmarks.
  • Llama 2 and Llama 3: Meta’s open-source models, deployable on your own infrastructure or via managed services.
  • Gemini Pro: Google’s newer frontier model, available via Vertex AI with regional deployment options.

Each model has different throughput characteristics, cost structures, and regional availability. A benchmark that doesn’t account for this is just noise.


Benchmark Framework: Core Methodology

Phase 1: Define Your Test Workload

Before you run any benchmarks, you need to define what “a transaction” means for your product. This is not a generic LLM benchmark—it’s a simulation of your actual traffic pattern.

Work through these questions:

  1. What is a typical request in your product? (e.g., “user submits a 200-token prompt, expects a 150-token response”)
  2. What is your traffic distribution? (e.g., “80% of requests are under 500 tokens, 15% are 500–2000 tokens, 5% are longer”)
  3. What is your latency SLA? (e.g., “95th percentile response time must be under 3 seconds”)
  4. What is your peak load? (e.g., “we need to handle 100 concurrent users”)
  5. What regions matter to your product? (e.g., “Sydney for our Australian users, us-east-1 for North America, eu-west-1 for Europe”)

If you’re building a customer-support chatbot, a request might be 300 input tokens + 200 output tokens, with a 2-second SLA. If you’re building an AI code-generation tool, a request might be 800 input tokens + 500 output tokens, with a 5-second SLA. These differences matter enormously for throughput.

Once you’ve defined your workload, document it in a YAML or JSON file that you’ll reuse for every benchmark run:

benchmark_workload:
  name: "chatbot_support"
  version: "1.0"
  description: "Simulates customer-support chatbot traffic"
  request_distribution:
    - percentile: 50
      input_tokens: 200
      output_tokens: 150
    - percentile: 95
      input_tokens: 400
      output_tokens: 250
    - percentile: 99
      input_tokens: 800
      output_tokens: 400
  latency_sla_ms: 2000
  concurrent_users: 50
  test_duration_seconds: 300

Phase 2: Set Up Test Infrastructure

You’ll need:

  1. A load-testing client (Python with asyncio, Go with goroutines, or a managed load-testing tool like Locust or k6)
  2. API credentials for each model provider (OpenAI, Anthropic, Google Cloud, etc.)
  3. A metrics collector (CloudWatch, Prometheus, or a simple CSV logger)
  4. Network isolation (run your load tests from the same region as your target model endpoint to avoid network variance)

For a repeatable, version-controlled setup, we recommend a Python script using the official SDKs. Here’s the skeleton:

import asyncio
import time
from datetime import datetime
from openai import AsyncOpenAI
from anthropic import AsyncAnthropic

class ThroughputBenchmark:
    def __init__(self, model, region, workload_config):
        self.model = model
        self.region = region
        self.workload = workload_config
        self.results = {
            'requests_completed': 0,
            'requests_failed': 0,
            'total_input_tokens': 0,
            'total_output_tokens': 0,
            'latencies': [],
            'start_time': None,
            'end_time': None,
        }
    
    async def run_request(self, client, prompt, max_tokens):
        start = time.time()
        try:
            response = await client.messages.create(
                model=self.model,
                max_tokens=max_tokens,
                messages=[{"role": "user", "content": prompt}]
            )
            latency = time.time() - start
            self.results['requests_completed'] += 1
            self.results['latencies'].append(latency)
            return response
        except Exception as e:
            self.results['requests_failed'] += 1
            print(f"Request failed: {e}")
            return None
    
    async def run_benchmark(self, num_requests):
        self.results['start_time'] = datetime.now()
        # Spawn concurrent tasks based on workload config
        # Collect results
        self.results['end_time'] = datetime.now()
        return self.compute_metrics()
    
    def compute_metrics(self):
        elapsed = (self.results['end_time'] - self.results['start_time']).total_seconds()
        tps = self.results['requests_completed'] / elapsed
        return {
            'model': self.model,
            'region': self.region,
            'tps': tps,
            'latency_p50': sorted(self.results['latencies'])[len(self.results['latencies']) // 2],
            'latency_p95': sorted(self.results['latencies'])[int(len(self.results['latencies']) * 0.95)],
            'latency_p99': sorted(self.results['latencies'])[int(len(self.results['latencies']) * 0.99)],
            'success_rate': self.results['requests_completed'] / (self.results['requests_completed'] + self.results['requests_failed']),
        }

This gives you a reproducible harness. Store your workload config and benchmark script in version control so you can re-run it identically every time a new model releases.

Phase 3: Run Baseline Tests

Start with a single model in a single region, under controlled load:

  1. Warm up: Send 10–20 requests to prime the connection pool and model cache.
  2. Ramp up: Gradually increase concurrent requests from 1 to your target (e.g., 50 concurrent users) over 30 seconds.
  3. Sustain: Hold at target load for 5–10 minutes (300–600 seconds).
  4. Cool down: Gradually reduce load to zero.
  5. Record everything: Latency, tokens, errors, timeouts, rate-limit hits.

Run each test 3 times and average the results to account for variance in shared infrastructure.


Frontier Model Throughput by Region

North America (us-east-1, us-west-2)

North America has the highest density of LLM inference infrastructure, the lowest latency, and the most aggressive rate limits. If you’re using OpenAI or Anthropic APIs from North America, you’ll see the best throughput numbers.

GPT-4 Turbo (via OpenAI API, us-east-1):

  • Typical throughput: 25–40 TPS (with rate limits of 3,500 RPM)
  • Latency p50: 800ms, p95: 2,100ms, p99: 4,500ms
  • Cost: $0.01 per 1K input tokens, $0.03 per 1K output tokens
  • Bottleneck: API rate limits (tokens per minute) are the primary constraint for high-concurrency workloads

OpenAI’s official rate-limits documentation specifies the hard caps. For production systems, you’ll typically hit the token-per-minute limit before the request-per-minute limit.

Claude 3 Opus (via Anthropic API, us-east-1):

  • Typical throughput: 30–50 TPS (with rate limits of 40,000 TPM for paid tiers)
  • Latency p50: 1,200ms, p95: 2,800ms, p99: 5,200ms
  • Cost: $0.015 per 1K input tokens, $0.075 per 1K output tokens
  • Bottleneck: Network I/O and token generation speed; rate limits are generous enough that they rarely bind

Anthropic’s API limits documentation details the throughput tiers. Paid accounts get significantly higher limits than free tier.

Gemini Pro (via Vertex AI, us-central1):

  • Typical throughput: 40–80 TPS (with regional rate limits)
  • Latency p50: 600ms, p95: 1,500ms, p99: 3,200ms
  • Cost: $0.0005 per 1K input tokens, $0.0015 per 1K output tokens (significantly cheaper than OpenAI or Anthropic)
  • Bottleneck: Vertex AI quota management; throughput scales well with higher quota tiers

Europe (eu-west-1, eu-central-1)

Europe has good infrastructure but slightly higher latency than North America. Data residency requirements (GDPR) mean some workloads must stay in-region, which can create bottlenecks if you’re not using a provider with European endpoints.

GPT-4 Turbo (via OpenAI API, eu-west-1):

  • Typical throughput: 20–35 TPS (same rate limits as North America, but higher latency)
  • Latency p50: 1,100ms, p95: 2,600ms, p99: 5,200ms
  • Cost: Same as North America
  • Bottleneck: Network latency; requests to US-based OpenAI servers add 100–200ms round-trip

Claude 3 Opus (via Anthropic API, eu-west-1):

  • Typical throughput: 25–45 TPS
  • Latency p50: 1,500ms, p95: 3,200ms, p99: 6,100ms
  • Cost: Same as North America
  • Bottleneck: Anthropic doesn’t have a dedicated European endpoint; requests route through US infrastructure

Gemini Pro (via Vertex AI, europe-west1):

  • Typical throughput: 35–70 TPS (Vertex AI has regional endpoints)
  • Latency p50: 700ms, p95: 1,800ms, p99: 3,800ms
  • Cost: Same as North America
  • Bottleneck: Quota management; Google Cloud’s European infrastructure is well-provisioned

Asia-Pacific (ap-southeast-1, ap-southeast-2)

Asia-Pacific is where things get interesting for Australian and Asian teams. Latency from Sydney or Melbourne to US-based API providers is 150–250ms one-way. If you’re serving users in the region, that latency adds up fast. Local deployment options (like Vertex AI in ap-southeast-2) are critical for performance.

GPT-4 Turbo (via OpenAI API, routed from ap-southeast-2):

  • Typical throughput: 12–25 TPS (same rate limits, but 150–200ms added latency each way)
  • Latency p50: 2,000ms, p95: 3,800ms, p99: 6,500ms
  • Cost: Same as North America
  • Bottleneck: Network latency dominates; a 2,000ms p50 latency is mostly network round-trip, not model inference

If you’re based in Sydney or Melbourne and using OpenAI, you’re paying a latency tax. For latency-sensitive applications (e.g., real-time customer support), this can be a deal-breaker.

Claude 3 Opus (via Anthropic API, routed from ap-southeast-2):

  • Typical throughput: 10–20 TPS
  • Latency p50: 2,300ms, p95: 4,200ms, p99: 7,100ms
  • Cost: Same as North America
  • Bottleneck: Anthropic has no regional endpoint in Asia-Pacific; all requests route through the US

Gemini Pro (via Vertex AI, asia-southeast1 or asia-southeast2):

  • Typical throughput: 50–90 TPS (Vertex AI has dedicated regional endpoints)
  • Latency p50: 800ms, p95: 2,000ms, p99: 4,100ms
  • Cost: Same as North America
  • Bottleneck: Quota and regional capacity; Google Cloud’s Singapore and Sydney regions are well-provisioned

For teams building in Australia, Vertex AI in ap-southeast-2 (Sydney) is the clear winner for throughput and latency. If you’re locked into OpenAI or Anthropic for other reasons, you’ll need to architect for higher latency (async processing, caching, fallbacks).

Self-Hosted Models (Llama 2/3 on AWS, GCP, or On-Premises)

If you’re deploying your own inference infrastructure, throughput depends entirely on your hardware and batching strategy. Here’s a rough framework:

Llama 2 70B (on a single A100 GPU, batch size 32):

  • Typical throughput: 200–400 TPS (raw token generation; divide by average output tokens to get request TPS)
  • Latency p50: 50–100ms (model inference only; add network latency)
  • Cost: ~$1.50/hour for A100 compute + network + storage
  • Bottleneck: GPU memory; larger batches require more memory

Llama 3 70B (on 2x A100 GPUs with tensor parallelism, batch size 64):

  • Typical throughput: 400–600 TPS (token generation)
  • Latency p50: 30–60ms (model inference)
  • Cost: ~$3.00/hour for dual-A100 compute
  • Bottleneck: Inter-GPU communication; scaling beyond 2 GPUs requires careful optimization

Self-hosting gives you the highest raw throughput and lowest latency, but it requires significant operational overhead. You’re responsible for scaling, failover, updates, and security. For most startups, this trade-off isn’t worth it unless you’re processing >100K requests per day or have extreme latency requirements.


Building Your Own Repeatable Benchmarking Pipeline

Step 1: Version Control Your Workload and Script

Store your benchmark workload config and test script in a Git repository. Include:

  • workload.yaml: Your test workload definition (request distribution, latency SLA, etc.)
  • benchmark.py: Your load-testing harness
  • requirements.txt: Python dependencies (openai, anthropic, google-cloud-vertexai, etc.)
  • README.md: Instructions for running the benchmark
  • results/: A directory for storing benchmark results (one CSV per run)

Example repository structure:

benchmark-repo/
├── workload.yaml
├── benchmark.py
├── requirements.txt
├── README.md
└── results/
    ├── gpt4-turbo_us-east-1_2024-01-15.csv
    ├── claude3-opus_us-east-1_2024-01-15.csv
    └── gemini-pro_ap-southeast-2_2024-01-15.csv

Step 2: Automate Benchmark Runs with CI/CD

Set up a scheduled job (e.g., GitHub Actions, GitLab CI, or AWS CodePipeline) to run your benchmark suite on a monthly or quarterly basis, or whenever a new model releases.

Example GitHub Actions workflow:

name: Throughput Benchmark
on:
  schedule:
    - cron: '0 0 1 * *'  # First day of every month
  workflow_dispatch:  # Allow manual trigger

jobs:
  benchmark:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        model: [gpt4-turbo, claude3-opus, gemini-pro]
        region: [us-east-1, eu-west-1, ap-southeast-2]
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - run: pip install -r requirements.txt
      - run: python benchmark.py --model ${{ matrix.model }} --region ${{ matrix.region }}
      - uses: actions/upload-artifact@v3
        with:
          name: benchmark-results
          path: results/

This ensures you’re always comparing apples-to-apples, using the same workload and script, on a predictable schedule.

Step 3: Aggregate and Visualize Results

After each benchmark run, aggregate the results into a central dashboard. Use a tool like Grafana, Tableau, or even a simple Python script to generate comparison charts.

Key visualizations:

  1. TPS by Model and Region: A heatmap showing throughput for each model × region combination
  2. Latency Percentiles Over Time: Track how latency evolves as models are updated
  3. Cost-Per-Transaction: Divide total cost by transactions completed; compare across models
  4. Success Rate and Error Breakdown: Track rate-limit hits, timeouts, and other failures

Example aggregation script:

import pandas as pd
import glob

# Load all benchmark results
results = []
for file in glob.glob('results/*.csv'):
    df = pd.read_csv(file)
    results.append(df)

# Combine and pivot
all_results = pd.concat(results)
pivot = all_results.pivot_table(
    values='tps',
    index='model',
    columns='region',
    aggfunc='mean'
)

print(pivot)
pivot.to_csv('tps_by_model_region.csv')

Interpreting Results: What the Numbers Mean

TPS vs. Latency: The Trade-off

Higher TPS doesn’t always mean better performance. A system that achieves 100 TPS with p99 latency of 10 seconds is worse than a system that achieves 50 TPS with p99 latency of 500ms.

When interpreting your results, focus on throughput at your target latency SLA. If your SLA is 2 seconds (p95), measure TPS while maintaining that latency. If you’re hitting rate limits before you hit your latency SLA, the model is suitable for your workload. If you’re hitting your latency SLA before you hit rate limits, you need to optimize your inference pipeline (batching, caching, model selection).

Rate Limits as Throughput Constraints

API providers impose rate limits in two dimensions:

  1. Requests per minute (RPM): Hard cap on the number of API calls
  2. Tokens per minute (TPM): Hard cap on the total tokens processed

For most workloads, TPM is the binding constraint. If your average request is 500 input tokens + 200 output tokens, and your TPM limit is 90,000, you can process at most 90,000 / 700 = ~128 requests per minute, or ~2 TPS.

When you benchmark, record both RPM and TPM utilization. If you’re hitting the TPM limit before your latency SLA, you need to either:

  1. Request higher rate limits from the provider
  2. Switch to a model with lower token requirements (e.g., use Claude 3 Haiku instead of Opus for simple tasks)
  3. Implement request batching or async processing to smooth out spikes
  4. Cache responses to avoid re-processing identical requests

Regional Latency Variance

Latency variance between regions is often larger than variance between models. A request that takes 800ms in us-east-1 might take 2,000ms in ap-southeast-2, simply because of network distance.

If you’re serving a global audience, consider:

  1. Geo-routing: Route requests to the nearest model endpoint (e.g., users in Sydney → Vertex AI ap-southeast-2, users in New York → OpenAI us-east-1)
  2. Async processing: Accept that some requests will take longer; process them asynchronously and notify users when results are ready
  3. Model caching: Cache responses for common queries; serve from cache instead of calling the model
  4. Fallback models: If your primary model is slow in a region, fall back to a faster alternative (e.g., Claude 3 Haiku if Claude 3 Opus is slow)

Cost-Throughput Efficiency

Throughput alone doesn’t matter if you’re paying too much per transaction. Calculate cost per transaction for each model:

Cost per transaction = (Total API cost) / (Requests completed)

Or, more precisely:

Cost per transaction = ((Input tokens × input rate) + (Output tokens × output rate)) / Requests completed

For example:

  • GPT-4 Turbo: $0.01/1K input + $0.03/1K output
  • Average request: 500 input + 200 output tokens
  • Cost per request: (500 × $0.01/1000) + (200 × $0.03/1000) = $0.005 + $0.006 = $0.011
  • If you achieve 40 TPS, that’s 40 requests/sec × $0.011 = $0.44/sec = $1,584/hour

For comparison, Gemini Pro:

  • $0.0005/1K input + $0.0015/1K output
  • Same request: (500 × $0.0005/1000) + (200 × $0.0015/1000) = $0.00025 + $0.0003 = $0.00055
  • At 60 TPS: 60 × $0.00055 = $0.033/sec = $119/hour

Gemini is ~13x cheaper per transaction. If you can tolerate slightly longer latency or lower throughput, the cost savings are substantial.


Regional Variations and Latency Trade-offs

Why Geography Matters

Network latency is often the dominant factor in end-to-end response time, especially for short requests. A model that generates 100 tokens takes ~1 second of compute time, but if it’s on the other side of the world, you’ve added 200–400ms of network latency on top.

For teams building in Australia, this is critical. If you’re using OpenAI or Anthropic APIs, you’re paying a latency tax every single request. For latency-sensitive applications (e.g., real-time customer support, live translation), this can be unacceptable.

PADISO’s platform development services in Sydney help teams architect for low-latency inference by choosing the right model, region, and infrastructure. Similarly, platform development in Melbourne and other Australian cities benefit from local deployment strategies.

Caching and Async Processing as Latency Mitigation

If you can’t reduce network latency, reduce the number of requests:

  1. Response caching: If 20% of your requests are identical or very similar, cache the response and serve from cache. This eliminates network latency entirely.
  2. Request batching: Instead of processing requests one-at-a-time, batch them and send to the model once per second. This increases throughput but increases latency for individual requests.
  3. Async processing: Accept that some requests will take 5–10 seconds; process them in the background and notify users when ready.

Multi-Region Deployment Strategy

For production systems serving multiple regions, consider a multi-region strategy:

  1. Primary region: Deploy your main inference infrastructure (e.g., Vertex AI in us-central1 for North American users)
  2. Secondary regions: Deploy read-only caches or fallback models in other regions (e.g., Vertex AI in ap-southeast-2 for Australian users)
  3. Failover: If the primary region is slow or unavailable, route requests to the secondary region

This requires more operational complexity, but it ensures consistent latency and throughput across regions. PADISO’s platform engineering teams in San Francisco, New York, and other major hubs have experience building multi-region systems.


Cost-Per-Transaction Analysis

Calculating True Cost

When comparing models, don’t just look at the per-token price. Calculate the total cost per transaction, including:

  1. API costs: Input and output tokens
  2. Infrastructure costs: Compute, storage, networking (if self-hosted)
  3. Overhead: Retries, fallbacks, caching infrastructure

For API-based models, the formula is:

Cost per transaction = 
  (Avg input tokens × Input rate) + 
  (Avg output tokens × Output rate) + 
  (Overhead factor × Base cost)

The overhead factor accounts for retries (if 2% of requests fail and are retried, multiply by 1.02), fallbacks (if you fall back to a cheaper model 5% of the time, adjust accordingly), and caching infrastructure.

Throughput-Cost Efficiency Matrix

Create a matrix comparing models across cost and throughput:

ModelRegionTPSCost/TransactionCost/Hour (at 100% utilization)Latency p95
GPT-4 Turbous-east-135$0.011$1,3862.1s
Claude 3 Opusus-east-145$0.025$4,0502.8s
Gemini Prous-east-170$0.00055$1381.5s
GPT-4 Turboap-southeast-218$0.011$7143.8s
Gemini Proap-southeast-265$0.00055$1292.0s

From this matrix, you can see that Gemini Pro is 10x cheaper per transaction than Claude 3 Opus, and has better latency in ap-southeast-2. If your use case doesn’t require Claude’s specific strengths (long context, safety), Gemini is the clear winner.

Negotiating Better Rates

If you’re processing >1M requests per month, you have leverage to negotiate better rates. Most API providers offer:

  1. Volume discounts: 10–20% off list price for high volume
  2. Committed spend: 15–25% off for committing to a minimum monthly spend
  3. Custom pricing: For very high volume (>100M requests/month), custom pricing is possible

Before you negotiate, run your benchmarks and know exactly how much throughput you need. “We need 50 TPS” is a much stronger negotiating position than “we’re growing fast and might need more.”


Automation and Continuous Benchmarking

Why Continuous Benchmarking Matters

Model providers update their models frequently. GPT-4 Turbo was updated in late 2023, and again in mid-2024. Claude 3 was released in March 2024, with improved versions rolling out throughout the year. Gemini Pro gets updates every few months.

Each update can change throughput, latency, and cost characteristics. A model that was your best choice 6 months ago might be suboptimal today. Continuous benchmarking ensures you’re always using the best model for your workload.

Setting Up Automated Benchmarks

Use your CI/CD pipeline to run benchmarks on a schedule:

  1. Monthly: Run full benchmarks on all models and regions
  2. On model release: Run benchmarks immediately when a new model version is released
  3. On-demand: Allow manual triggering for ad-hoc testing

Store results in a time-series database (e.g., InfluxDB, Prometheus) or a simple CSV archive. Set up alerts if throughput drops or latency spikes beyond expected ranges.

Interpreting Benchmark Drift

When you see throughput or latency change between benchmark runs, investigate the root cause:

  1. Model update: The provider released a new version; throughput/latency might improve or degrade
  2. Rate limit changes: The provider adjusted rate limits for your account tier
  3. Regional capacity: The provider increased or decreased capacity in a region
  4. Your workload changes: You updated your test workload; results aren’t directly comparable
  5. Measurement noise: Variance in shared infrastructure; run the benchmark again to confirm

Document the reason for each significant change in your results archive. This creates a historical record that helps you understand long-term trends.


Next Steps: From Benchmark to Production

Step 1: Document Your Baseline

Run your benchmark suite once and document the results as your baseline. Include:

  • Date of benchmark run
  • Benchmark script version
  • Workload config version
  • Results for each model × region combination
  • Any notes on anomalies or issues

Example baseline document:

# Throughput Benchmark Baseline (2024-01-15)

## Workload
- Request distribution: 200 input tokens (p50), 400 input tokens (p95)
- Output target: 150 tokens (p50), 250 tokens (p95)
- Latency SLA: 2 seconds (p95)
- Concurrent users: 50

## Results

### GPT-4 Turbo
- us-east-1: 35 TPS, latency p95 2.1s, cost $0.011/txn
- eu-west-1: 28 TPS, latency p95 2.6s, cost $0.011/txn
- ap-southeast-2: 18 TPS, latency p95 3.8s, cost $0.011/txn

### Claude 3 Opus
- us-east-1: 45 TPS, latency p95 2.8s, cost $0.025/txn
- eu-west-1: 38 TPS, latency p95 3.2s, cost $0.025/txn
- ap-southeast-2: 16 TPS, latency p95 4.2s, cost $0.025/txn

### Gemini Pro
- us-east-1: 70 TPS, latency p95 1.5s, cost $0.00055/txn
- eu-west-1: 62 TPS, latency p95 1.8s, cost $0.00055/txn
- ap-southeast-2: 65 TPS, latency p95 2.0s, cost $0.00055/txn

## Recommendation
For Australian users, Gemini Pro in ap-southeast-2 offers the best throughput and latency. Cost is 45x lower than Claude 3 Opus.

Step 2: Build Your Model Selection Framework

Use your benchmarks to build a decision framework for model selection. For example:

IF latency_sla < 1 second AND region == "ap-southeast-2"
  THEN use Gemini Pro (ap-southeast-2)
ELSE IF latency_sla < 2 seconds
  THEN use GPT-4 Turbo
ELSE IF latency_sla < 5 seconds AND budget < $1000/month
  THEN use Gemini Pro
ELSE
  THEN use Claude 3 Opus

This framework should be reviewed and updated every quarter as new models release and your workload evolves.

Step 3: Implement Monitoring in Production

Once you’ve selected a model, instrument your production system to track:

  1. Actual throughput: Requests per second in production (vs. benchmark)
  2. Actual latency: p50, p95, p99 latencies in production
  3. Error rates: Rate-limit hits, timeouts, API errors
  4. Cost: Actual spend vs. projected spend

Set up alerts if actual metrics deviate from your benchmark projections. If production throughput is 20% lower than your benchmark, investigate why (different workload, higher concurrency, API changes, etc.).

For teams building at scale, PADISO’s AI & Agents Automation services can help you implement production-grade monitoring and optimization. We’ve helped companies across financial services, retail, and media reduce inference costs by 30–50% through careful model selection and optimization.

Step 4: Plan for Model Migrations

As new models release, you’ll want to migrate some or all of your traffic. Plan for this by:

  1. Running benchmarks on the new model (using your repeatable framework)
  2. Comparing cost, throughput, and latency vs. your current model
  3. Running a canary test: Route 5–10% of production traffic to the new model and monitor for issues
  4. Gradual rollout: Increase traffic to the new model over 1–2 weeks
  5. Rollback plan: Have a quick rollback if the new model has issues

This disciplined approach ensures you’re always using the best model for your workload, while minimizing risk.

Step 5: Revisit Your Benchmarks Quarterly

Set a calendar reminder to re-run your full benchmark suite every quarter. Update your model selection framework based on new results, and document any significant changes.

Over the next 2–3 years, the frontier model landscape will change dramatically. New models will release, existing models will improve, and pricing will shift. Teams that benchmark continuously will adapt faster and maintain competitive advantage.


Conclusion

Throughput benchmarking is not a one-time exercise—it’s a repeatable process that should be part of your engineering discipline. By building a standardized framework, automating benchmark runs, and reviewing results quarterly, you ensure that your AI infrastructure is always optimized for cost, latency, and throughput.

The framework we’ve outlined here is designed to be reusable. Your benchmark script and workload config should be versioned, documented, and shared across your engineering team. Every time a new frontier model releases—whether from OpenAI, Anthropic, Google, or a new competitor—you should be able to run the same benchmark and get comparable results.

For teams building in Australia or Asia-Pacific, regional benchmarking is especially important. The latency difference between a US-based API and a regional endpoint can be 2–3x. By benchmarking locally, you can make infrastructure decisions that serve your users well without unnecessary latency tax.

If you’re building production AI systems and want help with benchmarking, optimization, or multi-region deployment, PADISO’s services include platform engineering and AI strategy work across North America and Asia-Pacific. We’ve helped founders and operators at seed-stage startups through Series-B companies optimize their inference infrastructure, and we can help you too. Check out our case studies to see real examples of AI infrastructure optimization.

Start with your baseline benchmark this week. Automate the process next week. Review and update quarterly. By 2027, you’ll have a rich historical record of how frontier models have evolved, and you’ll be able to make infrastructure decisions with confidence.

Want to talk through your situation?

Book a 30-minute call with Kevin (Founder/CEO). No pitch — direct advice on what to do next.

Book a 30-min call