Guide 29 mins

Multi-Cloud Model Access: AWS Bedrock vs GCP Vertex vs Azure OpenAI

Compare AWS Bedrock, GCP Vertex AI, and Azure OpenAI for multi-cloud model access. Framework for choosing, integrating, and scaling across cloud providers.

The PADISO Team ·2026-06-02

Why Multi-Cloud Model Access Matters
The Three Platforms at a Glance
AWS Bedrock: The Managed Foundation Model Play
GCP Vertex AI: The ML-First Approach
Azure OpenAI: The Enterprise Integration Path
Decision Framework: How to Choose
Building a Multi-Cloud Model Access Strategy
Implementation Patterns and Code Examples
Cost Optimisation Across Clouds
Security, Compliance, and Governance
The Repeatable Framework for Model Releases Through 2027
Next Steps

Why Multi-Cloud Model Access Matters

If you’re shipping AI products or automating workflows at scale, you’ve already hit the hard truth: no single cloud provider owns the entire model landscape. OpenAI’s GPT-4 runs on Azure. Anthropic’s Claude lives on AWS and GCP. Meta’s Llama 2 is everywhere. Google’s Gemini is native to Vertex. And the model release cycle keeps accelerating—new models, new capabilities, new trade-offs on cost, latency, and accuracy drop every 8–12 weeks.

This isn’t a theoretical problem. Teams we work with at PADISO see it daily: a founder wants Claude for reasoning but needs GPT-4 for vision tasks. An operator needs Llama 2 for cost control but Gemini Pro for multimodal work. A PE portfolio company is locked into Azure for compliance but wants Bedrock for faster model iteration.

The old answer was “pick one cloud and live with it.” That doesn’t work anymore. The new answer is a repeatable, testable framework for model access that lets you:

Ship faster: Test models in parallel, not sequentially.
Cost less: Route to the cheapest inference path for each workload.
Stay flexible: Switch models or clouds without rewriting your application logic.
Pass audits: Maintain security and compliance posture across multiple clouds.

This guide builds that framework. It’s designed for engineering teams, fractional CTOs, and founders who need to make this decision right now and then re-run it every time a new model drops between now and 2027.

The Three Platforms at a Glance

What Each Platform Does

Amazon Bedrock is AWS’s managed service for accessing foundation models from multiple providers—Claude, Llama, Mistral, Cohere, and others—without managing your own infrastructure. You call a single API, AWS handles the model hosting, and you pay per token.

Vertex AI is Google Cloud’s unified ML platform. It includes access to Google’s own models (Gemini, PaLM, Codey), third-party models, and a full MLOps stack for fine-tuning, evaluation, and deployment. It’s less about “managed access” and more about “end-to-end ML workflows.”

Azure OpenAI Service is Microsoft’s direct line to OpenAI’s models—GPT-4, GPT-4 Turbo, GPT-3.5, Embeddings—deployed on Azure infrastructure with enterprise controls, private endpoints, and compliance features built in.

Quick Comparison Table

Dimension	AWS Bedrock	GCP Vertex AI	Azure OpenAI
Primary Models	Claude, Llama, Mistral, Cohere, Stable Diffusion	Gemini, PaLM, Codey, third-party	GPT-4, GPT-4 Turbo, GPT-3.5
Model Variety	High (multi-vendor)	High (Google + partners)	Medium (OpenAI-focused)
Ease of Use	Simple API, minimal setup	Requires Vertex project, more config	Simple API, Azure ecosystem tied-in
Fine-tuning	Limited (model-dependent)	Full MLOps suite	Custom fine-tuning available
Pricing Model	Pay-per-token (on-demand)	Pay-per-token + compute	Pay-per-token (quota-based)
Best For	Model diversity, cost optimisation, multi-vendor lock-in avoidance	ML teams, custom training, Google-native workflows	Enterprise Azure shops, OpenAI lock-in acceptable
Compliance	SOC 2, ISO 27001, FedRAMP	SOC 2, ISO 27001, HIPAA	SOC 2, ISO 27001, FedRAMP, HIPAA

AWS Bedrock: The Managed Foundation Model Play

What Bedrock Solves

Bedrock is AWS’s answer to “I want access to multiple foundation models without running my own infrastructure.” You don’t deploy models; AWS does. You call the Bedrock API, specify your model, send your prompt, and get your response. AWS handles scaling, availability, and updates.

The key insight: Bedrock is not a model. It’s a gateway. And that gateway includes:

Model access: Claude (Anthropic), Llama 2 / Llama 3 (Meta), Mistral (Mistral AI), Cohere Command, Stable Diffusion (Stability AI).
Provisioned throughput: Reserve capacity at a discount if you have predictable workloads.
Knowledge bases: Upload your own documents, and Bedrock handles retrieval-augmented generation (RAG) without you managing a vector database.
Agents: Built-in orchestration for multi-step workflows—ask a question, Bedrock routes to tools, calls APIs, and returns results.
Fine-tuning: Available for some models (Claude, Llama 2) but not all.

When to Use Bedrock

Use Bedrock if:

You want access to multiple models from different vendors without managing separate accounts or APIs.
You’re already on AWS and want minimal lift to add AI to your stack.
You need fast iteration on model choice—test Claude, then Llama, then Mistral without rewriting code.
You’re building RAG pipelines and want AWS to handle vector storage and retrieval.
You have variable workloads and want to avoid the overhead of provisioned capacity.

Avoid Bedrock if:

You’re Google-first and want Gemini deeply integrated with your data and ML pipelines.
You’re locked into Azure and need GPT-4 with private endpoints and VNET integration.
You need heavy fine-tuning and custom training—Bedrock’s fine-tuning is model-dependent and limited.
You’re building a multi-cloud strategy and want each cloud’s native models (not third-party models hosted on AWS).

Bedrock Pricing Deep Dive

Bedrock charges per token (input and output). Prices vary by model (approximate, as of mid-2026—check Bedrock for current rates):

Claude Opus: ~$0.005 per 1K input tokens, ~$0.025 per 1K output tokens.
Claude Sonnet: ~$0.003 per 1K input tokens, ~$0.015 per 1K output tokens.
Llama 70B: ~$0.001 per 1K input tokens, ~$0.002 per 1K output tokens.
Mistral 7B: ~$0.00015 per 1K input tokens, ~$0.0006 per 1K output tokens.

For a 1 million token/month workload using Claude Sonnet (a typical RAG pipeline), you’d spend ~$3–5K/month on inference alone. Provisioned throughput can cut this by 40% if you commit to a baseline.

Access Amazon Bedrock Documentation for the latest pricing and model availability.

GCP Vertex AI: The ML-First Approach

What Vertex Solves

Vertex AI is not just a model access layer. It’s Google’s full ML platform: model hosting, fine-tuning, evaluation, monitoring, and deployment. If you’re building end-to-end ML workflows, Vertex is designed for that. If you just want to call a model, it’s overkill—but it’s there if you need it.

Vertex includes:

Generative AI models: Gemini (Google’s flagship), PaLM 2, Codey (code generation), Embeddings API.
Third-party models: Llama 2, Mistral, and others via partnerships.
Fine-tuning: Full support for custom training on your data.
Evaluation: Built-in tools to test model performance, safety, and bias.
MLOps: Pipelines, monitoring, and governance.
Endpoints: Deploy models to managed endpoints with auto-scaling.

When to Use Vertex AI

Use Vertex if:

You’re a data/ML-first team building custom models or heavy fine-tuning pipelines.
You’re already on GCP and want native integration with BigQuery, Dataflow, and other Google services.
You want Gemini (Google’s latest, most capable model) with tight integration to your data.
You need evaluation and monitoring built into your model development workflow.
You’re building a long-term ML platform, not just calling APIs.

Avoid Vertex if:

You’re not an ML team and just want to call a model API.
You need Claude or other non-Google models as your primary choice.
You’re on AWS or Azure and adding GCP just for models adds operational complexity.
You want simplicity—Vertex requires understanding Google Cloud projects, quotas, and regional availability.

Vertex AI Pricing Deep Dive

Vertex charges per token for model calls, but the structure is more complex because it bundles MLOps features:

Gemini Pro: ~$0.00025 per input token, ~$0.0005 per output token.
Gemini Pro Vision: ~$0.0025 per image, ~$0.00025 per input token, ~$0.0005 per output token.
PaLM 2: ~$0.0005 per 1K input tokens, ~$0.001 per 1K output tokens.
Fine-tuning: Charged per training hour (varies by model size).

For the same 1 million token/month workload using Gemini Pro, you’d spend ~$250–500/month—significantly cheaper than Claude on Bedrock. But add fine-tuning, evaluation, and endpoint hosting, and costs climb.

For detailed pricing, visit Vertex AI.

Azure OpenAI: The Enterprise Integration Path

What Azure OpenAI Solves

Azure OpenAI is Microsoft’s direct partnership with OpenAI. You get OpenAI’s models (GPT-4, GPT-4 Turbo, GPT-3.5) deployed on Azure infrastructure with enterprise controls: private endpoints, managed identity, VNET integration, and compliance certifications.

The key insight: Azure OpenAI is not a multi-vendor platform. It’s a single-vendor (OpenAI) service with enterprise Azure features. If you want OpenAI’s models with Azure’s security and governance, this is the path. If you want model diversity, it’s not.

Azure OpenAI includes:

Model access: GPT-4, GPT-4 Turbo, GPT-3.5, text embeddings, DALL-E 3.
Provisioned throughput: Reserve capacity for predictable workloads at a discount.
Content filtering: Built-in safety and compliance controls.
Azure integration: Managed identity, private endpoints, VNET, Key Vault, Azure Monitor.
Quota management: Fine-grained control over token limits and rate limiting.

When to Use Azure OpenAI

Use Azure OpenAI if:

You’re an enterprise on Azure and need GPT-4 with private endpoints, VNET integration, and compliance controls.
You’re locked into OpenAI’s models (GPT-4 is your best choice for your use case) and want to avoid the OpenAI API’s public endpoints.
You need SOC 2, ISO 27001, FedRAMP, or HIPAA compliance—Azure OpenAI is certified for all of these.
You’re building on Azure AI Studio and want seamless integration with other Azure services.
You have existing Azure AD, Key Vault, and governance investments.

Avoid Azure OpenAI if:

You want model diversity and flexibility to switch between Claude, Llama, Mistral, and GPT-4.
You’re not on Azure and don’t want to add a second cloud just for models.
You want to avoid OpenAI lock-in and prefer open-source or alternative models.
You’re cost-optimising and need the cheapest inference path (Llama 2 on Bedrock is cheaper than GPT-4 on Azure).

Azure OpenAI Pricing Deep Dive

Azure OpenAI uses a pay-per-token model, but prices are higher than OpenAI’s public API because you’re paying for enterprise controls and Azure infrastructure:

GPT-4: ~$0.03 per 1K input tokens, ~$0.06 per 1K output tokens.
GPT-4 Turbo: ~$0.01 per 1K input tokens, ~$0.03 per 1K output tokens.
GPT-3.5 Turbo: ~$0.0005 per 1K input tokens, ~$0.0015 per 1K output tokens.
Provisioned throughput: 40% discount if you commit to baseline capacity.

For 1 million tokens/month using GPT-4 Turbo, you’d spend ~$10–15K/month. With provisioned throughput, ~$6–9K/month.

For details, see Azure OpenAI Service Documentation.

Decision Framework: How to Choose

Choosing between Bedrock, Vertex, and Azure OpenAI isn’t about which is “best.” It’s about which solves your problem with the least operational overhead and the lowest total cost of ownership.

Use this decision tree:

Step 1: Are You Locked Into a Cloud?

If you’re on AWS: Bedrock is your default. You get model diversity, simple API, and no cross-cloud networking overhead. Cost is higher than Vertex but lower than Azure OpenAI for most models. Start here unless you have a specific reason not to.

If you’re on GCP: Vertex AI is your default. You get Gemini (Google’s best model), full MLOps, and native integration. Cost is lowest for Gemini workloads. Start here unless you need Claude or other non-Google models.

If you’re on Azure: Azure OpenAI is your default if you need GPT-4 with private endpoints and VNET integration. If you need model diversity, you’ll need to layer in Bedrock or Vertex, which adds operational complexity. Most Azure shops choose to stay in Azure and accept OpenAI lock-in.

Step 2: What Models Do You Actually Need?

If your primary model is Claude, Bedrock is your best path. If it’s Gemini, Vertex. If it’s GPT-4 with enterprise controls, Azure OpenAI.

If you need multiple models in production (e.g., Claude for reasoning, Llama 2 for cost, Gemini for multimodal), you’re building a multi-cloud strategy. See Section 7.

Step 3: What’s Your Compliance Profile?

All three platforms support SOC 2, ISO 27001, and HIPAA. Azure OpenAI also supports FedRAMP (federal government). If you’re federal or highly regulated, Azure OpenAI has the strongest compliance story.

For SOC 2 and ISO 27001 audits, all three are audit-ready. The NIST AI Risk Management Framework applies equally. The difference is in implementation: Azure OpenAI’s private endpoints and VNET integration make compliance easier to demonstrate; Bedrock and Vertex require more custom security controls.

Step 4: What’s Your Cost Ceiling?

For the same workload:

Cheapest: Llama 2 on Bedrock or Gemini on Vertex (~$250–500/month for 1M tokens).
Mid-range: Claude on Bedrock or Gemini Pro on Vertex (~$3–5K/month for 1M tokens).
Most expensive: GPT-4 on Azure OpenAI (~$10–15K/month for 1M tokens).

If cost is your constraint, Bedrock (Llama 2) or Vertex (Gemini) wins. If you need GPT-4, Azure OpenAI is your only option.

Step 5: Do You Need Fine-Tuning or Custom Training?

If yes, Vertex AI is your best option. It has the most mature fine-tuning stack and evaluation tools. Bedrock supports fine-tuning for some models (Claude, Llama 2) but with less flexibility. Azure OpenAI has fine-tuning but it’s more limited.

Step 6: What’s Your Operational Capacity?

Bedrock is simplest—call an API, get a response. Vertex requires understanding Google Cloud projects, quotas, and regional availability. Azure OpenAI requires Azure expertise (managed identity, Key Vault, VNET).

If you’re a small team with no cloud expertise, Bedrock is easiest. If you’re a data team on GCP, Vertex is natural. If you’re an enterprise Azure shop, Azure OpenAI fits your existing skills.

Building a Multi-Cloud Model Access Strategy

If you’ve decided that no single platform covers your needs, you’re building a multi-cloud strategy. This is increasingly common for founders and operators who need flexibility, cost optimisation, and model diversity.

Here’s how to structure it:

Principle 1: Abstraction Layer

Don’t call Bedrock, Vertex, and Azure OpenAI APIs directly from your application code. Build an abstraction layer—a thin wrapper that handles model routing, fallback, and monitoring.

Why? Because when Claude 3.5 launches and you want to test it against GPT-4 Turbo, you change your router config, not your application code. When a model gets deprecated or pricing changes, you update one place, not 50.

Your abstraction layer should:

Accept a model name (“claude-3-sonnet”, “gpt-4-turbo”, “gemini-pro”).
Route to the correct cloud (Bedrock, Vertex, or Azure OpenAI).
Handle authentication (AWS credentials, GCP service account, Azure managed identity).
Implement fallback (if Claude is down, try Gemini).
Log and monitor (track which model was used, latency, cost, errors).

Principle 2: Cost Router

Once you have multiple models in production, route based on cost and capability. Not every request needs GPT-4. Some can use Llama 2 at 1/100th the cost.

Example: For a customer support chatbot:

Simple FAQ queries: Route to Llama 2 on Bedrock (~$0.001 per 1K tokens).
Complex reasoning: Route to Claude on Bedrock (~$0.003 per 1K tokens).
Escalated issues: Route to GPT-4 on Azure OpenAI (~$0.03 per 1K tokens).

This can cut your inference bill by 60–70% without sacrificing quality.

Principle 3: Model Release Testing

Every time a new model drops (Claude 3.5, GPT-5, Gemini 2.0, Llama 4), you need to test it against your existing models in production. Your abstraction layer should support A/B testing:

Route 10% of traffic to the new model.
Compare latency, cost, and quality (via human review or automated evals).
If it’s better, promote it. If not, roll back.

This keeps your models fresh without breaking production.

Principle 4: Vendor Lock-In Avoidance

Multi-cloud is about avoiding lock-in, but you need to be intentional about it:

Don’t use vendor-specific features (Bedrock’s Knowledge Bases, Vertex’s MLOps, Azure OpenAI’s private endpoints) unless you’re willing to be locked in.
Use standard APIs (OpenAI API format is becoming the de facto standard—even Bedrock and Vertex support it).
Maintain multi-model parity (ensure your critical paths work with at least two models from different vendors).

If you follow these principles, switching clouds takes days, not months.

Implementation Patterns and Code Examples

Pattern 1: Abstraction Layer in Python

Here’s a minimal abstraction layer that routes between Bedrock, Vertex, and Azure OpenAI:

import os
import json
from typing import Optional, Dict, Any

class ModelRouter:
    def __init__(self):
        self.bedrock_client = None
        self.vertex_client = None
        self.azure_client = None
        self._init_clients()
    
    def _init_clients(self):
        """Initialise cloud clients based on environment variables."""
        if os.getenv('AWS_REGION'):
            import boto3
            self.bedrock_client = boto3.client('bedrock-runtime')
        
        if os.getenv('GOOGLE_APPLICATION_CREDENTIALS'):
            from vertexai.language_models import TextGenerationModel
            self.vertex_client = TextGenerationModel
        
        if os.getenv('AZURE_OPENAI_KEY'):
            from openai import AzureOpenAI
            self.azure_client = AzureOpenAI(
                api_key=os.getenv('AZURE_OPENAI_KEY'),
                api_version="2024-02-15-preview",
                azure_endpoint=os.getenv('AZURE_OPENAI_ENDPOINT')
            )
    
    def call_model(self, model_name: str, prompt: str, **kwargs) -> Dict[str, Any]:
        """Route to the correct cloud based on model name."""
        if 'claude' in model_name.lower():
            return self._call_bedrock(model_name, prompt, **kwargs)
        elif 'gemini' in model_name.lower() or 'palm' in model_name.lower():
            return self._call_vertex(model_name, prompt, **kwargs)
        elif 'gpt' in model_name.lower():
            return self._call_azure_openai(model_name, prompt, **kwargs)
        else:
            raise ValueError(f"Unknown model: {model_name}")
    
    def _call_bedrock(self, model_name: str, prompt: str, **kwargs) -> Dict[str, Any]:
        """Call AWS Bedrock."""
        response = self.bedrock_client.invoke_model(
            modelId=model_name,
            body=json.dumps({
                "prompt": prompt,
                "max_tokens_to_sample": kwargs.get('max_tokens', 1024),
                "temperature": kwargs.get('temperature', 0.7)
            })
        )
        return json.loads(response['body'].read())
    
    def _call_vertex(self, model_name: str, prompt: str, **kwargs) -> Dict[str, Any]:
        """Call GCP Vertex AI."""
        model = self.vertex_client.from_pretrained(model_name)
        response = model.predict(
            prompt,
            max_output_tokens=kwargs.get('max_tokens', 1024),
            temperature=kwargs.get('temperature', 0.7)
        )
        return {"text": response.text}
    
    def _call_azure_openai(self, model_name: str, prompt: str, **kwargs) -> Dict[str, Any]:
        """Call Azure OpenAI."""
        response = self.azure_client.chat.completions.create(
            model=model_name,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=kwargs.get('max_tokens', 1024),
            temperature=kwargs.get('temperature', 0.7)
        )
        return {"text": response.choices[0].message.content}

# Usage
router = ModelRouter()
response = router.call_model('claude-3-sonnet', 'What is 2+2?')
print(response)

Pattern 2: Cost Router

Route based on cost and complexity:

class CostRouter(ModelRouter):
    MODEL_COSTS = {
        'llama-2-70b': {'input': 0.001, 'output': 0.002},  # per 1K tokens
        'claude-3-sonnet': {'input': 0.003, 'output': 0.015},
        'gpt-4-turbo': {'input': 0.01, 'output': 0.03},
        'gemini-pro': {'input': 0.00025, 'output': 0.0005}
    }
    
    def route_by_complexity(self, prompt: str, complexity: str = 'auto') -> str:
        """Route to cheapest model that meets complexity requirements."""
        if complexity == 'simple':
            return 'llama-2-70b'  # Cheapest
        elif complexity == 'medium':
            return 'claude-3-sonnet'  # Good balance
        elif complexity == 'complex':
            return 'gpt-4-turbo'  # Most capable
        else:
            # Auto-detect based on prompt length and keywords
            if len(prompt) < 100 and 'simple' in prompt.lower():
                return 'llama-2-70b'
            elif 'reason' in prompt.lower() or 'analyze' in prompt.lower():
                return 'claude-3-sonnet'
            else:
                return 'gpt-4-turbo'
    
    def estimate_cost(self, model: str, prompt: str, expected_output_tokens: int = 500) -> float:
        """Estimate cost for a given model and prompt."""
        input_tokens = len(prompt.split()) * 1.3  # Rough estimate
        costs = self.MODEL_COSTS.get(model, {})
        input_cost = (input_tokens / 1000) * costs.get('input', 0)
        output_cost = (expected_output_tokens / 1000) * costs.get('output', 0)
        return input_cost + output_cost

# Usage
router = CostRouter()
model = router.route_by_complexity('What is machine learning?', complexity='auto')
print(f"Selected model: {model}")
print(f"Estimated cost: ${router.estimate_cost(model, 'What is machine learning?'):.4f}")

Pattern 3: A/B Testing New Models

Test new models against production models:

import random
from datetime import datetime

class ABTestRouter(CostRouter):
    def __init__(self):
        super().__init__()
        self.test_log = []
    
    def call_with_ab_test(self, prompt: str, control_model: str, test_model: str, test_percentage: float = 0.1) -> Dict[str, Any]:
        """Route to control or test model based on percentage."""
        use_test = random.random() < test_percentage
        selected_model = test_model if use_test else control_model
        
        response = self.call_model(selected_model, prompt)
        
        # Log for analysis
        self.test_log.append({
            'timestamp': datetime.now().isoformat(),
            'prompt_length': len(prompt),
            'control_model': control_model,
            'test_model': test_model,
            'selected_model': selected_model,
            'used_test': use_test,
            'response_length': len(response.get('text', ''))
        })
        
        return response
    
    def get_ab_test_summary(self) -> Dict[str, Any]:
        """Summarise A/B test results."""
        if not self.test_log:
            return {}
        
        control_responses = [r for r in self.test_log if not r['used_test']]
        test_responses = [r for r in self.test_log if r['used_test']]
        
        return {
            'total_requests': len(self.test_log),
            'control_count': len(control_responses),
            'test_count': len(test_responses),
            'avg_response_length_control': sum(r['response_length'] for r in control_responses) / len(control_responses) if control_responses else 0,
            'avg_response_length_test': sum(r['response_length'] for r in test_responses) / len(test_responses) if test_responses else 0
        }

# Usage
router = ABTestRouter()
response = router.call_with_ab_test(
    'Explain quantum computing',
    control_model='claude-3-sonnet',
    test_model='gpt-4-turbo',
    test_percentage=0.2
)
print(router.get_ab_test_summary())

Cost Optimisation Across Clouds

Multi-cloud is only worth it if it saves money. Here’s how to optimise:

Tactic 1: Right-Size Your Models

Not every task needs your biggest, most expensive model. Profile your workloads:

Routing queries (classify customer intent): Llama 2 (1/100th the cost of GPT-4).
Content generation (blog posts, emails): Claude 3 Sonnet (1/10th the cost of GPT-4).
Complex reasoning (financial analysis, code debugging): GPT-4 (highest cost, highest quality).

Measure quality for each model on your actual tasks (not benchmarks). You’ll often find that a smaller model is “good enough” and saves 70–80% on inference costs.

Tactic 2: Batch Processing

If your workload allows latency, batch requests and use cheaper batch APIs:

AWS Bedrock: No native batch API, but you can queue requests and process them off-peak.
GCP Vertex AI: Batch prediction API offers 50% discount on inference.
Azure OpenAI: No native batch API, but provisioned throughput offers 40% discount for committed capacity.

For a workload that can tolerate 24-hour latency (e.g., daily report generation), batch processing can cut costs by 40–50%.

Tactic 3: Caching and Memoization

If you’re asking similar questions repeatedly, cache the responses:

import hashlib
from functools import lru_cache

class CachingRouter(ABTestRouter):
    def __init__(self, cache_size: int = 1000):
        super().__init__()
        self.cache = {}
        self.cache_size = cache_size
    
    def call_model_cached(self, model_name: str, prompt: str, **kwargs) -> Dict[str, Any]:
        """Call model with caching."""
        cache_key = hashlib.md5(f"{model_name}:{prompt}".encode()).hexdigest()
        
        if cache_key in self.cache:
            return self.cache[cache_key]
        
        response = self.call_model(model_name, prompt, **kwargs)
        
        if len(self.cache) < self.cache_size:
            self.cache[cache_key] = response
        
        return response

For a support chatbot answering FAQs, caching can reduce API calls by 40–60%, cutting costs dramatically.

Tactic 4: Provisioned Throughput

If you have predictable, sustained workloads:

AWS Bedrock: Provisioned throughput at 40% discount.
Azure OpenAI: Provisioned throughput at 40% discount.
GCP Vertex AI: Quota reservation (less of a discount, but more predictable).

For a production chatbot running 24/7, provisioned throughput can save $5–10K/month.

Tactic 5: Monitor and Iterate

Set up cost monitoring for each model and cloud:

class CostMonitor:
    def __init__(self):
        self.costs = {}
    
    def log_call(self, model: str, input_tokens: int, output_tokens: int, cost: float):
        if model not in self.costs:
            self.costs[model] = {'calls': 0, 'input_tokens': 0, 'output_tokens': 0, 'total_cost': 0}
        
        self.costs[model]['calls'] += 1
        self.costs[model]['input_tokens'] += input_tokens
        self.costs[model]['output_tokens'] += output_tokens
        self.costs[model]['total_cost'] += cost
    
    def get_cost_breakdown(self) -> Dict[str, Any]:
        return {
            model: {
                'calls': data['calls'],
                'avg_cost_per_call': data['total_cost'] / data['calls'],
                'total_cost': data['total_cost']
            }
            for model, data in self.costs.items()
        }

Review this monthly. If one model is consistently cheaper for a task, switch your router to prefer it.

Security, Compliance, and Governance

When you’re using multiple clouds for AI, security and compliance become more complex. Here’s how to stay safe:

Principle 1: Data Residency and Privacy

Where does your data go when you call a model?

AWS Bedrock: Data stays in AWS. If you use provisioned throughput, data doesn’t go to the model provider (Anthropic, Meta, etc.)—it stays in your AWS account.
GCP Vertex AI: Data stays in GCP. If you use private endpoints, data doesn’t leave your VPC.
Azure OpenAI: Data stays in Azure. Private endpoints ensure data doesn’t traverse the public internet.

For regulated industries (healthcare, finance), verify that your model provider (Anthropic, OpenAI, Google) has data processing agreements (DPAs) in place. This is critical for HIPAA, GDPR, and other regulations.

Principle 2: Authentication and Authorisation

Each cloud has different auth mechanisms:

AWS Bedrock: IAM roles and policies. Use temporary credentials (STS) in production.
GCP Vertex AI: Service accounts and OAuth 2.0. Use Workload Identity Federation for external services.
Azure OpenAI: Managed identity, service principals, or API keys. Prefer managed identity in Azure.

Never hardcode API keys. Use secrets management:

import os
from aws_secretsmanager import SecretsManager
from google.cloud import secretmanager as gsm
from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient

def get_bedrock_credentials():
    """Get AWS credentials from Secrets Manager."""
    client = SecretsManager()
    secret = client.get_secret_value(SecretId='bedrock-credentials')
    return json.loads(secret['SecretString'])

def get_vertex_credentials():
    """Get GCP credentials from Secret Manager."""
    client = gsm.SecretManagerServiceClient()
    name = f"projects/my-project/secrets/vertex-credentials/versions/latest"
    response = client.access_secret_version(request={"name": name})
    return json.loads(response.payload.data.decode('UTF-8'))

def get_azure_openai_key():
    """Get Azure OpenAI key from Key Vault."""
    credential = DefaultAzureCredential()
    client = SecretClient(vault_url="https://my-vault.vault.azure.net/", credential=credential)
    secret = client.get_secret("azure-openai-key")
    return secret.value

Principle 3: Audit Logging and Monitoring

Log every model call for compliance and debugging:

import logging
from datetime import datetime

class AuditLogger:
    def __init__(self, log_file: str = 'audit.log'):
        self.logger = logging.getLogger('audit')
        handler = logging.FileHandler(log_file)
        formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
        handler.setFormatter(formatter)
        self.logger.addHandler(handler)
    
    def log_model_call(self, model: str, user_id: str, prompt_hash: str, response_hash: str, cost: float):
        """Log a model call for audit purposes."""
        self.logger.info(
            f"model={model} user_id={user_id} prompt_hash={prompt_hash} "
            f"response_hash={response_hash} cost={cost} timestamp={datetime.now().isoformat()}"
        )

For SOC 2 and ISO 27001 audits, you need:

Access logs: Who called which model, when, and from where.
Data logs: What data was processed (without storing the actual data).
Cost logs: How much was spent on each model (for cost control).
Error logs: When models failed or returned errors.

Principle 4: Content Filtering and Safety

All three platforms support content filtering:

AWS Bedrock: Optional safety filters for violence, hate speech, etc.
GCP Vertex AI: Safety settings (BLOCK_NONE, BLOCK_ONLY_HIGH, BLOCK_MEDIUM_AND_ABOVE, BLOCK_LOW_AND_ABOVE).
Azure OpenAI: Content filtering built in (configurable per deployment).

Implement your own layer of filtering as well—don’t rely solely on the cloud provider:

from profanityfilter import ProfanityFilter

class SafeRouter(CostRouter):
    def __init__(self):
        super().__init__()
        self.profanity_filter = ProfanityFilter()
    
    def call_model_safe(self, model: str, prompt: str, **kwargs) -> Dict[str, Any]:
        """Call model with safety checks."""
        # Check input
        if self.profanity_filter.is_profane(prompt):
            return {"error": "Input contains inappropriate content"}
        
        # Call model
        response = self.call_model(model, prompt, **kwargs)
        
        # Check output
        if self.profanity_filter.is_profane(response.get('text', '')):
            return {"error": "Model output contains inappropriate content"}
        
        return response

Principle 5: SOC 2 and ISO 27001 Readiness

For startups pursuing compliance, all three platforms support SOC 2 and ISO 27001. The difference is in implementation:

AWS Bedrock: Requires you to build your own audit logging, access controls, and encryption. Use PADISO’s AI Quickstart Audit to assess your readiness in 2 weeks.
GCP Vertex AI: Similar to Bedrock—you need to implement controls on top of Google’s infrastructure.
Azure OpenAI: Easiest path for compliance. Private endpoints, managed identity, and Key Vault integration are built in. You still need audit logging and access controls, but Azure handles most of the infrastructure.

If compliance is a blocker, consider working with a partner like PADISO who specialises in Security Audit (SOC 2 / ISO 27001) implementation and can help you navigate the multi-cloud compliance landscape.

The Repeatable Framework for Model Releases Through 2027

New models drop every 8–12 weeks. Your framework needs to handle them without manual rewrites. Here’s the repeatable process:

Phase 1: Detection (Week 1)

When a new model is announced:

Check availability: Is it available on Bedrock, Vertex, or Azure OpenAI?
Get pricing: What’s the cost per token?
Review capabilities: Benchmarks, supported features (vision, tools, etc.).
Assess relevance: Does it solve a problem your current models don’t?

Keep a spreadsheet:

Model	Provider	Available On	Input Cost	Output Cost	Capabilities	Relevant?
Claude 3.5	Anthropic	Bedrock	$0.003	$0.015	Reasoning, vision, tools	Yes
GPT-4 Turbo	OpenAI	Azure OpenAI, OpenAI API	$0.01	$0.03	Vision, tools, function calling	Yes
Gemini 2.0	Google	Vertex AI	$0.0005	$0.001	Multimodal, reasoning, agents	Yes

Phase 2: Testing (Week 2–3)

Set up a test environment: Deploy the new model on the appropriate cloud.
Run your benchmarks: Test on your actual use cases (not public benchmarks).
Measure cost and latency: Real-world performance, not synthetic.
Compare to current models: Is it better, faster, cheaper?

Example benchmark:

import time
import json
from typing import List, Dict

class ModelBenchmark:
    def __init__(self):
        self.router = CostRouter()
        self.results = []
    
    def benchmark_model(self, model: str, test_cases: List[Dict[str, str]], iterations: int = 3) -> Dict[str, Any]:
        """Benchmark a model on test cases."""
        latencies = []
        costs = []
        
        for test_case in test_cases:
            for _ in range(iterations):
                start = time.time()
                response = self.router.call_model(model, test_case['prompt'])
                latency = time.time() - start
                latencies.append(latency)
                
                # Estimate cost
                input_tokens = len(test_case['prompt'].split()) * 1.3
                output_tokens = len(response.get('text', '').split()) * 1.3
                cost = self.router.estimate_cost(model, test_case['prompt'], int(output_tokens))
                costs.append(cost)
        
        return {
            'model': model,
            'avg_latency': sum(latencies) / len(latencies),
            'p95_latency': sorted(latencies)[int(len(latencies) * 0.95)],
            'avg_cost': sum(costs) / len(costs),
            'total_cost': sum(costs)
        }
    
    def compare_models(self, models: List[str], test_cases: List[Dict[str, str]]) -> List[Dict[str, Any]]:
        """Compare multiple models."""
        results = []
        for model in models:
            result = self.benchmark_model(model, test_cases)
            results.append(result)
        
        # Sort by cost
        results.sort(key=lambda x: x['avg_cost'])
        return results

# Usage
benchmark = ModelBenchmark()
test_cases = [
    {'prompt': 'What is machine learning?'},
    {'prompt': 'Explain quantum computing in simple terms.'},
    {'prompt': 'Write a Python function to sort a list.'}
]

results = benchmark.compare_models(
    ['claude-3-sonnet', 'gpt-4-turbo', 'gemini-pro'],
    test_cases
)

for result in results:
    print(f"{result['model']}: {result['avg_cost']:.4f} per call, {result['avg_latency']:.2f}s latency")

Phase 3: A/B Testing (Week 4–8)

Deploy to production: Route 5–10% of traffic to the new model.
Monitor quality: Does it produce better outputs?
Monitor cost: Is it cheaper or more expensive?
Monitor errors: Does it fail more or less often?
Collect user feedback: Do users prefer it?

Use your ABTestRouter (from Section 8) to automate this.

Phase 4: Decision (Week 8)

Based on testing, decide:

Promote: Switch 100% of traffic to the new model.
Hybrid: Keep both models (route based on use case or cost).
Reject: Stick with the current model.

Document your decision and the reasoning. This becomes your model evolution log.

Phase 5: Retire (Ongoing)

When a model becomes outdated or deprecated:

Set a sunset date: Announce 6 months in advance.
Migrate workloads: Move to the new model.
Remove from router: Delete the old model from your config.
Update documentation: Record what changed and why.

Automating the Framework

You can automate this process:

import schedule
import requests
from datetime import datetime

class ModelReleaseMonitor:
    def __init__(self):
        self.models_db = {}  # Store model metadata
    
    def check_for_new_models(self):
        """Check for new model releases (daily)."""
        # Check Bedrock
        bedrock_models = self._get_bedrock_models()
        for model in bedrock_models:
            if model['name'] not in self.models_db:
                self._alert_new_model(model, 'bedrock')
        
        # Check Vertex
        vertex_models = self._get_vertex_models()
        for model in vertex_models:
            if model['name'] not in self.models_db:
                self._alert_new_model(model, 'vertex')
        
        # Check Azure OpenAI
        azure_models = self._get_azure_openai_models()
        for model in azure_models:
            if model['name'] not in self.models_db:
                self._alert_new_model(model, 'azure')
    
    def _get_bedrock_models(self) -> List[Dict[str, str]]:
        """Fetch available models from Bedrock."""
        # Implementation would call AWS API
        pass
    
    def _get_vertex_models(self) -> List[Dict[str, str]]:
        """Fetch available models from Vertex."""
        # Implementation would call GCP API
        pass
    
    def _get_azure_openai_models(self) -> List[Dict[str, str]]:
        """Fetch available models from Azure OpenAI."""
        # Implementation would call Azure API
        pass
    
    def _alert_new_model(self, model: Dict[str, str], provider: str):
        """Alert team of new model."""
        print(f"New model detected: {model['name']} on {provider}")
        # Send Slack notification, create Jira ticket, etc.
    
    def schedule_monitoring(self):
        """Run monitoring daily."""
        schedule.every().day.at("09:00").do(self.check_for_new_models)
        while True:
            schedule.run_pending()
            time.sleep(60)

# Usage
monitor = ModelReleaseMonitor()
monitor.schedule_monitoring()

This ensures you’re always aware of new models and can evaluate them systematically.

Next Steps

You now have a framework for multi-cloud model access that works today and scales through 2027. Here’s what to do next:

Step 1: Assess Your Current State

Where are you today?

Single cloud (AWS, GCP, or Azure)?
Already using multiple clouds?
Locked into a specific model (OpenAI, Claude, Llama)?

If you’re unsure, book a 30-minute call with PADISO’s AI Advisory team in Sydney or explore our CTO as a Service offering. We help teams assess their AI readiness and build strategies like this one.

Step 2: Build Your Abstraction Layer

Start with the code examples in Section 8. Build a ModelRouter that abstracts your cloud calls. This is the foundation of everything else.

Step 3: Profile Your Workloads

For each AI task in your product:

What model are you using today?
How much does it cost per month?
How much latency can you tolerate?
What quality do you need?

Use this data to build your cost router (Section 8, Pattern 2).

Step 4: Set Up Monitoring

Deploy cost, latency, and error monitoring. You can’t optimise what you don’t measure.

Step 5: Plan Your Model Testing

When the next major model drops (Claude 3.5, GPT-5, Gemini 2.0), you’ll be ready to test it systematically instead of reactively.

Step 6: Document Your Decisions

Keep a log of:

Which models you tested and when.
Why you chose or rejected each model.
How costs and performance changed.
What you’d do differently next time.

This becomes your institutional knowledge.

Step 7: Review Quarterly

Every quarter:

Check for new models.
Review your cost breakdown by model.
Evaluate whether your router is optimal.
Update your decision framework.

Summary

Multi-cloud model access is no longer optional. With new models dropping every 8–12 weeks and different providers owning different capabilities, a single-cloud strategy limits your options and locks you into outdated models.

The framework in this guide gives you:

A decision tree to choose between AWS Bedrock, GCP Vertex AI, and Azure OpenAI based on your constraints.
An abstraction layer to route between clouds without rewriting application code.
A cost optimisation strategy to route requests to the cheapest model that meets your quality requirements.
An A/B testing framework to evaluate new models safely in production.
A security and compliance checklist to stay audit-ready across clouds.
A repeatable process to evaluate new models every 8–12 weeks through 2027.

You don’t need all of this today. Start with the abstraction layer and cost router. As your product scales and model options expand, add A/B testing and monitoring. By 2027, you’ll have a system that adapts to whatever the AI landscape throws at you.

For teams building AI products or automating operations, this framework is the difference between being stuck with yesterday’s models and having access to tomorrow’s capabilities. Implement it now, and you’ll thank yourself in 6 months when GPT-5 drops and you can test it in production without breaking anything.

If you need help implementing this framework or want a fractional CTO to guide your multi-cloud AI strategy, PADISO’s CTO Advisory team is here. We’ve helped 50+ startups and enterprises build AI strategies that scale. Book a call or explore our AI Strategy & Readiness service to get started.

Want to talk through your situation?

Book a 30-minute call with Kevin (Founder/CEO). No pitch - direct advice on what to do next.

Book a 30-min call