Prompt Caching for Multi-Tenant SaaS: Per-Tenant or Shared Prefix?
Choose the right prompt caching strategy for multi-tenant SaaS. Compare per-tenant vs shared prefix caching with cost, security, and compliance analysis.
Prompt Caching for Multi-Tenant SaaS: Per-Tenant or Shared Prefix?
Table of Contents
- Why Prompt Caching Matters for Multi-Tenant SaaS
- Understanding Prompt Caching Fundamentals
- Shared Prefix Caching: Cost Wins and Trade-Offs
- Per-Tenant Caching: Isolation, Compliance, and When to Use It
- The Math: Cost and Latency Comparison
- Security and Compliance Considerations
- Implementation Patterns for Multi-Tenant Architectures
- Making the Decision: A Decision Framework
- Real-World Scenarios and Trade-Offs
- Next Steps and Optimization
Why Prompt Caching Matters for Multi-Tenant SaaS {#why-prompt-caching-matters}
Prompt caching has become a critical efficiency lever for SaaS platforms running large language models at scale. If you’re operating a multi-tenant AI product—whether that’s a customer support chatbot, document analysis platform, or agentic automation system—your infrastructure costs and latency directly determine unit economics and user experience.
The challenge is architectural: do you cache system prompts, context, and prefixes across all tenants to maximise cache hit rates and slash API costs, or do you isolate caches per tenant to guarantee data separation and simplify compliance?
This isn’t a theoretical question. The difference between these two approaches can mean 40–60% cost reduction on LLM inference, or it can mean failing a SOC 2 audit because cached data from one tenant leaked to another.
At PADISO, we’ve helped Sydney-based startups and enterprise teams navigate this exact decision across dozens of AI product builds. When you’re building AI & Agents Automation systems or scaling platform engineering for multi-tenant workloads, getting this decision right early saves months of refactoring and thousands in unnecessary spend.
This guide walks you through both architectures, the maths, the security implications, and a framework to choose the right one for your business.
Understanding Prompt Caching Fundamentals {#understanding-fundamentals}
Prompt caching works by storing the key-value (KV) cache outputs from processed tokens so that identical or overlapping prefixes don’t need to be recomputed on every request. Instead of running a 2,000-token system prompt and context block through the model every time a user sends a message, the cache stores the computed embeddings and attention states, and new requests only compute tokens after the cached prefix.
How Prefix Caching Works
How prompt caching works with Paged Attention and Automatic Prefix Caching explains the mechanics: when a request comes in, the system hashes the prompt prefix and checks whether that exact sequence has been cached. If it has, inference starts from the cached KV state rather than token zero.
In a single-tenant system, this is straightforward. You cache your system prompt once, and every user request reuses it. But in multi-tenant systems, the question becomes: do multiple tenants share the same cached prefix, or does each tenant get its own isolated cache?
The Token Economics
Most LLM APIs (OpenAI’s GPT-4, Anthropic Claude, etc.) charge per input token and per output token. With prompt caching enabled, cached tokens cost 90% less than non-cached input tokens. For example, on OpenAI’s API:
- Standard input tokens: $0.003 per 1K tokens
- Cached input tokens: $0.0003 per 1K token
If your system prompt and context are 5,000 tokens, and you process 100 requests per day, the difference between a cache hit and a miss is roughly $1.50 per day per user—or $45,000 per year for a 100-user cohort. Scale that to 10,000 users, and you’re looking at $4.5M annually in potential savings.
But that calculation assumes 100% cache hit rate. In reality, multi-tenant systems introduce complexity: not every tenant uses the same system prompt, context, or instructions.
Cache Invalidation and TTL
Caches have lifespans. Most LLM providers keep cached prefixes live for 5 minutes to 1 hour, depending on the service. If a tenant updates their system prompt or context, the old cache becomes stale. In a shared-prefix model, invalidating one tenant’s cache might affect others. In a per-tenant model, invalidation is isolated but requires more infrastructure to track.
Shared Prefix Caching: Cost Wins and Trade-Offs {#shared-prefix-strategy}
Shared prefix caching means all tenants in your SaaS platform use the same cached system prompt, instructions, and context blocks. The cache is global, and every request benefits from the same prefix hash.
When Shared Prefix Wins
Scenario 1: Standardised SaaS with Minimal Customisation
If your product has a fixed system prompt and all tenants use the same AI instructions, shared caching is a straightforward win. Examples:
- A customer support chatbot where all tenants use the same response tone and guardrails.
- A document summariser where the summarisation prompt is identical for all users.
- An email classification system with a standard taxonomy across all accounts.
In these cases, every request hits the same cached prefix, and your cost-per-request drops dramatically. A team building AI automation systems for retail or logistics often falls into this category—the underlying AI logic is the same; only the data and output formatting differ per tenant.
Scenario 2: High-Volume, Low-Variance Workloads
When you’re processing thousands of requests daily with minimal prompt variation, shared caching delivers outsized returns. Financial services platforms processing document analysis, HR platforms automating resume screening, or e-commerce platforms generating product descriptions all benefit from shared prefix caches.
The Cost Math
Let’s model a real scenario:
Platform: A multi-tenant document analysis SaaS with 500 active tenants.
Workload:
- System prompt: 1,500 tokens (same for all tenants)
- Average context per request: 3,000 tokens
- Average output: 200 tokens
- Daily requests: 50,000 across all tenants
Without caching:
- Input tokens per request: 4,500 (1,500 system + 3,000 context)
- Daily input tokens: 225M
- Daily cost @ $0.003 per 1K: $675
- Monthly cost: $20,250
With shared prefix caching (system prompt cached):
- Cached input tokens per request: 1,500 (cached at $0.0003 per 1K)
- Non-cached input tokens per request: 3,000 (context at $0.003 per 1K)
- Daily input cost: (50,000 × 1,500 × $0.0003/1K) + (50,000 × 3,000 × $0.003/1K)
- Daily input cost: $22.50 + $450 = $472.50
- Monthly cost: $14,175
- Monthly savings: $6,075 (30% reduction)
Now scale to 10,000 tenants:
- Monthly cost without caching: $405,000
- Monthly cost with shared prefix: $283,500
- Annual savings: $1.452M
These numbers compound when you add more cached prefixes (e.g., multiple instruction sets, role-based prompts) or increase request volume.
Trade-Offs and Risks
Reduced Customisation Flexibility: If tenants want custom system prompts or instructions, shared caching breaks down. You’d need to fall back to non-cached inference for tenant-specific prompts, negating the benefit.
Cache Invalidation Complexity: Updating a shared prefix requires invalidating the global cache, which affects all tenants. If you discover a bug in your system prompt and push a fix, every tenant’s cache clears simultaneously. This is usually fine, but it introduces a single point of failure.
Data Isolation Risks: This is the critical one. If your shared cache inadvertently leaks information from one tenant’s context into another’s, you’ve created a compliance nightmare. More on this in the security section.
Per-Tenant Caching: Isolation, Compliance, and When to Use It {#per-tenant-strategy}
Per-tenant caching means each tenant maintains its own isolated cache of prompts, instructions, and context. Requests from tenant A never share cached prefixes with tenant B.
When Per-Tenant Caching Is Essential
Scenario 1: Highly Customised Tenant Prompts
If your SaaS allows tenants to define custom system prompts, instructions, or tone, per-tenant caching is necessary. Examples:
- A customer service platform where each tenant configures their own chatbot personality and guardrails.
- A content generation tool where tenants specify brand voice, tone, and style guidelines.
- An internal knowledge assistant where tenants upload custom documentation and context.
In these cases, shared caching would prevent proper cache reuse because each tenant’s prefix is unique. You’d be paying for cache storage without getting cache hits.
Scenario 2: Regulated Industries and Compliance Requirements
If you’re operating in healthcare, finance, legal, or any regulated sector, per-tenant caching is often mandatory. The reason: Multi Tenant Security according to OWASP Cheat Sheet Series emphasises that shared caches without tenant prefixes create data leakage risks that violate SOC 2, HIPAA, and PCI-DSS requirements.
When pursuing SOC 2 compliance or ISO 27001 compliance, auditors explicitly require evidence of tenant isolation. A shared cache that could theoretically leak one tenant’s context to another is a finding waiting to happen.
Scenario 3: Multi-Tenant Architectures with Shared Infrastructure
If you’re running a shared Kubernetes cluster, shared database, or shared inference infrastructure, per-tenant caching provides a logical isolation boundary. Even if the underlying infrastructure is shared, tenant caches remain logically separate.
The Cost Math for Per-Tenant Caching
Using the same document analysis SaaS scenario, but now each of 500 tenants has a unique system prompt and context:
Per-Tenant Scenario:
- System prompt per tenant: 1,500 tokens (unique per tenant)
- Average context per request: 3,000 tokens
- Requests per tenant per day: 100
- Total daily requests: 50,000
With per-tenant caching:
- Each tenant’s system prompt is cached once per cache TTL (assume 1 hour, so 24 caches per tenant per day)
- Cached tokens per tenant per day: 1,500 × 24 = 36,000
- Non-cached tokens per tenant per day: (100 × 3,000) + (100 × 1,500) = 450,000
- Cost per tenant per day: (36,000 × $0.0003/1K) + (450,000 × $0.003/1K) = $10.80 + $1.35 = $12.15
- Cost for 500 tenants per day: $6,075
- Monthly cost: $182,250
Comparison:
- Shared prefix caching: $14,175/month
- Per-tenant caching: $182,250/month
- Monthly difference: $168,075 (per-tenant is 12.8× more expensive)
However, this comparison assumes shared prefix is viable. If tenants have custom prompts, shared prefix doesn’t work at all—you’d be paying the non-cached rate ($20,250/month in the original scenario), making per-tenant caching actually cheaper.
Cache Invalidation and Tenant Updates
With per-tenant caching, invalidating one tenant’s cache doesn’t affect others. If tenant A updates their system prompt, only tenant A’s cache clears. This is operationally cleaner and reduces the blast radius of changes.
You can also implement smarter invalidation strategies:
- Versioned prompts: Store multiple versions of a tenant’s prompt and cache each version separately. When a tenant updates their prompt, the old cached version remains available until TTL expires.
- Gradual rollout: Use feature flags to roll out prompt changes to a subset of tenants, validating cache performance before full deployment.
- Tenant-specific TTLs: High-traffic tenants might get longer cache TTLs; low-traffic tenants, shorter TTLs to save storage.
The Math: Cost and Latency Comparison {#the-math}
Let’s build a comprehensive decision matrix comparing the two approaches across multiple dimensions.
Cost Analysis Framework
Variables to model:
- Number of tenants (N)
- Requests per tenant per day (R)
- System prompt size in tokens (S)
- Context size per request in tokens (C)
- Output size in tokens (O)
- Cache hit rate (H) — percentage of requests that reuse a cached prefix
- Prompt customisation rate (P) — percentage of tenants with custom prompts
Shared Prefix Cost Formula:
If all tenants share the same prompt:
Daily Cost = (N × R × C × $0.003/1K) + (N × R × S × $0.0003/1K) + (N × R × O × $0.003/1K)
= (N × R × $0.003/1K × (C + O)) + (N × R × S × $0.0003/1K)
With 500 tenants, 100 requests/day each, 3,000-token context, 1,500-token prompt, 200-token output:
Daily Cost = (500 × 100 × $0.003/1K × 3,200) + (500 × 100 × 1,500 × $0.0003/1K)
= $480 + $22.50
= $502.50
Monthly Cost = $15,075
Per-Tenant Cost Formula (with custom prompts):
If 80% of tenants have custom prompts:
Daily Cost = (N × R × (C + O) × $0.003/1K) + (N × R × S × $0.0003/1K × H)
= (500 × 100 × 3,200 × $0.003/1K) + (500 × 100 × 1,500 × $0.0003/1K × 0.5)
= $480 + $11.25
= $491.25
Monthly Cost = $14,738
In this scenario, per-tenant caching is slightly cheaper because cache hit rate (H = 0.5) is lower, but the isolation benefit is significant for compliance.
Latency Impact
Prompt caching also reduces latency by eliminating redundant token processing. A cached prefix skips the forward pass for those tokens, reducing time-to-first-token (TTFT).
Without caching:
- Process 4,500 tokens (1,500 system + 3,000 context) = ~2–4 seconds (depending on model and hardware)
- Then generate output tokens = ~0.5–1 second per token
With caching:
- Retrieve cached KV states for 1,500 tokens = ~10–50ms
- Process 3,000 context tokens = ~1–2 seconds
- Generate output tokens = ~0.5–1 second per token
Latency improvement: 0.5–2 seconds faster TTFT (20–40% reduction)
For user-facing applications, this latency reduction is often more valuable than the cost savings. A chatbot that responds 1 second faster has higher user satisfaction and lower perceived lag.
Throughput and Concurrency
Caching also improves throughput by reducing the computational load on inference servers. With fewer tokens to process per request, you can serve more concurrent requests with the same hardware.
If your inference cluster can process 1,000 tokens/second, and each request without caching requires 4,500 input tokens:
- Without caching: 1,000 ÷ 4,500 = 0.22 requests/second (5 requests/second per 1,000 tokens/sec cluster)
- With caching: 1,000 ÷ 3,000 = 0.33 requests/second (7.5 requests/second per 1,000 tokens/sec cluster)
Throughput improvement: 50% more requests with the same hardware
This means you can either serve more users with existing infrastructure or downsize your inference cluster, further reducing costs.
Security and Compliance Considerations {#security-compliance}
This is where the decision gets serious. Shared prefix caching introduces security risks that per-tenant caching eliminates.
Data Leakage Risks in Shared Caches
CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-Tenant Systems documents a critical vulnerability: in shared prefix caching, information from one tenant’s cached context can be inferred by another tenant through cache timing attacks or by observing which prefixes are cached.
Example attack:
- Tenant A uploads a sensitive document to their context: “Patient John Doe has diabetes and takes insulin.”
- This context is cached with a hash:
cache_hash_12345. - Tenant B, through timing analysis or cache hit/miss patterns, discovers that
cache_hash_12345is cached. - Tenant B infers that a specific context sequence is in the system and potentially extracts information.
This is a side-channel attack, not a direct data breach, but it violates the confidentiality principle required for SOC 2 Type II and ISO 27001 compliance.
Compliance Frameworks and Tenant Isolation
SOC 2 Type II Requirements:
SOC 2 audits (which we help clients pass via Vanta implementation and Security Audit readiness) explicitly require:
- CC6.1 (Logical Access Controls): Logical access to systems and data is restricted to authorised users. In multi-tenant systems, this means each tenant’s data must be logically isolated.
- CC6.2 (Change Management): Changes to systems must not compromise other tenants’ data. A shared cache that could leak data violates this.
ISO 27001 Requirements:
ISO 27001 emphasises:
- A.9.1.1 (Access Control Policy): Access to information is restricted based on business requirements and the principle of least privilege.
- A.13.1.3 (Segregation of Networks): Networks must be segregated to control access. Shared caches without tenant prefixes violate logical segregation.
HIPAA and PII Regulations:
If you’re handling healthcare data (HIPAA) or personal information (GDPR, CCPA), shared caches are likely non-compliant. Regulators expect evidence that data from one individual cannot leak to another, even through side channels.
Implementing Tenant Isolation in Shared Caches
If you want the cost benefits of shared caching with the security of per-tenant isolation, you need to implement tenant-aware cache prefixes.
How to Design Shared Infrastructure Multi-Tenancy with Tenant Isolation on GCP describes the pattern:
- Prepend tenant ID to every cached prefix: Instead of caching
system_prompt_hash, cachetenant_123_system_prompt_hash. - Namespace cache keys: Use Redis or Memcached with tenant-prefixed keys:
cache:tenant_123:prompt_v1. - Implement cache access controls: Verify that requests can only access caches for their own tenant.
- Audit cache hits and misses: Log which tenants access which caches for compliance audits.
This approach gives you:
- Cost benefits of caching: Shared infrastructure and cache storage.
- Security of isolation: Logical separation of tenant data.
- Compliance alignment: Clear evidence of tenant isolation for auditors.
The trade-off is marginal: prepending a tenant ID to a cache key adds ~50 bytes of overhead per cache entry, negligible compared to the prompt and context size.
Implementation Patterns for Multi-Tenant Architectures {#implementation-patterns}
Now let’s discuss how to actually implement these strategies in production.
Pattern 1: Shared System Prompt with Tenant-Namespaced Context
This is a hybrid approach: the system prompt is shared and cached globally, but each tenant’s context is cached per-tenant.
Architecture:
Request: {tenant_id: "acme", message: "Summarise this document"}
1. Retrieve cached system prompt (global, shared across all tenants)
Cache key: "system_prompt_v2_hash"
Cache hit: Yes → reuse KV states
2. Retrieve tenant-specific context (per-tenant cache)
Cache key: "tenant_acme_context_hash"
Cache hit: Maybe → reuse if context hasn't changed
3. Build final prompt: [system_prompt] + [tenant_context] + [user_message]
4. Call LLM with cached prefixes for both components
Cost impact:
Using our document analysis example:
- Shared system prompt (1,500 tokens): Cached globally, hit on every request.
- Tenant context (3,000 tokens): Cached per-tenant, hit rate depends on context freshness.
Assuming 70% cache hit rate on tenant context:
Daily Cost = (N × R × $0.003/1K × (C × 0.3 + O)) + (N × R × S × $0.0003/1K)
= (500 × 100 × $0.003/1K × (3,000 × 0.3 + 200)) + (500 × 100 × 1,500 × $0.0003/1K)
= (50,000 × $0.003/1K × 1,100) + (50,000 × 1,500 × $0.0003/1K)
= $165 + $22.50
= $187.50
Monthly Cost = $5,625
This is 37% cheaper than shared prefix alone and maintains per-tenant isolation.
Pattern 2: Tenant-Isolated Caches with Shared Infrastructure
Each tenant has a logically isolated cache, but the infrastructure (Redis, Memcached) is shared.
Implementation:
import hashlib
import redis
class TenantAwareCacheManager:
def __init__(self, redis_client):
self.redis = redis_client
def cache_key(self, tenant_id, prompt_type, content_hash):
"""Generate tenant-namespaced cache key"""
return f"cache:tenant_{tenant_id}:{prompt_type}:{content_hash}"
def get_cached_prefix(self, tenant_id, prompt_type, content):
"""Retrieve cached KV states for a prompt, isolated by tenant"""
content_hash = hashlib.sha256(content.encode()).hexdigest()
key = self.cache_key(tenant_id, prompt_type, content_hash)
cached = self.redis.get(key)
if cached:
return json.loads(cached)
return None
def set_cached_prefix(self, tenant_id, prompt_type, content, kv_states, ttl=3600):
"""Store cached KV states with tenant isolation"""
content_hash = hashlib.sha256(content.encode()).hexdigest()
key = self.cache_key(tenant_id, prompt_type, content_hash)
self.redis.setex(key, ttl, json.dumps(kv_states))
# Log for compliance audit
self.audit_log(tenant_id, 'cache_write', key)
def audit_log(self, tenant_id, action, cache_key):
"""Log all cache operations for SOC 2 / ISO 27001 compliance"""
# Store in audit table with timestamp, tenant_id, action, cache_key
pass
Benefits:
- Logical isolation: Each tenant’s cache is namespaced and cannot be accessed by other tenants.
- Shared infrastructure: Single Redis cluster serves all tenants, reducing operational overhead.
- Auditability: Every cache operation is logged with tenant ID, enabling compliance audits.
- Cost-efficient: Shared infrastructure scales better than per-tenant databases.
Pattern 3: Database Per Tenant with Isolated Prompt Caching
For highly regulated environments, some teams implement Database Per Tenant vs Shared Schema architectures where each tenant has a dedicated database and, consequently, a dedicated cache.
When to use:
- HIPAA-regulated healthcare platforms.
- Financial services with strict data segregation requirements.
- Government or defence contractors with classified data handling requirements.
Cost impact:
Database per tenant is expensive—you’re running separate infrastructure for each tenant. However, if you’re already committed to this architecture for compliance reasons, adding per-tenant prompt caching is relatively cheap (just additional cache storage per tenant’s database).
Making the Decision: A Decision Framework {#decision-framework}
Here’s a practical framework to decide between shared prefix and per-tenant caching:
Step 1: Assess Prompt Customisation
Question: Do your tenants have custom system prompts or instructions, or is the prompt standardised across all tenants?
- Standardised prompt (0–20% customisation): Shared prefix caching is viable. Move to Step 2.
- Mixed (20–80% customisation): Consider hybrid approach (shared system prompt, per-tenant context). Move to Step 2.
- Highly customised (80%+ customisation): Per-tenant caching is necessary. Skip to Step 3.
Step 2: Evaluate Compliance Requirements
Question: Are you subject to SOC 2, ISO 27001, HIPAA, GDPR, or other data protection regulations?
- No compliance requirements: Shared prefix caching is acceptable if prompt is standardised.
- SOC 2 Type II or ISO 27001: Implement tenant-aware cache namespacing (Pattern 2). This gives you cost benefits with compliance alignment.
- HIPAA, PCI-DSS, or government regulation: Per-tenant caching or database-per-tenant architecture is required. Move to Step 3.
Step 3: Model the Cost Impact
Question: How much will each approach cost, and what’s the payback period?
Use the cost formulas from the “Math” section to model both approaches. Factor in:
- API costs (token pricing)
- Infrastructure costs (cache storage, servers)
- Operational overhead (cache management, invalidation, monitoring)
If per-tenant caching costs 2–3× more than shared prefix, but you’re required to do it for compliance, the decision is made. If per-tenant caching costs 10×+ more and compliance isn’t a blocker, reconsider.
Step 4: Assess Request Volume and Cache Hit Rate
Question: How many requests per day, and what’s your expected cache hit rate?
- Low volume (<1,000 requests/day) or low hit rate (<30%): Caching benefits are marginal. Focus on architectural simplicity over cache optimization.
- High volume (>10,000 requests/day) with high hit rate (>70%): Caching benefits are significant. Invest in robust cache management.
For high-volume, high-hit-rate scenarios, even a 10% cost reduction from caching translates to significant savings at scale.
Step 5: Plan for Evolution
Question: How will your product evolve? Will customisation increase?
If you’re starting with a standardised product but planning to add tenant customisation in 6–12 months, build for per-tenant caching from day one. Retrofitting cache isolation later is painful.
Real-World Scenarios and Trade-Offs {#real-world-scenarios}
Let’s walk through three realistic scenarios to see how the decision plays out.
Scenario A: Customer Support Chatbot SaaS
Product: White-label customer support chatbot for SMEs.
Characteristics:
- 200 active tenants
- Standardised system prompt (same tone, guardrails for all tenants)
- 10,000 requests/day across all tenants
- Tenants can customise knowledge base (FAQ, product docs) but not system prompt
- No specific compliance requirements (most tenants are SMEs)
Analysis:
- Prompt customisation: Standardised system prompt, but per-tenant knowledge base. Hybrid approach is best.
- Compliance: No strict requirements, but customer trust matters. Implement tenant-aware cache namespacing anyway (minimal overhead).
- Cost model:
- Shared system prompt (2,000 tokens): Cached globally.
- Per-tenant knowledge base (4,000 tokens): Cached per-tenant.
- Monthly cost with hybrid caching: ~$8,000.
- Monthly cost without caching: ~$18,000.
- Annual savings: $120,000.
- Hit rate: System prompt cached on every request (100%). Knowledge base cached on ~60% of requests (users ask similar questions).
Decision: Implement hybrid caching with tenant-aware namespacing. Cost savings are significant, and the architecture is straightforward.
Implementation: Use Pattern 2 (tenant-isolated caches with shared infrastructure). Cache system prompt globally, knowledge base per-tenant.
Scenario B: Healthcare AI Documentation Assistant
Product: AI tool that helps doctors generate clinical documentation from patient conversations.
Characteristics:
- 50 active tenants (hospitals and clinics)
- Highly customised prompts per tenant (each hospital has different documentation standards)
- 5,000 requests/day
- HIPAA-regulated; SOC 2 audit required in 6 months
- Patient data in context (PII-heavy)
Analysis:
- Prompt customisation: 95% of tenants have custom prompts. Shared prefix caching doesn’t work.
- Compliance: HIPAA requires strict data isolation. Shared caches are non-compliant. Per-tenant isolation is mandatory.
- Cost model:
- Per-tenant system prompt (3,000 tokens): Cached per-tenant.
- Per-tenant context (5,000 tokens): Cached per-tenant.
- Monthly cost with per-tenant caching: ~$22,000.
- Monthly cost without caching: ~$24,000.
- Annual savings: $24,000 (modest, but compliance is the driver).
- Compliance: Per-tenant caching with audit logging enables SOC 2 compliance.
Decision: Implement per-tenant caching (Pattern 2 or Pattern 3, depending on overall architecture). Compliance requirements override cost optimisation.
Implementation: Use Pattern 3 (database per tenant) if you’re building a HIPAA-compliant product anyway. If you’re using shared infrastructure, implement Pattern 2 with strict audit logging for SOC 2 readiness.
Scenario C: Enterprise AI Workflow Automation Platform
Product: Internal AI automation platform for a Fortune 500 company automating employee workflows (expense reports, leave requests, etc.).
Characteristics:
- 10,000+ employees across 50 departments
- Standardised workflows but department-specific rules (Finance has different rules than HR)
- 100,000+ requests/day
- ISO 27001 certified; SOC 2 Type II in place
- High-security environment; data isolation is critical
Analysis:
- Prompt customisation: Standardised core workflows, but 30% of departments have custom rules. Hybrid approach is optimal.
- Compliance: ISO 27001 and SOC 2 require strict isolation. Shared caches without tenant prefixes are non-compliant.
- Cost model:
- Shared core workflow prompt (2,500 tokens): Cached globally.
- Department-specific rules (1,500 tokens): Cached per-department.
- Monthly cost with hybrid caching: ~$45,000.
- Monthly cost without caching: ~$120,000.
- Annual savings: $900,000.
- Hit rate: Core prompt cached on 100% of requests. Department rules cached on ~80% of requests (rules are stable).
Decision: Implement hybrid caching (Pattern 2) with department-level isolation. Cost savings are substantial, and compliance is straightforward.
Implementation: Use Pattern 2 with department as the isolation boundary instead of individual tenant. Cache core workflows globally, department rules per-department.
Next Steps and Optimization {#next-steps}
Once you’ve chosen your caching strategy, here’s how to implement and optimise it.
Immediate Actions (Weeks 1–2)
- Audit your current prompts: Document all system prompts, instructions, and context blocks. Identify which are shared vs. tenant-specific.
- Baseline your costs: Calculate current LLM API spend without caching. This is your benchmark.
- Choose your pattern: Use the decision framework to select shared prefix, per-tenant, or hybrid caching.
- Design cache keys: If implementing per-tenant caching, design your cache key schema (tenant_id, prompt_type, content_hash).
Implementation (Weeks 3–6)
- Integrate with LLM API: Most LLM providers (OpenAI, Anthropic, etc.) support prompt caching natively. Enable it in your API calls.
- Implement cache layer: Use Redis or Memcached for cache storage. Implement tenant-aware key namespacing.
- Add monitoring: Track cache hit rates, cache size, and cost reduction. Use dashboards to visualise impact.
- Test thoroughly: Verify that cached prefixes are correctly reused and that no data leaks occur across tenants.
Optimisation (Weeks 7–12)
- Analyse cache hit patterns: Identify which prompts and contexts have the highest hit rates. Prioritise optimising those.
- Implement cache warming: Pre-load frequently used prompts into the cache at application startup.
- Tune cache TTLs: Experiment with different cache lifespans (5 minutes, 1 hour, 24 hours) to balance hit rate and freshness.
- Implement cache versioning: As you update prompts, maintain multiple cached versions to avoid sudden invalidations.
- Set up compliance auditing: If required, implement audit logging for all cache operations (who accessed which cache, when, from which tenant).
Monitoring and Observability
For teams building AI & Agents Automation systems or pursuing AI Strategy & Readiness, monitoring is critical. Track:
- Cache hit rate (%): Percentage of requests that reuse cached prefixes. Target: >70%.
- Cost per request ($): Total API cost divided by number of requests. Track this weekly.
- Latency (ms): Time-to-first-token and total response time. Caching should reduce TTFT by 20–40%.
- Cache storage (GB): Total size of cached prefixes. Monitor for runaway growth.
- Tenant-specific metrics: Hit rate, cost, and latency per tenant. Identify outliers.
Tools like Datadog, New Relic, or custom dashboards can visualise these metrics in real time.
Scaling Considerations
As your product scales, caching becomes more important:
- At 100 tenants: Caching is nice-to-have. Manual cache management is feasible.
- At 1,000 tenants: Caching is essential. Automated cache invalidation and monitoring are required.
- At 10,000+ tenants: Caching is critical to unit economics. Invest in robust cache infrastructure (Redis cluster, cache CDN, etc.).
For teams working on AI automation at enterprise scale, consider offloading cache management to a managed service (Redis Cloud, Memorystore on GCP, ElastiCache on AWS) to reduce operational overhead.
Regulatory and Compliance Evolution
As regulations evolve, your caching strategy may need to adapt. Stay informed about:
- SOC 2 audits: If pursuing SOC 2 compliance or Security Audit readiness, auditors will ask about cache isolation. Document your strategy.
- ISO 27001 updates: ISO 27001:2022 introduced new controls around logical access and segregation. Ensure your caching aligns.
- Regional data regulations: GDPR (EU), PIPEDA (Canada), PDPA (Singapore) all have implications for where and how you cache data.
When undertaking Vanta implementation or security audits, involve your infrastructure and caching teams in the audit process. Auditors will want to see evidence of tenant isolation, cache access controls, and audit logging.
Conclusion: Making the Right Call
Prompt caching is a powerful lever for reducing costs and improving latency in multi-tenant AI systems. But the architecture you choose—shared prefix vs. per-tenant—has profound implications for cost, compliance, and operational complexity.
Shared prefix caching wins when you have standardised prompts, high request volume, and no strict compliance requirements. The cost savings are substantial (30–50% reduction in API spend), and the implementation is straightforward.
Per-tenant caching is essential when tenants have custom prompts, when compliance regulations require data isolation, or when the risk of data leakage outweighs cost savings.
Hybrid approaches—shared system prompts with per-tenant context, or tenant-aware namespacing of shared caches—often provide the best balance of cost, security, and compliance.
The decision isn’t one-size-fits-all. Use the framework in this guide to assess your specific situation, model the costs, and choose the approach that aligns with your product, compliance requirements, and growth trajectory.
If you’re building AI & Agents Automation systems or scaling platform engineering for multi-tenant workloads, getting this decision right early saves months of refactoring and thousands in unnecessary spend. The teams we work with at PADISO—from Sydney startups scaling their first AI product to enterprise operators modernising legacy systems—consistently see 30–60% cost reductions and 20–40% latency improvements by optimising their caching strategy.
Start by auditing your current prompts and request patterns. Model both approaches using the cost formulas provided. Then implement your chosen strategy incrementally, monitoring cache hit rates and cost impact every week. The data will tell you if you’ve made the right call.
Ready to optimise your multi-tenant AI infrastructure? The team at PADISO specialises in AI Strategy & Readiness and Platform Design & Engineering for SaaS platforms. We help founders and engineering leaders design cost-efficient, compliant, and scalable AI systems. Reach out if you’re navigating this decision for your product.