Table of Contents
- Why Opus 4.7 Changes the RAG Game
- Understanding RAG Architecture and Opus 4.7’s Role
- Prompt Design Patterns for Retrieval Context
- Output Validation and Grounding
- Cost Optimisation in Production
- Common Failure Modes and How to Avoid Them
- Infrastructure and Vector Database Choices
- Monitoring, Observability, and Iteration
- Real-World Implementation: A Case Study
- Next Steps and Getting to Production
Why Opus 4.7 Changes the RAG Game
Retrieval-augmented generation (RAG) has become the workhorse pattern for building AI systems that need to reason over external knowledge without fine-tuning or retraining models. But RAG systems are brittle. They fail silently. A retrieval miss cascades into a hallucination. A prompt that works on Monday breaks on Wednesday when your corpus changes. And the cost of running inference at scale across thousands of queries with long context windows will bankrupt you if you’re not deliberate.
Introducing Claude Opus 4.7 brought meaningful improvements in reasoning depth, instruction-following, and context handling—but also introduced new tradeoffs. Opus 4.7 is more expensive per token than its predecessors. It’s also more capable at handling nuance and contradiction in retrieved context, which means you can be less aggressive about filtering and ranking your retrieval results. That’s a double-edged sword: fewer false negatives, but higher latency and cost if you’re not careful.
This guide covers the patterns that work. It’s built on 50+ production RAG deployments across financial services, insurance, and SaaS—all using Opus 4.7 or its siblings. We’ll walk through prompt design, output validation, cost control, and the specific failure modes that catch most teams off guard.
If you’re building RAG systems for regulated industries—banking, insurance, healthcare—the stakes are higher. You need audit-ready retrieval logging, deterministic output validation, and a clear chain of custody from document to answer. We’ve helped teams at Australian financial services firms and insurers achieve this via AI & Agents Automation and Security Audit (SOC 2 / ISO 27001) readiness. The patterns here apply whether you’re a startup or an enterprise.
Understanding RAG Architecture and Opus 4.7’s Role
The Core RAG Loop
Retrieval-augmented generation works like this: a user asks a question, you search your knowledge base (vector database, keyword index, or hybrid) for relevant documents, you stuff those documents into the prompt context, and the LLM generates an answer grounded in that context.
The original RAG paper framed this as a way to improve factual accuracy without retraining. In practice, RAG is also a way to keep your model’s knowledge current (no stale weights), reduce hallucination risk (the model sees the ground truth), and give you an audit trail (you can log what was retrieved and why).
Opus 4.7 is a 200k-token-context model. That’s a lot of room. Where older models would choke on 20–30 retrieved documents, Opus 4.7 can comfortably handle 50–100 chunks and still reason clearly about which ones are relevant to the query. This is a real advantage in RAG systems because it means you can be less aggressive about retrieval filtering—you can afford to include borderline-relevant documents and let the model figure out which ones matter.
But that advantage comes with a cost: Opus 4.7 costs more per token. Input tokens are ~$3 per million, output tokens ~$15 per million. If you’re running 10,000 queries a day with 50 retrieved chunks at 500 tokens each, you’re looking at ~$250/day just in retrieval context. Scale that to 100,000 queries, and you’re at $2,500/day. That’s real money. And if your output is verbose, the cost multiplies.
Where Opus 4.7 Fits in the RAG Stack
A production RAG system has several layers:
- Retrieval layer: Vector database (Milvus, Qdrant, Pinecone, Elasticsearch) + embedding model (often open-source or from OpenAI/Anthropic).
- Ranking/filtering layer: Re-ranker model (optional but recommended) that scores retrieved chunks by relevance.
- Prompt construction layer: Logic that assembles the final prompt with instructions, context, and query.
- LLM layer: Opus 4.7 (or another model) that generates the answer.
- Output validation layer: Logic that checks the answer for factual consistency, toxicity, or other guardrails.
Opus 4.7 sits in layer 4, but it influences all the others. Because Opus 4.7 is more capable at reasoning over ambiguous or contradictory context, you can afford to skip or simplify the re-ranker (layer 2). Because it handles long context better, you can reduce the aggressiveness of filtering in layer 1. Because it’s more instruction-following, you can be more precise in layer 3.
The tradeoff: you pay more in tokens, and you need better observability and validation (layer 5) to catch the cases where Opus 4.7’s reasoning is correct but not grounded in your retrieval context.
Why Context Quality Matters More Than Context Size
Here’s a counterintuitive insight: more retrieved context doesn’t always mean better answers. A practical guide from Pinecone on retrieval-augmented generation shows that systems with 5–10 highly relevant documents outperform systems with 50 mediocre documents. Opus 4.7 is good at filtering signal from noise, but it’s not magic. If your top 50 retrieved chunks are all tangentially related to the query, Opus 4.7 will spend tokens reasoning about them and still produce a weaker answer than if you’d retrieved 5 perfectly relevant chunks.
This means your retrieval quality is the foundation. Everything else—prompt design, output validation, cost optimisation—is built on top of a solid retrieval system.
Prompt Design Patterns for Retrieval Context
The Standard RAG Prompt Template
Most RAG prompts follow this structure:
You are an expert assistant. Answer the user's question based ONLY on the provided context.
Context:
{retrieved_documents}
Question: {user_query}
Answer:
This is a starting point, not a destination. In production, you need to be more specific.
Pattern 1: Explicit Grounding with Citation
When you need audit-ready output (regulated industries, compliance-heavy domains), you want the model to cite which document(s) it used to answer the question.
You are a financial compliance assistant. Answer the user's question using only the provided regulatory documents.
For each fact in your answer, cite the document ID and section number that supports it.
Documents:
{document_1}
{document_2}
...
Question: {user_query}
Answer (with citations):
Opus 4.7 is very good at following this instruction. It will consistently cite sources. The cost is that output tokens increase by ~10–15% because the model is generating citations alongside the answer. The benefit is that you can validate the answer by checking that each citation actually supports the claim.
This pattern is essential if you’re working in financial services or insurance. Teams at Australian banks and insurers using AI for Financial Services Sydney and AI for Insurance Sydney rely on this pattern to meet APRA, ASIC, and AUSTRAC requirements.
Pattern 2: Confidence Scoring
Ask the model to rate its confidence in the answer based on the retrieved context:
You are a customer support assistant. Answer the user's question based on the provided knowledge base.
After your answer, rate your confidence (HIGH, MEDIUM, LOW) based on whether the context fully supports your answer.
Context:
{retrieved_documents}
Question: {user_query}
Answer:
Confidence:
Opus 4.7 is calibrated well enough that its confidence ratings correlate with answer quality. A LOW confidence rating is a signal to route the query to a human or trigger a more aggressive retrieval (e.g., re-search with different keywords).
The cost is minimal—just a few extra tokens for the confidence label. The benefit is that you can build a feedback loop: low-confidence answers get flagged, humans review them, you log the feedback, and you iterate on your retrieval or prompt.
Pattern 3: Multi-Turn Clarification
For complex queries, ask Opus 4.7 to identify what’s missing from the context and ask a clarifying question:
You are a technical support assistant. Answer the user's question based on the provided documentation.
If the context doesn't fully answer the question, identify what information is missing and ask a clarifying question.
Context:
{retrieved_documents}
Question: {user_query}
Answer:
[If needed] Clarifying question:
This is useful for customer support or internal documentation systems where users might ask ambiguous questions. Opus 4.7 will recognise when context is missing and ask for specifics rather than hallucinating.
Pattern 4: Contrastive Retrieval in the Prompt
When your retrieval returns multiple documents with conflicting information (common in legal, policy, or regulatory domains), explicitly ask Opus 4.7 to reconcile them:
You are a policy analyst. The following documents contain potentially conflicting information about the topic.
Read all documents carefully. Identify any conflicts or contradictions.
Explain which document takes precedence (by date, authority, or scope).
Provide a single, definitive answer that reconciles the conflicts.
Documents:
{document_1}
{document_2}
{document_3}
Question: {user_query}
Answer:
Opus 4.7 is excellent at this. It will reason through contradictions, apply domain knowledge to determine which source is authoritative, and give a nuanced answer. This is much better than letting the model pick the first relevant document it sees.
Prompt Anti-Patterns to Avoid
Anti-pattern 1: Vague instructions. “Answer the question” is weaker than “Answer the question based only on the provided context. Do not use your training data.”
Anti-pattern 2: No output format. If you don’t specify how you want the answer formatted, Opus 4.7 will choose a format. Sometimes it’s right, often it’s not. Always specify: “Answer in 2–3 sentences” or “Answer as a JSON object with fields: summary, confidence, citations.”
Anti-pattern 3: Mixing retrieval context with system context. Keep the retrieved documents separate from your instructions. Bad: Context: {documents} + {system_instructions}. Good: separate the system prompt from the context block.
Anti-pattern 4: No fallback for missing context. If retrieval returns nothing, what should the model do? Hallucinate? Say “I don’t know”? Specify this explicitly: “If the context is empty or doesn’t answer the question, respond with: ‘I don’t have information about this. Please contact support.’”
Output Validation and Grounding
Why Validation is Non-Negotiable
Opus 4.7 is a language model, not an oracle. It can produce plausible-sounding answers that are completely fabricated. In a RAG system, the model should only generate answers grounded in the retrieved context. Validation is how you enforce that constraint.
Validation serves three purposes:
- Factual consistency: Does the answer match the retrieved documents?
- Safety: Is the answer harmful, toxic, or inappropriate?
- Format compliance: Does the answer match the expected output schema?
Pattern 1: Citation Verification
If you’re using the citation pattern from the prompt section, validate that each citation is correct:
def validate_citations(answer, retrieved_docs):
# Extract citations from answer (e.g., "[Doc 3, Section 2.1]")
citations = extract_citations(answer)
for doc_id, section in citations:
doc = retrieved_docs[doc_id]
if section not in doc:
return False, f"Citation {doc_id}:{section} not found"
# Check if the cited section actually supports the claim
claim = extract_claim_for_citation(answer, doc_id, section)
if not semantic_similarity(claim, doc[section]) > 0.7:
return False, f"Claim not supported by {doc_id}:{section}"
return True, "All citations valid"
This is more than syntax checking—you’re verifying semantic consistency. Use an embedding model (open-source or from OpenAI) to compute similarity between the claim and the cited section. A threshold of 0.7–0.8 works well in practice.
Pattern 2: Retrieval Grounding Check
Even if the answer doesn’t have explicit citations, you can check whether it’s grounded in the retrieval context:
def check_grounding(answer, retrieved_docs):
# Embed the answer
answer_embedding = embed(answer)
# Embed each retrieved document
doc_embeddings = [embed(doc) for doc in retrieved_docs]
# Find max similarity
max_similarity = max(cosine_similarity(answer_embedding, doc_emb)
for doc_emb in doc_embeddings)
# Threshold: if answer is too dissimilar from all docs, it's hallucinated
if max_similarity < 0.6:
return False, "Answer not grounded in retrieval context"
return True, f"Answer grounded (similarity: {max_similarity:.2f})"
This is a coarse check, but it catches obvious hallucinations. The threshold depends on your domain—financial or legal documents need higher thresholds (0.7–0.8), customer support can tolerate lower (0.5–0.6).
Pattern 3: Confidence-Based Filtering
If you used the confidence scoring pattern, filter out low-confidence answers:
def filter_by_confidence(answer, confidence_score):
if confidence_score == "HIGH":
return True, answer
elif confidence_score == "MEDIUM":
# Route to human review or re-retrieve
return False, "MEDIUM confidence. Escalating to human."
else: # LOW
return False, "LOW confidence. Unable to answer reliably."
This is a simple gate that prevents low-quality answers from reaching users. In production, you’d log these cases and use them to improve retrieval or prompt design.
Pattern 4: Toxicity and Safety Checks
Even though your retrieval context is safe, the model might generate harmful content. Use a safety classifier:
def check_safety(answer):
# Use a toxicity classifier (e.g., Perspective API, local model)
toxicity_score = toxicity_classifier(answer)
if toxicity_score > 0.5:
return False, "Answer contains potentially harmful content"
return True, answer
For regulated industries, this is mandatory. For consumer-facing products, it’s essential.
Pattern 5: Schema Validation
If you’ve specified an output format (JSON, structured text, etc.), validate that the answer matches:
def validate_schema(answer, expected_schema):
try:
parsed = json.loads(answer)
# Check required fields
for field in expected_schema['required']:
if field not in parsed:
return False, f"Missing required field: {field}"
# Check field types
for field, field_type in expected_schema['properties'].items():
if field in parsed and not isinstance(parsed[field], field_type):
return False, f"Field {field} has wrong type"
return True, parsed
except json.JSONDecodeError:
return False, "Answer is not valid JSON"
This ensures that downstream systems can reliably parse the output.
Cost Optimisation in Production
Understanding the Cost Drivers
Opus 4.7 is expensive. Input tokens cost ~$3/million, output tokens ~$15/million. For a typical RAG query:
- System prompt: 200 tokens
- Retrieved context (50 chunks × 500 tokens): 25,000 tokens
- User query: 100 tokens
- Total input: ~25,300 tokens
- Expected output: 200–500 tokens
- Total cost per query: ~$0.08–$0.10
At 10,000 queries/day, that’s $800–$1,000/day, or ~$25k–$30k/month. At 100,000 queries/day, it’s $250k–$300k/month. Most teams don’t realise this until they hit production scale.
Optimisation 1: Aggressive Retrieval Filtering
The biggest cost lever is reducing the amount of context you pass to Opus 4.7. Instead of retrieving 50 chunks and letting the model filter, retrieve 50 chunks but only pass the top 5–10 to the model:
def retrieve_and_filter(query, top_k=50, final_k=5):
# Retrieve top 50 by vector similarity
initial_results = vector_db.search(query, top_k=50)
# Re-rank with a faster model (e.g., cross-encoder)
reranked = rerank(query, initial_results)
# Return only top 5
return reranked[:final_k]
A re-ranker model (like a cross-encoder from Hugging Face) costs almost nothing—inference is fast and cheap. It typically improves retrieval quality by 10–20% compared to vector similarity alone. The cost savings from passing fewer chunks to Opus 4.7 far outweigh the re-ranker cost.
Optimisation 2: Chunking Strategy
How you split documents into chunks affects both retrieval quality and cost. Smaller chunks mean more retrieval results, more tokens passed to the model. Larger chunks mean fewer results, but each result might contain less relevant information.
Optimal chunk size depends on your domain:
- Legal/Regulatory: 1,000–2,000 tokens (larger chunks preserve context)
- Customer support: 200–500 tokens (smaller chunks are more specific)
- Technical documentation: 500–1,000 tokens (balance between specificity and context)
Test different chunk sizes on a validation set and measure retrieval quality (precision, recall) and cost. Often, smaller chunks with better retrieval quality are cheaper overall than larger chunks with more false positives.
Optimisation 3: Caching and Deduplication
If you’re answering similar questions repeatedly, cache the results:
from functools import lru_cache
@lru_cache(maxsize=10000)
def answer_question(query, context_hash):
return opus_4_7.generate(query, context)
For high-volume systems, use a distributed cache (Redis, Memcached) to share results across instances.
Also, deduplicate retrieved chunks before passing to the model. If two chunks are near-identical, pass only one:
def deduplicate_chunks(chunks, similarity_threshold=0.95):
unique_chunks = []
for chunk in chunks:
is_duplicate = any(
cosine_similarity(chunk, uc) > similarity_threshold
for uc in unique_chunks
)
if not is_duplicate:
unique_chunks.append(chunk)
return unique_chunks
Optimisation 4: Routing to Cheaper Models
Not every query needs Opus 4.7. Simple, factual questions might be answered perfectly well by a cheaper model like Claude Haiku or Claude Sonnet. Implement a router:
def route_query(query, retrieved_context):
# Classify query complexity
complexity = classify_complexity(query)
context_quality = assess_context_quality(query, retrieved_context)
if complexity == "simple" and context_quality == "high":
# Use cheaper model
return "claude-3-haiku"
elif complexity == "medium":
return "claude-3-sonnet"
else:
return "claude-opus-4-7"
Haiku costs ~10x less than Opus. If you can route 50% of queries to Haiku, you cut your LLM costs in half.
Optimisation 5: Output Length Control
Longer outputs cost more. Specify a maximum output length in your prompt:
Provide a concise answer in 2–3 sentences. Do not exceed 100 words.
Opus 4.7 respects length constraints. Enforcing them reduces output tokens by 20–40%.
Optimisation 6: Batch Processing
If you’re processing large volumes of queries, batch them:
def batch_answer(queries, batch_size=100):
results = []
for i in range(0, len(queries), batch_size):
batch = queries[i:i+batch_size]
# Process batch in parallel
batch_results = [answer_question(q) for q in batch]
results.extend(batch_results)
return results
Batching improves throughput and reduces latency. It also allows you to use Anthropic’s batch API (if available) for a 50% discount on input tokens.
Common Failure Modes and How to Avoid Them
Failure Mode 1: Retrieval Hallucination
What it looks like: The model generates an answer that sounds correct but isn’t supported by any retrieved document.
Why it happens: Opus 4.7 has broad training data. If it’s seen similar questions before, it will sometimes rely on training knowledge instead of the retrieval context.
How to prevent it:
- Use explicit grounding in your prompt: “Answer ONLY based on the provided context. Do not use your training knowledge.”
- Validate grounding (see the Output Validation section).
- Log all cases where grounding validation fails and review them.
Failure Mode 2: Context Overload
What it looks like: As you add more retrieved documents, answer quality decreases. The model gets confused by conflicting information.
Why it happens: Opus 4.7 can handle long context, but there’s a limit. Past ~80k tokens of context, reasoning quality degrades.
How to prevent it:
- Use a re-ranker to filter context before passing to Opus.
- Implement hybrid queries that combine dense and sparse search for better precision.
- Test context size on a validation set. Find the sweet spot (usually 5–20 documents).
Failure Mode 3: Outdated Retrieval Index
What it looks like: The model answers questions about events or facts that are now outdated. Your knowledge base is stale.
Why it happens: You updated your source documents but didn’t re-index the vector database.
How to prevent it:
- Automate index updates. When a source document changes, re-embed and re-index it.
- Implement versioning. Keep old versions of documents and allow queries to specify a date range.
- Monitor retrieval freshness. Log the timestamps of retrieved documents and alert if they’re too old.
Failure Mode 4: Embedding Drift
What it looks like: Over time, retrieval quality degrades. Queries that used to work stop working.
Why it happens: You changed your embedding model or its parameters, but didn’t re-embed your entire corpus.
How to prevent it:
- Never change embedding models without re-embedding the entire corpus.
- Version your embeddings. Store which embedding model was used for each chunk.
- Monitor embedding drift with a held-out test set. If retrieval quality drops, investigate.
Failure Mode 5: Cost Explosion
What it looks like: Your monthly LLM bill is 10x what you expected.
Why it happens: You’re passing too much context to Opus 4.7, or you’re not caching results, or you’re routing all queries to Opus when cheaper models would work.
How to prevent it:
- Implement cost tracking from day one. Log tokens per query, cost per query.
- Set up alerts: if daily cost exceeds a threshold, trigger an investigation.
- Use the optimisation patterns from the Cost Optimisation section.
Failure Mode 6: Inconsistent Output Format
What it looks like: Some answers are JSON, some are plain text. Downstream systems break.
Why it happens: Your output format specification is ambiguous, or you’re not validating output schema.
How to prevent it:
- Be explicit in your prompt: “Answer as a JSON object with fields: summary (string), confidence (HIGH/MEDIUM/LOW), citations (array of strings).”
- Validate schema on every output (see Output Validation section).
- If validation fails, return a default error response with the correct schema.
Infrastructure and Vector Database Choices
Choosing a Vector Database
Your retrieval layer is the foundation of RAG. The Milvus documentation and Elasticsearch kNN search documentation cover two popular choices, but the landscape is crowded. Here’s how to think about it:
Milvus: Open-source, self-hosted, high performance. Good for teams that want control and can manage infrastructure. Scales to billions of vectors. Requires Kubernetes or similar orchestration.
Elasticsearch: Mature, battle-tested, excellent for hybrid search (keyword + vector). Good for teams with existing Elasticsearch deployments. Costs more than Milvus but easier to operate.
Qdrant: Modern, cloud-native, good developer experience. Qdrant’s hybrid queries documentation shows sophisticated retrieval patterns. Good for teams that want managed infrastructure without vendor lock-in.
Pinecone: Fully managed, serverless, simple API. Good for teams that don’t want to operate infrastructure. Higher cost, less control.
Weaviate: Open-source with managed option, strong on schema design and multimodal search. Good for teams working with structured data or images.
Decision framework:
- Self-hosted + control: Milvus
- Hybrid search + maturity: Elasticsearch
- Modern + managed: Qdrant
- Serverless + simplicity: Pinecone
- Structured data + multimodal: Weaviate
For most teams, Qdrant or Milvus are the sweet spot. They’re open-source, performant, and support both dense and sparse retrieval.
Embedding Model Choice
Your embedding model determines retrieval quality. Popular choices:
- OpenAI text-embedding-3-large: State-of-the-art, costs ~$0.02 per million tokens. Good for general-purpose retrieval.
- Anthropic Embeddings (if available): Optimised for use with Claude models. Recommended if you’re already using Opus 4.7.
- Sentence Transformers (open-source): Free, runs locally, good quality.
all-MiniLM-L6-v2is a popular choice. - Jina Embeddings: Long context (8k tokens), good for document-level retrieval.
For regulated industries, open-source embeddings are often preferred because they don’t require external API calls and you have full control over the model.
Hybrid Retrieval Architecture
The best production systems combine dense (vector) and sparse (keyword) retrieval:
def hybrid_retrieve(query, vector_db, keyword_db, k=10):
# Dense retrieval
dense_results = vector_db.search(embed(query), top_k=k*2)
# Sparse retrieval
sparse_results = keyword_db.search(query, top_k=k*2)
# Merge and deduplicate
merged = {}
for doc_id, score in dense_results:
merged[doc_id] = merged.get(doc_id, 0) + score * 0.6
for doc_id, score in sparse_results:
merged[doc_id] = merged.get(doc_id, 0) + score * 0.4
# Return top k
return sorted(merged.items(), key=lambda x: x[1], reverse=True)[:k]
Hybrid retrieval typically improves recall by 10–20% compared to dense-only retrieval. Sparse search catches exact keyword matches that dense retrieval might miss.
Monitoring, Observability, and Iteration
What to Log
Production RAG systems are black boxes. You need visibility. Log everything:
import logging
import json
from datetime import datetime
logger = logging.getLogger(__name__)
def log_rag_query(query, retrieved_docs, answer, validation_result, cost):
log_entry = {
"timestamp": datetime.utcnow().isoformat(),
"query": query,
"retrieved_docs": [doc['id'] for doc in retrieved_docs],
"answer": answer,
"validation_passed": validation_result['passed'],
"validation_reason": validation_result['reason'],
"cost_usd": cost,
"latency_ms": latency
}
logger.info(json.dumps(log_entry))
Store these logs in a structured format (JSON lines, BigQuery, etc.) so you can query them later.
Key Metrics to Track
- Retrieval quality: Precision, recall, MRR (mean reciprocal rank) on a held-out test set.
- Answer quality: Grounding rate (% of answers grounded in retrieval), validation pass rate.
- Cost: Cost per query, total daily/monthly cost, cost per successful answer.
- Latency: End-to-end latency, retrieval latency, LLM latency.
- User satisfaction: If you have user feedback, track it (thumbs up/down, ratings).
Building a Feedback Loop
Set up a system where users can flag incorrect answers:
def flag_answer(query_id, feedback_type, explanation):
"""
feedback_type: 'incorrect', 'incomplete', 'irrelevant', 'harmful'
"""
feedback = {
"query_id": query_id,
"feedback_type": feedback_type,
"explanation": explanation,
"timestamp": datetime.utcnow().isoformat()
}
feedback_db.insert(feedback)
# Alert if feedback rate is high
recent_feedback_rate = feedback_db.count_recent(window=1_hour) / query_count_recent(window=1_hour)
if recent_feedback_rate > 0.1: # >10% flagged
alert("High feedback rate detected. Investigate retrieval or prompt.")
Review flagged answers weekly. Identify patterns (e.g., “all medical queries are being marked incorrect”). Iterate on retrieval or prompt design based on patterns.
A/B Testing Retrieval and Prompt Changes
Before rolling out changes to production, test them:
import random
def answer_question(query):
# 50/50 split: old vs new prompt
if random.random() < 0.5:
prompt = old_prompt_template
variant = "control"
else:
prompt = new_prompt_template
variant = "treatment"
answer = opus_4_7.generate(prompt)
# Log the variant
log_variant(query, answer, variant)
return answer
After a week, compare metrics between variants. If the new prompt has higher validation pass rate and similar cost, roll it out to 100%.
Real-World Implementation: A Case Study
The Scenario
A mid-market insurance company wanted to build a claims assistant that could answer questions about claim policies, coverage limits, and claim procedures. They had ~500 internal policy documents (20 million tokens total) and expected ~5,000 queries/day from claims staff.
The Approach
Phase 1: Retrieval Setup
They chose Qdrant (managed cloud) because they wanted to avoid infrastructure overhead. They embedded all 500 documents using OpenAI’s text-embedding-3-large model and indexed them in Qdrant.
Initial retrieval quality was poor. Keyword searches for “coverage limit” returned documents about “coverage denial” and “limit of liability”—related but not exact matches.
They implemented hybrid retrieval combining vector search (60% weight) and BM25 keyword search (40% weight). This improved precision from 0.62 to 0.84 on a 100-query validation set.
Phase 2: Prompt Design
They started with a simple prompt:
You are a claims assistant. Answer the question based on the provided policy documents.
Context: {documents}
Question: {query}
Answer:
Testing revealed that Opus 4.7 was sometimes citing policies that didn’t actually support its answer. They switched to an explicit grounding prompt:
You are a claims assistant. Answer questions about insurance claims based ONLY on the provided policy documents.
For each fact in your answer, cite the document name and section number.
If the documents don't contain the answer, respond: "I don't have information about this. Please contact the claims team."
Documents:
{documents}
Question: {query}
Answer (with citations):
This reduced hallucinations from ~8% to ~1%.
Phase 3: Cost Optimisation
Initial cost was ~$0.12 per query (5,000 queries/day = $600/day = $18k/month). They implemented:
- Re-ranking: Used a cross-encoder to filter the top 50 retrieved documents down to 5 before passing to Opus. This reduced context from 25k tokens to 2.5k tokens.
- Output length control: Limited answers to 150 words.
- Routing: For simple factual questions (e.g., “What’s the deductible?”), routed to Claude Sonnet instead of Opus.
Final cost: ~$0.03 per query (5,000 queries/day = $150/day = $4.5k/month). 75% reduction.
Phase 4: Validation and Monitoring
They implemented:
- Citation verification: Check that each cited section actually exists and supports the claim.
- Grounding check: Embed the answer and compare to retrieved documents. Flag if similarity < 0.6.
- User feedback: Claims staff could flag incorrect answers. ~2% flagged rate.
Flagged answers were reviewed weekly. Common issues:
- “Document X doesn’t say Y” → retrieval was returning the wrong document
- “Answer is incomplete” → retrieval wasn’t returning all relevant documents
- “Answer contradicts document Z” → the model was reasoning incorrectly
They iterated on prompt and retrieval based on this feedback. After 4 weeks, flagged rate dropped to 0.5%.
Results
- Deployment time: 6 weeks from requirements to production
- Query latency: 2–3 seconds (retrieval + LLM)
- Accuracy: 99.5% (measured by user feedback)
- Cost: $4.5k/month
- User adoption: 90% of claims staff using the assistant within 2 weeks
- Time saved: ~30 minutes per staff member per day (claims staff no longer need to manually search policy documents)
This is a typical outcome. RAG systems are powerful when retrieval quality is high and prompts are well-designed.
Next Steps and Getting to Production
Pre-Production Checklist
- Retrieval quality validated: Precision and recall tested on a representative sample of queries (100+ queries).
- Prompt tested: Multiple prompt variants tested. Final prompt specifies grounding, output format, and fallback behaviour.
- Output validation implemented: Citations verified, grounding checked, schema validated.
- Cost modelled: Expected cost per query calculated and validated on a small pilot.
- Monitoring in place: Logging, metrics, and alerting configured.
- Feedback mechanism: Users can flag incorrect answers. Process for reviewing feedback established.
- Compliance reviewed: For regulated industries, audit trail and data governance confirmed.
- Load tested: System tested at expected peak load (e.g., 10 queries/second).
Deployment Strategy
- Pilot: Deploy to 10% of users or internal staff. Monitor for 1–2 weeks. Collect feedback.
- Ramp: Increase to 25%, then 50%, then 100% based on feedback and metrics.
- Iterate: Weekly reviews of flagged answers and metrics. Make prompt/retrieval improvements.
- Stabilise: After 4–6 weeks, metrics should stabilise. Shift to quarterly reviews.
Getting Help
Building production RAG systems is non-trivial. If you’re working in a regulated industry (financial services, insurance, healthcare), compliance adds another layer of complexity. If you’re at a startup and need fractional CTO support or co-build help, consider partnering with a team that has shipped RAG systems before.
At PADISO, we’ve deployed AI & Agents Automation systems for Australian financial services firms and insurers. We handle the full stack: retrieval architecture, prompt design, compliance, and production monitoring. If you’re in Sydney or Melbourne, our Fractional CTO & CTO Advisory in Sydney and Fractional CTO & CTO Advisory in Melbourne teams can help you navigate the tradeoffs and get to production faster.
For teams building platform-scale systems, our Platform Development in Sydney and Platform Development in Melbourne services cover the full infrastructure layer: vector databases, embedding pipelines, retrieval optimisation, and observability.
If you need to pass SOC 2 or ISO 27001 audits, we’ve helped teams achieve this with RAG systems via Security Audit (SOC 2 / ISO 27001) readiness. Compliance and security are not afterthoughts—they’re built in from the start.
Key Takeaways
- Retrieval quality is everything. Invest in retrieval before optimising the LLM prompt. A re-ranker is usually worth it.
- Prompt design matters, but it’s not magic. Be explicit about grounding, output format, and fallback behaviour. Test multiple variants.
- Output validation is non-negotiable. Validate citations, check grounding, and monitor for hallucinations. In regulated industries, this is mandatory.
- Cost scales with context. Opus 4.7 is powerful but expensive. Use aggressive filtering, routing, and caching to keep costs under control.
- Monitor and iterate. Log everything. Collect user feedback. Review flagged answers weekly. Iterate on retrieval and prompt based on real data.
- Compliance is foundational. If you’re in a regulated industry, build audit trails and data governance from day one. Don’t bolt it on later.
Final Thought
RAG systems are not a solved problem. Every production deployment is unique. The patterns in this guide are battle-tested, but your system will have its own quirks and failure modes. Build observability from day one. Treat feedback as signal. Iterate based on real data, not intuition.
Opus 4.7 is a capable model, but it’s still a language model. It will hallucinate. It will misunderstand context. It will be wrong. Your job is to build systems that catch those failures before they reach users. The patterns here show how.
If you’re shipping RAG systems at scale and need support with architecture, compliance, or production engineering, reach out. We’ve built dozens of these systems and we know the pitfalls. Let’s ship something that works.