Table of Contents
- Why Default Model Selection Matters
- The Core Decision Framework
- Evaluating Accuracy and Consistency
- Cost, Speed, and Operational Constraints
- Context Window and Document Handling
- Governance, Audit, and Compliance
- Building Your Re-Evaluation Cadence
- Implementation and Rollout
- Common Pitfalls and How to Avoid Them
- Next Steps: Getting Started
Why Default Model Selection Matters
Choosing a default model for document review is one of the highest-leverage decisions your engineering team will make this year. It affects latency, cost, accuracy, and your ability to scale. Yet most teams make this choice once, in a meeting, and never revisit it—even as the model landscape shifts every 90 days.
This guide gives you a repeatable framework to choose a default model now, and to re-run that decision every major model release between now and 2027. You’ll avoid lock-in, stay cost-efficient, and keep pace with the AI tooling arms race without burning engineering cycles on constant re-evaluation.
Document review is one of the earliest high-value applications of AI in enterprise workflows. Whether you’re reviewing contracts, regulatory filings, discovery documents, or customer agreements, the stakes are real: a poor model choice costs you either in raw dollars (overspending on inference) or in accuracy (missing critical details, requiring human rework, or worse—missing compliance issues).
The problem is that “best” is contextual. GPT-4o might be right for your team. So might Claude 3.5 Sonnet. Or a smaller, cheaper model fine-tuned on your specific document types. The framework below removes guesswork and replaces it with measurable criteria.
The Core Decision Framework
Your default model choice hinges on three non-negotiable inputs:
1. Your Document Types and Complexity
Start here. Not all documents are created equal.
Simple documents (standardised forms, invoices, single-page contracts) can often run on smaller, cheaper models like GPT-4 Mini or Claude 3.5 Haiku. You’re extracting structured fields, flagging obvious red flags, and moving on.
Medium complexity (multi-page contracts, regulatory filings, discovery documents with cross-references) need models that maintain coherence across 10–50 pages and catch subtle relationships between clauses. This is where Claude 3.5 Sonnet and GPT-4o live.
High complexity (litigation discovery, regulatory investigations, complex financial instruments, documents requiring domain reasoning) demand models that reason deeply, hold context across 100+ pages, and make defensible inferences. This is where reasoning models and extended context windows become non-negotiable.
Be specific about your mix. If 70% of your document volume is simple and 30% is complex, your default model strategy will differ from a team where the split is 30/70. Document one or two representative examples from each category. You’ll use these for testing later.
2. Your Accuracy Floor
What’s the cost of a mistake? This determines how aggressive you can be on model selection.
High-stakes review (regulatory compliance, legal discovery, financial audit): You need ≥95% accuracy on critical fields. A missed compliance flag or misread contract clause costs you more than the inference bill. This pushes you toward larger, more capable models, and likely toward human-in-the-loop verification on high-risk documents.
Medium-stakes review (customer contract triage, internal document categorisation, vendor assessment): 85–92% accuracy is often acceptable. You’ll catch most issues; some rework is expected and budgeted.
Low-stakes review (internal knowledge capture, rough document classification, preliminary screening): 75–80% accuracy is fine. You’re using AI to save time on preliminary work; humans will review the output anyway.
Don’t guess at your accuracy floor. Run a small pilot on 50–100 representative documents with your current process (human review, current tooling, whatever you have). Measure what “correct” looks like. Then test models against that ground truth.
3. Your Operational Constraints
Three variables matter:
Budget per document. If you’re reviewing 1,000 documents per month, a $0.10 difference per document is $1,000/month. At $0.01 per document, it’s $100/month. Know your volume and your acceptable cost per unit. Anthropic’s model selection guide and OpenAI’s reasoning documentation both publish pricing; use it to calculate total monthly cost for your expected volume.
Latency tolerance. Do you need results in seconds (interactive review, real-time triage) or is batch processing overnight acceptable (end-of-day reporting, bulk processing)? Smaller models are faster; larger models and reasoning models are slower. Some workflows can tolerate 30-second responses; others need sub-second.
Scale and variability. If your document volume is stable and predictable, you can optimise for that baseline. If it’s spiky (regulatory deadlines, M&A activity, litigation holds), you need a model that scales without degrading quality or breaking your budget.
Write these down. “We process 500–2,000 documents per month, need results within 24 hours, and can spend up to $500/month on inference” is a constraint you can optimise against. “We need a good model” is not.
Evaluating Accuracy and Consistency
Once you’ve defined your constraints, you test. This is non-negotiable.
Building Your Test Set
Gather 50–150 representative documents from your actual corpus. Aim for distribution:
- 60% from your most common document type
- 20% from your second most common
- 20% edge cases: unusual formatting, poor scans, handwritten notes, mixed languages, unusual structures
Have a subject-matter expert (or two) review each document and produce a ground-truth output: extracted fields, flagged clauses, compliance issues, whatever your workflow requires. This is your gold standard. It takes time, but it’s worth it—you’re buying certainty about your model choice.
Running the Test
For each model you’re considering (typically 2–4), run your test set through the same prompt, same instructions, same extraction format. Measure:
Exact match accuracy: Percentage of documents where the model output matches ground truth perfectly.
Field-level accuracy: For extracted fields (dates, amounts, party names), measure accuracy per field. A model might get 95% of contract dates right but only 80% of obligation amounts.
False positive and false negative rates: If you’re flagging compliance issues or red flags, measure separately: How many issues did the model flag that weren’t real? How many real issues did it miss? A model that flags everything is useless; a model that misses half your issues is worse.
Latency: Time per document. Multiply by your expected monthly volume.
Cost: Per-document inference cost × monthly volume.
Document all of this in a simple spreadsheet. You’ll refer back to it every quarter.
Consistency and Drift
Accuracy matters, but so does consistency. A model that’s 92% accurate but varies wildly (90% on one batch, 94% on another, 89% on a third) is harder to work with than a model that’s consistently 88%.
Run your test set twice, a week apart, on the same model. Measure whether results are identical. They should be (models are deterministic at temperature=0). If they’re not, investigate: Are you hitting rate limits? Are there API changes? This is a yellow flag.
Also test consistency across document types. A model might be 94% accurate on clean PDFs but only 78% accurate on scanned images or handwritten documents. This matters if your document mix includes both.
Cost, Speed, and Operational Constraints
Accuracy is only half the equation. You need to trade accuracy against cost and speed in a way that makes sense for your business.
The Cost-Accuracy Frontier
Plot your test results on a simple graph:
- X-axis: cost per document (in cents or dollars)
- Y-axis: accuracy (as a percentage)
You’ll see a frontier: as cost increases, accuracy generally increases, but with diminishing returns. GPT-4 Mini might be $0.005 per document at 82% accuracy. Claude 3.5 Sonnet might be $0.03 per document at 94% accuracy. GPT-4o might be $0.015 at 91%. A reasoning model might be $0.10 at 97%.
Your job is to find the point on that frontier that matches your constraints. If you’re processing 1,000 documents per month and can spend $50/month, you’re looking at $0.05 per document max. If you need 95% accuracy and can spend $100/month, you’re at $0.10 per document.
This is not about picking the “best” model. It’s about picking the model that gives you the accuracy you need at a cost you can sustain.
Speed and Batch vs. Real-Time
If your workflow is batch (overnight processing, end-of-day reports), you can afford slower models. If it’s interactive (users waiting for results while reviewing), you need speed.
Measure latency end-to-end: API call to response. Include any retries or error handling. Some models are 2–3x faster than others on the same task. If your SLA is “results within 30 seconds” and a model averages 45 seconds, it’s disqualified, regardless of accuracy.
For batch workflows, speed matters less, but throughput does. If you’re processing 10,000 documents, a model that’s 2x slower might force you into more parallel API calls, which costs more. Calculate total cost including parallelisation overhead.
Volume Elasticity
Some teams have predictable, flat volume. Others spike dramatically (litigation holds, regulatory deadlines, M&A). If you spike, you need a model that scales without degrading quality or breaking your budget.
Test this: Run your test set through a model at 2x volume (or via an API that’s under load). Does accuracy drop? Does latency degrade? Does cost scale linearly or super-linearly?
Smaller models often scale more gracefully. Larger models sometimes degrade under load. This is worth knowing before you’re in a crisis.
Context Window and Document Handling
One of the biggest practical decisions is how much of a document you can fit into a single API call.
Context Window Trade-Offs
Larger context windows (100K+ tokens) let you send entire documents, multi-page contracts, or even collections of documents in a single request. This is convenient and often more accurate (the model sees everything at once).
Smaller context windows (8K–32K tokens) force you to chunk documents, summarise sections, or make multiple calls. This is more complex but often cheaper and faster.
For a 20-page contract (typically 5,000–8,000 tokens), an 8K context window is tight. A 32K window is comfortable. A 100K window is overkill but gives you flexibility.
Measure your document sizes in tokens (use your model’s tokenizer). If 80% of your documents are <8K tokens, a smaller context window is fine. If you regularly see 20K+ token documents, you need a larger window or a chunking strategy.
Chunking and Multi-Call Workflows
If you chunk documents (break them into sections, process each separately, then synthesise), you lose some context. A clause on page 10 might reference definitions on page 2; if you process them separately, the model might miss the connection.
Test this explicitly. Take a complex document, process it whole (if your model allows), then process it in chunks. Compare results. If chunking causes accuracy to drop >5%, it’s a problem. If it’s <2%, you can live with it.
Multi-call workflows also cost more. If you make 3 API calls per document instead of 1, your cost triples (all else equal). This might be acceptable if the model is cheaper per call, but measure it.
Handling Unusual Formats
Test your models on documents that represent your actual mix:
- Scanned PDFs (images, not searchable text)
- Handwritten notes or annotations
- Tables and structured data embedded in prose
- Mixed languages
- Poor formatting or encoding issues
Some models handle images better than others. Some struggle with tables. Some are better at multilingual text. Your test set should include these edge cases, and your accuracy measurement should flag which models struggle with which types.
If 20% of your documents are scanned images and a model’s accuracy on those is 70% (vs. 92% on clean PDFs), that’s a critical constraint on your choice.
Governance, Audit, and Compliance
If your document review touches regulated workflows (legal, financial services, healthcare), you need to think about governance, auditability, and compliance.
Audit Trails and Reproducibility
Can you reproduce the model’s output for a specific document? Can you explain why it made a particular decision?
For SOC 2 and ISO 27001 compliance, you’ll need to log:
- Which model version was used
- When the document was processed
- What the model output was
- Whether a human reviewed or overrode the output
Some models and providers make this easier than others. OpenAI and Anthropic both provide request IDs and versioning. Make sure your chosen model supports the logging and reproducibility your compliance framework requires.
Model Transparency and Explainability
If you need to explain a decision to a regulator or a court, can you? Some models (especially smaller, fine-tuned models) are more interpretable than large foundation models.
For high-stakes review, consider building in an explainability layer: Have the model not just extract a field but explain its reasoning. “The contract date is 2024-03-15 because the signature block shows ‘Signed this 15th day of March, 2024’.” This is more defensible in an audit or legal proceeding.
Data Residency and Privacy
Where does your data go? If you’re processing confidential documents (M&A, litigation, healthcare), you need to know whether the model provider can access your data, whether it’s used for training, and whether it meets your data residency requirements.
Anthropic’s documentation and OpenAI’s policies both address this; review them carefully if privacy is a constraint.
For highly sensitive work, consider on-premise or private deployment options. They’re more expensive, but they give you control.
Regulatory and Industry-Specific Guidance
Your industry may have specific guidance on AI use in document review. The ILTA’s best practices guide covers legal workflows. The American Bar Association’s guidance on choosing AI tools addresses legal-specific concerns. If you’re in financial services, APRA, ASIC, and AUSTRAC have their own frameworks.
Read these before you commit to a model. They’ll flag constraints you might otherwise miss.
Building Your Re-Evaluation Cadence
This is the part most teams skip, and it’s why they end up locked into suboptimal models for years.
Quarterly Model Landscape Review
Every quarter (January, April, July, October), spend 2 hours reviewing what’s new:
- What new models were released?
- What changed in pricing?
- What changed in context window, latency, or accuracy benchmarks?
- Are any of your current models being deprecated?
You don’t need to test everything. But you should know what’s available. Create a simple spreadsheet: Model name, release date, key specs, approximate cost, whether it’s worth testing.
Annual Re-Evaluation
Once a year (pick a month—June or September works well), re-run your full test suite on 2–3 candidate models. This takes a day or two of engineering time, but it’s worth it.
You’re measuring:
- Has your current default model’s accuracy drifted?
- Has a new model entered the market that’s better on your specific use case?
- Has pricing changed enough to shift your cost-accuracy frontier?
- Have your document types or accuracy requirements changed?
Document the results. If your current model is still best, keep it. If something new is better, plan a migration. If something is close but cheaper, consider it.
Trigger-Based Re-Evaluation
Beyond the schedule, re-evaluate immediately if:
- Your accuracy floor changes (new regulatory requirement, higher-stakes use case)
- Your volume changes dramatically (10x growth, new product launch)
- A model you’re using is deprecated
- A major new model is released (GPT-5, Claude 4, etc.)
- Your current model’s accuracy drops >3% (indicates possible drift or API changes)
Building the Playbook
Document your re-evaluation process in a playbook:
- Test set: Where is it stored? How is it versioned? Who maintains it?
- Models to test: What’s your candidate set? (Your current default + 2–3 challengers)
- Evaluation criteria: What metrics matter? (Accuracy, cost, latency, consistency)
- Decision rule: How do you decide to switch? (E.g., “Switch if accuracy is ≥2% higher and cost is ≤10% more, or if cost is ≥20% lower and accuracy is within 1%”)
- Rollout plan: If you switch, how do you migrate? (Gradual rollout, A/B testing, full cutover?)
- Rollback plan: If the new model underperforms in production, how do you revert?
This playbook should be owned by your engineering team, not by management. It’s a living document. Update it every time you re-evaluate.
Implementation and Rollout
Once you’ve chosen your default model, you need to ship it. This is straightforward but easy to mess up.
Prompt Engineering and Consistency
Your model choice is only as good as your prompt. Spend time on this:
- System prompt: Define the model’s role clearly. “You are a contract review specialist. Extract key dates, parties, obligations, and red flags.”
- Instructions: Be specific about format. “Return output as JSON with keys: contract_date, parties (array), obligations (array), red_flags (array).”
- Examples: Provide 2–3 in-context examples of input and expected output. This dramatically improves consistency.
- Error handling: What should the model do if it can’t find a field? (Return null, return “Not found”, return best guess?) Be explicit.
Test your prompt on your test set. Measure accuracy with your final prompt. If accuracy drops significantly from your earlier testing, your prompt needs work.
Versioning and Rollback
Version your prompt and your model choice together. “Default model: Claude 3.5 Sonnet, Prompt v2.3, deployed 2024-09-15.” If you need to rollback, you can do it quickly.
Keep the previous version live for 24–48 hours. Run both in parallel. Compare outputs on a sample of documents. If the new version is clearly better, switch fully. If it’s worse or inconsistent, rollback.
Monitoring in Production
You can’t measure accuracy on every document (that would require human review of everything). But you can measure:
- Output consistency: Are results stable day-to-day? Are there sudden changes?
- Error rates: How many documents fail to process? How many return obviously wrong outputs (dates in the future, amounts that are nonsensical)?
- Latency: Is response time stable?
- Cost: Is inference cost tracking your budget?
Set up alerts. If accuracy (measured on a sample) drops >5%, or if error rate spikes, investigate immediately. This might indicate an API change, a model update, or a problem with your prompt.
Gradual Rollout
Don’t flip a switch and move 100% of your traffic to a new model. Instead:
- Week 1: Route 10% of documents to the new model. Compare outputs on a sample.
- Week 2: 25% of documents.
- Week 3: 50% of documents.
- Week 4: 100% of documents.
This gives you time to catch issues before they affect your entire workflow. If something goes wrong at 10%, you fix it before it affects 100%.
Common Pitfalls and How to Avoid Them
Pitfall 1: Choosing Based on Benchmarks, Not Your Data
GPT-4o scores 92% on MMLU. Claude 3.5 Sonnet scores 88%. Therefore, GPT-4o is better, right?
Wrong. MMLU is a general-knowledge benchmark. Your documents might be highly specialised contracts, medical records, or regulatory filings. A model’s performance on MMLU tells you almost nothing about its performance on your specific documents.
Fix: Always test on your actual data. Benchmarks are useful for getting in the ballpark, but your test set is ground truth.
Pitfall 2: Underestimating the Cost of Human Review
You choose a model that’s 88% accurate, thinking you’ll catch the remaining 12% with spot checks. In practice, you end up reviewing 50% of documents because you don’t trust the model.
If you need 95%+ accuracy, don’t pick an 88% model and hope. The cost of human review will dwarf your inference savings.
Fix: Be honest about your accuracy floor. If you need 95%, test models at 95%+. If you can live with 85%, test at 85%+. Don’t pick a model and then complain about the rework.
Pitfall 3: Ignoring Edge Cases
Your test set is clean, well-formatted documents. Your production documents include scanned images, handwritten notes, mixed languages, and corrupted PDFs.
The model that scored 94% on your test set scores 72% on your edge cases. Now you’re in trouble.
Fix: Include edge cases in your test set from the start. Weight them appropriately (if 15% of your documents are scanned images, 15% of your test set should be too). Measure accuracy per category.
Pitfall 4: Locking In to a Model
You choose GPT-4o. You build your entire workflow around it. Then OpenAI raises prices 40%, or deprecates the model, or releases a new model that’s 30% cheaper and just as accurate.
Now you’re stuck. Re-architecting your workflow is expensive.
Fix: Build abstraction. Use a wrapper that lets you swap models without changing your application code. Keep your test set and evaluation process alive. Re-evaluate annually. Plan for model switching as a normal operational task, not a crisis.
Pitfall 5: Not Measuring Consistency
You test a model once, get 92% accuracy, and declare victory. But the model’s accuracy varies 88–96% day-to-day. You don’t know this because you only tested once.
In production, some days your accuracy is good, some days it’s terrible. You can’t rely on it.
Fix: Test multiple times (at least twice, a week apart). Run batches of different sizes. Measure variance explicitly.
Pitfall 6: Ignoring Compliance and Auditability
You choose a model because it’s cheap and fast. Later, your compliance team asks: “Can you explain why the model flagged this clause as a red flag?” You can’t. The model doesn’t provide explanations, and you don’t have audit logs.
Now you’re scrambling to add logging and explainability after the fact.
Fix: Before you choose a model, know your compliance requirements. If you need audit trails, explainability, or data residency guarantees, bake those into your model selection criteria. Don’t compromise on them later.
Next Steps: Getting Started
You now have a framework. Here’s how to execute it.
Week 1: Define Your Constraints
- Document your document types. Gather 5–10 representative examples of each type you process. Categorise by complexity.
- Define your accuracy floor. Work with your team to decide: What accuracy do you need? What’s the cost of a mistake?
- Calculate your operational constraints. How many documents per month? What’s your budget? What’s your latency requirement?
Write this down. One page. This is your north star.
Week 2–3: Build Your Test Set
- Gather 50–150 representative documents. Include your most common types and edge cases.
- Have a subject-matter expert review each one. Produce ground-truth outputs: extracted fields, flags, whatever your workflow requires.
- Document the ground truth. Store it in a version-controlled location. You’ll use this for years.
Week 4: Test Candidate Models
- Identify 2–4 candidate models. Usually: your current model (if you have one) + 2–3 challengers.
- Run your test set through each model. Use the same prompt, same instructions, same format.
- Measure accuracy, cost, latency, and consistency. Document everything.
If you need help with this—if you’re not sure which models to test, or how to set up the evaluation infrastructure—PADISO’s AI advisory team in Sydney can guide you through the process. We’ve done this for dozens of teams across financial services, legal, and enterprise workflows.
Week 5: Make Your Decision
- Plot your results. Cost vs. accuracy. Speed vs. accuracy. Consistency.
- Apply your constraints. Which models meet your accuracy floor? Which fit your budget? Which have acceptable latency?
- Pick your default. Document the decision: why this model, what trade-offs you accepted, what the alternative was.
Week 6+: Implement and Monitor
- Write your prompt. Be specific. Include examples. Test it.
- Set up monitoring. Track accuracy (on a sample), error rates, latency, cost.
- Plan your rollout. Gradual migration, not a big bang.
- Schedule your re-evaluation. Quarterly landscape review. Annual full re-evaluation.
Building Long-Term Capability
This isn’t a one-time project. You’re building a capability: the ability to choose and re-evaluate models as the landscape evolves.
Own this in your engineering team. Assign an owner (a senior engineer, an ML engineer, or a product manager). Give them 2–4 hours per quarter for the landscape review, and 1–2 days per year for the full re-evaluation. That’s it. But it’s non-negotiable.
If you’re building document review into a product (for your customers, not just internal use), this becomes even more critical. You’re making model choices that affect your customers’ workflows. PADISO’s case studies show teams that built this capability early—they were able to iterate on accuracy and cost every quarter, while competitors were stuck with their original choice.
For teams in regulated industries (financial services, legal, healthcare), PADISO’s security audit and compliance work can help you audit your model choice against SOC 2, ISO 27001, and industry-specific requirements. We’ve helped teams in Sydney and across Australia build audit-ready AI workflows.
If you’re a fractional CTO or engineering leader building this for the first time, PADISO’s advisory services can accelerate your process. We’ve guided teams through model selection for everything from contract review to regulatory document processing to discovery workflows.
Summary
Choosing a default model for document review is not a one-time decision. It’s a process:
- Define your constraints (document types, accuracy floor, budget, latency, volume)
- Build a test set that represents your actual workload
- Test candidate models against that test set
- Measure accuracy, cost, latency, and consistency explicitly
- Make a decision based on trade-offs, not hype
- Implement with monitoring and a rollback plan
- Re-evaluate quarterly and annually as the landscape evolves
This framework is repeatable. You can run it every time a major new model is released. You won’t be locked in. You won’t overspend. And you’ll stay ahead of the curve as AI models improve.
Start this week. Define your constraints. Build your test set. By month-end, you’ll have a defensible default model and a process to keep it current through 2027.
The teams that do this early win. The teams that don’t end up reworking their choice in a panic when their model is deprecated or a better option emerges. Be the former.