Table of Contents
- Why Default Model Selection Matters
- The Core Evaluation Framework
- Benchmarking Code Quality and Accuracy
- Context Window and Token Economics
- Speed, Latency, and Developer Experience
- Cost Analysis and Scaling
- Security, Privacy, and Compliance
- Integration and Routing Strategies
- Running Your First Model Selection Cycle
- Maintaining Your Default Model Strategy
Why Default Model Selection Matters
Choosing a default model for code generation is not a one-time decision. It’s the foundation of your team’s development velocity, cost structure, and technical debt trajectory. A poorly chosen default will compound across thousands of code completions, pull requests, and architectural decisions over the next 12–24 months.
The problem is acute because the landscape shifts every quarter. In 2024 alone, we’ve seen Claude 3.5 Sonnet redefine what’s possible in structured reasoning, GPT-4o raise the bar on speed and cost efficiency, and Mistral’s open-weight models challenge the closed-garden assumption. By 2025 and beyond, that list will only grow.
This guide provides a repeatable framework—one you can run today, and run again in six months, twelve months, or whenever a major model release forces a re-evaluation. The framework is built for engineering teams at startups and mid-market companies who need to ship fast, control costs, and maintain security without hiring a dedicated AI ops team.
We’ve built this framework across dozens of engagements at PADISO, where we partner with founders and engineering leaders to ship AI products and modernise operations. The teams that win are those that treat model selection as a quarterly or bi-annual ritual, not a one-off choice.
The Core Evaluation Framework
A default model for code generation must excel across five dimensions:
- Code Quality: Does it generate correct, maintainable, idiomatic code?
- Efficiency: What’s the latency, token cost, and developer friction?
- Context Handling: Can it work with your real codebase and architectural patterns?
- Security and Compliance: Does it meet your data governance and regulatory requirements?
- Operational Fit: Can your team integrate it into your existing CI/CD, IDE, and workflow?
These dimensions are not equally weighted. Your weighting depends on your stage, industry, and team size. A fintech company pursuing SOC 2 or ISO 27001 compliance will weight security and data governance heavily. A seed-stage startup building a consumer app will prioritise speed and cost. A mid-market SaaS platform modernising with agentic AI will care deeply about context window and multi-turn reasoning.
The framework works because it decouples selection from implementation. You can evaluate models without committing to a vendor, and you can run the evaluation multiple times as the market evolves.
Why Not Just Use “The Best” Model?
The temptation is to default to the largest, most capable model available—Claude 3.5 Sonnet, GPT-4o, or the latest flagship. This is often a mistake.
Capability is not the primary constraint for most teams. What matters is the optimal model for your specific use case, at a cost and latency your team can sustain. Smaller models like Mistral Small or Claude 3 Haiku often outperform larger siblings on structured tasks like code generation when properly prompted. And they cost 60–80% less.
Moreover, defaulting to a single model creates vendor lock-in and brittle systems. If OpenAI raises prices, changes their API, or deprioritises code generation, your entire pipeline breaks. Teams that build model-agnostic routing from day one stay nimble.
Benchmarking Code Quality and Accuracy
Code quality is the hardest dimension to measure, and the easiest to get wrong. You cannot rely on marketing claims or academic benchmarks. You must test against your own codebase, your own patterns, and your own standards.
Setting Up Your Benchmark Suite
Start by collecting 20–50 representative code generation tasks from your actual backlog or recent pull requests. These should span:
- Frontend logic: React components, state management, API integration
- Backend services: REST endpoints, database queries, middleware
- Infrastructure: Terraform, Docker, Kubernetes manifests
- Data pipelines: ETL jobs, transformations, validation logic
- Tests and utilities: Unit tests, fixtures, helper functions
For each task, define a clear spec: the input (a prompt or partial code), the expected output (a working function or module), and the success criteria (does it compile, does it pass tests, is it idiomatic?).
Do not use generic benchmarks like OpenAI’s code generation guide or Anthropic’s code generation documentation as your sole source of truth. These are useful for understanding model capabilities, but they do not reflect your team’s standards, your codebase’s complexity, or your production constraints.
Running Your Benchmark
For each task, generate code from your candidate models using identical prompts and temperature settings. Then evaluate:
- Correctness: Does the code compile and run without errors?
- Test Pass Rate: What percentage of your test suite passes on the first try?
- Idiomatic Style: Does the code match your team’s conventions and patterns?
- Security Issues: Are there obvious vulnerabilities (SQL injection, unvalidated input, missing auth checks)?
- Maintainability: Would a junior engineer understand this code? Are there unnecessary abstractions?
Score each dimension on a 0–10 scale. Weight the scores based on your priorities. For a fintech team, security might be 40% of the score. For a consumer app team, speed and idiomatic style might dominate.
The result is a scorecard. Model A might score 8.5 overall, Model B 8.2, Model C 7.9. But look at the breakdown. If Model C scores 9.5 on security and 7.0 on speed, and you’re building a payment system, that’s a different story than if Model A is fast but weak on security.
Common Pitfalls in Code Quality Benchmarking
Using the same prompt for all models: Different models respond to different prompt styles. Claude prefers detailed context and explicit reasoning steps. GPT-4 often excels with concise, direct prompts. Mistral models respond well to code examples and structured output formats. Spend time tuning prompts per model.
Not testing edge cases: Models often fail on boundary conditions, error handling, and unusual inputs. Include tasks that test these scenarios explicitly.
Ignoring prompt engineering: A 10-minute investment in prompt refinement can shift a model’s score by 1–2 points. Use Google’s Gemini API documentation and Mistral’s code generation guide to learn model-specific prompt patterns.
Not testing with real context: Models perform better when given relevant context (existing code, architecture diagrams, style guides). Test with and without context to understand the impact.
Context Window and Token Economics
Context window—the amount of code and documentation a model can ingest in a single request—is a hidden multiplier on developer productivity.
A model with a 4k token context window forces you to send just the function you’re trying to write. A model with 100k or 200k tokens lets you send your entire service, its dependencies, and your style guide in a single request. The difference in code quality is often 2–3 points on a 10-point scale.
Calculating Your Effective Context Needs
Audit your recent code generation requests. For each one, measure:
- Code context: How many tokens of existing code does the engineer include?
- Documentation: How many tokens of README, API specs, or architecture docs?
- Examples: How many tokens of example code or test cases?
- Prompt overhead: How many tokens for the actual generation request?
Sum these up. If your median request is 8k tokens (code + context + prompt), a 4k context model will force truncation or multiple round-trips. A 100k model will handle it comfortably.
Why does this matter? Because truncation forces you to make trade-offs:
- You omit architectural context, and the model generates code that doesn’t fit your system
- You omit test cases, and the model doesn’t understand your error-handling patterns
- You omit style examples, and the model generates code that requires rewrites
Each trade-off costs developer time. A model that costs 20% more but eliminates three rounds of revision is a net win.
Token Economics Across the Lifecycle
Context window also affects your token spend across a full development cycle. Consider this scenario:
Scenario A: 4k context model
- Initial generation: 3k tokens input, 1k tokens output = 4k tokens
- Revision 1: 3k tokens input (re-send context), 800 tokens output = 3.8k tokens
- Revision 2: 3k tokens input, 600 tokens output = 3.6k tokens
- Total: 11.4k tokens, three round-trips, 15 minutes developer time
Scenario B: 100k context model
- Initial generation: 8k tokens input (full context + tests + style guide), 1.2k tokens output = 9.2k tokens
- Revision 1: 8k tokens input, 900 tokens output = 8.9k tokens
- Total: 18.1k tokens, two round-trips, 5 minutes developer time
Scenario B uses 59% more tokens but saves two round-trips and 10 minutes of developer time per task. Over a year with 2,000 code generation tasks, that’s 333 hours of developer time—or roughly $50k in salary cost.
When evaluating models, do not look at token cost in isolation. Look at token cost per completed task, including revisions.
Checking Token Costs Across Providers
As of late 2024, the landscape looks roughly like this (prices change quarterly; verify with current documentation):
- GPT-4o: ~$0.005 per 1k input tokens, ~$0.015 per 1k output tokens. 128k context.
- Claude 3.5 Sonnet: ~$0.003 per 1k input tokens, ~$0.015 per 1k output tokens. 200k context.
- Mistral Large: ~$0.004 per 1k input tokens, ~$0.012 per 1k output tokens. 32k context (or 128k with extended).
- Gemini 2.0 Flash: ~$0.075 per 1M input tokens, ~$0.3 per 1M output tokens. 1M context.
These prices are approximate and subject to change. Always verify with current AWS Bedrock documentation and provider pricing pages.
For code generation specifically, the input-to-output ratio matters more than absolute price. If you’re sending 8k tokens of context and getting 1k tokens of code, the input cost dominates. Models with lower input costs (Claude, Mistral) win. If you’re generating long functions or entire modules, output cost matters more.
Speed, Latency, and Developer Experience
A model that’s 10% cheaper but adds 2 seconds of latency to every code completion will degrade your team’s experience and slow your iteration velocity.
Latency has two components: time to first token (how long before the model starts responding) and tokens per second (how fast it generates). A model that takes 500ms to start but generates at 100 tokens/second will feel fast. A model that starts instantly but generates at 20 tokens/second will feel slow.
Measuring Latency in Your Context
Run your 20–50 benchmark tasks through each candidate model and measure:
- Time to first token: How long from request to first character of output?
- Total generation time: How long from request to complete output?
- Tokens per second: Calculate from total time and token count.
- End-to-end latency: Include network round-trip, your API gateway, and any post-processing.
Do this under realistic conditions: during your team’s peak hours, with your production API load, using your actual integration (IDE plugin, CLI tool, API endpoint).
A 500ms difference might not sound like much, but multiplied across 50 completions per day per engineer, it adds up to 40 minutes per week per engineer. For a 10-person team, that’s 6.5 hours per week—or roughly $1,000 per week in lost productivity.
The Developer Experience Multiplier
Speed also affects perceived quality. A model that takes 3 seconds to generate code feels slow and unreliable, even if the code is correct. A model that takes 1 second feels snappy and trustworthy.
This is not merely psychological. Fast feedback loops improve decision-making. If a developer can generate, review, and revise code in 30 seconds, they’ll iterate more and build better mental models. If each cycle takes 5 minutes, they’ll iterate less and make fewer refinements.
When evaluating latency, also consider:
- Streaming vs. batch: Streaming (returning tokens as they’re generated) feels faster than waiting for a complete response, even if the total time is the same.
- Caching: Do you have context that repeats across requests? Some providers (Claude, Gemini) offer prompt caching, which can reduce latency for repeated context by 90%.
- Batching: For non-interactive tasks (code review, test generation), batching requests can reduce per-request overhead.
If your team is using code generation in an IDE (GitHub Copilot, VS Code Intellisense, JetBrains AI Assistant), latency is even more critical. A 2-second delay in an IDE feels like the tool is broken. A 500ms delay feels responsive.
Cost Analysis and Scaling
Cost matters, but only in context. A model that costs 2x more but generates code that requires 50% fewer revisions is a 4x win on total cost of ownership.
Building a Cost Model
Start with your usage projections. How many code generation requests per engineer per day? How many engineers? How many tokens per request (input + output)?
For a 10-person engineering team:
- 50 code generation requests per engineer per day = 500 requests/day
- Average request: 5k input tokens + 1k output tokens = 6k tokens/request
- Total: 3M tokens/day, or ~90M tokens/month
Now multiply by model pricing:
- GPT-4o: 90M input tokens @ $0.005/1k = $450; 18M output tokens @ $0.015/1k = $270. Total: $720/month
- Claude 3.5 Sonnet: 90M input tokens @ $0.003/1k = $270; 18M output tokens @ $0.015/1k = $270. Total: $540/month
- Mistral Large: 90M input tokens @ $0.004/1k = $360; 18M output tokens @ $0.012/1k = $216. Total: $576/month
But this is incomplete. You also need to account for:
- Revision costs: If a model generates code that requires 30% more revisions, you’re paying 30% more in tokens.
- Integration costs: Some models require custom infrastructure (self-hosting, fine-tuning, API gateway). Others are plug-and-play.
- Latency costs: Slower models might require better caching or batching infrastructure, adding operational cost.
- Security and compliance costs: Some models require SOC 2 or ISO 27001 compliance. If you’re pursuing these certifications (as many teams are when they reach Series A), the compliance overhead matters.
At PADISO, we help teams navigate this with Security Audit support and AI Advisory Services. The teams that get this right are those that factor compliance into model selection early, not as an afterthought.
Scaling Cost
Your cost model should project forward 12–24 months. As your team grows and you add more use cases (not just code completion, but code review, test generation, documentation), token usage will grow.
If your token usage doubles, which model scales better? Some providers offer volume discounts. Some have hard rate limits that force you to upgrade to enterprise plans. Some have no public pricing, which means you’ll negotiate with sales, and the outcome depends on your leverage.
For startups, this is a critical consideration. A model that costs $500/month at 10 engineers might cost $50k/month at 100 engineers if the pricing doesn’t scale. A model that costs $500/month but scales linearly is a better bet.
Security, Privacy, and Compliance
For many teams—especially those in financial services, healthcare, or regulated industries—security and compliance are not secondary concerns. They are primary constraints.
Data Governance Questions
Before committing to a model, answer these questions:
- Data retention: Does the provider store your code? For how long? Can you opt out?
- Training data: Will your code be used to train future models? Can you opt out?
- Data location: Where are your requests processed? In which regions are they stored?
- Encryption: Is data encrypted in transit and at rest?
- Access controls: Who at the provider can access your code? Under what circumstances?
- Audit trails: Can you get logs of who accessed your data and when?
- Compliance certifications: Does the provider have SOC 2, ISO 27001, HIPAA, or other certifications relevant to your industry?
For a fintech team pursuing APRA CPS 234 compliance, these questions are non-negotiable. For a consumer app team, they’re less critical but still important.
Self-Hosted and Open-Weight Models
One option is to self-host an open-weight model. Models like Mistral 7B, Code Llama, or DeepSeek-Coder give you complete control over data and can run on your infrastructure.
The trade-off is operational overhead. Self-hosting requires:
- GPU infrastructure (cost: $500–$5,000/month depending on scale)
- Monitoring and maintenance (cost: 0.5 FTE engineer)
- Fine-tuning and optimisation (cost: 1–2 weeks of engineering time per model release)
For many teams, this overhead is not worth the security benefit. Managed services from reputable providers (OpenAI, Anthropic, Google, Mistral) offer strong security guarantees with zero operational overhead.
But for teams in highly regulated industries or with strict data residency requirements, self-hosting is the right call. This is especially true in Australia, where some teams have requirements to keep data on Australian soil. We help teams navigate this decision at PADISO, working across Sydney, Melbourne, Perth, and other Australian cities.
Compliance Audit Readiness
If you’re pursuing SOC 2 or ISO 27001 compliance, your model selection affects your audit readiness. Some models come with compliance documentation and audit support. Others require you to build it yourself.
When evaluating models, ask:
- Does the provider have SOC 2 Type II certification?
- Do they have ISO 27001 certification?
- Do they have a Data Processing Agreement (DPA) that covers your use case?
- Do they participate in audit processes (e.g., via Vanta)?
Managed services like OpenAI and Anthropic have SOC 2 certification. But not all do, and not all cover all regions or use cases. If you’re pursuing compliance (and you should be by Series A), factor this into your model selection.
Integration and Routing Strategies
Choosing a default model does not mean choosing only one model. The most sophisticated teams use a routing strategy: a default model for most tasks, with fallback models for specific scenarios.
When to Route to Different Models
Consider routing to a different model when:
- Task complexity: A smaller, cheaper model handles simple completions. A larger model handles complex refactoring or architectural decisions.
- Context size: A model with a larger context window handles tasks that require full codebase context. A smaller model handles isolated functions.
- Latency requirements: An interactive IDE use case needs fast latency. A batch code review can tolerate slower models.
- Cost sensitivity: A high-volume, low-stakes task (generating boilerplate) can use a cheap model. A critical production task uses a more capable model.
- Compliance: A task involving sensitive data routes to a self-hosted or compliant model. A task with public data can use a public API.
For example:
- Default: Claude 3.5 Sonnet for most tasks (good balance of quality, speed, cost)
- Fallback 1: Mistral Small for simple completions and boilerplate (cheaper, faster)
- Fallback 2: GPT-4o for complex architectural decisions (slightly better reasoning)
- Fallback 3: Self-hosted Code Llama for tasks with sensitive financial data (compliant)
This routing strategy requires:
- A routing layer: Logic that determines which model to use based on task characteristics
- Unified prompts: Prompts that work across multiple models (with model-specific tweaks)
- Monitoring: Tracking which model was used, how long it took, and whether the output was useful
Building a Routing Layer
You can build routing logic into your IDE plugin, API gateway, or CI/CD pipeline. The logic might look like:
if (task.isSensitive) {
use SelfHostedModel
} else if (task.context_tokens > 50k) {
use ClaudeWithLargeContext
} else if (task.isSimple && task.isLowStakes) {
use MistralSmall
} else {
use ClaudeDefault
}
This is simple to implement and gives you flexibility without operational overhead.
Avoiding Over-Complexity
A common mistake is to build too complex a routing strategy too early. Start with a single default model. Once you have 3–6 months of usage data, identify patterns: which tasks fail most often? Which tasks take longest? Which tasks cost the most? Then introduce routing to address those patterns.
For a 10-person team, a single default model with one or two fallbacks is usually sufficient. For a 100-person company with diverse use cases, you might have 4–6 models in your routing pool.
Running Your First Model Selection Cycle
Now that you understand the dimensions, here’s how to run a rigorous model selection process for your team.
Phase 1: Preparation (1 week)
- Assemble your benchmark suite: Collect 30–50 representative code generation tasks from your backlog. Include frontend, backend, infrastructure, and test tasks.
- Define scoring criteria: Weight code quality (40%), latency (25%), cost (20%), and security (15%). Adjust weights based on your priorities.
- Select candidate models: Choose 3–5 models to evaluate. Include your current default (if you have one), a smaller/cheaper alternative, and a larger/more capable alternative.
- Prepare your environment: Set up API access for each model. Standardise your prompts and temperature settings.
Phase 2: Evaluation (2–3 weeks)
- Generate code: Run each benchmark task through each candidate model. Collect the outputs.
- Score code quality: Have 2–3 engineers independently score each output on correctness, idiomatic style, and security. Average the scores.
- Measure latency: Record time to first token, total generation time, and tokens per second for each task.
- Calculate costs: Sum the input and output tokens for each task and model. Project to monthly and annual costs.
- Document results: Create a scorecard for each model showing quality, latency, and cost.
Phase 3: Analysis (1 week)
- Compare scorecards: Rank models by overall score (using your weighted criteria).
- Identify trade-offs: Which model is fastest? Which is cheapest? Which has the best code quality? Are these the same model?
- Drill into edge cases: For the top 2–3 models, run additional benchmarks on edge cases (error handling, security, complex refactoring).
- Assess integration effort: How hard is it to integrate each model into your IDE, CLI, or API?
- Validate with your team: Have engineers who will use the model review the results and provide feedback.
Phase 4: Decision (1 week)
- Choose your default: Based on your analysis, select a default model. Document the decision and the criteria that led to it.
- Plan your fallbacks: Identify 1–2 fallback models for specific scenarios (complex tasks, sensitive data, high-latency tolerance).
- Set up monitoring: Configure logging to track which model is used, latency, token cost, and user satisfaction.
- Communicate to your team: Announce the decision, explain the reasoning, and provide guidance on when to use fallbacks.
Phase 5: Rollout (2–4 weeks)
- Integrate into your stack: Update your IDE plugins, CLI tools, API endpoints, and CI/CD pipelines to use the new default model.
- Run a pilot: Have a subset of your team (2–3 engineers) use the new model for 1–2 weeks. Collect feedback.
- Iterate: Based on pilot feedback, tune your prompts, routing logic, or model selection.
- Full rollout: Roll out to your entire team. Monitor latency, cost, and code quality metrics.
- Document learnings: Record what worked, what didn’t, and what you’d do differently next time.
Maintaining Your Default Model Strategy
Model selection is not a one-time decision. The landscape evolves constantly. New models are released every quarter. Prices change. Capabilities improve. Your team’s needs shift.
Quarterly Review Ritual
Every quarter (or whenever a major model release happens), run a lightweight review:
- Benchmark new models: Run your benchmark suite against any new models that launched. How do they compare to your current default?
- Check pricing changes: Have any providers changed their pricing? Do the economics still favour your current default?
- Review usage metrics: Which tasks use your default? Which use fallbacks? Are there patterns you should address?
- Gather team feedback: Ask your engineers: Is the model meeting your needs? Are there pain points? What would make it better?
- Decide: Do you stay with your current default, or switch? If you switch, what’s the migration plan?
This review should take 3–5 days of engineering time, not weeks. You’re not re-running your full evaluation. You’re checking whether your assumptions still hold.
Monitoring and Observability
To make informed decisions, you need visibility into how your models are performing in production.
Track:
- Latency: Time to first token, total generation time, tokens per second. Set alerts if latency degrades.
- Cost: Input tokens, output tokens, cost per task. Track trends over time.
- Quality: Code review comments, test failures, security issues. Are they increasing or decreasing?
- Usage: Which models are being used for which tasks? Are your routing rules working as intended?
- User satisfaction: Periodically survey your team. Are they happy with the model? Would they want to switch?
Tools like Vanta (which we integrate with for compliance audits) can help with logging and audit trails. For code generation specifically, you can build custom monitoring into your IDE plugin or API gateway.
When to Re-Evaluate Fully
Do a full re-evaluation (like Phase 1–4 above) when:
- A major model release changes the landscape: GPT-5, Claude 4, or a new open-weight model that’s 2+ generations ahead of what you’re using.
- Your usage patterns change significantly: You’re now generating 10x more code, or using code generation for new domains (e.g., infrastructure-as-code, SQL queries).
- Your team grows or changes: You’re now 50 people instead of 10, or you’ve hired specialists in areas where code generation is critical.
- Your business requirements shift: You’re pursuing compliance (SOC 2, ISO 27001), expanding to new regions, or entering regulated industries.
- Your current model underperforms: Code quality is declining, latency is unacceptable, or costs are out of control.
When these events happen, re-run the full evaluation. The landscape moves fast. A model that was best-in-class 6 months ago might be outdated now.
Building a Sustainable Model Selection Practice
Choosing a default model for code generation is the foundation of your team’s AI infrastructure. But it’s not an isolated decision. It’s part of a larger strategy around AI readiness, platform engineering, and technical leadership.
At PADISO, we work with teams across Australia—from Sydney startups to Melbourne enterprises to Darwin defence contractors—to build sustainable AI practices. This includes:
- AI Strategy and Readiness: Understanding where code generation fits into your product and operations strategy
- Platform Design and Engineering: Building the infrastructure to support code generation at scale
- Fractional CTO leadership: Providing the technical guidance to navigate model selection, integration, and scaling
- Security and Compliance: Ensuring your model choices support your compliance goals
The teams that win are those that treat model selection as a strategic capability, not a tactical choice. They build processes, monitor outcomes, and iterate continuously.
If you’re building AI products or modernising your operations with code generation, book a call with our team. We can help you design a model selection process tailored to your team, industry, and stage.
Key Takeaways
- Default model selection matters: It compounds across thousands of completions and affects your velocity, cost, and code quality.
- Use a repeatable framework: Evaluate models across code quality, latency, cost, context, and security. Weight the dimensions based on your priorities.
- Benchmark against your own code: Don’t rely on marketing claims or academic benchmarks. Test against your real codebase and standards.
- Account for total cost of ownership: A model that costs more but requires fewer revisions is often a net win.
- Build in flexibility: Use a default model with 1–2 fallbacks for specific scenarios. Avoid over-complexity early.
- Monitor and iterate: Track latency, cost, and quality metrics. Review quarterly and re-evaluate fully when the landscape shifts.
- Plan for scale: Your model selection should support your team’s growth from 10 engineers to 100+ without major re-architecture.
The landscape will continue to evolve through 2025, 2026, and beyond. New models will be released. Prices will change. Your team’s needs will shift. By building a repeatable process now, you’ll stay ahead of the curve and make model selection decisions that compound over time.
Next Steps
- Assemble your benchmark suite this week: Collect 30–50 representative code generation tasks from your backlog.
- Identify your candidate models: Choose 3–5 models to evaluate based on your priorities and constraints.
- Run Phase 1 (Preparation): Spend 1 week setting up your evaluation environment and standardising your prompts.
- Start Phase 2 (Evaluation): Begin generating code and scoring outputs. This will take 2–3 weeks.
- Schedule a quarterly review: Set a reminder to re-evaluate your model selection every three months, or whenever a major model release happens.
If you need help designing your evaluation process, integrating models into your stack, or scaling code generation across your team, reach out to PADISO. We’ve built this framework with dozens of teams and can help you adapt it to your context.
Your default model choice is too important to leave to chance. Build a process, measure outcomes, and iterate. That’s how you win with AI.