Table of Contents
- What are agentic sales workflows and why they matter
- The architecture of production-grade sales agents
- Tool design and integration patterns
- State management and memory in sales agents
- Evaluation and observability at scale
- Common production failures and how to avoid them
- Human-in-the-loop workflows and override patterns
- Cost control and efficiency optimisation
- Security, compliance, and audit readiness
- Deployment and operational patterns
What are agentic sales workflows and why they matter
Agentic sales workflows are autonomous systems that handle repeating sales tasks—account research, outreach drafting, CRM updates, lead scoring, and next-action planning—without human intervention on every step. Unlike traditional chatbots or rule-based automation, AI agents autonomously plan and act with tools to achieve defined goals, deciding which actions to take based on real-time context and feedback.
The business case is straightforward: a sales team of 20 people spends roughly 40% of their time on non-selling work—data entry, research, email drafting, follow-up scheduling. That’s 8 people’s worth of capacity locked in admin. A production agentic sales workflow can reclaim 60–80% of that time, freeing your sales team to focus on relationship-building, negotiation, and closing.
But moving from proof-of-concept to production is where most teams stumble. The gap between a working prototype and a system handling 1,000+ leads per week with 99%+ accuracy is vast. It involves architecture decisions, tool integration complexity, state management, evaluation frameworks, and operational discipline that many teams underestimate.
At PADISO, we’ve built and scaled agentic sales systems across fintech, SaaS, and enterprise software companies. The patterns we’ve learned—and the failures we’ve debugged—form the backbone of this guide.
The architecture of production-grade sales agents
Core agent loop and orchestration
A production agentic sales workflow sits on top of a core agent loop: observe → plan → act → reflect. The agent observes the current state (a lead record, email thread, opportunity stage). It plans the next action (research competitor, draft outreach, update CRM). It acts (calls tools, integrates with external systems). It reflects on the outcome and decides whether to continue, escalate, or hand off to a human.
The orchestration layer determines how this loop scales. Microsoft’s AI agent design patterns guide outlines several patterns: sequential (one action at a time), parallel (multiple agents working on different aspects), hierarchical (a coordinator agent delegating to specialist agents), and reactive (responding to external events).
For sales workflows, a hierarchical pattern works best in production:
- Coordinator agent: Receives a lead or opportunity, decides the overall workflow (research → outreach → follow-up).
- Research agent: Gathers company data, funding history, recent news, decision-maker details.
- Outreach agent: Drafts personalised emails, LinkedIn messages, or call scripts.
- CRM agent: Updates Salesforce, HubSpot, or Pipedrive with findings and next actions.
- Escalation agent: Identifies edge cases (high-value deals, complex scenarios) and flags them for human review.
This separation of concerns means each agent can be trained, evaluated, and iterated independently. A research agent can use different tools and models than an outreach agent. If the research agent fails on 5% of lookups, you can fix it without touching the outreach logic.
Model selection and routing
Production systems rarely use a single model. The best teams use model routing: lightweight models for simple tasks, heavier models for complex reasoning.
- Fast-path tasks (lead classification, CRM updates, scheduling): Claude Haiku, GPT-4 Mini, or Gemini 1.5 Flash. Cost: $0.0008–$0.003 per 1K input tokens.
- Medium-complexity tasks (email drafting, objection handling): Claude 3.5 Sonnet or GPT-4 Turbo. Cost: $0.003–$0.01 per 1K input tokens.
- Complex reasoning (multi-step research, deal strategy): Claude 3 Opus or GPT-4. Cost: $0.015–$0.03 per 1K input tokens.
For a sales workflow processing 1,000 leads per week, routing matters. If 80% of leads hit the fast path (Haiku), 15% hit the medium path (Sonnet), and 5% hit the complex path (Opus), your average token cost per lead drops by 60% versus using Opus for everything.
Implementation: Use a classifier agent that examines the lead data (company size, industry, deal stage) and returns a routing decision. LangChain’s agent documentation and Anthropic’s tools guide both show patterns for conditional routing.
Error handling and fallback chains
Production agents fail. A tool times out. An API returns unexpected data. A model hallucinates. The difference between a prototype and production is how you handle failure.
Implement a fallback chain:
- Primary action: Agent calls Tool A (e.g., ZoomInfo API for company research).
- Timeout fallback (5 seconds): If no response, use cached data or a secondary tool (e.g., LinkedIn API).
- Parse error fallback: If the response is malformed, retry with a simpler model prompt or a different tool.
- Escalation fallback: If both tools fail, flag the lead for human research and move to the next lead.
Do not retry infinitely. Set a retry budget: 3 attempts per tool, then escalate. Infinite retries waste tokens and delay processing.
Tool design and integration patterns
Designing tools for agentic use
Tools are the bridge between an agent and the real world. A poorly designed tool will cause the agent to fail, hallucinate, or waste tokens. A well-designed tool is clear, bounded, and returns structured data.
Good tool design follows these principles:
1. Single responsibility: One tool does one thing. Instead of a generic “search the web” tool, create specific tools: search_company_news, fetch_linkedin_profile, lookup_funding_history. The agent can then decide which tool to use based on its goal.
2. Clear input schema: Define exactly what inputs the tool accepts. Use JSON Schema with descriptions.
{
"name": "search_company_news",
"description": "Searches recent news about a company. Returns up to 10 articles from the past 90 days.",
"input_schema": {
"type": "object",
"properties": {
"company_name": {
"type": "string",
"description": "The company name to search for (e.g., 'Acme Corp')"
},
"limit": {
"type": "integer",
"description": "Number of articles to return (1-10). Defaults to 5.",
"default": 5
}
},
"required": ["company_name"]
}
}
3. Predictable output: Return structured JSON, not prose. The agent can parse structured data reliably.
{
"articles": [
{
"title": "Acme Corp raises $50M Series B",
"source": "TechCrunch",
"published_date": "2024-11-15",
"url": "https://techcrunch.com/..."
}
],
"total_found": 12,
"query_time_ms": 245
}
4. Timeout and rate limits: Every tool should have a hard timeout (5–10 seconds for API calls). Document rate limits. If a tool has a limit of 100 calls per minute, the agent needs to know this to avoid throttling.
5. Error responses: Return structured errors, not exceptions.
{
"error": "rate_limit_exceeded",
"retry_after_seconds": 30,
"message": "API rate limit hit. Try again in 30 seconds."
}
Integrating with sales platforms
Most sales workflows need to read from and write to a CRM (Salesforce, HubSpot, Pipedrive). This integration is where production systems often fail.
Common pitfalls:
- No authentication caching: Every tool call re-authenticates with the CRM. This is slow and burns API calls.
- No field mapping: The agent doesn’t know which CRM field corresponds to “decision maker title” or “annual revenue.”
- No conflict resolution: Two agents write to the same lead simultaneously, and the second write overwrites the first.
- No audit trail: You can’t see what the agent wrote or why.
Production patterns:
-
Centralised CRM client: Create a single CRM client that handles authentication, token refresh, and connection pooling. All agents use this client.
-
Field schema registry: Maintain a mapping of logical fields to CRM fields. This decouples the agent logic from CRM specifics.
CRM_FIELD_MAPPING = {
"company_name": "Account.Name",
"decision_maker": "Contact.Title",
"annual_revenue": "Account.AnnualRevenue",
"last_contact_date": "Account.LastModifiedDate",
"next_action": "Account.NextStep"
}
-
Optimistic locking: When writing to a lead, include the last-modified timestamp. If the timestamp doesn’t match, the write fails, and the agent retries with fresh data.
-
Write audit trail: Every agent write logs what was written, when, and why. Store this in a separate table or a CRM custom field.
{
"timestamp": "2024-11-20T14:32:15Z",
"agent": "outreach_agent",
"action": "draft_email",
"fields_updated": ["next_action", "email_draft"],
"reason": "Personalised outreach based on recent Series B funding"
}
Building a tool library
Production sales agents typically use 8–15 tools. Here’s a typical set:
Research tools:
search_company_news(Crunchbase, news APIs)fetch_linkedin_profile(LinkedIn API or scraping service)lookup_funding_history(Crunchbase API)search_competitor_analysis(Semrush, SimilarWeb APIs)get_decision_maker_contact(Hunter.io, RocketReach APIs)
Outreach tools:
draft_email(local LLM call with templates)personalise_message(inserts company-specific details)check_email_deliverability(Sendgrid, Mailgun APIs)
CRM tools:
read_lead_record(Salesforce/HubSpot read)update_lead_record(Salesforce/HubSpot write)create_task(schedule follow-up)log_activity(record research findings)
Utility tools:
get_current_time(for scheduling)format_phone_number(normalise formats)check_domain_validity(validate email domains)
Each tool should have a test suite (unit tests for logic, integration tests for API calls) and monitoring (error rates, latency, cost per call).
State management and memory in sales agents
The memory problem
A sales workflow might take days to complete. An agent researches a lead on Monday. On Tuesday, it drafts an email. On Wednesday, it follows up. Without proper state management, the agent on Wednesday doesn’t know what happened on Monday.
This is the memory problem. LLMs have finite context windows (8K–200K tokens depending on the model). You can’t fit a month of lead history into the context. You need external memory.
Structured session state
The simplest pattern is a structured session state: a JSON object that tracks the lead, the workflow stage, and the actions taken so far.
{
"lead_id": "lead_12345",
"company_name": "Acme Corp",
"workflow_stage": "outreach_drafted",
"created_at": "2024-11-18T10:00:00Z",
"last_updated": "2024-11-20T14:32:15Z",
"actions": [
{
"timestamp": "2024-11-18T10:00:00Z",
"agent": "research_agent",
"action": "research_company",
"findings": {
"founded": 2015,
"employees": 250,
"recent_news": "Series B funding announced",
"decision_makers": ["Alice Johnson (CEO)", "Bob Smith (VP Sales)"]
}
},
{
"timestamp": "2024-11-20T14:32:15Z",
"agent": "outreach_agent",
"action": "draft_email",
"output": "Hi Alice, I saw that Acme just raised Series B..."
}
]
}
Store this in a database (PostgreSQL, DynamoDB, or Firestore). When an agent starts working on a lead, fetch the session state, pass it to the agent, and update it when the agent finishes.
Contextual summarisation
As a session grows, the full history becomes too large to pass to the model. Use contextual summarisation: summarise old actions into a narrative, keep recent actions in full detail.
{
"lead_id": "lead_12345",
"summary": "Acme Corp (Series B, 250 employees) was researched on Nov 18. Key contacts: Alice Johnson (CEO), Bob Smith (VP Sales). Recent focus: Series B funding announced in Oct 2024.",
"recent_actions": [
{
"timestamp": "2024-11-20T14:32:15Z",
"agent": "outreach_agent",
"action": "draft_email",
"output": "Hi Alice, I saw that Acme just raised Series B..."
}
]
}
When you summarise, you lose detail, but you keep the essential context. The agent can still reason about the lead, and the model has room in its context window for new reasoning.
Conversation history and turn tracking
For multi-turn workflows (e.g., agent sends an email, prospect replies, agent responds), maintain a conversation history with turn tracking.
{
"conversation_id": "conv_67890",
"lead_id": "lead_12345",
"turns": [
{
"turn_number": 1,
"timestamp": "2024-11-20T14:32:15Z",
"actor": "agent",
"message": "Hi Alice, I saw that Acme just raised Series B...",
"tool_calls": ["draft_email"]
},
{
"turn_number": 2,
"timestamp": "2024-11-21T09:15:00Z",
"actor": "prospect",
"message": "Thanks for reaching out. We're always interested in learning about new solutions."
},
{
"turn_number": 3,
"timestamp": "2024-11-21T10:00:00Z",
"actor": "agent",
"message": "Great! I'd love to show you how we've helped companies like...",
"tool_calls": ["personalise_message", "schedule_call"]
}
]
}
This structure makes it easy for the agent to understand the conversation context and respond appropriately.
Evaluation and observability at scale
Why evaluation matters
One year of agentic AI research from McKinsey emphasises that the difference between a working agent and a production agent is rigorous evaluation. You need to measure: accuracy, latency, cost, and business impact.
Without evaluation, you’re flying blind. You deploy an agent, and two weeks later, you discover it’s been sending personalised emails to the wrong decision makers. Or it’s taking 30 seconds per lead when it should take 3 seconds. Or it’s costing $0.50 per lead when your budget is $0.10.
Building evaluation datasets
Start with a manual evaluation dataset: 100–500 leads that you’ve already researched and validated. For each lead, capture:
- Input: The raw lead data (name, company, title, email).
- Ground truth: The correct research findings, decision maker, email subject, call-to-action.
- Metadata: Lead difficulty (easy, medium, hard), industry, company size.
Use this dataset to evaluate your agent before production. Run the agent on all 100 leads. Compare the agent’s output to ground truth. Calculate:
- Accuracy: % of leads where the agent’s research matches ground truth.
- Completeness: % of leads where the agent found all required information.
- Latency: Average time per lead.
- Cost: Average token cost per lead.
{
"evaluation_run": "2024-11-20_production_candidate_v3",
"dataset_size": 250,
"results": {
"research_accuracy": 0.94,
"decision_maker_accuracy": 0.87,
"email_quality_score": 0.91,
"avg_latency_seconds": 4.2,
"avg_cost_per_lead": 0.087,
"total_cost": 21.75
},
"by_difficulty": {
"easy": {"accuracy": 0.98, "latency": 2.1},
"medium": {"accuracy": 0.93, "latency": 4.5},
"hard": {"accuracy": 0.82, "latency": 7.2}
}
}
Set thresholds: if accuracy drops below 90%, don’t deploy. If cost per lead exceeds budget, optimise the model routing.
Continuous monitoring in production
Once deployed, monitor continuously. For each lead, log:
- Agent decisions: Which tools did it call? What was the reasoning?
- Tool outputs: What data did each tool return?
- Agent output: What did the agent decide to do?
- Human feedback: Did a sales rep approve the agent’s output? Did they edit it?
Use human feedback as a signal. If a sales rep edits 30% of the agent’s emails, the outreach agent needs retraining. If a sales rep ignores 50% of the agent’s research findings, the research agent is providing low-value data.
Observability stack
For production agentic workflows, you need:
-
Structured logging: Log every agent action, tool call, and decision. Use JSON to make logs queryable.
-
Tracing: Track the full execution path of a workflow. Use tools like Datadog, New Relic, or open-source solutions (Jaeger, Tempo) to visualise the trace.
-
Metrics: Track latency (p50, p95, p99), error rates, cost per lead, tool success rates.
-
Dashboards: Build dashboards showing agent health, error trends, cost trends, and business impact (leads processed, conversion rates).
-
Alerting: Alert on anomalies: error rate spikes, latency degradation, cost overruns, tool failures.
Nebius’s guide on launching production agents at scale discusses evaluation and monitoring in detail.
Common production failures and how to avoid them
Hallucination and false confidence
The failure: An agent confidently states that “Acme Corp was founded in 1987 and has 10,000 employees” when the actual founding year is 2007 and employee count is 100. The agent didn’t look this up; it hallucinated based on training data.
Why it happens: Models are trained to be helpful and confident. They’ll make up plausible-sounding details rather than say “I don’t know.”
How to prevent it:
- Require tool use: Design the agent so it must call a tool to answer factual questions. Don’t let it answer from memory.
- Validate outputs: After a tool returns data, have the agent verify it. If it finds conflicting information, escalate.
- Use structured outputs: Ask the agent to return confidence scores. “Confidence: 0.95” tells you if the agent is certain.
- Test rigorously: Your evaluation dataset should include edge cases where hallucination is likely (obscure companies, conflicting data sources).
Tool timeouts and cascading failures
The failure: The research agent calls the LinkedIn API, which times out after 30 seconds. The agent retries. The retry times out. The agent retries again. After 3 minutes, the workflow is still stuck on one lead, and 100 other leads are queued.
Why it happens: Teams often don’t set timeouts on external API calls. Or they set timeouts but don’t implement fallbacks.
How to prevent it:
- Hard timeouts on all external calls: 5–10 seconds, no exceptions.
- Circuit breakers: If a tool fails 5 times in a row, stop calling it for 5 minutes. Use cached data or a fallback tool instead.
- Async processing: Don’t block on slow tools. If a tool is slow, queue it for async processing and continue with other tasks.
State corruption and race conditions
The failure: Two agents are processing the same lead simultaneously. Agent A reads the lead record, which has next_action = null. Agent B reads the same record. Agent A sets next_action = "send_email" and writes back. Agent B sets next_action = "schedule_call" and writes back. Agent B’s write overwrites Agent A’s write. Now the lead has the wrong next action.
Why it happens: Distributed systems are hard. Without proper locking or versioning, concurrent writes cause data loss.
How to prevent it:
- Optimistic locking: Include a version number or timestamp in the read. When writing, check that the version hasn’t changed. If it has, the write fails, and the agent retries with fresh data.
- Serialisation: Process each lead sequentially. Use a queue (SQS, RabbitMQ, Pub/Sub) to ensure only one agent works on a lead at a time.
- Audit trails: Log every write. If corruption happens, you can see what went wrong and roll back.
Model context window exhaustion
The failure: The agent is processing a lead with a 6-month conversation history. The model context window is 8K tokens. The conversation history alone is 7K tokens. There’s only 1K tokens left for the agent’s reasoning. The agent makes poor decisions because it has no room to think.
Why it happens: As workflows mature, context grows. Teams don’t proactively manage context.
How to prevent it:
- Context budgeting: Allocate tokens: 30% for input (lead data, history), 50% for reasoning, 20% for output. If input exceeds 30%, summarise or truncate.
- Sliding window: Keep only the last N turns of conversation, not the entire history.
- Hierarchical summarisation: Summarise old context into a narrative. Keep recent context in full detail.
- Model routing: If a workflow needs more context, use a model with a larger window (e.g., Claude 200K).
Cost explosion
The failure: You deploy an agent that processes 1,000 leads per week. Each lead costs $0.50 in tokens. That’s $500 per week, or $26,000 per year. You budgeted $5,000 per year.
Why it happens: Teams don’t model costs upfront. They optimise for accuracy, not efficiency.
How to prevent it:
- Cost modelling: Before deployment, estimate cost per lead. Model different scenarios: 100 leads/week, 1,000 leads/week, 10,000 leads/week.
- Model routing: Use cheaper models for simple tasks, expensive models for complex tasks.
- Prompt optimisation: Shorter prompts = fewer tokens. Remove unnecessary examples, use concise language.
- Caching: If multiple leads need the same research (e.g., “Who are the decision makers at Acme Corp?”), cache the result.
- Monitoring: Track cost per lead continuously. Alert if it drifts above budget.
Human-in-the-loop workflows and override patterns
When to escalate to humans
Not every lead should be handled entirely by the agent. Some require human judgment:
- High-value deals: If a lead is worth $100K+, a human should review the research and outreach.
- Complex scenarios: If a lead has unusual attributes (e.g., a private company with no public funding data), escalate.
- Low confidence: If the agent’s confidence score is below a threshold (e.g., 0.7), escalate.
- Objections: If a prospect responds with an objection, escalate to a sales rep who can handle it.
- Regulatory concerns: If a lead is in a regulated industry (financial services, healthcare), a human should verify compliance.
Define escalation rules explicitly:
{
"escalation_rules": [
{
"condition": "deal_value > 100000",
"action": "escalate",
"reason": "High-value deal requires human review"
},
{
"condition": "research_confidence < 0.7",
"action": "escalate",
"reason": "Low confidence in research findings"
},
{
"condition": "prospect_response_contains_objection",
"action": "escalate",
"reason": "Objection handling requires human skill"
}
]
}
Approval workflows
Before an agent sends an email or calls an API that modifies data, require approval from a human (or a more conservative model).
Pattern 1: Async approval
The agent drafts an email. The email is queued for approval. A sales manager reviews it in Slack or a web interface. Once approved, the agent sends it.
{
"approval_request": {
"id": "apr_123456",
"lead_id": "lead_12345",
"action": "send_email",
"draft": "Hi Alice, I saw that Acme just raised Series B...",
"created_at": "2024-11-20T14:32:15Z",
"expires_at": "2024-11-20T15:32:15Z",
"status": "pending"
}
}
If the manager approves, the agent sends the email. If the manager rejects, the agent learns why and adjusts its drafting logic.
Pattern 2: Confidence-based approval
If the agent’s confidence is above 0.9, send immediately. If it’s between 0.7 and 0.9, queue for approval. If it’s below 0.7, escalate.
Learning from overrides
When a human overrides an agent decision, capture it and use it to improve the agent.
{
"override": {
"id": "ovr_789012",
"approval_request_id": "apr_123456",
"agent_output": "Hi Alice, I saw that Acme just raised Series B...",
"human_output": "Hi Alice, Congratulations on the Series B! I'd love to show you how...",
"reason": "More congratulatory tone, better opening hook",
"timestamp": "2024-11-20T14:35:00Z"
}
}
Analyse overrides to find patterns. If 30% of overrides are “tone too formal,” retrain the outreach agent with examples of conversational emails.
Cost control and efficiency optimisation
Token budgeting
Every agent action costs tokens. Research costs tokens. Drafting costs tokens. Reasoning costs tokens. Without budgeting, costs spiral.
Set a token budget per lead:
- Research phase: 2,000 tokens max (2–3 tool calls).
- Outreach phase: 1,500 tokens max (drafting + personalisation).
- CRM update phase: 500 tokens max.
- Total: 4,000 tokens per lead.
At $0.003 per 1K tokens (using Sonnet), that’s $0.012 per lead, or $12 per 1,000 leads.
If an agent exceeds the budget, it escalates or uses a fallback (cached data, simpler logic).
Batch processing and parallelisation
Processing leads sequentially is slow. Process them in batches.
leads = fetch_leads(limit=100)
results = []
with ThreadPoolExecutor(max_workers=10) as executor:
futures = [executor.submit(process_lead, lead) for lead in leads]
for future in futures:
result = future.result(timeout=30)
results.append(result)
With 10 parallel workers, you process 100 leads in roughly the time it takes to process 10 sequentially. But be careful: parallelisation can cause rate limiting, state corruption, and higher costs. Use connection pooling, implement backoff, and monitor carefully.
Caching and deduplication
Many leads share attributes. If you’ve already researched “Acme Corp,” don’t research it again for the next Acme employee.
Implement a cache:
cache = {
"company_research": {}, # key: company_name, value: research findings
"decision_makers": {}, # key: company_name, value: list of decision makers
"recent_news": {} # key: company_name, value: news articles
}
def get_company_research(company_name):
if company_name in cache["company_research"]:
return cache["company_research"][company_name]
# Not in cache, fetch and cache
research = call_research_tool(company_name)
cache["company_research"][company_name] = research
return research
With a 1,000-lead batch where 30% are from the same companies, caching saves 300 API calls and $3–$5 in token costs.
Prompt optimisation
Every token in your prompt costs money. Optimise prompts to be concise without losing clarity.
Before:
You are a sales research agent. Your job is to research companies and identify decision makers.
You should use tools to find information about the company, including its founding date,
employee count, recent funding, and key executives. You should also search for recent news
about the company. Once you have gathered this information, you should identify the most
relevant decision makers for a sales outreach.
After:
Research a company and identify key decision makers. Use tools to find: founding date,
employee count, recent funding, executives, and recent news.
The second version is 60% shorter and says the same thing.
Security, compliance, and audit readiness
Data handling and privacy
Sales agents handle sensitive data: prospect names, emails, company information, conversation history. You need to protect it.
Principles:
- Minimise data: Only collect and store data you need.
- Encrypt in transit: Use HTTPS, TLS for all API calls.
- Encrypt at rest: Use AES-256 for database encryption.
- Access control: Only agents and systems that need data can access it. Use IAM roles and policies.
- Data retention: Define how long you keep data. Delete old leads after 90 days.
For Australian businesses, you must comply with the Privacy Act 1988 and Australian Privacy Principles. For financial services, you must comply with APRA CPS 234 and ASIC RG 271. If you work with regulated entities, PADISO’s financial services AI advisory can help you navigate compliance.
Audit trails and logging
Every agent action should be logged:
- Who: Which agent, which model, which user initiated the workflow.
- What: Which tools were called, what data was accessed, what decisions were made.
- When: Timestamp of each action.
- Why: The reasoning for the decision (if available).
- Outcome: What happened as a result.
{
"audit_log": {
"id": "log_123456",
"timestamp": "2024-11-20T14:32:15Z",
"agent": "research_agent",
"model": "claude-3-5-sonnet",
"lead_id": "lead_12345",
"action": "call_tool",
"tool_name": "search_company_news",
"tool_input": {"company_name": "Acme Corp", "limit": 5},
"tool_output": {"articles": [...], "total_found": 12},
"cost_tokens": 342,
"cost_usd": 0.0031
}
}
Store logs in a tamper-proof system (write-once storage, immutable logs). Retain logs for at least 7 years (regulatory requirement in many jurisdictions).
Model governance and version control
When you update a model or prompt, you need to track it. Use version control:
{
"model_version": {
"id": "mv_v3_2024_11_20",
"agent": "research_agent",
"model_name": "claude-3-5-sonnet",
"model_version": "claude-3-5-sonnet-20241022",
"system_prompt": "Research a company and identify key decision makers...",
"tools": ["search_company_news", "fetch_linkedin_profile", ...],
"created_at": "2024-11-20T10:00:00Z",
"created_by": "alice@company.com",
"approval_status": "approved",
"evaluation_results": {"accuracy": 0.94, "latency": 4.2},
"deployed_at": "2024-11-20T15:00:00Z"
}
}
Before deploying a new model version, require approval from a human (manager, compliance officer). Document why the change was made and what the impact is.
SOC 2 and ISO 27001 readiness
If you’re building agentic systems for enterprise customers, they’ll ask: “Are you SOC 2 Type II certified? Are you ISO 27001 certified?” These certifications prove you have controls around security, availability, and confidentiality.
For agentic systems, focus on:
- Access control: Who can deploy agents? Who can modify tools? Who can access logs?
- Change management: How do you test and approve new agent versions?
- Incident response: If an agent makes a mistake or a tool fails, how do you respond?
- Monitoring and alerting: How do you detect anomalies?
- Data protection: How do you protect prospect data?
PADISO’s AI Quickstart Audit is a fixed-fee, 2-week diagnostic that assesses your AI systems against SOC 2 and ISO 27001 standards. It tells you what’s missing and what to fix first.
Deployment and operational patterns
Local vs. cloud deployment
Local deployment (on-premises): Your infrastructure, your data never leaves your network. Best for regulated industries (financial services, healthcare). Drawback: you manage infrastructure, scaling, and uptime.
Cloud deployment (AWS, Azure, GCP): Managed infrastructure, auto-scaling, high availability. Drawback: data is in the cloud, compliance is shared responsibility.
For sales agents, cloud deployment is typical. Use managed services:
- Compute: AWS Lambda, Google Cloud Functions, or Azure Functions for serverless execution.
- Orchestration: AWS Step Functions, Google Cloud Workflows, or Temporal for workflow management.
- Storage: PostgreSQL (RDS) or DynamoDB for session state and audit logs.
- Queues: SQS, Pub/Sub, or RabbitMQ for lead processing.
- Monitoring: CloudWatch, Datadog, or New Relic.
CI/CD for agents
Treat agent code like software. Use version control (Git), continuous integration (run tests on every commit), and continuous deployment (auto-deploy to staging, manual approval to production).
name: Deploy Agent
on:
push:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Run unit tests
run: pytest tests/
- name: Run integration tests
run: pytest tests/integration/
- name: Evaluate on test dataset
run: python eval.py --dataset test_leads_100.json
deploy:
needs: test
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Deploy to staging
run: terraform apply -auto-approve -var-file=staging.tfvars
- name: Run smoke tests
run: pytest tests/smoke/
- name: Manual approval for production
uses: actions/github-script@v6
with:
script: |
// Require approval from a human before deploying to production
Gradual rollout and canary deployments
Don’t deploy a new agent version to 100% of traffic immediately. Use a canary deployment:
- Deploy to 5% of traffic. Monitor for errors, latency, cost.
- If metrics are good, increase to 10%.
- Continue increasing until you reach 100%.
- If metrics degrade at any point, rollback to the previous version.
def route_lead_to_agent(lead):
random_value = random.random()
if random_value < 0.05: # 5% to canary
return process_with_agent(lead, version="v3_canary")
else: # 95% to stable
return process_with_agent(lead, version="v2_stable")
Monitor canary metrics separately. If the canary’s accuracy is 92% and the stable version’s accuracy is 94%, the canary is not ready. Investigate why and fix it.
Operational runbooks
Create runbooks for common operational scenarios:
Runbook: Agent latency spike
- Check CloudWatch metrics. Is CPU high? Memory high? Network latency high?
- Check tool latency. Which tool is slow? Is it timing out?
- Check error logs. Are there error spikes?
- If a tool is slow, activate fallback (use cached data or a different tool).
- If a tool is down, disable it and route to fallback.
- If infrastructure is overloaded, scale up (increase Lambda concurrency, add more pods).
- Once resolved, post-mortem: why did this happen? How do we prevent it?
Runbook: Cost overrun
- Check cost metrics. Which agent is expensive? Which tool?
- Check token usage. Are prompts too long? Are we retrying too much?
- Check model routing. Are we using expensive models for simple tasks?
- Optimise: shorten prompts, reduce tool calls, use cheaper models.
- If cost is still high, reduce traffic to the agent or disable it.
Monitoring dashboards
Build dashboards that show:
- Volume: Leads processed per hour, per day, per week.
- Latency: p50, p95, p99 latency per lead.
- Accuracy: Research accuracy, email quality, decision maker accuracy.
- Cost: Cost per lead, total cost per day, cost trends.
- Errors: Error rate, error types, tool failure rates.
- Business impact: Leads qualified, conversion rate, revenue influenced.
Update dashboards in real-time. Alert on anomalies: error rate > 5%, latency p99 > 10 seconds, cost per lead > budget.
Putting it all together: A production architecture
Here’s how a production agentic sales workflow fits together:
┌─────────────────────────────────────────────────────────────┐
│ Inbound Leads (Salesforce, HubSpot, API) │
└────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Lead Queue (SQS / Pub/Sub) │
│ - Deduplication: check if lead already processed │
│ - Prioritisation: high-value deals first │
└────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Coordinator Agent (Lambda / Cloud Function) │
│ - Fetch lead from queue │
│ - Fetch session state from database │
│ - Decide workflow: research → outreach → follow-up │
│ - Route to specialist agents │
└────────────────────┬────────────────────────────────────────┘
│
┌────────────┼────────────┐
│ │ │
▼ ▼ ▼
Research Outreach CRM
Agent Agent Agent
│ │ │
└────────────┼────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Tool Execution Layer │
│ - CRM client (Salesforce, HubSpot) │
│ - Research tools (LinkedIn, Crunchbase, news APIs) │
│ - Email tools (Sendgrid, draft engine) │
│ - Scheduling tools (Calendly) │
│ - Caching layer (Redis) │
│ - Circuit breakers, timeouts, retries │
└────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Session State Database (PostgreSQL / DynamoDB) │
│ - Lead record, workflow stage, actions taken │
│ - Conversation history, research findings │
│ - Audit trail │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Approval / Escalation Layer │
│ - High-value deals → human review │
│ - Low confidence → escalate │
│ - Objections → sales rep │
└────────────────────┬────────────────────────────────────────┘
│
┌────────────┼────────────┐
│ │ │
▼ ▼ ▼
Approved Escalated Monitoring &
Actions to Humans Observability
│ │ │
└────────────┼────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Outbound Actions │
│ - Send email via Sendgrid │
│ - Update CRM (Salesforce, HubSpot) │
│ - Schedule call (Calendly) │
│ - Log activity │
└─────────────────────────────────────────────────────────────┘
Each component is independent, testable, and monitorable. If the research agent fails, the outreach agent can still work (using cached research). If the CRM write fails, the action is queued for retry. If a tool times out, a fallback is used.
Summary and next steps
Building agentic sales workflows in production is complex. It’s not just about calling an LLM API. It’s about architecture, tool design, state management, evaluation, monitoring, and operational discipline.
Here’s what you need to do:
Week 1–2: Foundation
- Define your sales workflow (research → outreach → follow-up).
- Choose your tools (CRM, research APIs, email platform).
- Design your tool schema (inputs, outputs, error handling).
- Build a basic agent loop (coordinator + specialist agents).
Week 3–4: Evaluation
- Create a test dataset (100–500 leads with ground truth).
- Run your agent on the test dataset.
- Measure accuracy, latency, cost.
- Identify failure modes and fix them.
Week 5–6: Production readiness
- Implement state management (session state database).
- Add approval workflows and escalation rules.
- Build monitoring and alerting.
- Create runbooks for common failures.
- Deploy to staging and run smoke tests.
Week 7+: Launch and iterate
- Canary deploy to 5% of traffic.
- Monitor metrics closely.
- Collect human feedback and use it to improve the agent.
- Gradually increase to 100%.
- Continuously optimise: reduce cost, improve accuracy, increase speed.
If you’re building this at a startup or scaling-stage company, you likely don’t have the in-house expertise to do this alone. PADISO’s platform engineering services and AI & Agents Automation can help you design, build, and deploy production-grade agentic systems. We’ve done this for dozens of companies, and we know the patterns, the pitfalls, and how to get to market fast.
If you’re in financial services, PADISO’s AI advisory for financial services ensures your agents are APRA, ASIC, and AUSTRAC compliant by design. If you’re in insurance, our insurance AI advisory covers claims automation, conduct risk, and underwriting.
For a quick assessment of where you are and what to build first, book an AI Quickstart Audit. It’s a fixed-fee, 2-week diagnostic that tells you exactly what to ship first, what to retire, and what 90 days could unlock.
Or book a call with our Sydney-based AI advisory team to discuss your specific workflow and challenges. We’ll help you ship faster and avoid the pitfalls that slow down most teams.
The future of sales is agentic. The teams that ship production agentic workflows first will have a massive competitive advantage. The time to start is now.