Guide 28 mins

AI Agents in Production: Agentic Sales Workflows

Engineering patterns for agentic sales workflows in production. Real architectures, code-level recommendations, and operational quirks at scale.

The PADISO Team ·2026-06-01

What are agentic sales workflows and why they matter
The architecture of production-grade sales agents
Tool design and integration patterns
State management and memory in sales agents
Evaluation and observability at scale
Common production failures and how to avoid them
Human-in-the-loop workflows and override patterns
Cost control and efficiency optimisation
Security, compliance, and audit readiness
Deployment and operational patterns

What are agentic sales workflows and why they matter

Agentic sales workflows are autonomous systems that handle repeating sales tasks—account research, outreach drafting, CRM updates, lead scoring, and next-action planning—without human intervention on every step. Unlike traditional chatbots or rule-based automation, AI agents autonomously plan and act with tools to achieve defined goals, deciding which actions to take based on real-time context and feedback.

The business case is straightforward: a sales team of 20 people spends roughly 40% of their time on non-selling work—data entry, research, email drafting, follow-up scheduling. That’s 8 people’s worth of capacity locked in admin. A production agentic sales workflow can reclaim 60–80% of that time, freeing your sales team to focus on relationship-building, negotiation, and closing.

But moving from proof-of-concept to production is where most teams stumble. The gap between a working prototype and a system handling 1,000+ leads per week with 99%+ accuracy is vast. It involves architecture decisions, tool integration complexity, state management, evaluation frameworks, and operational discipline that many teams underestimate.

At PADISO, we’ve built and scaled agentic sales systems across fintech, SaaS, and enterprise software companies. The patterns we’ve learned—and the failures we’ve debugged—form the backbone of this guide.

The architecture of production-grade sales agents

Core agent loop and orchestration

A production agentic sales workflow sits on top of a core agent loop: observe → plan → act → reflect. The agent observes the current state (a lead record, email thread, opportunity stage). It plans the next action (research competitor, draft outreach, update CRM). It acts (calls tools, integrates with external systems). It reflects on the outcome and decides whether to continue, escalate, or hand off to a human.

The orchestration layer determines how this loop scales. Microsoft’s AI agent design patterns guide outlines several patterns: sequential (one action at a time), parallel (multiple agents working on different aspects), hierarchical (a coordinator agent delegating to specialist agents), and reactive (responding to external events).

For sales workflows, a hierarchical pattern works best in production:

Coordinator agent: Receives a lead or opportunity, decides the overall workflow (research → outreach → follow-up).
Research agent: Gathers company data, funding history, recent news, decision-maker details.
Outreach agent: Drafts personalised emails, LinkedIn messages, or call scripts.
CRM agent: Updates Salesforce, HubSpot, or Pipedrive with findings and next actions.
Escalation agent: Identifies edge cases (high-value deals, complex scenarios) and flags them for human review.

This separation of concerns means each agent can be trained, evaluated, and iterated independently. A research agent can use different tools and models than an outreach agent. If the research agent fails on 5% of lookups, you can fix it without touching the outreach logic.

Model selection and routing

Production systems rarely use a single model. The best teams use model routing: lightweight models for simple tasks, heavier models for complex reasoning.

Fast-path tasks (lead classification, CRM updates, scheduling): Claude Haiku, GPT-4 Mini, or Gemini 1.5 Flash. Cost: $0.0008–$0.003 per 1K input tokens.
Medium-complexity tasks (email drafting, objection handling): Claude 3.5 Sonnet or GPT-4 Turbo. Cost: $0.003–$0.01 per 1K input tokens.
Complex reasoning (multi-step research, deal strategy): Claude 3 Opus or GPT-4. Cost: $0.015–$0.03 per 1K input tokens.

For a sales workflow processing 1,000 leads per week, routing matters. If 80% of leads hit the fast path (Haiku), 15% hit the medium path (Sonnet), and 5% hit the complex path (Opus), your average token cost per lead drops by 60% versus using Opus for everything.

Implementation: Use a classifier agent that examines the lead data (company size, industry, deal stage) and returns a routing decision. LangChain’s agent documentation and Anthropic’s tools guide both show patterns for conditional routing.

Error handling and fallback chains

Production agents fail. A tool times out. An API returns unexpected data. A model hallucinates. The difference between a prototype and production is how you handle failure.

Implement a fallback chain:

Primary action: Agent calls Tool A (e.g., ZoomInfo API for company research).
Timeout fallback (5 seconds): If no response, use cached data or a secondary tool (e.g., LinkedIn API).
Parse error fallback: If the response is malformed, retry with a simpler model prompt or a different tool.
Escalation fallback: If both tools fail, flag the lead for human research and move to the next lead.

Do not retry infinitely. Set a retry budget: 3 attempts per tool, then escalate. Infinite retries waste tokens and delay processing.

Tool design and integration patterns

Designing tools for agentic use

Tools are the bridge between an agent and the real world. A poorly designed tool will cause the agent to fail, hallucinate, or waste tokens. A well-designed tool is clear, bounded, and returns structured data.

Good tool design follows these principles:

1. Single responsibility: One tool does one thing. Instead of a generic “search the web” tool, create specific tools: search_company_news, fetch_linkedin_profile, lookup_funding_history. The agent can then decide which tool to use based on its goal.

2. Clear input schema: Define exactly what inputs the tool accepts. Use JSON Schema with descriptions.

{
  "name": "search_company_news",
  "description": "Searches recent news about a company. Returns up to 10 articles from the past 90 days.",
  "input_schema": {
    "type": "object",
    "properties": {
      "company_name": {
        "type": "string",
        "description": "The company name to search for (e.g., 'Acme Corp')"
      },
      "limit": {
        "type": "integer",
        "description": "Number of articles to return (1-10). Defaults to 5.",
        "default": 5
      }
    },
    "required": ["company_name"]
  }
}

3. Predictable output: Return structured JSON, not prose. The agent can parse structured data reliably.

{
  "articles": [
    {
      "title": "Acme Corp raises $50M Series B",
      "source": "TechCrunch",
      "published_date": "2024-11-15",
      "url": "https://techcrunch.com/..."
    }
  ],
  "total_found": 12,
  "query_time_ms": 245
}

4. Timeout and rate limits: Every tool should have a hard timeout (5–10 seconds for API calls). Document rate limits. If a tool has a limit of 100 calls per minute, the agent needs to know this to avoid throttling.

5. Error responses: Return structured errors, not exceptions.

{
  "error": "rate_limit_exceeded",
  "retry_after_seconds": 30,
  "message": "API rate limit hit. Try again in 30 seconds."
}

Integrating with sales platforms

Most sales workflows need to read from and write to a CRM (Salesforce, HubSpot, Pipedrive). This integration is where production systems often fail.

Common pitfalls:

No authentication caching: Every tool call re-authenticates with the CRM. This is slow and burns API calls.
No field mapping: The agent doesn’t know which CRM field corresponds to “decision maker title” or “annual revenue.”
No conflict resolution: Two agents write to the same lead simultaneously, and the second write overwrites the first.
No audit trail: You can’t see what the agent wrote or why.

Production patterns:

Centralised CRM client: Create a single CRM client that handles authentication, token refresh, and connection pooling. All agents use this client.
Field schema registry: Maintain a mapping of logical fields to CRM fields. This decouples the agent logic from CRM specifics.

CRM_FIELD_MAPPING = {
    "company_name": "Account.Name",
    "decision_maker": "Contact.Title",
    "annual_revenue": "Account.AnnualRevenue",
    "last_contact_date": "Account.LastModifiedDate",
    "next_action": "Account.NextStep"
}

Optimistic locking: When writing to a lead, include the last-modified timestamp. If the timestamp doesn’t match, the write fails, and the agent retries with fresh data.
Write audit trail: Every agent write logs what was written, when, and why. Store this in a separate table or a CRM custom field.

{
  "timestamp": "2024-11-20T14:32:15Z",
  "agent": "outreach_agent",
  "action": "draft_email",
  "fields_updated": ["next_action", "email_draft"],
  "reason": "Personalised outreach based on recent Series B funding"
}

Building a tool library

Production sales agents typically use 8–15 tools. Here’s a typical set:

Research tools:

search_company_news (Crunchbase, news APIs)
fetch_linkedin_profile (LinkedIn API or scraping service)
lookup_funding_history (Crunchbase API)
search_competitor_analysis (Semrush, SimilarWeb APIs)
get_decision_maker_contact (Hunter.io, RocketReach APIs)

Outreach tools:

draft_email (local LLM call with templates)
personalise_message (inserts company-specific details)
check_email_deliverability (Sendgrid, Mailgun APIs)

CRM tools:

read_lead_record (Salesforce/HubSpot read)
update_lead_record (Salesforce/HubSpot write)
create_task (schedule follow-up)
log_activity (record research findings)

Utility tools:

get_current_time (for scheduling)
format_phone_number (normalise formats)
check_domain_validity (validate email domains)

Each tool should have a test suite (unit tests for logic, integration tests for API calls) and monitoring (error rates, latency, cost per call).

State management and memory in sales agents

The memory problem

A sales workflow might take days to complete. An agent researches a lead on Monday. On Tuesday, it drafts an email. On Wednesday, it follows up. Without proper state management, the agent on Wednesday doesn’t know what happened on Monday.

This is the memory problem. LLMs have finite context windows (8K–200K tokens depending on the model). You can’t fit a month of lead history into the context. You need external memory.

Structured session state

The simplest pattern is a structured session state: a JSON object that tracks the lead, the workflow stage, and the actions taken so far.

{
  "lead_id": "lead_12345",
  "company_name": "Acme Corp",
  "workflow_stage": "outreach_drafted",
  "created_at": "2024-11-18T10:00:00Z",
  "last_updated": "2024-11-20T14:32:15Z",
  "actions": [
    {
      "timestamp": "2024-11-18T10:00:00Z",
      "agent": "research_agent",
      "action": "research_company",
      "findings": {
        "founded": 2015,
        "employees": 250,
        "recent_news": "Series B funding announced",
        "decision_makers": ["Alice Johnson (CEO)", "Bob Smith (VP Sales)"]
      }
    },
    {
      "timestamp": "2024-11-20T14:32:15Z",
      "agent": "outreach_agent",
      "action": "draft_email",
      "output": "Hi Alice, I saw that Acme just raised Series B..."
    }
  ]
}

Store this in a database (PostgreSQL, DynamoDB, or Firestore). When an agent starts working on a lead, fetch the session state, pass it to the agent, and update it when the agent finishes.

Contextual summarisation

As a session grows, the full history becomes too large to pass to the model. Use contextual summarisation: summarise old actions into a narrative, keep recent actions in full detail.

{
  "lead_id": "lead_12345",
  "summary": "Acme Corp (Series B, 250 employees) was researched on Nov 18. Key contacts: Alice Johnson (CEO), Bob Smith (VP Sales). Recent focus: Series B funding announced in Oct 2024.",
  "recent_actions": [
    {
      "timestamp": "2024-11-20T14:32:15Z",
      "agent": "outreach_agent",
      "action": "draft_email",
      "output": "Hi Alice, I saw that Acme just raised Series B..."
    }
  ]
}

When you summarise, you lose detail, but you keep the essential context. The agent can still reason about the lead, and the model has room in its context window for new reasoning.

Conversation history and turn tracking

For multi-turn workflows (e.g., agent sends an email, prospect replies, agent responds), maintain a conversation history with turn tracking.

{
  "conversation_id": "conv_67890",
  "lead_id": "lead_12345",
  "turns": [
    {
      "turn_number": 1,
      "timestamp": "2024-11-20T14:32:15Z",
      "actor": "agent",
      "message": "Hi Alice, I saw that Acme just raised Series B...",
      "tool_calls": ["draft_email"]
    },
    {
      "turn_number": 2,
      "timestamp": "2024-11-21T09:15:00Z",
      "actor": "prospect",
      "message": "Thanks for reaching out. We're always interested in learning about new solutions."
    },
    {
      "turn_number": 3,
      "timestamp": "2024-11-21T10:00:00Z",
      "actor": "agent",
      "message": "Great! I'd love to show you how we've helped companies like...",
      "tool_calls": ["personalise_message", "schedule_call"]
    }
  ]
}

This structure makes it easy for the agent to understand the conversation context and respond appropriately.

Evaluation and observability at scale

Why evaluation matters

One year of agentic AI research from McKinsey emphasises that the difference between a working agent and a production agent is rigorous evaluation. You need to measure: accuracy, latency, cost, and business impact.

Without evaluation, you’re flying blind. You deploy an agent, and two weeks later, you discover it’s been sending personalised emails to the wrong decision makers. Or it’s taking 30 seconds per lead when it should take 3 seconds. Or it’s costing $0.50 per lead when your budget is $0.10.

Building evaluation datasets

Start with a manual evaluation dataset: 100–500 leads that you’ve already researched and validated. For each lead, capture:

Input: The raw lead data (name, company, title, email).
Ground truth: The correct research findings, decision maker, email subject, call-to-action.
Metadata: Lead difficulty (easy, medium, hard), industry, company size.

Use this dataset to evaluate your agent before production. Run the agent on all 100 leads. Compare the agent’s output to ground truth. Calculate:

Accuracy: % of leads where the agent’s research matches ground truth.
Completeness: % of leads where the agent found all required information.
Latency: Average time per lead.
Cost: Average token cost per lead.

{
  "evaluation_run": "2024-11-20_production_candidate_v3",
  "dataset_size": 250,
  "results": {
    "research_accuracy": 0.94,
    "decision_maker_accuracy": 0.87,
    "email_quality_score": 0.91,
    "avg_latency_seconds": 4.2,
    "avg_cost_per_lead": 0.087,
    "total_cost": 21.75
  },
  "by_difficulty": {
    "easy": {"accuracy": 0.98, "latency": 2.1},
    "medium": {"accuracy": 0.93, "latency": 4.5},
    "hard": {"accuracy": 0.82, "latency": 7.2}
  }
}

Set thresholds: if accuracy drops below 90%, don’t deploy. If cost per lead exceeds budget, optimise the model routing.

Continuous monitoring in production

Once deployed, monitor continuously. For each lead, log:

Agent decisions: Which tools did it call? What was the reasoning?
Tool outputs: What data did each tool return?
Agent output: What did the agent decide to do?
Human feedback: Did a sales rep approve the agent’s output? Did they edit it?

Use human feedback as a signal. If a sales rep edits 30% of the agent’s emails, the outreach agent needs retraining. If a sales rep ignores 50% of the agent’s research findings, the research agent is providing low-value data.

Observability stack

For production agentic workflows, you need:

Structured logging: Log every agent action, tool call, and decision. Use JSON to make logs queryable.
Tracing: Track the full execution path of a workflow. Use tools like Datadog, New Relic, or open-source solutions (Jaeger, Tempo) to visualise the trace.
Metrics: Track latency (p50, p95, p99), error rates, cost per lead, tool success rates.
Dashboards: Build dashboards showing agent health, error trends, cost trends, and business impact (leads processed, conversion rates).
Alerting: Alert on anomalies: error rate spikes, latency degradation, cost overruns, tool failures.

Nebius’s guide on launching production agents at scale discusses evaluation and monitoring in detail.

Common production failures and how to avoid them

Hallucination and false confidence

The failure: An agent confidently states that “Acme Corp was founded in 1987 and has 10,000 employees” when the actual founding year is 2007 and employee count is 100. The agent didn’t look this up; it hallucinated based on training data.

Why it happens: Models are trained to be helpful and confident. They’ll make up plausible-sounding details rather than say “I don’t know.”

How to prevent it:

Require tool use: Design the agent so it must call a tool to answer factual questions. Don’t let it answer from memory.
Validate outputs: After a tool returns data, have the agent verify it. If it finds conflicting information, escalate.
Use structured outputs: Ask the agent to return confidence scores. “Confidence: 0.95” tells you if the agent is certain.
Test rigorously: Your evaluation dataset should include edge cases where hallucination is likely (obscure companies, conflicting data sources).

Tool timeouts and cascading failures

The failure: The research agent calls the LinkedIn API, which times out after 30 seconds. The agent retries. The retry times out. The agent retries again. After 3 minutes, the workflow is still stuck on one lead, and 100 other leads are queued.

Why it happens: Teams often don’t set timeouts on external API calls. Or they set timeouts but don’t implement fallbacks.

How to prevent it:

Hard timeouts on all external calls: 5–10 seconds, no exceptions.
Circuit breakers: If a tool fails 5 times in a row, stop calling it for 5 minutes. Use cached data or a fallback tool instead.
Async processing: Don’t block on slow tools. If a tool is slow, queue it for async processing and continue with other tasks.

State corruption and race conditions

The failure: Two agents are processing the same lead simultaneously. Agent A reads the lead record, which has next_action = null. Agent B reads the same record. Agent A sets next_action = "send_email" and writes back. Agent B sets next_action = "schedule_call" and writes back. Agent B’s write overwrites Agent A’s write. Now the lead has the wrong next action.

Why it happens: Distributed systems are hard. Without proper locking or versioning, concurrent writes cause data loss.

How to prevent it:

Optimistic locking: Include a version number or timestamp in the read. When writing, check that the version hasn’t changed. If it has, the write fails, and the agent retries with fresh data.
Serialisation: Process each lead sequentially. Use a queue (SQS, RabbitMQ, Pub/Sub) to ensure only one agent works on a lead at a time.
Audit trails: Log every write. If corruption happens, you can see what went wrong and roll back.

Model context window exhaustion

The failure: The agent is processing a lead with a 6-month conversation history. The model context window is 8K tokens. The conversation history alone is 7K tokens. There’s only 1K tokens left for the agent’s reasoning. The agent makes poor decisions because it has no room to think.

Why it happens: As workflows mature, context grows. Teams don’t proactively manage context.

How to prevent it:

Context budgeting: Allocate tokens: 30% for input (lead data, history), 50% for reasoning, 20% for output. If input exceeds 30%, summarise or truncate.
Sliding window: Keep only the last N turns of conversation, not the entire history.
Hierarchical summarisation: Summarise old context into a narrative. Keep recent context in full detail.
Model routing: If a workflow needs more context, use a model with a larger window (e.g., Claude 200K).

Cost explosion

The failure: You deploy an agent that processes 1,000 leads per week. Each lead costs $0.50 in tokens. That’s $500 per week, or $26,000 per year. You budgeted $5,000 per year.

Why it happens: Teams don’t model costs upfront. They optimise for accuracy, not efficiency.

How to prevent it:

Cost modelling: Before deployment, estimate cost per lead. Model different scenarios: 100 leads/week, 1,000 leads/week, 10,000 leads/week.
Model routing: Use cheaper models for simple tasks, expensive models for complex tasks.
Prompt optimisation: Shorter prompts = fewer tokens. Remove unnecessary examples, use concise language.
Caching: If multiple leads need the same research (e.g., “Who are the decision makers at Acme Corp?”), cache the result.
Monitoring: Track cost per lead continuously. Alert if it drifts above budget.

Human-in-the-loop workflows and override patterns

When to escalate to humans

Not every lead should be handled entirely by the agent. Some require human judgment:

High-value deals: If a lead is worth $100K+, a human should review the research and outreach.
Complex scenarios: If a lead has unusual attributes (e.g., a private company with no public funding data), escalate.
Low confidence: If the agent’s confidence score is below a threshold (e.g., 0.7), escalate.
Objections: If a prospect responds with an objection, escalate to a sales rep who can handle it.
Regulatory concerns: If a lead is in a regulated industry (financial services, healthcare), a human should verify compliance.

Define escalation rules explicitly:

{
  "escalation_rules": [
    {
      "condition": "deal_value > 100000",
      "action": "escalate",
      "reason": "High-value deal requires human review"
    },
    {
      "condition": "research_confidence < 0.7",
      "action": "escalate",
      "reason": "Low confidence in research findings"
    },
    {
      "condition": "prospect_response_contains_objection",
      "action": "escalate",
      "reason": "Objection handling requires human skill"
    }
  ]
}

Approval workflows

Before an agent sends an email or calls an API that modifies data, require approval from a human (or a more conservative model).

Pattern 1: Async approval

The agent drafts an email. The email is queued for approval. A sales manager reviews it in Slack or a web interface. Once approved, the agent sends it.

{
  "approval_request": {
    "id": "apr_123456",
    "lead_id": "lead_12345",
    "action": "send_email",
    "draft": "Hi Alice, I saw that Acme just raised Series B...",
    "created_at": "2024-11-20T14:32:15Z",
    "expires_at": "2024-11-20T15:32:15Z",
    "status": "pending"
  }
}

If the manager approves, the agent sends the email. If the manager rejects, the agent learns why and adjusts its drafting logic.

Pattern 2: Confidence-based approval

If the agent’s confidence is above 0.9, send immediately. If it’s between 0.7 and 0.9, queue for approval. If it’s below 0.7, escalate.

Learning from overrides

When a human overrides an agent decision, capture it and use it to improve the agent.

{
  "override": {
    "id": "ovr_789012",
    "approval_request_id": "apr_123456",
    "agent_output": "Hi Alice, I saw that Acme just raised Series B...",
    "human_output": "Hi Alice, Congratulations on the Series B! I'd love to show you how...",
    "reason": "More congratulatory tone, better opening hook",
    "timestamp": "2024-11-20T14:35:00Z"
  }
}

Analyse overrides to find patterns. If 30% of overrides are “tone too formal,” retrain the outreach agent with examples of conversational emails.

Cost control and efficiency optimisation

Token budgeting

Every agent action costs tokens. Research costs tokens. Drafting costs tokens. Reasoning costs tokens. Without budgeting, costs spiral.

Set a token budget per lead:

Research phase: 2,000 tokens max (2–3 tool calls).
Outreach phase: 1,500 tokens max (drafting + personalisation).
CRM update phase: 500 tokens max.
Total: 4,000 tokens per lead.

At $0.003 per 1K tokens (using Sonnet), that’s $0.012 per lead, or $12 per 1,000 leads.

If an agent exceeds the budget, it escalates or uses a fallback (cached data, simpler logic).

Batch processing and parallelisation

Processing leads sequentially is slow. Process them in batches.

leads = fetch_leads(limit=100)
results = []

with ThreadPoolExecutor(max_workers=10) as executor:
    futures = [executor.submit(process_lead, lead) for lead in leads]
    for future in futures:
        result = future.result(timeout=30)
        results.append(result)

With 10 parallel workers, you process 100 leads in roughly the time it takes to process 10 sequentially. But be careful: parallelisation can cause rate limiting, state corruption, and higher costs. Use connection pooling, implement backoff, and monitor carefully.

Caching and deduplication

Many leads share attributes. If you’ve already researched “Acme Corp,” don’t research it again for the next Acme employee.

Implement a cache:

cache = {
    "company_research": {},  # key: company_name, value: research findings
    "decision_makers": {},   # key: company_name, value: list of decision makers
    "recent_news": {}        # key: company_name, value: news articles
}

def get_company_research(company_name):
    if company_name in cache["company_research"]:
        return cache["company_research"][company_name]
    
    # Not in cache, fetch and cache
    research = call_research_tool(company_name)
    cache["company_research"][company_name] = research
    return research

With a 1,000-lead batch where 30% are from the same companies, caching saves 300 API calls and $3–$5 in token costs.

Prompt optimisation

Every token in your prompt costs money. Optimise prompts to be concise without losing clarity.

Before:

You are a sales research agent. Your job is to research companies and identify decision makers. 
You should use tools to find information about the company, including its founding date, 
employee count, recent funding, and key executives. You should also search for recent news 
about the company. Once you have gathered this information, you should identify the most 
relevant decision makers for a sales outreach.

After:

Research a company and identify key decision makers. Use tools to find: founding date, 
employee count, recent funding, executives, and recent news.

The second version is 60% shorter and says the same thing.

Security, compliance, and audit readiness

Data handling and privacy

Sales agents handle sensitive data: prospect names, emails, company information, conversation history. You need to protect it.

Principles:

Minimise data: Only collect and store data you need.
Encrypt in transit: Use HTTPS, TLS for all API calls.
Encrypt at rest: Use AES-256 for database encryption.
Access control: Only agents and systems that need data can access it. Use IAM roles and policies.
Data retention: Define how long you keep data. Delete old leads after 90 days.

For Australian businesses, you must comply with the Privacy Act 1988 and Australian Privacy Principles. For financial services, you must comply with APRA CPS 234 and ASIC RG 271. If you work with regulated entities, PADISO’s financial services AI advisory can help you navigate compliance.

Audit trails and logging

Every agent action should be logged:

Who: Which agent, which model, which user initiated the workflow.
What: Which tools were called, what data was accessed, what decisions were made.
When: Timestamp of each action.
Why: The reasoning for the decision (if available).
Outcome: What happened as a result.

{
  "audit_log": {
    "id": "log_123456",
    "timestamp": "2024-11-20T14:32:15Z",
    "agent": "research_agent",
    "model": "claude-3-5-sonnet",
    "lead_id": "lead_12345",
    "action": "call_tool",
    "tool_name": "search_company_news",
    "tool_input": {"company_name": "Acme Corp", "limit": 5},
    "tool_output": {"articles": [...], "total_found": 12},
    "cost_tokens": 342,
    "cost_usd": 0.0031
  }
}

Store logs in a tamper-proof system (write-once storage, immutable logs). Retain logs for at least 7 years (regulatory requirement in many jurisdictions).

Model governance and version control

When you update a model or prompt, you need to track it. Use version control:

{
  "model_version": {
    "id": "mv_v3_2024_11_20",
    "agent": "research_agent",
    "model_name": "claude-3-5-sonnet",
    "model_version": "claude-3-5-sonnet-20241022",
    "system_prompt": "Research a company and identify key decision makers...",
    "tools": ["search_company_news", "fetch_linkedin_profile", ...],
    "created_at": "2024-11-20T10:00:00Z",
    "created_by": "alice@company.com",
    "approval_status": "approved",
    "evaluation_results": {"accuracy": 0.94, "latency": 4.2},
    "deployed_at": "2024-11-20T15:00:00Z"
  }
}

Before deploying a new model version, require approval from a human (manager, compliance officer). Document why the change was made and what the impact is.

SOC 2 and ISO 27001 readiness

If you’re building agentic systems for enterprise customers, they’ll ask: “Are you SOC 2 Type II certified? Are you ISO 27001 certified?” These certifications prove you have controls around security, availability, and confidentiality.

For agentic systems, focus on:

Access control: Who can deploy agents? Who can modify tools? Who can access logs?
Change management: How do you test and approve new agent versions?
Incident response: If an agent makes a mistake or a tool fails, how do you respond?
Monitoring and alerting: How do you detect anomalies?
Data protection: How do you protect prospect data?

PADISO’s AI Quickstart Audit is a fixed-fee, 2-week diagnostic that assesses your AI systems against SOC 2 and ISO 27001 standards. It tells you what’s missing and what to fix first.

Deployment and operational patterns

Local vs. cloud deployment

Local deployment (on-premises): Your infrastructure, your data never leaves your network. Best for regulated industries (financial services, healthcare). Drawback: you manage infrastructure, scaling, and uptime.

Cloud deployment (AWS, Azure, GCP): Managed infrastructure, auto-scaling, high availability. Drawback: data is in the cloud, compliance is shared responsibility.

For sales agents, cloud deployment is typical. Use managed services:

Compute: AWS Lambda, Google Cloud Functions, or Azure Functions for serverless execution.
Orchestration: AWS Step Functions, Google Cloud Workflows, or Temporal for workflow management.
Storage: PostgreSQL (RDS) or DynamoDB for session state and audit logs.
Queues: SQS, Pub/Sub, or RabbitMQ for lead processing.
Monitoring: CloudWatch, Datadog, or New Relic.

CI/CD for agents

Treat agent code like software. Use version control (Git), continuous integration (run tests on every commit), and continuous deployment (auto-deploy to staging, manual approval to production).

name: Deploy Agent

on:
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Run unit tests
        run: pytest tests/
      - name: Run integration tests
        run: pytest tests/integration/
      - name: Evaluate on test dataset
        run: python eval.py --dataset test_leads_100.json

  deploy:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Deploy to staging
        run: terraform apply -auto-approve -var-file=staging.tfvars
      - name: Run smoke tests
        run: pytest tests/smoke/
      - name: Manual approval for production
        uses: actions/github-script@v6
        with:
          script: |
            // Require approval from a human before deploying to production

Gradual rollout and canary deployments

Don’t deploy a new agent version to 100% of traffic immediately. Use a canary deployment:

Deploy to 5% of traffic. Monitor for errors, latency, cost.
If metrics are good, increase to 10%.
Continue increasing until you reach 100%.
If metrics degrade at any point, rollback to the previous version.

def route_lead_to_agent(lead):
    random_value = random.random()
    
    if random_value < 0.05:  # 5% to canary
        return process_with_agent(lead, version="v3_canary")
    else:  # 95% to stable
        return process_with_agent(lead, version="v2_stable")

Monitor canary metrics separately. If the canary’s accuracy is 92% and the stable version’s accuracy is 94%, the canary is not ready. Investigate why and fix it.

Operational runbooks

Create runbooks for common operational scenarios:

Runbook: Agent latency spike

Check CloudWatch metrics. Is CPU high? Memory high? Network latency high?
Check tool latency. Which tool is slow? Is it timing out?
Check error logs. Are there error spikes?
If a tool is slow, activate fallback (use cached data or a different tool).
If a tool is down, disable it and route to fallback.
If infrastructure is overloaded, scale up (increase Lambda concurrency, add more pods).
Once resolved, post-mortem: why did this happen? How do we prevent it?

Runbook: Cost overrun

Check cost metrics. Which agent is expensive? Which tool?
Check token usage. Are prompts too long? Are we retrying too much?
Check model routing. Are we using expensive models for simple tasks?
Optimise: shorten prompts, reduce tool calls, use cheaper models.
If cost is still high, reduce traffic to the agent or disable it.

Monitoring dashboards

Build dashboards that show:

Volume: Leads processed per hour, per day, per week.
Latency: p50, p95, p99 latency per lead.
Accuracy: Research accuracy, email quality, decision maker accuracy.
Cost: Cost per lead, total cost per day, cost trends.
Errors: Error rate, error types, tool failure rates.
Business impact: Leads qualified, conversion rate, revenue influenced.

Update dashboards in real-time. Alert on anomalies: error rate > 5%, latency p99 > 10 seconds, cost per lead > budget.

Putting it all together: A production architecture

Here’s how a production agentic sales workflow fits together:

┌─────────────────────────────────────────────────────────────┐
│ Inbound Leads (Salesforce, HubSpot, API)                   │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│ Lead Queue (SQS / Pub/Sub)                                  │
│ - Deduplication: check if lead already processed            │
│ - Prioritisation: high-value deals first                    │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│ Coordinator Agent (Lambda / Cloud Function)                 │
│ - Fetch lead from queue                                     │
│ - Fetch session state from database                         │
│ - Decide workflow: research → outreach → follow-up          │
│ - Route to specialist agents                                │
└────────────────────┬────────────────────────────────────────┘
                     │
        ┌────────────┼────────────┐
        │            │            │
        ▼            ▼            ▼
   Research      Outreach       CRM
   Agent         Agent          Agent
        │            │            │
        └────────────┼────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│ Tool Execution Layer                                        │
│ - CRM client (Salesforce, HubSpot)                         │
│ - Research tools (LinkedIn, Crunchbase, news APIs)         │
│ - Email tools (Sendgrid, draft engine)                     │
│ - Scheduling tools (Calendly)                              │
│ - Caching layer (Redis)                                    │
│ - Circuit breakers, timeouts, retries                      │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│ Session State Database (PostgreSQL / DynamoDB)              │
│ - Lead record, workflow stage, actions taken                │
│ - Conversation history, research findings                   │
│ - Audit trail                                               │
└─────────────────────────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│ Approval / Escalation Layer                                 │
│ - High-value deals → human review                           │
│ - Low confidence → escalate                                 │
│ - Objections → sales rep                                    │
└────────────────────┬────────────────────────────────────────┘
                     │
        ┌────────────┼────────────┐
        │            │            │
        ▼            ▼            ▼
   Approved     Escalated    Monitoring &
   Actions      to Humans    Observability
        │            │            │
        └────────────┼────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│ Outbound Actions                                            │
│ - Send email via Sendgrid                                   │
│ - Update CRM (Salesforce, HubSpot)                         │
│ - Schedule call (Calendly)                                  │
│ - Log activity                                              │
└─────────────────────────────────────────────────────────────┘

Each component is independent, testable, and monitorable. If the research agent fails, the outreach agent can still work (using cached research). If the CRM write fails, the action is queued for retry. If a tool times out, a fallback is used.

Summary and next steps

Building agentic sales workflows in production is complex. It’s not just about calling an LLM API. It’s about architecture, tool design, state management, evaluation, monitoring, and operational discipline.

Here’s what you need to do:

Week 1–2: Foundation

Define your sales workflow (research → outreach → follow-up).
Choose your tools (CRM, research APIs, email platform).
Design your tool schema (inputs, outputs, error handling).
Build a basic agent loop (coordinator + specialist agents).

Week 3–4: Evaluation

Create a test dataset (100–500 leads with ground truth).
Run your agent on the test dataset.
Measure accuracy, latency, cost.
Identify failure modes and fix them.

Week 5–6: Production readiness

Implement state management (session state database).
Add approval workflows and escalation rules.
Build monitoring and alerting.
Create runbooks for common failures.
Deploy to staging and run smoke tests.

Week 7+: Launch and iterate

Canary deploy to 5% of traffic.
Monitor metrics closely.
Collect human feedback and use it to improve the agent.
Gradually increase to 100%.
Continuously optimise: reduce cost, improve accuracy, increase speed.

If you’re building this at a startup or scaling-stage company, you likely don’t have the in-house expertise to do this alone. PADISO’s platform engineering services and AI & Agents Automation can help you design, build, and deploy production-grade agentic systems. We’ve done this for dozens of companies, and we know the patterns, the pitfalls, and how to get to market fast.

If you’re in financial services, PADISO’s AI advisory for financial services ensures your agents are APRA, ASIC, and AUSTRAC compliant by design. If you’re in insurance, our insurance AI advisory covers claims automation, conduct risk, and underwriting.

For a quick assessment of where you are and what to build first, book an AI Quickstart Audit. It’s a fixed-fee, 2-week diagnostic that tells you exactly what to ship first, what to retire, and what 90 days could unlock.

Or book a call with our Sydney-based AI advisory team to discuss your specific workflow and challenges. We’ll help you ship faster and avoid the pitfalls that slow down most teams.

The future of sales is agentic. The teams that ship production agentic workflows first will have a massive competitive advantage. The time to start is now.

Want to talk through your situation?

Book a 30-minute call with Kevin (Founder/CEO). No pitch - direct advice on what to do next.

Book a 30-min call

AI Agents in Production: Agentic Sales Workflows

Table of Contents

What are agentic sales workflows and why they matter

The architecture of production-grade sales agents

Core agent loop and orchestration

Model selection and routing

Error handling and fallback chains

Tool design and integration patterns

Designing tools for agentic use

Integrating with sales platforms

Building a tool library

State management and memory in sales agents

The memory problem

Structured session state

Contextual summarisation

Conversation history and turn tracking

Evaluation and observability at scale

Why evaluation matters

Building evaluation datasets

Continuous monitoring in production

Observability stack

Common production failures and how to avoid them

Hallucination and false confidence

Tool timeouts and cascading failures

State corruption and race conditions

Model context window exhaustion

Cost explosion

Human-in-the-loop workflows and override patterns

When to escalate to humans

Approval workflows

Learning from overrides

Cost control and efficiency optimisation

Token budgeting

Batch processing and parallelisation

Caching and deduplication

Prompt optimisation

Security, compliance, and audit readiness

Data handling and privacy

Audit trails and logging

Model governance and version control

SOC 2 and ISO 27001 readiness

Deployment and operational patterns

Local vs. cloud deployment

CI/CD for agents

Gradual rollout and canary deployments

Operational runbooks

Monitoring dashboards

Putting it all together: A production architecture

Summary and next steps

Want to talk through your situation?