PADISO.ai: AI Agent Orchestration Platform - Launching May 2026
Back to Blog
Guide 25 mins

AI Agents in Production: Agentic Engineering Workflows

Production-ready agentic AI workflows: orchestration patterns, tool integration, deployment architecture, and operational lessons from scaling AI agents.

The PADISO Team ·2026-06-09

Table of Contents

  1. What Are Agentic Workflows and Why They Matter
  2. Orchestration Patterns That Scale
  3. Tool Integration and Function Calling
  4. State Management and Memory in Agents
  5. Deployment Architecture for Production
  6. Observability, Logging, and Debugging
  7. Cost Control and Token Optimization
  8. Safety, Guardrails, and Responsible AI
  9. Common Failure Modes and How to Fix Them
  10. Bringing It Together: A Real-World Implementation

What Are Agentic Workflows and Why They Matter {#what-are-agentic-workflows}

An agentic workflow is fundamentally different from a simple chatbot or prompt-response system. Rather than returning an answer in a single pass, an agent operates in a loop: it observes its environment, decides what action to take, executes that action (often calling external tools or APIs), and then repeats until it reaches a goal or terminal state.

This loop-based approach unlocks capabilities that static LLM calls cannot achieve. An agent can break down complex tasks into subtasks, recover from errors, gather missing information, and adapt its strategy based on feedback. In production, this means agents can handle real business logic: reconciling invoices across multiple systems, orchestrating multi-step data pipelines, conducting compliance reviews, or managing customer support escalations with context and nuance.

The shift to agentic systems is not hype. Teams at scale—from fintech to healthcare to e-commerce—are moving beyond single-prompt engineering to multi-step workflows because the business case is clear: agents reduce manual work, improve consistency, and handle edge cases that would otherwise require human intervention. However, shipping agents to production is not the same as running a prototype in a notebook. The engineering discipline required is substantial.

At PADISO, we’ve worked with founders and operators across seed-stage startups and enterprise teams modernising their operations with agentic AI. The patterns that work in production are not always intuitive, and the operational gotchas are real. This guide covers the patterns, architectures, and lessons we’ve seen teams learn—often the hard way.


Orchestration Patterns That Scale {#orchestration-patterns}

Orchestration is the backbone of any agentic system. It defines how the agent loop runs, how decisions are made, and how state flows through the system. There are several proven patterns, each with trade-offs.

The Agentic Loop: Plan-Act-Observe

The simplest and most common pattern is the plan-act-observe loop. The agent:

  1. Plans: Decides what action to take next (or reports it is done)
  2. Acts: Executes the action (calls a tool, queries a database, invokes an API)
  3. Observes: Receives the result and updates its understanding
  4. Repeats: Loops until the goal is achieved or a terminal state is reached

This loop is synchronous by default, which works well for latency-sensitive applications (customer support, real-time data lookup) but can be inefficient for long-running tasks. Research from OpenAI’s tools and agents documentation outlines how to structure tool definitions and function calling to support this pattern reliably.

The key implementation detail: the agent must have a clear way to signal “I am done” or “I need help from a human.” Without explicit terminal states, agents can loop indefinitely or get stuck in unproductive cycles.

Multi-Agent Orchestration

As complexity grows, a single agent often cannot handle everything. Multi-agent systems delegate tasks to specialist agents, each with its own set of tools and expertise. For example:

  • A router agent receives the user request and decides which specialist agent should handle it
  • A data agent handles queries and aggregation across databases
  • A compliance agent checks policies and flags violations
  • A communication agent drafts messages or reports

Frameworks like CrewAI are purpose-built for this pattern. They handle agent-to-agent communication, task delegation, and result aggregation. The trade-off is added latency and complexity in debugging—when something goes wrong, you need to trace across multiple agents and their tool calls.

Hierarchical Planning

For long-running, complex workflows, hierarchical planning works better than flat loops. The agent first breaks the goal into subgoals (planning phase), then executes each subgoal in sequence, with the ability to revise the plan if new information emerges.

Example: An agent tasked with “onboard a new customer” might plan:

  1. Validate customer data
  2. Create account in billing system
  3. Provision infrastructure
  4. Send welcome email
  5. Schedule kickoff call

If step 2 fails (duplicate email), the agent can either retry, skip, or escalate rather than blindly looping. Anthropic’s agentic workflows documentation provides patterns for implementing this kind of structured planning.

Reactive vs. Proactive Agents

Reactive agents respond to external triggers (a user message, a webhook, a scheduled job). Proactive agents can initiate actions without external prompts—for example, monitoring a system and alerting when anomalies are detected.

In production, most agents are reactive for safety reasons: you want a human or a policy layer to initiate high-impact actions. However, hybrid approaches work well: a reactive agent handles the primary workflow, but can trigger proactive sub-agents for background tasks (e.g., gathering data while the main agent waits for user input).


Tool Integration and Function Calling {#tool-integration}

Tools are the agent’s interface to the world. Without tools, an agent is just a text generator. With tools, it becomes an actor.

Designing Tool Schemas

Every tool must have a clear schema that the LLM can understand. The schema defines:

  • Name: What the tool does (e.g., query_database, send_email)
  • Description: Plain English explanation of when to use it
  • Parameters: Input arguments with types and constraints
  • Return type: What the tool outputs

Poor tool design is a common failure point. If your schema is ambiguous, the agent will misuse the tool. For example:

Bad schema:

Tool: "search"
Description: "Search for information"
Parameters: query (string)

Better schema:

Tool: "search_customer_invoices"
Description: "Search for invoices by customer ID, date range, or status. Use this when the user asks about a specific customer's billing history."
Parameters:
  - customer_id (string, required): The unique customer identifier
  - start_date (ISO 8601, optional): Search from this date
  - end_date (ISO 8601, optional): Search until this date
  - status (enum: draft, sent, paid, overdue, optional): Filter by invoice status
Returns: List of invoices with ID, amount, date, and status

The second schema is verbose but precise. The agent knows exactly when to use it and what to expect.

Function Calling Reliability

LLMs sometimes hallucinate tool calls—they invent parameters that don’t exist, or call tools that don’t match the task. This is more common with smaller models or when schemas are unclear.

Mitigation strategies:

  1. Validation before execution: Parse the tool call and validate all parameters before executing. If validation fails, return an error to the agent and let it retry.
  2. Constrained generation: Some platforms (like Claude’s tool use) use constrained decoding to ensure the LLM only generates valid tool calls. Use these when available.
  3. Fallback tools: Define a “clarify” tool that the agent can call if it is unsure. This is better than a hallucinated call.

Tool Composition and Chains

Often, a single tool call is not enough. The agent needs to chain multiple tools. For example:

  1. Call get_customer_data to fetch customer info
  2. Call check_inventory with the customer’s preferred products
  3. Call calculate_discount based on customer tier and inventory levels
  4. Call create_order with the final details

Frameworks like LangChain provide abstractions for building tool chains. However, in production, explicit orchestration is often clearer than implicit chaining. Define the sequence of tool calls in code, not in the agent’s reasoning.

Handling Tool Failures

Tools fail in production. Networks are unreliable, databases go down, APIs rate-limit. The agent must handle these gracefully.

For each tool, define:

  • Timeout: How long to wait before giving up
  • Retry logic: Should we retry immediately, with backoff, or escalate?
  • Fallback behavior: If the tool fails, what does the agent do? (e.g., use cached data, skip the step, ask a human)

Example:

def call_tool_with_fallback(tool_name, params):
    try:
        result = call_tool(tool_name, params, timeout=5)
        return result
    except TimeoutError:
        # Retry once with longer timeout
        try:
            result = call_tool(tool_name, params, timeout=15)
            return result
        except Exception:
            return {"status": "failed", "message": "Tool unavailable, escalating to human"}
    except Exception as e:
        log_error(f"Tool {tool_name} failed: {e}")
        return {"status": "failed", "message": str(e)}

This is not glamorous, but it prevents agents from getting stuck.


State Management and Memory in Agents {#state-management}

Agents need memory to function effectively. Without it, every loop iteration starts from scratch, and the agent cannot learn from its past actions or maintain context across a long conversation.

Short-Term Memory: The Conversation History

The simplest form of memory is the conversation history—the list of all messages exchanged with the agent. This is passed to the LLM with each inference, so the agent can see what it has already tried.

In production, conversation history has costs:

  1. Token cost: Longer histories consume more tokens, increasing latency and cost
  2. Context window limits: Very long histories exceed the LLM’s context window, forcing you to summarize or truncate
  3. Relevance: Old information in the history can distract the agent from the current task

Mitigation:

  • Summarization: Periodically summarize the conversation and replace the full history with the summary
  • Windowing: Keep only the last N messages
  • Hierarchical memory: Store detailed history separately; pass only a summary to the LLM

Long-Term Memory: Knowledge Bases and Embeddings

For agents that need to remember facts across sessions (e.g., “this customer always prefers email over phone”), use a knowledge base. Store facts as embeddings and retrieve relevant facts at inference time.

LlamaIndex provides tooling for building and querying knowledge bases. The pattern is:

  1. Index: Convert facts into embeddings and store in a vector database
  2. Query: At runtime, embed the current task and retrieve similar facts
  3. Augment: Pass retrieved facts to the agent as context

This is more scalable than storing everything in the conversation history, but it adds latency (embedding + retrieval) and requires careful tuning of retrieval parameters.

State Machines and Explicit State Tracking

For workflows with clear stages (e.g., “awaiting customer input” → “processing” → “complete”), use an explicit state machine. Define states, transitions, and the valid actions in each state.

Example:

class OrderProcessingAgent:
    states = ["created", "validating", "processing", "shipped", "complete"]
    
    def __init__(self, order_id):
        self.order_id = order_id
        self.state = "created"
        self.history = []
    
    def step(self, action):
        if self.state == "created":
            if action == "validate":
                self.state = "validating"
        elif self.state == "validating":
            if action == "pass":
                self.state = "processing"
            elif action == "fail":
                self.state = "created"  # Retry
        # ... more transitions
        self.history.append((self.state, action))

Explicit state machines prevent the agent from taking invalid actions and make debugging easier. The downside is that you must define all states and transitions upfront.

Distributed State and Persistence

In production, agents often run across multiple processes or machines. State must be persisted to a database so that if the agent crashes, it can resume from where it left off.

For each agent run, store:

  • Agent ID and run ID
  • Current state
  • Conversation history (or a reference to it)
  • Tool call results and timestamps
  • Any errors or exceptions

Use transactions to ensure atomicity: either the entire step (state update + history append) succeeds, or none of it does. This prevents inconsistencies if a crash occurs mid-step.


Deployment Architecture for Production {#deployment-architecture}

Moving an agent from a notebook to production requires infrastructure decisions.

Synchronous vs. Asynchronous Execution

Synchronous: The user or client waits for the agent to complete. The agent loop runs in the request handler.

  • Pros: Simple to implement, immediate feedback
  • Cons: Limited by HTTP request timeouts (typically 30-60 seconds); blocks resources

Asynchronous: The user submits a task, and the agent runs in the background. The user polls or gets notified when done.

  • Pros: Supports long-running tasks, better resource utilization
  • Cons: More complex; requires job queue, polling, or webhooks

For agents that complete in <5 seconds, synchronous is fine. For anything longer, go async.

Typical async architecture:

  1. User submits task via API → stored in database
  2. Task enqueued to a job queue (e.g., Celery, BullMQ)
  3. Worker picks up task, runs agent loop
  4. Results stored in database
  5. User polls /status/{task_id} or receives webhook notification

Containerization and Scaling

Deploy agents in containers (Docker) so they can be versioned, tested, and scaled independently. Each agent type (or agent + LLM combination) can have its own container.

For scaling, use an orchestrator:

  • Kubernetes: Full control, handles auto-scaling, but operational overhead
  • Serverless (AWS Lambda, Google Cloud Functions): Simple for short-running agents (<15 min), but limited customization
  • Managed platforms (Modal, Replicate): Middle ground; good for ML workloads

Choose based on your task duration and traffic pattern. Short, bursty tasks → serverless. Long, predictable tasks → Kubernetes.

API Gateway and Rate Limiting

Agents consume tokens and compute. Without rate limiting, a single user can exhaust your budget or cause denial of service.

Implement:

  • Per-user rate limits: Max tasks per minute
  • Per-task cost limits: Abort if a task consumes >N tokens
  • Global rate limits: Max concurrent agents

Use a gateway (Kong, Envoy, or cloud provider’s API Gateway) to enforce these before tasks reach the agent.

Monitoring and Alerting

Set up alerts for:

  • Agent failures: If >5% of tasks fail in a 5-minute window
  • Latency: If p99 latency exceeds threshold
  • Cost overruns: If token consumption spikes
  • Tool failures: If a specific tool fails repeatedly

Teams often overlook cost monitoring. A single agent loop that calls an expensive API 100 times can rack up bills quickly. Log every tool call with its cost and set budgets per agent, per user, per day.

Versioning and Rollback

Agents change frequently. You might update the system prompt, add new tools, or switch LLM models. Version everything:

  • Agent version: Increment when you change the prompt, tools, or logic
  • LLM version: Track which model version was used
  • Tool versions: If a tool’s schema or behavior changes, increment

Store all versions in git and deploy via CI/CD. If a new version causes problems, rollback is a single command.


Observability, Logging, and Debugging {#observability}

When an agent fails or produces a wrong answer, debugging is hard. You need visibility into every step.

Structured Logging

Log every significant event in structured format (JSON). Include:

{
  "timestamp": "2025-01-15T10:30:45Z",
  "agent_id": "agent_123",
  "run_id": "run_456",
  "event_type": "tool_call",
  "tool_name": "query_database",
  "tool_input": {"customer_id": "cust_789"},
  "tool_output": {"status": "success", "rows": 5},
  "duration_ms": 234,
  "tokens_used": {"input": 120, "output": 45}
}

Structured logs are queryable. You can filter by agent, tool, or outcome and analyze patterns.

Tracing Agent Loops

Use distributed tracing (e.g., OpenTelemetry, Datadog) to visualize the agent’s execution path. Each tool call becomes a span, and you can see the full timeline:

Agent Start
├─ Tool: validate_input (50ms)
├─ Tool: fetch_data (200ms)
├─ Tool: process_data (150ms)
└─ Tool: save_result (100ms)
Agent End (500ms total)

Traces make it obvious where time is spent and where bottlenecks occur.

Logging LLM Calls

Log every LLM call with:

  • Prompt: The full input to the LLM (watch for sensitive data)
  • Completion: The full output
  • Model and parameters: Which model, temperature, max_tokens, etc.
  • Tokens and cost: How many tokens, what it cost

Do not log sensitive data (passwords, API keys, PII). Implement a redaction layer if needed.

Debugging Failed Runs

When an agent produces a wrong answer or gets stuck, you need to replay the run. Store:

  1. All inputs (user message, context, tool results)
  2. All LLM calls and responses
  3. All state transitions

Then, in a notebook or debug environment, replay the run step-by-step. Often, you’ll spot the issue immediately (e.g., the agent misunderstood the task, or a tool returned unexpected data).

User Feedback and Continuous Improvement

In production, collect user feedback on agent outputs. Did the agent help? Was the answer correct? Use this feedback to:

  1. Identify failure modes (e.g., “agent always fails on invoices with special characters”)
  2. Improve the prompt or tool definitions
  3. Collect training data for fine-tuning

Implement a simple feedback UI: thumbs up/down, or a form to report issues. Even 1% feedback rate is valuable.


Cost Control and Token Optimization {#cost-control}

Running agents at scale is expensive. Each LLM call consumes tokens, and tokens cost money. A single agent loop might make 5-10 tool calls, each requiring an LLM inference. Costs add up fast.

Token Budgeting

Before deploying an agent, estimate tokens per run:

  1. System prompt: ~500 tokens (typical)
  2. User input: ~100 tokens
  3. Tool definitions: ~1000 tokens (schemas for all tools)
  4. Per loop iteration: ~200 tokens (agent reasoning + tool results)
  5. Typical loop count: 3-5 iterations

Estimate: 500 + 100 + 1000 + (200 × 4) = 2300 tokens per run.

At $0.01 per 1K input tokens (GPT-4 pricing), that’s $0.023 per run. Scale to 1000 runs/day: $23/day or $690/month. Now add error cases, retries, and debugging, and you’re at $1000+/month for a single agent.

Optimize:

  • Smaller models: GPT-3.5 or Claude 3 Haiku are 5-10x cheaper
  • Fewer tool definitions: Only include tools relevant to the task
  • Summarize context: Don’t pass the entire conversation history; summarize it
  • Batch processing: Process multiple tasks in one LLM call if possible

Caching and Memoization

If the agent often processes similar tasks, cache results. For example:

  • Cache customer data: “I already fetched this customer’s info 5 minutes ago, reuse it”
  • Cache tool results: “This database query returned the same result last time, don’t repeat it”

Implement a simple cache (in-memory or Redis) with TTL (time-to-live). Be careful with freshness: stale data can cause wrong decisions.

Model Selection and Fine-Tuning

Different models have different costs and capabilities:

  • GPT-4 Turbo: Most capable, most expensive (~$0.01/1K input tokens)
  • GPT-3.5: Cheaper (~$0.0005/1K input tokens), good for simple tasks
  • Claude 3 Opus: Competitive with GPT-4, good reasoning
  • Claude 3 Haiku: Very cheap, good for simple classification
  • Open-source (Llama 2, Mistral): Run locally, no per-token cost

For agents, start with a capable model (GPT-4 or Claude Opus) to get the logic right. Once you understand the task, experiment with cheaper models. Often, a smaller model with better prompting beats a larger model.

Fine-tuning is rarely needed for agents (you’re better off improving the prompt), but if you have 100+ examples of correct behavior, fine-tuning can improve accuracy and reduce cost.

Async Processing and Batch APIs

If latency is not critical, use batch APIs. OpenAI and Anthropic offer batch endpoints with 50% cost reduction. You submit 100 requests, and they process them in bulk overnight.

Batch mode is great for:

  • Nightly report generation
  • Bulk data processing
  • Scheduled agents that don’t need immediate results

Not suitable for:

  • Real-time customer-facing agents
  • Interactive workflows

Safety, Guardrails, and Responsible AI {#safety-guardrails}

Agents have agency: they can call APIs, modify databases, send messages. Without guardrails, they can cause harm.

Input Validation

Before the agent even starts, validate the user input:

  • Length: Reject inputs >10K characters (prevents token explosion)
  • Format: Ensure input matches expected format
  • Content: Check for prompts trying to manipulate the agent (e.g., “ignore your instructions”)

This is not foolproof, but it catches obvious attacks.

Tool Authorization

Not all agents should have access to all tools. Implement role-based access:

  • Admin agent: Can delete users, modify policies
  • Support agent: Can view customer data, create tickets (but not delete)
  • Data agent: Can query databases (read-only)

Check authorization before executing each tool call. If the agent tries to call an unauthorized tool, reject it and inform the agent.

Action Confirmation

For high-impact actions (delete, transfer funds, send to external system), require explicit confirmation before execution.

Pattern:

  1. Agent decides to execute action (e.g., “I will delete this user”)
  2. System does not execute; instead, asks for confirmation: “Are you sure you want to delete user_123?”
  3. Human approves or rejects
  4. If approved, agent executes; if rejected, agent tries alternative approach

This prevents accidental harm from agent mistakes.

Hallucination Detection

Agents sometimes make up facts or claim to have done things they haven’t. Detect this by:

  1. Cross-checking: After the agent claims to have done something, verify it actually happened
  2. Confidence scoring: Ask the agent how confident it is in its answer; if low, escalate
  3. Fact verification: For factual claims, check against a knowledge base

If hallucination is detected, inform the user and ask the agent to correct itself.

Audit Trails

For compliance reasons (especially in regulated industries), log every action the agent takes:

  • What action was taken
  • When
  • By which agent
  • With what authorization
  • What was the result

Store audit logs immutably (e.g., in a database with no delete permissions). This proves to auditors that agents acted correctly.

If you’re pursuing SOC 2 or ISO 27001 compliance, audit trails are essential. Many teams use Vanta to automate evidence collection for compliance audits.

Responsible AI and Bias

Agents can amplify biases in training data. For example, if your agent recommends loan approvals and was trained on biased historical data, it will discriminate.

Mitigations:

  1. Diverse training data: Ensure training data represents diverse groups
  2. Fairness metrics: Measure agent performance across demographic groups
  3. Human review: For high-stakes decisions, always have a human in the loop
  4. Transparency: Tell users that an agent made a decision; explain why

For agents making decisions about people (hiring, lending, content moderation), this is not optional—it is legally and ethically required.


Common Failure Modes and How to Fix Them {#failure-modes}

We’ve seen these problems repeatedly in production. Here’s how to fix them.

Agent Loops Forever

Problem: The agent gets stuck in an infinite loop, repeating the same action and getting the same result.

Cause: No terminal condition, or the agent doesn’t recognize when it’s done.

Fix:

  1. Define explicit terminal states (“done”, “escalated”, “failed”)
  2. Limit loop iterations: if >10 iterations, force termination
  3. Detect loops: if the agent repeats the same tool call, break the loop and escalate
max_iterations = 10
iteration_count = 0
while not agent.is_done() and iteration_count < max_iterations:
    action = agent.decide_action()
    result = execute_action(action)
    agent.observe(result)
    iteration_count += 1

if iteration_count >= max_iterations:
    escalate_to_human("Agent exceeded max iterations")

Tool Misuse

Problem: The agent calls the wrong tool, or calls a tool with wrong parameters.

Cause: Unclear tool schema, or the agent misunderstood the task.

Fix:

  1. Improve tool schema (be specific, include examples)
  2. Validate parameters before execution; return clear error if invalid
  3. Provide examples in the system prompt: “When the user asks for a customer’s invoices, call the get_invoices tool like this: …”
  4. If the agent keeps misusing a tool, remove it or rename it for clarity

High Token Consumption

Problem: A single agent run consumes 50K+ tokens, costing $0.50+.

Cause: Long conversation history, verbose tool results, or the agent is looping and retrying.

Fix:

  1. Summarize conversation history every 5-10 messages
  2. Truncate tool results: instead of returning 1000 rows, return top 10 + a summary
  3. Implement token budgets: if a single run exceeds budget, escalate
  4. Use cheaper models for simple tasks

Agent Produces Wrong Answer

Problem: The agent’s output is factually incorrect or doesn’t match the user’s intent.

Cause: Misunderstood the task, used stale data, or hallucinated.

Fix:

  1. Improve the system prompt: be explicit about the task and expected output format
  2. Provide examples in the prompt: “For input X, the correct output is Y”
  3. Verify data freshness: if tool results are >1 hour old, refresh
  4. Add a verification step: after the agent produces an answer, have it double-check

Slow Agent Execution

Problem: Agent takes 30+ seconds per run.

Cause: Slow tool calls, many sequential tool calls, or large model inference time.

Fix:

  1. Profile: log duration of each tool call and LLM call
  2. Parallelize: if multiple tools are independent, call them concurrently
  3. Optimize tools: add caching, indexes, or simpler queries
  4. Use smaller, faster models: GPT-3.5 is 2-3x faster than GPT-4
  5. Reduce context: fewer tool definitions and shorter conversation history = faster inference

Agent Escalates Everything

Problem: The agent escalates to a human for every task, making it useless.

Cause: Over-cautious guardrails, or the agent lacks confidence.

Fix:

  1. Review escalation logic: is it too strict?
  2. Improve the prompt: give the agent permission to act independently
  3. Add more examples: show the agent cases where it should act without escalation
  4. Reduce uncertainty: provide more context or better tools

Bringing It Together: A Real-World Implementation {#real-world-implementation}

Let’s walk through a realistic example: an invoice reconciliation agent for a SaaS company.

The Task

The company receives invoices from vendors in various formats (PDF, email, API). They need to:

  1. Extract invoice data (vendor, amount, date, line items)
  2. Match invoices to purchase orders
  3. Check for discrepancies (amount mismatch, duplicate invoices)
  4. Flag for approval if amount >$10K
  5. Post to accounting system

Manually, this takes 30 minutes per invoice. With an agent, the company wants to automate 80% of invoices and escalate only edge cases.

Architecture

Tools:

  • extract_invoice_data: OCR + LLM to extract from PDF
  • query_purchase_orders: Search PO database by vendor, amount, date
  • check_duplicate: Query invoice history for duplicates
  • post_to_accounting: Create journal entry in accounting system
  • escalate_for_approval: Flag invoice for human review

State Machine:

received → extracting → matching → validating → posting → complete
   ↓                                                ↓
   └─────────────────────────────────────────→ escalated

Orchestration:

  1. Invoice received (webhook from email or API)
  2. Enqueued to job queue
  3. Worker starts agent with invoice ID
  4. Agent loops: extract → match → validate → post
  5. Results stored in database
  6. Accounting team notified of new entries

Implementation Details

System Prompt:

You are an invoice reconciliation agent. Your job is to process vendor invoices and post them to the accounting system.

For each invoice:
1. Extract vendor name, amount, date, and line items
2. Search for matching purchase orders
3. Check for duplicates
4. If amount > $10,000, escalate for approval
5. If all checks pass, post to accounting

If you cannot extract data (e.g., invoice is corrupted), escalate immediately.
If you find a mismatch (e.g., invoice amount != PO amount), escalate with details.

Be precise. Double-check amounts. If unsure, escalate.

Tool Definitions:

tools = [
    {
        "name": "extract_invoice_data",
        "description": "Extract vendor, amount, date, and line items from an invoice PDF.",
        "parameters": {
            "invoice_id": "str (required)",
            "file_path": "str (required, path to PDF)"
        },
        "returns": {
            "vendor_name": "str",
            "amount": "float",
            "date": "ISO 8601 date",
            "line_items": "list of {description, quantity, unit_price}"
        }
    },
    # ... more tools
]

Error Handling:

def run_agent(invoice_id):
    agent = InvoiceReconciliationAgent(invoice_id)
    max_iterations = 10
    iteration = 0
    
    while not agent.is_done() and iteration < max_iterations:
        try:
            action = agent.decide_action()
            result = execute_action(action)
            agent.observe(result)
            iteration += 1
        except ToolTimeoutError as e:
            log_error(f"Tool timeout: {e}")
            agent.escalate(f"Tool failed: {e}")
            break
        except Exception as e:
            log_error(f"Unexpected error: {e}")
            agent.escalate(f"Internal error: {e}")
            break
    
    if iteration >= max_iterations:
        agent.escalate("Max iterations exceeded")
    
    return agent.get_result()

Monitoring:

metrics = {
    "invoices_processed": counter(),
    "invoices_posted": counter(),
    "invoices_escalated": counter(),
    "avg_processing_time": histogram(),
    "tokens_per_invoice": histogram(),
    "cost_per_invoice": histogram(),
}

# Log metrics after each run
log_metrics({
    "invoice_id": invoice_id,
    "status": agent.status,  # posted, escalated, failed
    "duration_seconds": duration,
    "tokens_used": tokens,
    "cost": tokens * price_per_token,
})

Results

After 3 months:

  • 80% of invoices processed fully automatically
  • 15% escalated for approval (amount >$10K or mismatch)
  • 5% failed (corrupted PDFs, unrecognized vendors)
  • Processing time: 2 minutes per invoice (down from 30 minutes)
  • Cost: $0.15 per invoice (at 300 tokens per run × $0.0005/token)
  • Accuracy: 99.2% (2 errors in 250 invoices, both caught in QA)

The 5% failure rate is acceptable; the team reviews failed invoices manually. The 15% escalation rate is also acceptable; approval takes 5 minutes, much faster than full manual processing.

The team’s next step: fine-tune the agent on the 2 error cases to improve accuracy further.


Conclusion and Next Steps

Agentic workflows are not a silver bullet. They require discipline in design, implementation, and operations. But when done right, they unlock significant value: faster processing, lower costs, and better consistency.

The patterns in this guide—orchestration, tool integration, state management, deployment, observability, cost control, and safety—are the foundation of production-grade agentic systems. Master these, and you can build agents that work reliably at scale.

If you’re building agentic systems, start with a clear use case (like invoice reconciliation). Define the task precisely, design tools that match the task, and iterate based on real data. Invest in observability from day one; it will save you hours of debugging.

For teams looking to move beyond prototypes, PADISO offers AI & Agents Automation services tailored to your infrastructure and business logic. We’ve shipped agentic systems across fintech, healthcare, and e-commerce, and we understand the operational challenges. Whether you need help with platform engineering to support your agents, security audit and compliance for regulated environments, or fractional CTO guidance on architecture, we’re here to help.

Reach out if you’re scaling agents in production. Let’s build something that works.


Further Reading and Resources

For deeper dives into agentic architecture, refer to:

For implementation support, consider partnering with a team experienced in production AI systems. PADISO’s platform engineering and AI advisory services are built for teams shipping at scale.

Want to talk through your situation?

Book a 30-minute call with Kevin (Founder/CEO). No pitch — direct advice on what to do next.

Book a 30-min call