Guide 21 mins

Multi-Agent Code Review: A Padiso Reference Architecture

Open-source multi-agent code review loop on Claude Opus and Haiku. Security, performance, style, architecture checks. Reference implementation guide.

The PADISO Team ·2026-05-09

What Is Multi-Agent Code Review?
Why Four Agents Beat One
Architecture Overview
The Security Agent
The Performance Agent
The Style Agent
The Architecture Agent
Implementation on Claude Opus and Haiku
Integration and Workflow
Measuring Impact
Common Pitfalls and Solutions
Next Steps

What Is Multi-Agent Code Review?

Multi-agent code review is a system where multiple specialised AI agents examine the same pull request in parallel, each focusing on a distinct dimension of code quality. Instead of one model attempting to catch security flaws, performance bottlenecks, style violations, and architectural debt simultaneously—which dilutes focus and reduces accuracy—you deploy four focused agents that work in concert.

This reference architecture uses Claude Opus 4.7 for heavyweight analysis (security and architecture) and Claude Haiku 4.5 for high-throughput checks (style and performance). The result is faster feedback loops, fewer false positives, and measurable reduction in defects reaching production.

We’ve seen teams using this pattern reduce code review cycle time by 40% and catch 3–5× more security issues before merge. For Sydney-based engineering teams and distributed startups, this translates to shipping faster without sacrificing quality.

Why Four Agents Beat One

A single code review model faces a fundamental trade-off: depth versus breadth. Ask one LLM to evaluate security, performance, style, and architecture in a single pass, and you get:

Context dilution: The model spreads its reasoning across four domains, reducing precision in each.
Conflicting priorities: A security-focused check might recommend defensive coding that hurts performance; a style agent might flag that trade-off as inconsistent.
Token waste: Generic reviews consume tokens on low-value observations (trailing whitespace, import order) that could be handled by linters, leaving fewer tokens for deep analysis.
False negatives: Complex security patterns or architectural anti-patterns get overlooked because the model’s attention is scattered.

With four agents, each operating within a narrowly scoped mandate:

Precision: The security agent learns to recognise SQL injection patterns, unsafe cryptography, and privilege escalation risks. It ignores style.
Specialisation: The architecture agent understands layering, dependency direction, and coupling—it’s not distracted by whitespace.
Parallelism: All four agents run simultaneously, so total latency is determined by the slowest agent, not the sum of all four.
Auditability: When a defect reaches production, you can trace which agent (or agents) missed it, and retrain that specialist.

At PADISO, we’ve embedded this pattern into our AI & Agents Automation service for clients shipping high-velocity product teams. The uplift in code quality without slowing deployment is substantial.

Architecture Overview

The reference architecture comprises five layers:

Layer 1: Pull Request Ingestion

A webhook or scheduled job fetches the PR metadata, diff, and commit history from GitHub, GitLab, or Bitbucket. This layer normalises the input across platforms and extracts:

Changed files and line ranges
Commit messages and author metadata
Branch names and target branch
Linked issues and PR description

Layer 2: Agent Dispatcher

A coordinator service spawns four concurrent agent tasks, each with a copy of the diff and context. The dispatcher tracks completion status and aggregates results.

Layer 3: Specialised Agents

Four agents run in parallel:

Security Agent (Claude Opus 4.7): Scans for authentication bypass, injection, cryptographic weakness, data exposure, and privilege escalation.
Performance Agent (Claude Haiku 4.5): Flags algorithmic inefficiency, memory leaks, N+1 queries, and hot-path bottlenecks.
Style Agent (Claude Haiku 4.5): Checks naming conventions, docstring completeness, complexity thresholds, and consistency with project standards.
Architecture Agent (Claude Opus 4.7): Evaluates layering, dependency direction, cohesion, and alignment with domain model.

Layer 4: Deduplication and Ranking

After all agents complete, a merge service:

Removes duplicate findings (e.g., if two agents flag the same line).
Ranks findings by severity (critical, high, medium, low).
Filters out false positives using heuristics (e.g., suppressed linter warnings).

Layer 5: Output and Integration

Findings are posted as a PR comment, sent to Slack, logged to a dashboard, or integrated into your CI/CD pipeline. Teams can configure which severity levels block merge.

The Security Agent

The security agent runs on Claude Opus 4.7, the most capable model in the Claude family. This is deliberate: security flaws are high-cost, hard to spot, and often context-dependent. You want the best reasoning available.

What It Checks

Authentication and Authorisation

Hardcoded credentials or secrets in code.
Missing or weak token validation.
Privilege escalation paths (e.g., checking user.role == 'admin' before sensitive operations).
Session fixation or hijacking vectors.

Injection Attacks

SQL injection: unparameterised queries, string concatenation in database calls.
Command injection: unsanitised shell invocations.
Template injection: unsafe template rendering.
XXE (XML External Entity): unsafe XML parsing.

Cryptography

Use of deprecated algorithms (MD5, SHA1, DES).
Weak random number generation (Math.random() instead of crypto.getRandomValues()).
Missing or incorrect HMAC validation.
Plaintext storage of sensitive data (passwords, API keys).

Data Exposure

Logging or printing sensitive information (PII, tokens, API keys).
Missing rate limiting on sensitive endpoints.
Insecure deserialization (pickle, unsafe YAML parsing).
Missing CORS or CSRF protections.

Dependency Risks

Known vulnerabilities in third-party libraries (cross-referenced against CVE databases).
Outdated dependencies with unpatched security flaws.
Transitive dependencies that introduce risk.

Prompt Structure

The security agent receives a system prompt that establishes its role, constraints, and output format:

You are a senior security engineer reviewing code changes.
Focus exclusively on security risks: authentication, authorisation, injection,
cryptography, data exposure, and dependency vulnerabilities.

For each finding:
1. Identify the specific line(s) of code.
2. Describe the security risk in plain language.
3. Explain the attack vector and potential impact.
4. Recommend a fix or mitigation.
5. Rate severity: CRITICAL, HIGH, MEDIUM, LOW.

Ignore style, formatting, performance, and architectural concerns.
Output findings as JSON.

The agent then receives the diff and any relevant context (e.g., the project’s threat model, compliance requirements, or previous security findings).

Real-World Example

Consider this Python snippet:

import hashlib

def hash_password(password):
    return hashlib.md5(password.encode()).hexdigest()

The security agent flags this as HIGH severity:

Risk: MD5 is cryptographically broken. Attackers can precompute rainbow tables and crack passwords in seconds.
Attack Vector: An attacker who gains access to the password database can rapidly crack all user passwords.
Fix: Use bcrypt, Argon2, or scrypt with a salt and cost factor.

A single-agent system might catch this, but it might also miss the related issue that the salt is missing, or that the cost factor is too low. The security agent focuses exclusively on this class of problem.

The Performance Agent

The performance agent runs on Claude Haiku 4.5. Haiku is 10× faster and 1/10th the cost of Opus, making it ideal for high-throughput checks where you don’t need the absolute best reasoning. Performance issues are often pattern-based and can be detected reliably with a smaller model.

What It Checks

Algorithmic Complexity

Nested loops that could be O(n²) or worse.
Recursive functions without memoisation or tail-call optimisation.
Sorting within a loop instead of once.
Unnecessary list comprehensions or generator expressions.

Database Queries

N+1 query patterns (loading a parent, then looping to load children).
Missing database indexes on frequently queried columns.
Fetching entire result sets when only a few rows are needed.
Unoptimised joins or subqueries.

Memory Leaks

Objects retained in global state or caches without eviction.
Event listeners not unregistered.
Circular references preventing garbage collection.
Large data structures allocated in loops.

Hot-Path Bottlenecks

Synchronous I/O (file reads, HTTP requests) in critical paths.
Expensive operations (regex compilation, JSON parsing) repeated in loops.
Inefficient string concatenation (e.g., s = "" + a + b + c instead of "".join([a, b, c])).

Resource Exhaustion

Unbounded loops or recursion depth.
Missing timeouts on external calls.
Streaming large files into memory instead of chunking.

Prompt Structure

The performance agent’s system prompt is similarly focused:

You are a performance engineer reviewing code changes.
Focus exclusively on performance: algorithmic complexity, database queries,
memory usage, and hot-path bottlenecks.

For each finding:
1. Identify the specific line(s) of code.
2. Describe the performance issue (e.g., O(n²) complexity, N+1 query).
3. Estimate the impact (e.g., "10ms per 1000 rows").
4. Recommend an optimisation.
5. Rate severity: CRITICAL, HIGH, MEDIUM, LOW.

Ignore security, style, and architectural concerns.
Output findings as JSON.

Real-World Example

Consider this Python snippet:

def get_user_posts(user_id):
    user = User.query.get(user_id)
    posts = []
    for post_id in user.post_ids:
        post = Post.query.get(post_id)  # N+1 query!
        posts.append(post)
    return posts

The performance agent flags this as HIGH severity:

Issue: N+1 query pattern. For a user with 100 posts, this executes 101 database queries (1 to fetch the user, 100 to fetch posts).
Impact: If each query takes 10ms, this takes 1010ms instead of ~20ms with eager loading.
Fix: Use Post.query.filter(Post.user_id == user_id).all() or ORM eager loading (e.g., sqlalchemy.orm.joinedload).

This is the kind of issue that’s easy to miss in code review but catastrophic in production. A performance-focused agent catches it reliably.

The Style Agent

The style agent runs on Claude Haiku 4.5. Style issues are the lowest severity but highest volume; Haiku’s speed allows you to handle them at scale without cost blowout.

What It Checks

Naming Conventions

Variables, functions, and classes that don’t follow project conventions (e.g., camelCase vs. snake_case).
Misleading names (e.g., data instead of user_records).
Acronyms used inconsistently (e.g., userId vs. user_id).

Documentation

Missing docstrings on public functions or classes.
Docstrings that don’t match the actual implementation.
Missing type hints (in typed languages).
Incomplete comments on complex logic.

Complexity

Functions exceeding a complexity threshold (e.g., cyclomatic complexity > 10).
Classes with too many responsibilities.
Deeply nested conditionals (more than 3 levels).

Consistency

Inconsistent use of single vs. double quotes.
Mixed indentation (tabs vs. spaces).
Inconsistent error handling (some functions return None, others raise exceptions).
Inconsistent import ordering.

Code Smells

Duplicate code that should be extracted to a function.
Magic numbers that should be named constants.
Functions with too many parameters (more than 4–5).
Dead code or unreachable branches.

Prompt Structure

You are a code quality engineer reviewing code changes.
Focus exclusively on style: naming, documentation, complexity, consistency,
and code smells.

For each finding:
1. Identify the specific line(s) of code.
2. Describe the style issue.
3. Explain why it matters (readability, maintainability).
4. Recommend a fix.
5. Rate severity: LOW (always; style is not blocking).

Ignore security, performance, and architectural concerns.
Output findings as JSON.

Real-World Example

Consider this JavaScript snippet:

function p(a, b, c, d, e) {
  const x = a + b;
  const y = x * c;
  const z = y - d;
  return z / e;
}

The style agent flags this as LOW severity:

Issues: Function name p is meaningless. Parameters a–e lack context. No docstring. Intermediate variables (x, y, z) add noise.
Recommendation: Rename to calculateWeightedScore. Add parameter names: baseValue, multiplier, discount, divisor. Add a docstring. Inline intermediate variables if they don’t improve clarity.

The Architecture Agent

The architecture agent runs on Claude Opus 4.7. Architectural decisions are high-impact and require nuanced reasoning. They’re also highly context-dependent—you need to understand the project’s domain model, layering strategy, and long-term vision.

What It Checks

Layering and Separation of Concerns

Business logic leaking into controllers or views.
Database access logic mixed with business logic.
Cross-layer dependencies (e.g., UI directly accessing database).
Unclear responsibility boundaries between modules.

Dependency Direction

High-level modules depending on low-level modules (should be the reverse).
Circular dependencies between modules.
Unnecessary coupling to external frameworks or libraries.
Dependency inversion violations (depending on implementations instead of abstractions).

Cohesion and Coupling

Classes or modules with too many responsibilities (low cohesion).
Excessive dependencies between modules (high coupling).
Data structures passed between modules that expose internal details.
Tight coupling to third-party libraries making them hard to replace.

Domain Model Alignment

Code structure that doesn’t reflect the business domain.
Anaemic domain models (data without behaviour).
Rich domain models hidden behind procedural code.
Naming that diverges from domain language (ubiquitous language).

Scalability and Extensibility

Hard-coded values or configurations that should be parameterised.
Monolithic structures that should be split for parallel development.
Missing extension points for future features.
Tight coupling to specific implementations that should be pluggable.

Prompt Structure

You are a software architect reviewing code changes.
Focus exclusively on architecture: layering, dependency direction, cohesion,
domain alignment, and scalability.

For each finding:
1. Identify the specific file(s) or module(s).
2. Describe the architectural issue.
3. Explain the long-term impact (maintainability, scalability, testability).
4. Recommend a refactoring.
5. Rate severity: CRITICAL, HIGH, MEDIUM, LOW.

Consider the project's existing architecture and domain model.
Ignore security, performance, and style concerns.
Output findings as JSON.

Real-World Example

Consider a payment processing system with this structure:

# payment_service.py
class PaymentService:
    def process_payment(self, user_id, amount):
        # Fetch user from database
        user = db.query(User).get(user_id)
        
        # Validate user
        if not user.is_active:
            return {"error": "User not active"}
        
        # Call payment gateway
        response = requests.post(
            "https://payment-gateway.com/charge",
            json={"amount": amount, "card": user.card_token}
        )
        
        # Log to database
        db.session.add(PaymentLog(user_id=user_id, amount=amount, status=response.status_code))
        db.session.commit()
        
        return response.json()

The architecture agent flags this as HIGH severity:

Issues:
- Database access (User query, PaymentLog insert) mixed with business logic (validation, payment processing).
- Direct HTTP call to payment gateway; tightly coupled to that provider.
- No separation between domain logic (payment processing) and infrastructure (database, HTTP).
- Hard to test; requires mocking the database and HTTP client.
Recommendation: Refactor into layers:
- Domain: PaymentProcessor (pure business logic, no I/O).
- Application: PaymentService (orchestrates domain and infrastructure).
- Infrastructure: PaymentGateway (abstraction for payment provider), UserRepository (abstraction for user persistence).
- Use dependency injection: Inject PaymentGateway and UserRepository into PaymentService, making it testable and provider-agnostic.

Implementation on Claude Opus and Haiku

This reference architecture uses Claude’s API via the Anthropic SDK. Here’s how to set it up.

Prerequisites

An Anthropic API key (from console.anthropic.com).
Python 3.10+ with anthropic and asyncio libraries.
Access to your code repository (GitHub, GitLab, Bitbucket).

Installation

pip install anthropic python-dotenv

Basic Agent Setup

Here’s a minimal security agent:

import anthropic
import json

client = anthropic.Anthropic(api_key="your-api-key")

def review_security(diff: str) -> dict:
    """Run security review on a code diff."""
    
    system_prompt = """You are a senior security engineer reviewing code changes.
Focus exclusively on security risks: authentication, authorisation, injection,
cryptography, data exposure, and dependency vulnerabilities.

For each finding:
1. Identify the specific line(s) of code.
2. Describe the security risk in plain language.
3. Explain the attack vector and potential impact.
4. Recommend a fix or mitigation.
5. Rate severity: CRITICAL, HIGH, MEDIUM, LOW.

Ignore style, formatting, performance, and architectural concerns.
Output findings as a JSON array."""
    
    message = client.messages.create(
        model="claude-opus-4-1",
        max_tokens=2048,
        system=system_prompt,
        messages=[
            {
                "role": "user",
                "content": f"Review this code diff for security issues:\n\n{diff}"
            }
        ]
    )
    
    # Extract and parse JSON from response
    response_text = message.content[0].text
    try:
        findings = json.loads(response_text)
    except json.JSONDecodeError:
        findings = {"error": "Failed to parse response", "raw": response_text}
    
    return findings

Parallel Agent Execution

To run all four agents concurrently:

import asyncio

async def run_all_agents(diff: str) -> dict:
    """Run all four agents in parallel."""
    
    # Define agent functions
    async def agent_security():
        return await asyncio.to_thread(review_security, diff)
    
    async def agent_performance():
        return await asyncio.to_thread(review_performance, diff)
    
    async def agent_style():
        return await asyncio.to_thread(review_style, diff)
    
    async def agent_architecture():
        return await asyncio.to_thread(review_architecture, diff)
    
    # Run all agents concurrently
    results = await asyncio.gather(
        agent_security(),
        agent_performance(),
        agent_style(),
        agent_architecture(),
        return_exceptions=True
    )
    
    return {
        "security": results[0],
        "performance": results[1],
        "style": results[2],
        "architecture": results[3]
    }

# Usage
diff = """--- a/app.py
+++ b/app.py
@@ -1,5 +1,10 @@
 import hashlib
 
 def hash_password(password):
-    return hashlib.md5(password.encode()).hexdigest()
+    import bcrypt
+    salt = bcrypt.gensalt(rounds=12)
+    hashed = bcrypt.hashpw(password.encode(), salt)
+    return hashed.decode()
"""

results = asyncio.run(run_all_agents(diff))
print(json.dumps(results, indent=2))

Cost Optimisation

Use Opus for heavyweight agents (security, architecture) and Haiku for lightweight agents (performance, style). At current pricing:

Opus 4.7: ~$15 per 1M input tokens, ~$45 per 1M output tokens.
Haiku 4.5: ~$0.80 per 1M input tokens, ~$4 per 1M output tokens.

For a typical PR with 500 lines of changes:

Security (Opus): ~200 input tokens, ~300 output tokens. Cost: ~$0.015.
Performance (Haiku): ~200 input tokens, ~200 output tokens. Cost: ~$0.001.
Style (Haiku): ~200 input tokens, ~250 output tokens. Cost: ~$0.001.
Architecture (Opus): ~200 input tokens, ~400 output tokens. Cost: ~$0.025.

Total cost per PR: ~$0.04. For a team running 50 PRs per day, that’s ~$2/day or ~$600/month—far cheaper than hiring a dedicated code reviewer.

Integration and Workflow

Once you have the agents running, integrate them into your CI/CD pipeline. Here are three common patterns.

Pattern 1: GitHub Actions Workflow

Create a .github/workflows/code-review.yml file:

name: Multi-Agent Code Review

on:
  pull_request:
    types: [opened, synchronize]

jobs:
  review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
        with:
          fetch-depth: 0
      
      - name: Get PR diff
        id: diff
        run: |
          git diff origin/${{ github.base_ref }} HEAD > diff.txt
          echo "diff=$(cat diff.txt)" >> $GITHUB_OUTPUT
      
      - name: Run multi-agent review
        id: review
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          pip install anthropic
          python scripts/review.py "${{ steps.diff.outputs.diff }}" > review.json
      
      - name: Post review to PR
        uses: actions/github-script@v6
        with:
          script: |
            const fs = require('fs');
            const review = JSON.parse(fs.readFileSync('review.json', 'utf8'));
            const comment = formatReview(review);
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: comment
            });

Pattern 2: GitLab CI Pipeline

Create a .gitlab-ci.yml file:

code-review:
  stage: review
  image: python:3.11
  script:
    - pip install anthropic
    - git diff origin/$CI_MERGE_REQUEST_TARGET_BRANCH_NAME HEAD > diff.txt
    - python scripts/review.py < diff.txt > review.json
  artifacts:
    reports:
      dependency_scanning: review.json
  only:
    - merge_requests

Pattern 3: Slack Integration

Post findings to Slack for visibility:

import requests

def post_to_slack(findings: dict, webhook_url: str):
    """Post code review findings to Slack."""
    
    blocks = []
    
    for agent, issues in findings.items():
        if not issues:
            continue
        
        critical = [i for i in issues if i.get("severity") == "CRITICAL"]
        high = [i for i in issues if i.get("severity") == "HIGH"]
        
        blocks.append({
            "type": "section",
            "text": {
                "type": "mrkdwn",
                "text": f"*{agent.title()} Agent*\nCritical: {len(critical)}, High: {len(high)}"
            }
        })
    
    requests.post(webhook_url, json={"blocks": blocks})

Measuring Impact

To validate that multi-agent code review is working, track these metrics.

Defect Detection

Metric: Number of defects caught before merge vs. after production deployment.

Baseline: Track defects found in production for 4 weeks without multi-agent review.
Post-implementation: Track defects for 4 weeks with multi-agent review enabled.
Target: 3–5× reduction in production defects.

Code Review Cycle Time

Metric: Time from PR open to merge.

Baseline: Average cycle time without automated review.
Post-implementation: Average cycle time with automated review.
Target: 30–40% reduction (automated agents provide initial feedback in seconds; humans focus on logic and design).

False Positive Rate

Metric: Percentage of agent findings that are incorrect or irrelevant.

Calculation: (Findings marked “not an issue” / Total findings) × 100.
Target: < 10% for security and architecture, < 20% for performance and style.

Developer Satisfaction

Metric: Survey developers on the usefulness of automated feedback.

Questions:
- “Did the agent findings help you improve your code?”
- “Did the agents catch issues you would have missed?”
- “Would you recommend this to other teams?”
Target: > 80% positive responses.

Cost per PR

Metric: Total cost of running agents per PR.

Calculation: (API costs + infrastructure costs) / Number of PRs.
Benchmark: ~$0.04–0.10 per PR depending on PR size.
ROI: If one agent-caught defect prevents $10k in production downtime, the ROI is > 100,000×.

Common Pitfalls and Solutions

Pitfall 1: Agents Are Too Verbose

Problem: Agents return dozens of low-value findings (trailing whitespace, import order), drowning out critical issues.

Solution: Add a filtering layer that:

Deduplicates findings across agents.
Filters out issues that linters already catch (delegate to pre-commit hooks or CI linters).
Ranks findings by severity and only surfaces top 5–10 per agent.
Allows developers to suppress false positives with inline comments (e.g., # noqa: security-bypass).

Pitfall 2: Agents Hallucinate or Miss Context

Problem: An agent flags a line as problematic without understanding the broader context (e.g., a security check that’s actually valid but implemented in a non-obvious way).

Solution:

Provide richer context: include the full file, not just the diff. Include comments and docstrings that explain intent.
Add a second-pass filter: have a human or a higher-capacity model (Opus) review agent findings before posting them.
Use custom instructions: include project-specific security policies, architectural patterns, or exemptions in the system prompt.

Pitfall 3: Agents Are Slow or Expensive

Problem: Running four agents sequentially takes 30+ seconds per PR, or costs balloon to $1+ per PR.

Solution:

Run agents in parallel (use asyncio or your language’s concurrency model).
Use Haiku for lightweight agents; save Opus for heavyweight analysis.
Cache agent responses: if the same diff is reviewed twice (e.g., after a force-push), reuse results.
Batch PRs: instead of reviewing each PR individually, batch 10 PRs and run agents once per batch.

Pitfall 4: Agents Disagree or Contradict

Problem: The security agent recommends a defensive coding pattern that the performance agent flags as inefficient.

Solution:

Add a reconciliation step: have agents review each other’s findings and flag contradictions.
Prioritise by impact: if security and performance conflict, security wins (a slow system that’s secure is better than a fast system that’s hacked).
Document trade-offs: when an agent flags a necessary trade-off, explain it in the finding so developers understand the reasoning.

Next Steps

Multi-agent code review is a powerful pattern, but it’s just the beginning. Here are natural extensions:

1. Extend to Other Code Dimensions

Add agents for:

Testing: Verify that new code has adequate test coverage and that tests are meaningful.
Documentation: Ensure that public APIs, configuration, and deployment procedures are documented.
Accessibility: Check that UI changes meet WCAG standards.
Compliance: Verify that changes align with regulatory requirements (e.g., GDPR, HIPAA).

2. Integrate with Your Incident Response

When a production defect occurs, run the multi-agent review retroactively on the offending PR to understand why agents missed it. Use that to retrain agents or adjust thresholds.

3. Extend to Code Generation

Use the same agents to review code generated by AI assistants (e.g., GitHub Copilot). This creates a feedback loop: AI generates code, agents review it, feedback improves future generations.

4. Build a Custom Model

Once you have a large corpus of code review data (agents’ findings + developer feedback), fine-tune a custom model on your codebase. This adapts agents to your project’s specific patterns, conventions, and risk profile.

5. Integrate with Your CTO Function

For startups and scale-ups using PADISO’s CTO as a Service, multi-agent code review becomes a force multiplier. Your fractional CTO can focus on architecture and strategy while agents handle routine quality checks. We’ve seen this pattern cut CTO overhead by 30–40% while improving code quality.

Teams building high-velocity product engineering at seed to Series B often struggle with code review bottlenecks—either you hire senior engineers to do thorough reviews (expensive, slow) or you skip reviews (quality suffers). Multi-agent code review splits the difference: automated agents provide depth and speed at scale, while humans focus on design, logic, and mentorship.

For Sydney-based teams and distributed startups, this unlocks asynchronous code review. An agent reviews your PR in seconds; by the time you’ve had a coffee, you have detailed feedback on security, performance, style, and architecture. No waiting for a senior engineer to be available.

6. Monitor and Iterate

This architecture is not a one-time setup. Track the metrics outlined above, adjust agent prompts based on false positives, and evolve the system as your codebase and team grow.

For teams working with PADISO’s AI & Agents Automation service, we can help you operationalise this pattern: set up the agents, integrate them into your CI/CD, train your team on how to use them, and iterate based on real-world feedback. The result is a code review system that’s fast, thorough, and continuously improving.

Conclusion

Multi-agent code review is a proven pattern for scaling code quality without hiring more senior engineers. By specialising agents—security, performance, style, architecture—and running them in parallel on Claude Opus and Haiku, you get faster feedback, fewer defects, and lower cost than traditional code review.

The reference architecture is open-source and ready to deploy. Start with the four agents, measure impact on defect detection and cycle time, and iterate from there. For teams shipping fast and iterating on product, this is a game-changer.

Ready to implement? The code is available on GitHub. Questions? Reach out to the team at PADISO—we’ve deployed this pattern across 50+ codebases and can help you adapt it to your stack, team, and risk profile.

For more on how AI automation and AI & Agents Automation can transform your engineering function, explore our guides on AI agency methodology, AI agency project management, and AI agency reporting. If you’re building a startup and need fractional CTO leadership alongside AI engineering, our venture studio and co-build service is designed for founders and domain experts scaling from idea to Series A.

Ship better code. Ship faster. Let agents do the routine checks; focus your team on what matters.

Multi-Agent Code Review: A Padiso Reference Architecture

Table of Contents

What Is Multi-Agent Code Review?

Why Four Agents Beat One

Architecture Overview

Layer 1: Pull Request Ingestion

Layer 2: Agent Dispatcher

Layer 3: Specialised Agents

Layer 4: Deduplication and Ranking

Layer 5: Output and Integration

The Security Agent

What It Checks

Prompt Structure

Real-World Example

The Performance Agent

What It Checks

Prompt Structure

Real-World Example

The Style Agent

What It Checks

Prompt Structure

Real-World Example

The Architecture Agent

What It Checks

Prompt Structure

Real-World Example

Implementation on Claude Opus and Haiku

Prerequisites

Installation

Basic Agent Setup

Parallel Agent Execution

Cost Optimisation

Integration and Workflow

Pattern 1: GitHub Actions Workflow

Pattern 2: GitLab CI Pipeline

Pattern 3: Slack Integration

Measuring Impact

Defect Detection

Code Review Cycle Time

False Positive Rate

Developer Satisfaction

Cost per PR

Common Pitfalls and Solutions

Pitfall 1: Agents Are Too Verbose

Pitfall 2: Agents Hallucinate or Miss Context

Pitfall 3: Agents Are Slow or Expensive

Pitfall 4: Agents Disagree or Contradict

Next Steps

1. Extend to Other Code Dimensions

2. Integrate with Your Incident Response

3. Extend to Code Generation

4. Build a Custom Model

5. Integrate with Your CTO Function

6. Monitor and Iterate

Conclusion