Guide 22 mins

Versioning Agent Skills Across a Portfolio of Mid-Market Companies

Master agent skill versioning, mono-repo patterns, and governance across portfolio companies. Learn semver, changelogs, eval suites, and rollback drills.

The PADISO Team ·2026-05-05

Why Agent Skill Versioning Matters at Scale
Understanding Agent Skills and Governance
Mono-Repo vs Per-Portco Patterns
Semantic Versioning for Agent Skills
Changelogs and Change Management
Building Robust Evaluation Suites
Rollback Drills and Incident Response
Implementing Governance Across Your Portfolio
Real-World Patterns from PADISO Quarterly Drills
Next Steps and Quick Wins

Why Agent Skill Versioning Matters at Scale {#why-versioning-matters}

When you’re running a portfolio of mid-market companies—whether through private equity, a venture studio, or a multi-brand holding—autonomous agents become operational infrastructure. They’re not experiments anymore. They’re shipping code to production, orchestrating workflows, and directly impacting revenue and cost.

But here’s the problem: agent skills are fragile. A prompt tweak in one company’s workflow automation can cascade into hallucinations across three others sharing the same skill. A deployment that works in staging fails silently in production because the evaluation suite didn’t catch edge cases. A rollback takes six hours instead of six minutes because nobody documented the skill’s dependencies.

Without proper versioning and governance, you’re running a portfolio where each company is a isolated island, duplicating effort, diverging on quality, and burning engineering cycles on preventable failures.

Agent skills are becoming an industry standard, with structured, reusable instructions managed through version control and code review processes. The best-run portfolios treat agent skills like production code: versioned, tested, documented, and rollback-ready.

This guide walks through the patterns PADISO runs across portfolio companies every quarter. We’ll cover mono-repo architecture, semantic versioning for agents, evaluation suites that actually catch bugs, and the rollback drills that keep your portfolio resilient.

Understanding Agent Skills and Governance {#understanding-skills}

What Are Agent Skills?

Agent skills are discrete, reusable instructions—essentially functions or modules—that autonomous agents can invoke to accomplish specific tasks. A skill might be “fetch customer invoice,” “update Salesforce record,” “validate email format,” or “generate compliance report.”

Unlike traditional APIs or microservices, agent skills exist in a grey zone: they’re part prompt engineering, part code, part data schema. They’re versioned, but they’re also sensitive to context, model behaviour, and upstream data quality.

The VoltAgent awesome-agent-skills repository curates official agent skills from leading teams including Anthropic, Google Labs, Vercel, and Stripe. These resources show how enterprise teams structure skills with clear documentation, dependency graphs, and versioning conventions.

Why Governance Breaks Down

In a portfolio setting, governance fails for three reasons:

1. Skill Duplication Across Companies Company A builds a “fetch invoice” skill. Company B builds the same thing, slightly different. Company C doesn’t know either exists and builds a third version. Six months later, you have three unmaintained variants and no single source of truth.

2. Prompt Drift A skill’s prompt changes because an engineer tweaked it to fix a bug in one context. That tweak breaks the skill in another company’s workflow. No changelog. No evaluation suite. No rollback plan.

3. Dependency Hell Skills depend on other skills, APIs, and data schemas. When Company A upgrades a foundational skill, Company B’s workflow breaks silently because nobody documented the dependency or ran a cross-portco evaluation suite.

The Governance Layers

Proper agent skill governance has four layers:

Versioning: Semantic versioning (semver) with clear breaking vs. non-breaking changes
Change Management: Changelogs, pull requests, and code review processes
Evaluation: Automated test suites that verify skills work across different contexts and datasets
Deployment & Rollback: Staged rollouts, feature flags, and fast rollback procedures

We’ll cover each in detail below.

Mono-Repo vs Per-Portco Patterns {#mono-repo-patterns}

The Mono-Repo Approach

A mono-repo is a single Git repository containing all agent skills across all portfolio companies. Structure looks like this:

agent-skills/
├── skills/
│   ├── fetch-invoice/
│   │   ├── v1.0.0/
│   │   │   ├── skill.yaml
│   │   │   ├── prompt.md
│   │   │   ├── schema.json
│   │   │   └── tests/
│   │   ├── v1.1.0/
│   │   └── v2.0.0/
│   ├── validate-email/
│   ├── update-crm/
│   └── generate-report/
├── shared/
│   ├── utils/
│   ├── evaluators/
│   └── rollback-scripts/
├── companies/
│   ├── company-a/
│   │   ├── skill-dependencies.yaml
│   │   ├── skill-versions.lock
│   │   └── eval-config.yaml
│   ├── company-b/
│   └── company-c/
├── CHANGELOG.md
└── README.md

Advantages:

Single source of truth for all skills
Shared evaluation infrastructure
Easy to track cross-portco impact of changes
Consistent versioning and naming conventions
Simplified dependency management

Disadvantages:

Requires strong governance discipline
Risk of one company’s change affecting others
Larger repository, slower clones
Requires cross-company coordination on deployments

The Per-Portco Approach

Each company has its own repository. A central registry tracks which skills exist where.

company-a-skills/
├── skills/
├── CHANGELOG.md
└── skill-registry.yaml

company-b-skills/
├── skills/
├── CHANGELOG.md
└── skill-registry.yaml

skill-registry/ (central)
├── registry.yaml
├── compatibility-matrix.yaml
└── shared-evaluators/

Advantages:

Company autonomy and faster iteration
Smaller repositories
Easier to onboard new companies
Lower blast radius for company-specific changes

Disadvantages:

Skill duplication and drift
Harder to track cross-portco dependencies
Inconsistent versioning and naming
Evaluation and testing fragmentation

Hybrid Approach (Recommended)

The best pattern for mid-market portfolios is a hybrid: a mono-repo for core, shared skills (those used across 2+ companies) and per-portco repos for company-specific skills. A central skill registry tracks everything.

shared-agent-skills/ (mono-repo)
├── skills/
│   ├── fetch-invoice/ (used by A, B, C)
│   ├── validate-email/ (used by A, B)
│   └── update-crm/ (used by all)
├── evaluators/
├── shared-utils/
└── CHANGELOG.md

company-a-skills/ (per-portco)
├── skills/
│   ├── custom-pricing-logic/ (A only)
│   └── industry-specific-validation/ (A only)
├── CHANGELOG.md
└── dependencies.lock (pins shared-agent-skills versions)

skill-registry/ (central)
├── registry.yaml
├── dependency-graph.json
└── audit-log.yaml

This approach gives you:

Shared skill governance for high-value, cross-portco skills
Company autonomy for specialised, company-specific skills
Clear dependency tracking via the central registry
Manageable blast radius when changes happen

When you’re scaling across portfolio modernisation and platform re-platforming projects, this hybrid pattern keeps governance tight without strangling iteration.

Semantic Versioning for Agent Skills {#semantic-versioning}

The Semver Framework

Semantic versioning (semver) for agent skills follows the pattern: MAJOR.MINOR.PATCH

MAJOR: Breaking changes (skill output schema changes, required inputs change, behaviour fundamentally changes)
MINOR: Backwards-compatible additions (new optional parameters, improved accuracy, new optional outputs)
PATCH: Bug fixes and non-functional improvements (prompt refinement, performance improvement, documentation fix)

Examples in Practice

Scenario 1: Prompt Refinement (PATCH) Your “fetch-invoice” skill sometimes hallucinates invoice numbers. You refine the prompt to be more explicit about the format. Output schema unchanged. Backwards compatible.

version: 1.0.1 (was 1.0.0)
changes:
  - Refined prompt to reduce hallucination in invoice-number field
  - No schema changes
  - No breaking changes
breaking: false

Scenario 2: Add Optional Output (MINOR) You add an optional “invoice_status” field to the output. Existing consumers ignore it. New consumers can use it.

version: 1.1.0 (was 1.0.1)
changes:
  - Added optional invoice_status field to output
  - Backwards compatible; existing consumers unaffected
breaking: false

Scenario 3: Change Required Input (MAJOR) You refactor the skill to require a new mandatory “company_id” parameter. Existing calls without this parameter will fail.

version: 2.0.0 (was 1.1.0)
changes:
  - Made company_id a required parameter
  - Changed output schema for invoice_date (ISO 8601 format)
  - Consumers must update their calls
breaking: true

Communicating Semver Changes

Every version change must include:

Version bump in skill.yaml
Changelog entry (see section below)
Migration guide (for MAJOR versions)
Evaluation results (see section below)
Git tag (v1.0.0, v1.1.0, etc.)

Without this discipline, teams won’t know whether they can safely upgrade or if they need to do work. Ambiguity kills portfolios.

Handling Pre-Release Versions

For experimental skills or major rewrites, use pre-release tags:

v2.0.0-alpha.1  # Early development
v2.0.0-beta.1   # Feature-complete, testing phase
v2.0.0-rc.1     # Release candidate, ready for final eval
v2.0.0          # Stable release

This lets you iterate on major versions without forcing all consumers to track unstable code.

Changelogs and Change Management {#changelogs}

The CHANGELOG Format

Every skill repository needs a CHANGELOG.md following the Keep a Changelog format:

# Changelog

All notable changes to this project will be documented in this file.

## [2.0.0] - 2025-02-15

### Changed
- **BREAKING**: Made `company_id` a required parameter
- Refactored invoice-fetching logic to use new v3 API endpoint
- Changed `invoice_date` output format to ISO 8601

### Added
- New optional `invoice_status` field
- New optional `payment_terms` field

### Fixed
- Fixed hallucination in invoice_number field (prompt refinement)
- Improved error handling for invalid company IDs

### Migration Guide

All calls to fetch-invoice v2.0.0 must now include company_id.

Before:

fetch-invoice(invoice_id="INV-001")

After:

fetch-invoice(invoice_id="INV-001", company_id="acme-corp")


## [1.1.0] - 2025-01-20

### Added
- New optional `include_line_items` parameter
- New optional `line_items` field in output

## [1.0.0] - 2024-12-01

### Added
- Initial release of fetch-invoice skill

The Change Management Process

Every skill change follows this workflow:

Create feature branch: git checkout -b skill/fetch-invoice/add-line-items
Make changes: Update prompt, schema, tests
Run evaluation suite: Verify the change doesn’t break anything
Update CHANGELOG.md: Document the change with version bump
Create pull request: Include evaluation results, migration guide (if breaking)
Code review: At least one other engineer reviews
Merge and tag: git tag v1.1.0 and push
Notify consumers: Alert all companies using this skill of the new version

Without this process, changes are invisible. Teams upgrade and get surprised.

The Skill Dependency Manifest

Each company maintains a skill-versions.lock file that pins exact versions:

# company-a/skill-versions.lock
skills:
  fetch-invoice: "1.0.0"
  validate-email: "2.1.3"
  update-crm: "3.0.0-rc.1"
  custom-pricing: "1.2.0"

generated: 2025-02-15T10:30:00Z
updated_by: alice@company-a.com
reason: "Tested new fetch-invoice 1.0.0, ready for production"

This lock file prevents “dependency drift” where different environments run different skill versions. It’s like package-lock.json for agent skills.

Building Robust Evaluation Suites {#evaluation-suites}

Why Standard Unit Tests Aren’t Enough

Agent skills are different from traditional code. A unit test that passes doesn’t guarantee the skill works correctly:

Prompt sensitivity: Small wording changes change behaviour
Model variance: Different models (Claude, GPT-4, Gemini) behave differently
Context dependence: Skill performance depends on upstream data quality
Silent failures: Skills can hallucinate plausible-sounding wrong answers

You need evaluation suites that test:

Accuracy: Does the skill produce correct outputs?
Consistency: Does it produce the same output for the same input?
Edge cases: How does it handle malformed, missing, or extreme inputs?
Cross-model compatibility: Does it work with different LLMs?
Cost: Does it stay within expected token budgets?
Latency: Does it respond within SLA?

Building an Evaluation Suite

Here’s a practical structure:

# skills/fetch-invoice/eval/config.yaml
evaluation:
  name: "fetch-invoice"
  versions_to_test: ["1.0.0", "1.1.0", "2.0.0-rc.1"]
  
  test_datasets:
    - name: "standard-invoices"
      file: "datasets/standard-invoices.jsonl"
      size: 100
      description: "Normal invoice requests"
    
    - name: "edge-cases"
      file: "datasets/edge-cases.jsonl"
      size: 50
      description: "Malformed, missing, or extreme inputs"
    
    - name: "company-a-production-sample"
      file: "datasets/company-a-sample.jsonl"
      size: 500
      description: "Real production data from Company A"
  
  evaluators:
    - type: "schema-compliance"
      description: "Does output match expected schema?"
      threshold: 100%
    
    - type: "accuracy"
      description: "Does output match ground truth?"
      threshold: 95%
      groundtruth_file: "datasets/standard-invoices-groundtruth.jsonl"
    
    - type: "hallucination-detection"
      description: "Does skill hallucinate?"
      threshold: 0% (no hallucinations allowed)
      checks:
        - invoice_number_format
        - currency_code_validity
        - date_plausibility
    
    - type: "latency"
      description: "Does skill respond within SLA?"
      threshold: 2000ms (p95)
    
    - type: "cost"
      description: "Token usage within budget?"
      threshold: 500 tokens per request (p95)
  
  cross_model_testing:
    - model: "claude-3-5-sonnet"
      region: "us-east-1"
    - model: "gpt-4-turbo"
      region: "us-east-1"
    - model: "gemini-2.0-flash"
      region: "us-central1"
  
  regression_testing:
    - "Does v2.0.0 work with Company A's real data?"
    - "Does v2.0.0 handle the edge case that broke v1.1.0?"
    - "Does v2.0.0 maintain backwards compatibility with old inputs?"

Running Evaluations in CI/CD

Every pull request that changes a skill must run the evaluation suite automatically:

# .github/workflows/skill-eval.yml
name: Skill Evaluation

on:
  pull_request:
    paths:
      - 'skills/**'

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Detect changed skills
        id: changes
        run: |
          git diff --name-only origin/main..HEAD | grep 'skills/' | cut -d'/' -f2 | sort -u > /tmp/changed-skills.txt
          cat /tmp/changed-skills.txt
      
      - name: Run evaluation suite
        run: |
          for skill in $(cat /tmp/changed-skills.txt); do
            echo "Evaluating $skill..."
            python eval/run_evaluation.py --skill $skill --versions all
          done
      
      - name: Check thresholds
        run: |
          python eval/check_thresholds.py --report eval-results.json
          if [ $? -ne 0 ]; then
            echo "Evaluation failed. See results below."
            cat eval-results.json
            exit 1
          fi
      
      - name: Upload results
        uses: actions/upload-artifact@v3
        with:
          name: eval-results
          path: eval-results.json
      
      - name: Comment on PR
        uses: actions/github-script@v6
        with:
          script: |
            const fs = require('fs');
            const results = JSON.parse(fs.readFileSync('eval-results.json', 'utf8'));
            const comment = `## Skill Evaluation Results\n\n${JSON.stringify(results, null, 2)}`;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: comment
            });

This ensures every change is evaluated before it ships. No surprises in production.

Rollback Drills and Incident Response {#rollback-drills}

Why Rollback Matters

When a skill fails in production—and it will—you need to rollback in minutes, not hours. If a bad skill update causes silent failures across three companies, the cost of a slow rollback is measured in lost revenue and customer trust.

PADISO runs quarterly rollback drills across portfolio companies. Here’s the pattern.

The Rollback Procedure

Step 1: Detect the Failure Monitoring alerts on skill failure rates, output schema violations, or evaluation metric degradation:

# monitoring/alerts.yaml
alerts:
  - name: skill_failure_rate_spike
    metric: skill.failure_rate
    threshold: ">5% (vs. baseline 0.5%)"
    window: 5m
    severity: critical
    action: "Page on-call engineer"
  
  - name: skill_output_schema_violation
    metric: skill.schema_violations
    threshold: ">0"
    window: 1m
    severity: critical
    action: "Page on-call engineer + trigger rollback playbook"
  
  - name: skill_latency_degradation
    metric: skill.p95_latency
    threshold: ">2x baseline"
    window: 10m
    severity: warning
    action: "Alert on-call engineer"

Step 2: Trigger Rollback The on-call engineer runs a single command:

./scripts/rollback-skill.sh \
  --skill fetch-invoice \
  --from-version 2.0.0 \
  --to-version 1.1.0 \
  --companies company-a,company-b,company-c \
  --reason "Critical: 15% failure rate detected"

This script:

Verifies the target version exists and is tested
Checks dependencies (is anything depending on v2.0.0 that won’t work with v1.1.0?)
Updates the skill-versions.lock files
Triggers redeployment to all specified companies
Monitors failure rates for 5 minutes
Sends notifications to all affected teams
Creates an incident ticket with rollback details

Step 3: Investigate and Fix While the rollback is live, the engineering team investigates:

# Pull the evaluation results for the bad version
python eval/compare-versions.py \
  --skill fetch-invoice \
  --version-a 1.1.0 \
  --version-b 2.0.0 \
  --output comparison-report.json

# Review the diff
git diff v1.1.0..v2.0.0 -- skills/fetch-invoice/

# Check the evaluation suite results from the PR
cat pr-evaluation-results.json

Step 4: Fix and Re-Release Once the fix is made:

# Increment patch version
# Update CHANGELOG.md
# Run evaluation suite
# Tag and push
git tag v2.0.1
git push --tags

# Deploy to canary first (10% of traffic)
./scripts/deploy-skill.sh \
  --skill fetch-invoice \
  --version 2.0.1 \
  --canary true \
  --canary-percentage 10

# Monitor for 15 minutes
# If stable, roll out to 100%
./scripts/deploy-skill.sh \
  --skill fetch-invoice \
  --version 2.0.1 \
  --canary false

Quarterly Rollback Drills

Every quarter, PADISO runs a “failure game day” across the portfolio:

Drill 1: Simulate a Skill Failure We deliberately introduce a bug into a non-critical skill in a staging environment and see how fast the team detects and rolls back.

Drill 2: Test Cross-Portco Rollback We trigger a rollback that affects multiple companies simultaneously and measure:

Time to detect failure
Time to rollback
Any cascading failures
Communication lag

Drill 3: Dependency Chain Failure We break a shared skill (used by 3+ companies) and verify that rollback doesn’t break any dependent skills.

Drill 4: Data Corruption Scenario We simulate a skill that produces corrupt output and verify the evaluation suite would have caught it before production.

After each drill, we document:

What went well
What failed
Root cause
Process improvements
Updated runbooks

This is how you keep a portfolio resilient. Not through hope, but through practice.

Implementing Governance Across Your Portfolio {#governance-implementation}

Phase 1: Audit Current State (Week 1-2)

Before implementing versioning and governance, understand what you have:

# Inventory all agent skills across portfolio
for company in company-a company-b company-c; do
  echo "=== $company ==="
  find $company -name "*.yaml" -o -name "*.py" | grep -i skill
  find $company -name "*.md" | grep -i agent
done > skill-inventory.txt

# Identify duplicates
python scripts/find-duplicate-skills.py --inventory skill-inventory.txt

# Map dependencies
python scripts/map-skill-dependencies.py --output dependency-graph.json

# Assess evaluation coverage
python scripts/audit-evaluation-coverage.py --output eval-audit.json

Phase 2: Design Governance Model (Week 3-4)

Decide: mono-repo, per-portco, or hybrid? We recommend hybrid for most portfolios:

# governance/model.yaml
governance_model: hybrid

shared_skills:
  - fetch-invoice (used by A, B, C)
  - validate-email (used by A, B)
  - update-crm (used by all)
  repository: shared-agent-skills
  versioning: strict-semver
  evaluation: mandatory
  code-review: 2 approvals
  deployment: canary (10%, 5min) -> 100%

company_specific_skills:
  - company-a: custom-pricing, industry-validation
  - company-b: compliance-check, audit-log
  - company-c: custom-integration
  repository: per-company
  versioning: semver
  evaluation: mandatory
  code-review: 1 approval
  deployment: direct or canary (depends on risk)

central_registry:
  repository: skill-registry
  tracks: all skills, versions, dependencies, owners, slas
  updated: on every skill change
  audited: monthly

Phase 3: Migrate to Versioning (Week 5-8)

Migrate existing skills to the new structure:

Assign initial versions: All current skills become v1.0.0
Create CHANGELOG.md: Retroactively document any known issues
Build evaluation suites: Start with basic accuracy tests
Set up CI/CD: Automate evaluation on every PR
Create skill-versions.lock: Pin all companies to v1.0.0 initially

Phase 4: Establish Governance Processes (Week 9-12)

Code review process: Who approves skill changes?
Evaluation thresholds: What accuracy/latency/cost limits do we enforce?
Deployment cadence: Weekly? Daily? On-demand with approval?
Rollback procedures: Who can trigger rollback? How fast?
Incident response: What’s the escalation path?
Training: Ensure all engineers understand the system

Document everything in a GOVERNANCE.md file in each repository:

# Agent Skills Governance

## Versioning
- Semantic versioning (MAJOR.MINOR.PATCH)
- MAJOR: breaking changes
- MINOR: backwards-compatible additions
- PATCH: bug fixes

## Code Review
- Shared skills: 2 approvals from different teams
- Company-specific skills: 1 approval
- Evaluation results must be passing

## Evaluation
- All skills must pass accuracy threshold (95%)
- Schema compliance: 100%
- Hallucination detection: 0%
- Latency: <2000ms p95

## Deployment
- Shared skills: canary 10% for 5min, then 100%
- Company-specific: direct deployment with approval
- Rollback: on-call engineer can trigger immediately

## Incident Response
- Failure rate >5%: page on-call
- Schema violations: immediate rollback
- Silent failures: investigate within 1 hour

Real-World Patterns from PADISO Quarterly Drills {#quarterly-drills}

Case Study 1: The Shared Skill Cascade

Scenario: Company A’s engineering team optimised the “update-crm” skill (used by A, B, C) to handle a new field. They bumped the version from 1.2.0 to 1.3.0, but forgot to update the output schema documentation.

Company B’s workflow automation relied on the old schema. When B upgraded to 1.3.0, their downstream process failed silently—it was looking for a field that no longer existed.

Detection: Company B’s revenue reporting job failed. The team noticed 6 hours later when monthly reports weren’t generated.

Root Cause: The skill change was MINOR (backwards-compatible in theory) but the documentation wasn’t updated. Company B didn’t realise the schema had changed.

Fix:

Rollback Company B to v1.2.0 (5 minutes)
Fix the skill version to v2.0.0 (MAJOR, breaking change)
Update CHANGELOG.md with explicit migration guide
Company B manually upgraded with awareness of the schema change

Lesson Learned:

Schema changes = MAJOR version bump, even if technically backwards-compatible
Evaluation suite should detect schema drift and fail the PR
All companies using a shared skill must be notified of MAJOR changes with migration guides

Case Study 2: The Prompt Hallucination

Scenario: Company C’s “fetch-invoice” skill started hallucinating invoice numbers. The prompt had been tweaked to handle a new edge case, but the tweak made the skill more prone to making up plausible-sounding invoice numbers.

Detection: Company C’s accounts team noticed invoices being marked as “paid” that didn’t exist. The skill was returning hallucinated invoice IDs that didn’t match any real records.

Root Cause: The evaluation suite tested accuracy on standard invoices but didn’t test for hallucination on edge cases (missing invoices, ambiguous requests).

Fix:

Rollback to v1.0.0 (2 minutes)
Enhance evaluation suite with hallucination detection
Rewrite the prompt more carefully
Add specific test cases for edge cases
Re-release as v1.0.1 with updated evaluation results

Lesson Learned:

Prompt changes need evaluation suites that specifically test for hallucination
Edge-case datasets are critical
PATCH version bumps can still introduce regressions; evaluation is mandatory

Case Study 3: The Dependency Hell

Scenario: The “validate-email” skill (v2.0.0) depended on a regex utility that was shared across multiple skills. When the utility was updated, the email validation started failing on certain domain formats.

Company A was using validate-email v2.0.0, but the utility update wasn’t reflected in their skill-versions.lock. So they got the new utility code without the new validate-email version that was compatible with it.

Detection: Company A’s email validation started failing for valid email addresses with certain domain structures.

Root Cause: Dependency tracking was incomplete. The skill-versions.lock tracked skills but not shared utilities.

Fix:

Rollback Company A’s shared utilities to the previous version (3 minutes)
Update the skill-versions.lock format to include utility versions
Update CI/CD to verify utility compatibility
Re-release validate-email v2.0.1 with explicit utility version requirement

Lesson Learned:

Skills have implicit dependencies (shared code, APIs, data schemas)
Dependency tracking must be explicit and verified
Evaluation suites should test against multiple versions of dependencies

Case Study 4: The Silent Failure

Scenario: The “generate-report” skill started producing invalid JSON in certain edge cases. The skill didn’t error; it just returned malformed output that downstream processes couldn’t parse.

Company B’s reporting pipeline started silently dropping records. No errors in logs. Just missing data.

Detection: Manual audit of report completeness revealed 15% of records were missing.

Root Cause: The evaluation suite tested “happy path” scenarios but not schema validation on real production data with edge cases.

Fix:

Rollback to v1.0.0 (2 minutes)
Add schema validation to the skill itself (return error instead of malformed JSON)
Enhance evaluation suite to test against real production datasets
Add monitoring for output schema violations
Re-release as v1.1.0 with stricter validation

Lesson Learned:

Silent failures are worse than loud failures
Skills should validate their own output and error loudly
Evaluation suites should test against real production data, not just synthetic datasets
Monitoring should alert on schema violations, not just error rates

Next Steps and Quick Wins {#next-steps}

This Week

Audit your current state: Inventory all agent skills across your portfolio. Are they versioned? Documented? Tested?
Read the standards: Familiarise yourself with how agent skills are becoming an industry standard and explore the VoltAgent awesome-agent-skills repository to see how leading teams structure skills.
Choose your governance model: Mono-repo, per-portco, or hybrid? Document the decision.
Design your versioning scheme: Use semantic versioning. Document what MAJOR, MINOR, and PATCH mean for your skills.

This Month

Migrate one shared skill: Pick a skill used by 2+ companies. Version it, add a CHANGELOG.md, build an evaluation suite, and deploy with the new process. Use this as your template.
Set up CI/CD: Automate evaluation on every PR. Fail PRs that don’t meet your thresholds.
Create a skill registry: Build a central source of truth that tracks all skills, versions, dependencies, and owners.
Run your first rollback drill: Intentionally break a skill and see how fast your team detects and recovers. Document the gaps.

This Quarter

Migrate all shared skills: Version and governance everything used by 2+ companies.
Establish governance processes: Document code review, evaluation, deployment, and incident response procedures.
Train your teams: Ensure all engineers understand the versioning, governance, and rollback procedures.
Run quarterly drills: Make rollback and incident response a muscle memory exercise.

Connecting to Your Broader AI Strategy

Agent skill versioning and governance aren’t isolated technical problems. They’re part of your broader AI strategy and readiness across the portfolio.

If you’re modernising mid-market companies with agentic AI vs traditional automation approaches, skill governance determines whether you ship fast or get stuck in chaos.

If you’re running AI automation and orchestration across portfolio companies, skill versioning ensures consistency and prevents the silent failures that kill trust.

If you’re pursuing SOC 2 or ISO 27001 compliance, skill governance creates the audit trail and change control that auditors demand.

Working with PADISO

If you’re managing a portfolio of mid-market companies and need help implementing agent skill governance, PADISO runs this at scale. We design the governance model, set up the infrastructure, run the quarterly drills, and embed the practices into your teams.

Our fractional CTO and AI & Agents Automation services cover:

Governance design: Choosing the right versioning, evaluation, and deployment patterns for your portfolio
Infrastructure setup: Building the mono-repo, evaluation suites, CI/CD pipelines, and rollback automation
Team enablement: Training your engineers on the processes and running quarterly drills
Ongoing optimisation: Monitoring failure rates, identifying bottlenecks, and continuously improving the system

We’ve run this pattern across portfolios managing 50+ agent skills in production. The difference between chaotic and controlled is governance.

Summary

Versioning agent skills across a mid-market portfolio requires four things:

A governance model: Mono-repo for shared skills, per-portco for company-specific skills, with a central registry tracking everything.
Semantic versioning: MAJOR for breaking changes, MINOR for backwards-compatible additions, PATCH for bug fixes. Every version must have a CHANGELOG entry and evaluation results.
Robust evaluation suites: Test accuracy, consistency, edge cases, cross-model compatibility, cost, and latency. Make evaluation mandatory in CI/CD.
Fast rollback procedures: Detect failures, trigger rollback in one command, investigate, fix, and re-release. Practice this quarterly until it’s automatic.

Without these, your portfolio is a collection of isolated islands, each duplicating effort, diverging on quality, and burning engineering cycles on preventable failures.

With these, you ship faster, fail safer, and scale confidently across companies.

Versioning Agent Skills Across a Portfolio of Mid-Market Companies

Table of Contents

Why Agent Skill Versioning Matters at Scale {#why-versioning-matters}

Understanding Agent Skills and Governance {#understanding-skills}

What Are Agent Skills?

Why Governance Breaks Down

The Governance Layers

Mono-Repo vs Per-Portco Patterns {#mono-repo-patterns}

The Mono-Repo Approach

The Per-Portco Approach

Hybrid Approach (Recommended)

Semantic Versioning for Agent Skills {#semantic-versioning}

The Semver Framework

Examples in Practice

Communicating Semver Changes

Handling Pre-Release Versions

Changelogs and Change Management {#changelogs}

The CHANGELOG Format

The Change Management Process

The Skill Dependency Manifest

Building Robust Evaluation Suites {#evaluation-suites}

Why Standard Unit Tests Aren’t Enough

Building an Evaluation Suite

Running Evaluations in CI/CD

Rollback Drills and Incident Response {#rollback-drills}

Why Rollback Matters

The Rollback Procedure

Quarterly Rollback Drills

Implementing Governance Across Your Portfolio {#governance-implementation}

Phase 1: Audit Current State (Week 1-2)

Phase 2: Design Governance Model (Week 3-4)

Phase 3: Migrate to Versioning (Week 5-8)

Phase 4: Establish Governance Processes (Week 9-12)

Real-World Patterns from PADISO Quarterly Drills {#quarterly-drills}

Case Study 1: The Shared Skill Cascade

Case Study 2: The Prompt Hallucination

Case Study 3: The Dependency Hell

Case Study 4: The Silent Failure

Next Steps and Quick Wins {#next-steps}

This Week

This Month

This Quarter

Connecting to Your Broader AI Strategy

Working with PADISO

Summary