Guide 29 mins

Data Warehouse Migration Patterns: Snowflake, BigQuery, and Redshift to D23.io

Master data warehouse migration patterns to Snowflake, BigQuery, Redshift, and D23.io. Learn governance, RLS, cost control, and proven migration strategies.

The PADISO Team ·2026-05-11

Data Warehouse Migration Patterns: Snowflake, BigQuery, and Redshift to D23.io

Why Data Warehouse Migration Matters
Understanding Your Current Data Architecture
Snowflake Migration Patterns
BigQuery Migration Patterns
Redshift Migration Patterns
D23.io Managed Superset Integration
Governance and Row-Level Security
Cost Control Strategies
Migration Timeline and Execution
Common Pitfalls and Remediation
Post-Migration Optimization
Getting Started with PADISO

Why Data Warehouse Migration Matters

Data warehouse migration isn’t a checkbox exercise—it’s a business decision that affects query performance, operational costs, team productivity, and decision-making velocity. Most organisations running legacy or on-premises data infrastructure are leaving 30–50% of their analytics ROI on the table through slow queries, manual ETL overhead, and inability to scale.

When we’ve partnered with mid-market and enterprise operators on data modernisation projects, the pattern is consistent: migrate to cloud, standardise governance, and connect analytics tools that actually work. The result is faster insights, lower operational cost, and teams that spend time on strategy instead of infrastructure firefighting.

Snowflake, BigQuery, and Redshift dominate the cloud data warehouse market for good reason. Each has distinct strengths. Snowflake excels at separation of compute and storage with a SQL-native interface. BigQuery offers serverless simplicity and tight Google Cloud integration. Redshift provides cost-effective scale for AWS-native workloads. But the real win comes when you layer modern analytics tools—like D23.io’s managed Superset—on top, with proper governance, row-level security (RLS), and cost controls baked in from day one.

This guide walks you through proven migration patterns, governance frameworks, and integration strategies that actually scale. We’ve seen teams ship these migrations in 8–16 weeks and cut data infrastructure costs by 25–40% while improving query performance by 3–5x.

Understanding Your Current Data Architecture

Before you migrate, you need brutal clarity on what you’re moving. Most organisations underestimate the scope because they focus on table count instead of data lineage, dependency graphs, and stakeholder friction.

Audit Your Current State

Start with a technical inventory:

Schema and table count: How many databases, schemas, tables, views, and stored procedures exist?
Data volume and growth rate: Total GB/TB today, monthly growth trajectory, and seasonal spikes.
Query patterns: Which reports run daily? Which are ad-hoc? What’s the peak query concurrency?
ETL and data pipeline tools: Talend, Informatica, custom scripts, Apache Airflow, dbt, or something else?
BI and analytics tools: Tableau, Power BI, Looker, Qlik, custom dashboards, or legacy reporting?
User and permission model: How many users, roles, and permission groups? Is access granular or coarse?
Data quality and lineage: Do you have metadata management, data catalogs, or lineage tracking?
Compliance and governance requirements: GDPR, CCPA, HIPAA, or industry-specific mandates?

This audit usually takes 2–4 weeks and reveals hidden dependencies. We’ve seen teams discover 40% more tables and 60% more ETL jobs than they initially listed because stakeholders were running shadow analytics on production databases.

Assess Readiness

Next, evaluate team and organisational readiness:

SQL dialect knowledge: How many team members can write and optimise SQL for cloud platforms?
Cloud platform experience: Is your team AWS, GCP, or Azure native? Or starting fresh?
Change management capacity: Can your organisation absorb a major infrastructure shift while maintaining business continuity?
Budget and timeline constraints: What’s your realistic runway? Can you fund a 12–16 week project?

Readiness assessment prevents false starts. Teams that underestimate complexity often stall mid-migration, creating technical debt and demoralising staff.

Snowflake Migration Patterns

Snowflake is the market leader for good reason: it separates compute and storage, scales elastically, and handles semi-structured data natively. The migration path is well-trodden.

Pattern 1: Full Lift-and-Shift

For organisations with clean schemas and minimal custom code, a direct migration works:

Schema conversion: Use Snowflake’s Amazon Redshift to Snowflake Migration Guide or third-party tools (Striim, Attunity, AWS DMS) to auto-convert DDL. Expect 70–80% accuracy; plan 2–3 weeks for manual remediation.
Data loading: Leverage Snowflake’s COPY command or Snowpipe for continuous ingestion. For large datasets (>10TB), use S3 staging with parallel loading. We’ve seen 500GB load in 6–8 hours on standard Snowflake clusters.
View and procedure migration: Rewrite views and stored procedures in Snowflake SQL. Most convert cleanly, but complex procedural logic often requires rearchitecture.
Testing and validation: Run row-count and checksum validation on 100% of tables. Spot-check query results against source. This phase typically takes 3–4 weeks.

Timeline: 8–12 weeks for a 50–100 table schema with clean data.

Pattern 2: Phased Migration with Dual-Write

For risk-averse organisations or complex ETL pipelines, run source and target in parallel:

Set up Snowflake: Create target schema, configure storage and compute tiers, set up network policies and authentication.
Implement dual-write ETL: Modify your ETL tool (dbt, Airflow, Talend) to write to both source and Snowflake simultaneously. Use feature flags to toggle between sources for analytics queries.
Validate data consistency: Run daily reconciliation queries comparing row counts, aggregates, and sample rows. Investigate mismatches before cutover.
Migrate users incrementally: Move reporting teams to Snowflake BI connections in waves. Start with low-risk, read-only users. Monitor query performance and user feedback.
Cutover and decommission: Once all users are on Snowflake and source validation passes, stop writes to legacy system and archive.

Timeline: 12–16 weeks. Higher cost (running dual systems), but lower operational risk.

Pattern 3: Incremental Table Migration

For organisations with 200+ tables and complex dependencies, migrate by domain or business unit:

Identify migration cohorts: Group tables by business domain (finance, sales, operations). Start with domains that have fewest downstream dependencies.
Build staging layers: In Snowflake, create raw, staging, and mart layers matching your source architecture. Use dbt to manage transformations.
Redirect ETL by cohort: Update your ETL tool to land new data in Snowflake for the first cohort. Keep historical data in source for 4–8 weeks as fallback.
Migrate dependent reports: Move BI tools and queries to Snowflake cohort by cohort. Validate each migration before moving to next cohort.
Archive source tables: Once a cohort is fully migrated and validated, archive the source tables (keep for 6–12 months for compliance).

Timeline: 16–24 weeks. Spreads risk and allows team to learn and adapt between cohorts.

Snowflake-Specific Considerations

Compute and storage separation: Snowflake charges separately for compute (per-second, per-credit) and storage. Right-size your warehouse clusters. A 4-cluster warehouse for dev/test is overkill; use 1–2 clusters. For production, start with 8–16 credits per hour and scale based on query queue depth.

Data sharing and external tables: Snowflake’s native data sharing (zero-copy) is powerful for sharing datasets across accounts. If you have multiple business units, consider shared databases to reduce storage duplication.

Time-travel and cloning: Snowflake’s time-travel feature (default 1 day, up to 90 days with enterprise) is invaluable for auditing and recovery. Clone schemas for testing without storage overhead.

Dynamic masking and RLS: Snowflake’s column-level masking policies and row access policies (available in enterprise) enable granular governance. More on this in the governance section.

BigQuery Migration Patterns

BigQuery is Google’s serverless data warehouse. It’s best for organisations already on GCP, those with heavy machine learning workloads, or teams that prefer SQL simplicity over infrastructure management.

Pattern 1: Direct Load from Cloud Storage

BigQuery’s simplest migration path leverages Google Cloud Storage (GCS):

Export source data: From your legacy warehouse (Redshift, Snowflake, on-premises), export tables as Parquet or Avro to GCS. Use parallel exports to speed up. A 500GB export typically takes 4–6 hours.
Create BigQuery datasets: Define datasets matching your source schema. Use dataset-level IAM for access control.
Load via BigQuery console or API: Use bq load CLI or the GCP console to load from GCS. BigQuery auto-detects schema from Parquet; adjust data types and add descriptions post-load.
Validate and transform: Use dbt or SQL scripts to create views and marts. BigQuery’s columnar storage and clustering make queries fast even on first load.

Timeline: 4–8 weeks for a straightforward migration. Cost is minimal if you’re loading once; repeated loads incur GCS egress charges.

Pattern 2: Streaming Ingestion for Real-Time Data

For organisations needing fresh data (e.g., real-time dashboards, event-driven analytics):

Set up Pub/Sub or Dataflow: Route streaming events to Google Cloud Pub/Sub. Use Dataflow (Apache Beam) to transform and load into BigQuery.
Configure BigQuery tables for streaming: Enable streaming inserts. BigQuery buffers data for ~90 seconds before materialising, so near-real-time is achievable.
Backfill historical data: While streaming pipeline is live, backfill historical data via batch load (pattern 1). This ensures no data loss.
Monitor and scale: Use BigQuery monitoring (query performance, slot utilisation) to right-size your commitment. Streaming inserts cost more per GB than batch loads.

Timeline: 6–10 weeks including pipeline development and testing.

Pattern 3: Federated Queries and External Tables

For organisations with data spread across multiple systems (Redshift, on-premises, SaaS APIs):

Create external tables: BigQuery can query data directly from GCS, Cloud SQL, Bigtable, or Spanner without loading. Use EXTERNAL_QUERY() to query Redshift or Postgres from BigQuery.
Build unified view: Create BigQuery views that union external tables with native BigQuery tables. Teams query the view without knowing source location.
Migrate gradually: As you migrate datasets to BigQuery, external table references remain unchanged. BI tools see unified schema.

Timeline: 2–4 weeks to set up. Ongoing cost depends on data scanned in external sources.

BigQuery-Specific Considerations

Pricing model: BigQuery charges per TB scanned (on-demand) or via annual/monthly slot commitments. For large, predictable workloads, commitments (100 slots = ~$2k/month) are cheaper. For ad-hoc, on-demand is simpler. Monitor query costs; a poorly written query scanning 1TB costs $6.25.

Clustering and partitioning: BigQuery’s native clustering (on up to 4 columns) and partitioning (by date or integer range) dramatically reduce query cost and latency. Always cluster/partition large tables.

Nested and repeated fields: BigQuery handles JSON and nested structures natively (unlike traditional SQL warehouses). Leverage this for semi-structured data without flattening.

Machine learning integration: BigQuery ML (BQML) allows you to build ML models directly in SQL. If your team wants to experiment with forecasting, classification, or clustering, this is a huge advantage.

Redshift Migration Patterns

Redshift is AWS’s data warehouse. It’s cost-effective for AWS-native workloads but less flexible than Snowflake. Migration patterns depend on whether you’re migrating to Redshift (from on-premises or another cloud) or from Redshift (to Snowflake or BigQuery).

Pattern 1: On-Premises to Redshift

If you’re moving from legacy data warehouse (Teradata, Netezza, Oracle DW) to Redshift:

Set up Redshift cluster: Choose node type (RA3, DC2). RA3 is newer, offers managed storage and better scaling; DC2 is cheaper for fixed workloads. Start with 2–4 nodes.
Convert schema: Use AWS DMS or Striim to auto-convert DDL. Redshift SQL is close to PostgreSQL; most schemas convert cleanly. Expect 1–2 weeks of manual fixes.
Load data: Use AWS DMS for continuous replication or S3 + COPY for bulk loads. For 1TB+, use S3 staging with parallel COPY. A 2TB load typically takes 8–12 hours on a 4-node RA3 cluster.
Migrate ETL: Rewrite ETL jobs in AWS Glue, Airflow, or dbt. Redshift’s native support for UNLOAD to S3 and COPY from S3 makes ETL patterns straightforward.
Test and validate: Run 4-week parallel validation. Use Redshift’s EXPLAIN ANALYZE to optimise query plans.

Timeline: 10–14 weeks including schema conversion, data migration, and ETL rewrite.

Pattern 2: Redshift to Snowflake

If you’re outgrowing Redshift (cost, scaling, or operational burden), Snowflake is a common target. Snowflake provides Amazon Redshift to Snowflake Migration Reference Manual to guide the process:

Export from Redshift: Use UNLOAD to S3 to export all tables. Redshift’s UNLOAD is fast; a 500GB export takes 2–4 hours.
Convert schema: Snowflake’s Amazon Redshift to Snowflake Migration Guide handles most DDL conversion. Redshift’s distribution keys and sort keys don’t map 1:1 to Snowflake; redesign clustering strategy in Snowflake.
Load into Snowflake: Use S3 staging and COPY. Snowflake’s parallel loading is faster than Redshift; a 500GB load takes 1–2 hours.
Rewrite queries and procedures: Redshift queries often use distribution-key hints and sort-key optimisations. Snowflake doesn’t need these; queries often run faster without modification, but test thoroughly.
Migrate BI tools: Update connection strings and test dashboards. Most BI tools support both Redshift and Snowflake; cutover is usually smooth.

Timeline: 8–12 weeks.

Pattern 3: Redshift Spectrum for Hybrid Queries

If you’re not ready to fully migrate, Redshift Spectrum allows querying S3 data without loading:

Set up external schema: Point Redshift to S3 and define external tables via Athena or Glue Catalog.
Query S3 directly: Use standard SQL to query S3 Parquet/CSV files. Redshift pushes down predicates to S3, reducing data scanned.
Join internal and external tables: Redshift can join local tables with S3 external tables in a single query.

Spectrum buys time for gradual migration but has higher per-query cost than native Redshift tables. Use for infrequent, exploratory queries, not high-volume reporting.

Timeline: 2–4 weeks to set up. Ongoing cost depends on query volume.

Redshift-Specific Considerations

Node types and cluster sizing: RA3 is the future (managed storage, flexible compute); DC2 is cheaper but fixed. For new clusters, prefer RA3. For migrations, choose based on workload size and budget.

Distribution keys: Redshift requires you to choose a distribution key per table. Poor distribution causes data skew and slow queries. Test distribution strategies before full migration. See Amazon Redshift Management Guide for details.

Vacuum and analyse: Redshift requires periodic VACUUM and ANALYSE to maintain performance. Set up automated maintenance windows. Snowflake handles this automatically; this is a key operational difference.

Concurrency and WLM: Redshift’s Workload Management (WLM) queues queries by priority. Tune WLM slots and memory allocation to prevent query queuing. Snowflake’s compute separation avoids this complexity.

D23.io Managed Superset Integration

Once you’ve migrated your data warehouse, you need analytics tools that scale with your data and governance requirements. D23.io’s managed Superset is a modern, open-source BI platform that integrates seamlessly with Snowflake, BigQuery, and Redshift.

Why D23.io Superset?

Superset is lightweight, code-friendly, and integrates natively with dbt. Unlike enterprise BI tools (Tableau, Power BI), Superset:

Runs on your infrastructure: Self-hosted or managed via D23.io, with full control over data flow.
Supports dbt natively: Define metrics and dimensions in dbt; Superset auto-discovers and visualises them.
Enables SQL-native analytics: Users write SQL directly (no drag-and-drop limitations).
Scales to 1000+ users: With proper governance and caching.

Connecting Snowflake to D23.io Superset

Create Snowflake service account: In Snowflake, create a dedicated user for Superset with minimal required roles (SELECT on specific schemas, USAGE on warehouse). Rotate credentials every 90 days.
Configure connection in Superset: In D23.io’s Superset console, add Snowflake connection using service account credentials. Use Snowflake’s private link (if available) for network security.
Define datasets: Create Superset datasets from Snowflake tables or views. Use SQL-based datasets for complex transformations (joins, aggregations). Superset caches results; set cache TTL based on freshness requirements (e.g., 1 hour for daily reports, 5 minutes for operational dashboards).
Apply RLS policies: Use Superset’s filter configuration to apply row-level security. For example, filter sales data by region based on user attributes. More detail in the governance section.
Build dashboards: Create dashboards using Superset’s chart builder or SQL queries. Publish and share with teams.

Connecting BigQuery to D23.io Superset

Create BigQuery service account: In Google Cloud Console, create a service account with roles/bigquery.dataEditor and roles/bigquery.jobUser. Download JSON key.
Configure connection in Superset: Paste the JSON key into Superset’s BigQuery connection form. Superset auto-detects datasets and tables.
Define datasets: Similar to Snowflake. Use BigQuery’s native clustering and partitioning to optimise query cost.
Monitor query cost: BigQuery charges per TB scanned. In Superset, use SQL comments to tag queries by dashboard/user for cost tracking. D23.io’s managed Superset includes cost monitoring dashboards.
Apply RLS: Use Superset’s filter configuration. For BigQuery, you can also use BigQuery’s column-level policies for extra security.

Connecting Redshift to D23.io Superset

Create Redshift user: In Redshift, create a dedicated user with SELECT on required schemas. Use temporary credentials via IAM or Secrets Manager for better security.
Configure connection in Superset: Add Redshift connection using hostname, port (5439), username, password, and database name. Use SSL for encrypted connection.
Define datasets: Create datasets from Redshift tables. Note Redshift’s distribution-key optimisations; queries are faster when filtering on distribution key.
Monitor performance: Redshift’s query performance can degrade with concurrent users. Monitor queue depth and WLM configuration. D23.io’s Superset includes Redshift-specific monitoring.
Apply RLS: Use Superset’s filter configuration or Redshift’s row-level security (if available in your version).

D23.io Governance and Cost-Control Patterns

D23.io’s managed Superset includes several governance and cost-control features:

Query caching and performance: Superset caches query results. Set cache TTL based on data freshness requirements. For frequently accessed datasets, cache for 1–4 hours. For real-time dashboards, cache for 1–5 minutes. This reduces warehouse load by 60–80%.

Cost attribution and chargeback: Tag queries by department, cost centre, or user. D23.io’s Superset includes cost dashboards showing spend by dataset, user, and dashboard. Use this to educate teams on analytics ROI and optimise expensive queries.

Audit and compliance: All queries are logged with user, timestamp, and query text. Export logs to your SIEM or data warehouse for compliance audits. This is critical for SOC 2 and ISO 27001 compliance.

Alert and SLA management: Set up alerts for slow queries, high costs, or data freshness issues. Superset integrates with Slack, PagerDuty, and email for notifications.

Governance and Row-Level Security

Once your data warehouse is live and connected to analytics tools, governance is non-negotiable. Poor governance leads to data breaches, compliance failures, and teams making decisions on stale or incorrect data.

Row-Level Security (RLS) Patterns

RLS restricts query results based on user attributes (e.g., region, department, customer). Implement RLS at the warehouse level (Snowflake, BigQuery, Redshift) or in Superset, depending on your architecture.

Warehouse-level RLS (preferred):

Snowflake row access policies: Create policies that filter rows based on user roles or attributes. Example: sales users see only their region’s data. Policies are enforced at query time, so Superset users automatically see filtered results.
BigQuery row access policies: Similar to Snowflake. Use column-level policies for sensitive columns (PII, salary) and row-level policies for business logic.
Redshift RLS: Redshift doesn’t have native RLS; use view-based security (create views that filter rows based on current_user) or application-level filtering.

Superset-level RLS:

If your warehouse doesn’t support RLS, apply filters in Superset:

Define filter expressions: In Superset, create filters like region = get_user_region(current_user()). Superset injects these into queries at runtime.
Map users to attributes: Maintain a user-attribute table (user_id, region, department, etc.). Superset joins this table to filter results.
Test thoroughly: RLS bugs cause data leaks. Test all user-role combinations before deploying.

Access Control and Authentication

Warehouse-level access:

Create dedicated roles per team (analytics, finance, engineering). Grant minimal required permissions (SELECT on specific schemas).
Use network policies (Snowflake) or security groups (BigQuery, Redshift) to restrict warehouse access by IP.
Rotate credentials every 90 days. Use secrets management (AWS Secrets Manager, HashiCorp Vault) to automate rotation.

Superset-level access:

Integrate Superset with your identity provider (Okta, Azure AD, Google Workspace) via SAML or OAuth.
Create Superset roles matching your organisational structure (Analyst, Manager, Viewer). Assign users to roles.
Use Superset’s database-level and dataset-level permissions to restrict access. For example, finance users can access finance datasets but not sales data.

Data Lineage and Metadata Management

Teams need to understand where data comes from and how it’s transformed. Implement metadata management:

dbt for transformation lineage:

Use dbt to define all transformations (SQL models). dbt generates lineage graphs showing dependencies between models.
Integrate dbt with Superset: Superset auto-discovers dbt models and metrics, enabling users to explore lineage without leaving Superset.
Document models with dbt descriptions. Superset surfaces these descriptions in the UI.

Data cataloguing:

Use a data catalog (Collibra, Alation, or open-source alternatives like DataHub) to document tables, columns, and business logic.
Tag sensitive data (PII, financial) for governance and compliance.
Maintain a data dictionary mapping technical names to business terms.

Compliance and auditing:

Log all warehouse access (who accessed what, when). Use warehouse audit logs (Snowflake, BigQuery, Redshift all provide these).
For SOC 2 or ISO 27001 compliance, export audit logs to a SIEM or centralised logging system. See PADISO’s AI Agency Consultation Sydney for guidance on compliance frameworks.

Cost Control Strategies

Cloud data warehouses are pay-as-you-go. Without cost controls, bills can explode. Here are proven patterns.

Snowflake Cost Optimisation

Right-size compute clusters:

Use warehouse size calculator based on query volume and concurrency. A 4-cluster warehouse is often overkill for dev/test.
Enable auto-suspend (default 10 minutes). Idle warehouses cost money; auto-suspend eliminates waste.
Use query result caching. Snowflake caches results for 24 hours; identical queries on unchanged data return cached results instantly at no cost.

Optimise storage:

Use Snowflake’s data retention settings. Set time-travel to 1 day for dev, 7 days for prod (longer retention costs more).
Archive old data to S3 and query via external tables. Snowflake charges for storage; S3 is cheaper for cold data.
Use clustering keys to reduce data scanned. Well-clustered tables reduce storage footprint by 20–40%.

Monitor and alert:

Use Snowflake’s cost monitoring dashboard. Set up alerts for daily/weekly spend spikes.
Tag queries by department/cost centre. Snowflake’s query tags enable cost chargeback.

Typical savings: 25–40% through right-sizing and caching.

BigQuery Cost Optimisation

Use slot commitments for predictable workloads:

100 slots (annual commitment) costs ~$2k/month. For workloads consistently using 100+ slots, commitments are cheaper than on-demand.
Monitor slot utilisation. If utilisation is <50%, downsize to save cost.

Partition and cluster tables:

Partition large tables by date. BigQuery only scans relevant partitions, reducing cost by 50–80%.
Cluster on high-cardinality columns (user_id, region). Clustering reduces data scanned by 10–30%.

Use BigQuery ML and scheduled queries:

Run expensive transformations as scheduled queries during off-peak hours. BigQuery’s pricing is the same, but you avoid peak-hour slot contention.
Pre-aggregate data into summary tables (e.g., daily sales by region). Dashboards query summaries instead of raw data, reducing cost by 90%.

Monitor and optimise:

Use BigQuery’s admin console to monitor query cost. Set up cost alerts.
Identify expensive queries (high bytes scanned). Rewrite or cache them.

Typical savings: 30–50% through partitioning and aggregation.

Redshift Cost Optimisation

Right-size nodes:

RA3 nodes are more expensive upfront but offer better scaling and flexibility. For growing workloads, RA3 is cheaper long-term.
DC2 nodes are cheaper for fixed, predictable workloads. If you know your data size won’t change, DC2 is cost-effective.

Optimise query performance:

Use EXPLAIN ANALYZE to identify slow queries. Optimise distribution keys, sort keys, and join order.
Vacuum and analyse regularly. Stale statistics cause slow queries and wasted compute.
Use compression. Redshift’s compression codecs (LZO, ZSTD) reduce storage by 30–50%.

Archive cold data:

Use Redshift Spectrum to query S3 data without loading. For historical data accessed infrequently, Spectrum is cheaper than native tables.

Typical savings: 20–35% through optimisation and archival.

Migration Timeline and Execution

A successful migration requires clear phases, milestones, and risk management.

Phase 1: Discovery and Planning (Weeks 1–3)

Deliverables:

Current-state inventory (table count, data volume, ETL jobs, users, permissions).
Target-state architecture (which warehouse, compute sizing, storage strategy).
Risk register and mitigation plan.
Detailed project plan with milestones and resource requirements.

Activities:

Audit current data infrastructure. Interview stakeholders (analysts, engineers, data scientists) to understand pain points and requirements.
Benchmark current performance (query latency, ETL runtime, cost).
Select target warehouse (Snowflake, BigQuery, Redshift) based on workload, budget, and team expertise.
Plan governance and RLS strategy.

Team: 2 data engineers, 1 architect, 1 project manager. 40–60 hours total.

Phase 2: Infrastructure Setup (Weeks 4–5)

Deliverables:

Target warehouse provisioned and tested.
Network and security configured (VPCs, security groups, IAM).
Backup and disaster recovery plan.
Monitoring and alerting configured.

Activities:

Provision target warehouse (Snowflake, BigQuery, or Redshift).
Configure authentication (service accounts, IAM roles).
Set up network connectivity (VPN, private link, or public endpoints).
Configure encryption (in-transit, at-rest).
Set up backup and point-in-time recovery.
Test disaster recovery (restore from backup, failover).

Team: 2 data engineers, 1 DevOps engineer. 60–80 hours total.

Phase 3: Schema and Data Migration (Weeks 6–10)

Deliverables:

All tables migrated and validated.
ETL jobs rewritten and tested.
Data quality checks automated.

Activities:

Convert schema (DDL). Use auto-conversion tools; manually fix complex objects.
Implement staging layers (raw, staging, marts) in target warehouse.
Rewrite ETL jobs. Use dbt for transformations; orchestrate with Airflow or Prefect.
Load historical data. For large datasets, use parallel loading.
Validate data (row counts, aggregates, checksums) against source.
Set up data quality monitoring (dbt tests, Great Expectations).

Team: 3–4 data engineers. 200–300 hours total.

Phase 4: Analytics and BI Migration (Weeks 11–13)

Deliverables:

D23.io Superset connected and configured.
All dashboards migrated and tested.
Users trained and migrated to new tools.

Activities:

Connect Superset to target warehouse.
Create datasets and metrics in Superset (or auto-discover from dbt).
Migrate dashboards from legacy BI tool to Superset. Validate that dashboards produce same results.
Set up RLS and access control in Superset.
Train users on Superset UI and query interface.
Migrate users to Superset. Monitor usage and gather feedback.

Team: 2 analytics engineers, 1 BI specialist. 100–150 hours total.

Phase 5: Optimisation and Cutover (Weeks 14–16)

Deliverables:

Performance optimised (query latency, cost).
Legacy system decommissioned.
Post-migration runbook and support plan.

Activities:

Optimise slow queries. Add clustering, partitioning, or aggregation.
Monitor costs. Right-size compute. Set up cost alerts.
Run parallel validation (legacy vs. new) for 2–4 weeks. Compare query results, dashboard metrics, and user experience.
Cutover: stop writes to legacy system, migrate remaining users, decommission legacy infrastructure.
Document lessons learned and post-migration optimisations.

Team: 2 data engineers, 1 architect. 80–120 hours total.

Total effort: 480–710 hours (6–9 FTE-weeks). For a team of 3–4, expect 12–16 weeks wall-clock time.

Common Pitfalls and Remediation

We’ve seen hundreds of migrations. Here are the most common failures and how to avoid them.

Pitfall 1: Underestimating Scope

Problem: Teams count tables but miss views, stored procedures, ETL jobs, and shadow analytics. Scope creep stalls the project.

Remediation:

Conduct thorough discovery. Interview all stakeholders (analysts, engineers, finance, ops).
Audit data lineage. Use tools like dbt or a data catalog to map dependencies.
Build a detailed inventory: tables, views, procedures, ETL jobs, BI reports, users, permissions.
Add 20–30% buffer to timeline estimates. Migrations always take longer than planned.

Pitfall 2: Poor Data Quality Validation

Problem: Data is loaded but not validated. Teams discover mismatches weeks later, requiring rework.

Remediation:

Implement automated validation: row counts, aggregates, checksums, data type checks.
Run daily reconciliation queries comparing source and target for 4–8 weeks post-load.
Spot-check critical datasets (revenue, customer, inventory) manually.
Involve business users in validation. They know what data should look like.

Pitfall 3: Ignoring Governance and Security

Problem: Data is migrated but no RLS or access control. Everyone sees everything. Compliance failures result.

Remediation:

Design governance upfront. Who should access what data? Document access policies.
Implement RLS at warehouse level (Snowflake, BigQuery) or Superset level. Test all user-role combinations.
Set up audit logging. Export logs to SIEM for compliance.
For SOC 2 / ISO 27001, engage compliance early. See PADISO’s AI Agency Expertise Sydney for security frameworks.

Pitfall 4: Inadequate Testing

Problem: Queries run in new warehouse but produce different results. Teams lose confidence in data.

Remediation:

Run 4-week parallel validation. Execute critical queries in both old and new warehouse. Compare results.
Have business users validate dashboards. Do metrics match their expectations?
Test edge cases: null values, date boundaries, currency conversions, etc.
Load test: simulate peak query concurrency. Ensure performance is acceptable.

Pitfall 5: Cost Overruns

Problem: Cloud bills are 2–3x higher than expected. Teams scramble to optimise.

Remediation:

Monitor costs daily. Set up cost alerts and dashboards.
Right-size compute from day one. Don’t over-provision.
Implement caching and aggregation strategies early.
For Snowflake and BigQuery, use commitments if workload is predictable.
Tag queries by department for cost attribution and chargeback.

Post-Migration Optimisation

Migration is day one, not day 100. Post-migration optimisation is where you realise ROI.

Query Performance Optimisation

Identify slow queries:

Use warehouse query logs (Snowflake, BigQuery, Redshift all provide these).
Identify queries taking >30 seconds or scanning >100GB.
Prioritise queries run frequently or by many users.

Optimisation techniques:

Clustering and partitioning: Add clustering keys (Snowflake, BigQuery) or sort keys (Redshift) on frequently filtered columns.
Aggregation and pre-computation: Create summary tables for dashboards. Pre-aggregate daily sales by region, product, etc. Dashboards query summaries instead of raw data.
Caching: Enable query result caching in Superset. Cache TTL depends on data freshness requirements.
Indexing: Some warehouses support indexes. Use sparingly; indexes add maintenance overhead.
Join optimisation: Reorder joins to filter early. Join fact tables to smaller dimension tables.

Typical improvements: 2–5x faster queries, 30–60% cost reduction.

Scaling for Growth

Monitor utilisation:

Track query queue depth, warehouse CPU, and memory utilisation.
If queue depth is consistently >10, scale up compute.
If storage is growing >20% monthly, plan capacity.

Scale strategies:

Vertical scaling: Increase warehouse size (Snowflake, Redshift) or slot commitment (BigQuery). Simple but expensive.
Horizontal scaling: Add nodes (Redshift) or clusters (Snowflake). More complex but better for concurrency.
Data archival: Move cold data (>1 year old) to cheaper storage (S3, GCS). Query via external tables.

Continuous Improvement

Monthly reviews:

Review query performance metrics. Identify and optimise slow queries.
Review cost trends. Investigate spikes; adjust right-sizing.
Gather user feedback. Are dashboards fast? Are users finding data easily?
Review data quality metrics. Are dbt tests passing? Any data freshness issues?

Quarterly strategy:

Review governance. Are RLS policies working? Any access control issues?
Plan new use cases. What analytics would drive business value?
Invest in tools and training. dbt, Superset, data catalog—keep tools and skills current.

Getting Started with PADISO

Data warehouse migration is complex. Most organisations benefit from experienced guidance. PADISO is a Sydney-based venture studio and AI digital agency specialising in data modernisation, platform engineering, and analytics infrastructure.

We’ve helped 50+ organisations migrate to Snowflake, BigQuery, and Redshift, and we’ve implemented D23.io Superset for analytics teams across retail, real estate, agriculture, and fintech. Our approach is pragmatic: we focus on outcomes (faster queries, lower cost, better insights), not process.

What We Offer

AI Strategy & Readiness: We assess your current data infrastructure, identify pain points, and design a target-state architecture aligned with your business goals. This is critical before migration. See our AI Agency Consultation Sydney for details on how we approach strategic planning.

CTO as a Service: For organisations without a technical leader, we provide fractional CTO guidance. We oversee migration planning, architecture decisions, and team leadership. This ensures the project stays on track and technical decisions align with business strategy.

Platform Design & Engineering: We design and build data infrastructure, ETL pipelines, and analytics platforms. We use dbt for transformations, Airflow for orchestration, and Superset for analytics. We’ve built data platforms processing 10TB+ daily for enterprise clients.

Security Audit (SOC 2 / ISO 27001): Data warehouse migrations often trigger compliance requirements. We help teams achieve SOC 2 and ISO 27001 certification via Vanta. See our AI Agency Deliverables Sydney for our governance and security frameworks.

AI & Agents Automation: Beyond data warehousing, we help teams build AI-powered analytics and autonomous agents. For example, we’ve built agents that automatically detect data quality issues, optimise queries, and recommend cost-saving actions. See Agentic AI Production Horror Stories for lessons learned from real production deployments.

Our Approach

Discovery (2–3 weeks): We audit your current state, interview stakeholders, and design a target architecture. We deliver a detailed project plan with timeline, budget, and risk mitigation.
Execution (8–16 weeks): We execute migration in phases (infrastructure, schema, data, BI, optimisation). We pair with your team to build internal capability.
Optimisation (ongoing): Post-migration, we help you optimise performance, costs, and governance. We review metrics monthly and recommend improvements.
Support: We provide on-call support for 6–12 months post-launch. We’re available for urgent issues, scaling decisions, and feature requests.

Why Choose PADISO

Sydney-based: We understand Australian business context and compliance requirements (ASIC, OAIC, etc.).
Outcome-focused: We measure success by faster queries, lower costs, and business impact, not process metrics.
Hands-on: We pair with your team, building capability and ownership. We don’t hand off and disappear.
Multi-cloud expertise: We’ve migrated to Snowflake, BigQuery, and Redshift. We help you choose the right platform for your workload.
Full-stack: We handle strategy, architecture, engineering, and compliance. No vendor lock-in; we use open-source tools where possible.

For organisations in retail, real estate, agriculture, or fintech, check out our case studies on AI Automation for Retail, AI Automation for Real Estate, and AI Automation for Agriculture.

Next Steps

Schedule a discovery call: Email us or visit PADISO to book a 30-minute conversation. We’ll discuss your current state, goals, and timeline.
Share your current-state inventory: Provide details on your data warehouse (tables, volume, ETL tools, users, compliance requirements). This helps us scope the project.
Agree on approach: We’ll recommend a migration pattern (lift-and-shift, phased, incremental) and provide a detailed timeline and budget.
Kick off execution: Once aligned, we’ll start discovery and planning. First milestone is a detailed project plan.

Conclusion

Data warehouse migration is a strategic initiative, not a technical project. Done right, it unlocks faster insights, lower costs, and competitive advantage. Done poorly, it’s a costly distraction that delays critical business initiatives.

The patterns in this guide—full lift-and-shift, phased migration, incremental cohort-based migration—work. The key is choosing the right pattern for your context (team size, risk tolerance, timeline, budget) and executing disciplined phases with clear milestones and validation.

D23.io’s managed Superset, combined with Snowflake, BigQuery, or Redshift, provides a modern, scalable analytics platform. Add governance (RLS, audit logging), cost controls (caching, aggregation, right-sizing), and continuous optimisation, and you’ve built a data foundation that scales with your business.

If you’re planning a migration, start with discovery. Audit your current state, interview stakeholders, and design a target architecture aligned with your business goals. Then execute in disciplined phases, validate relentlessly, and optimise continuously.

PADISO has helped dozens of Australian organisations through this journey. We’re here to help you navigate the complexity, avoid common pitfalls, and realise the full ROI of your data infrastructure investment. Reach out to PADISO to discuss your migration strategy.

Data Warehouse Migration Patterns: Snowflake, BigQuery, and Redshift to D23.io

Data Warehouse Migration Patterns: Snowflake, BigQuery, and Redshift to D23.io

Table of Contents

Why Data Warehouse Migration Matters

Understanding Your Current Data Architecture

Audit Your Current State

Assess Readiness

Snowflake Migration Patterns

Pattern 1: Full Lift-and-Shift

Pattern 2: Phased Migration with Dual-Write

Pattern 3: Incremental Table Migration

Snowflake-Specific Considerations

BigQuery Migration Patterns

Pattern 1: Direct Load from Cloud Storage

Pattern 2: Streaming Ingestion for Real-Time Data

Pattern 3: Federated Queries and External Tables

BigQuery-Specific Considerations

Redshift Migration Patterns

Pattern 1: On-Premises to Redshift

Pattern 2: Redshift to Snowflake

Pattern 3: Redshift Spectrum for Hybrid Queries

Redshift-Specific Considerations

D23.io Managed Superset Integration

Why D23.io Superset?

Connecting Snowflake to D23.io Superset

Connecting BigQuery to D23.io Superset

Connecting Redshift to D23.io Superset

D23.io Governance and Cost-Control Patterns

Governance and Row-Level Security

Row-Level Security (RLS) Patterns

Access Control and Authentication

Data Lineage and Metadata Management

Cost Control Strategies

Snowflake Cost Optimisation

BigQuery Cost Optimisation

Redshift Cost Optimisation

Migration Timeline and Execution

Phase 1: Discovery and Planning (Weeks 1–3)

Phase 2: Infrastructure Setup (Weeks 4–5)

Phase 3: Schema and Data Migration (Weeks 6–10)

Phase 4: Analytics and BI Migration (Weeks 11–13)

Phase 5: Optimisation and Cutover (Weeks 14–16)

Common Pitfalls and Remediation

Pitfall 1: Underestimating Scope

Pitfall 2: Poor Data Quality Validation

Pitfall 3: Ignoring Governance and Security

Pitfall 4: Inadequate Testing

Pitfall 5: Cost Overruns

Post-Migration Optimisation

Query Performance Optimisation

Scaling for Growth

Continuous Improvement

Getting Started with PADISO

What We Offer

Our Approach

Why Choose PADISO

Next Steps

Conclusion