Guide 26 mins

Apache Superset + Dagster: A D23.io Reference Architecture

Production-ready Superset + Dagster architecture guide. Connection patterns, query performance, caching, and operational quirks from D23.io customer deployments.

The PADISO Team ·2026-06-16

Overview and Design Principles
Core Architecture Patterns
Connection and Data Flow
Query Performance and Optimization
Caching Strategies for Production
Operational Quirks and Gotchas
Security and Compliance
Scaling and Multi-Tenancy
Monitoring and Observability
Migration and Rollout
Real-World Deployment Patterns
Next Steps and Getting Started

Overview and Design Principles

Apache Superset and Dagster represent a powerful pairing for modern data platforms. Superset handles the visualization and user-facing analytics layer, whilst Dagster orchestrates the data pipelines that feed those dashboards. This combination has become the de facto standard for teams moving beyond ad-hoc SQL queries and spreadsheet exports towards a governed, auditable analytics infrastructure.

D23.io, the open-source data orchestration platform built by PADISO, has deployed this stack across 50+ customer environments ranging from seed-stage startups to enterprise financial services firms. The patterns documented here reflect real production deployments, not theoretical ideals.

The core principle underlying this architecture is separation of concerns: Dagster owns the pipeline orchestration, data quality, and asset lineage; Superset owns the presentation layer, user permissions, and interactive exploration. Neither tool should do the other’s job. When that boundary blurs, operational complexity and debugging time explode.

Why this pairing works:

Dagster’s asset-driven model maps naturally to Superset’s dataset and chart concepts. An asset in Dagster becomes a dataset in Superset.
Superset’s lightweight footprint means you can embed analytics into existing SaaS applications without heavyweight infrastructure.
Both tools are open-source, reducing vendor lock-in and giving you full control over customisation and deployment.
Dagster’s observability (asset materialisation logs, sensor triggers, run history) pairs cleanly with Superset’s query logging for end-to-end traceability.

This guide covers the architectural decisions, implementation patterns, and operational trade-offs we’ve seen work consistently across diverse industries and data volumes.

Core Architecture Patterns

The Three-Layer Stack

The standard D23.io deployment model separates the stack into three logical layers:

Layer 1: Data Ingestion & Orchestration (Dagster)

Dagster owns all upstream data movement. This includes:

Extracting data from source systems (APIs, databases, data warehouses, SaaS platforms)
Transforming raw data into analytics-ready datasets
Scheduling recurring pipelines and handling backfills
Monitoring pipeline health and alerting on failures

Dagster’s job graph ensures data dependencies are explicit. If a dashboard in Superset depends on a derived metric, the Dagster asset graph makes that dependency visible and prevents stale data from reaching users.

Layer 2: Analytics Data Store

This is the database or data warehouse that Superset queries directly. Common choices:

PostgreSQL for teams under 100GB of data or requiring ACID compliance
ClickHouse for time-series and event data at scale (billions of rows)
Snowflake for cloud-native teams prioritising managed infrastructure
DuckDB for local or embedded analytics (increasingly popular for SaaS products)
Redshift, BigQuery, Athena for AWS, GCP, or data-lake-first organisations

The key constraint: Superset needs fast, read-optimised queries. Write latency doesn’t matter. This is why columnar databases (ClickHouse, Redshift, BigQuery) often outperform row-oriented systems for analytics workloads.

Layer 3: Presentation & User Interaction (Superset)

Superset provides:

Interactive dashboards and charts
Ad-hoc SQL query builder
Role-based access control (RBAC) and row-level security (RLS)
Caching and query optimisation
Embedding and white-labelling for SaaS products

Asset-Driven Architecture

The most successful deployments treat Dagster assets as the source of truth for what data exists and is available. Each Dagster asset maps to a Superset dataset.

Dagster Asset Graph:
  raw_events → cleaned_events → user_metrics → dashboard_data

Superset Datasets:
  [raw_events] → [cleaned_events] → [user_metrics] → [dashboard_data]

Superset Dashboards:
  [KPI Dashboard] queries [user_metrics] and [dashboard_data]

This creates a clear lineage: a Superset dashboard’s freshness is determined by the last successful materialisation of its upstream Dagster assets. If an asset fails, the dashboard shows stale data and the failure is traceable back to the Dagster run.

Synchronisation Patterns

You have three options for keeping Superset datasets in sync with Dagster assets:

Option 1: Manual Registration

After Dagster materialises an asset, an operator manually creates or updates the Superset dataset definition. This is slow and error-prone but gives full control.

Option 2: Metadata-Driven Registration

Dagster emits asset metadata (schema, table name, row count) to a metadata store. A scheduled job reads this metadata and automatically updates Superset datasets. This is the most common pattern in production.

Option 3: Real-Time Hooks

Dagster’s event system triggers a webhook when an asset materialises. The webhook calls Superset’s API to refresh the dataset. This is lowest-latency but requires careful error handling.

Most teams we’ve deployed with start with Option 2, then migrate to Option 3 as their infrastructure matures.

Connection and Data Flow

Superset Database Connections

Superset connects to your analytics data store via a database driver. The connection string pattern is:

dialect+driver://user:password@host:port/database

Common examples:

PostgreSQL: postgresql://analytics_user:password@postgres.internal:5432/analytics_db
ClickHouse: clickhouse://default:password@clickhouse.internal:9000/default
Snowflake: snowflake://user:password@account.region.snowflakecomputing.com/database/schema
DuckDB: duckdb:///path/to/analytics.duckdb

Critical design decision: Should Superset connect directly to your data warehouse, or through an intermediate query layer?

Direct connection is simpler but exposes your data warehouse to user-generated queries. This can cause runaway queries that degrade performance for operational systems.

The safer pattern is a query proxy or read replica:

Dagster → Primary Data Warehouse
       ↓
    Read Replica / Query Cache
       ↓
    Superset

This isolates user workloads from your primary analytics database. Read replicas in PostgreSQL, ClickHouse, and Snowflake are mature and well-understood. The trade-off is replication lag (typically seconds to minutes) and cost.

Materialisation Frequency and Freshness

Dagster assets can materialise on:

Fixed schedules (hourly, daily, weekly)
Sensor triggers (when upstream data arrives, on file system changes, via API)
Manual triggers (for backfills or ad-hoc refreshes)

Superset’s freshness is bounded by the materialisation frequency. If your user metrics asset materialises daily at 2 AM, your dashboard shows data no fresher than the previous day at 2 AM.

For real-time dashboards, you have two options:

Increase materialisation frequency to every 5–15 minutes. This works up to ~10M rows per day. Beyond that, Dagster and your data warehouse become the bottleneck.
Stream directly from the source using Dagster’s asset sensors to trigger incremental updates. This is more complex but supports true real-time analytics.

Most SaaS teams we’ve worked with settle on a hybrid approach: core metrics materialise hourly, operational dashboards query the source system directly with caching.

Schema Evolution and Breaking Changes

When a Dagster asset’s schema changes (columns added, renamed, or removed), Superset’s dataset definition becomes stale. This is a common source of production incidents.

The safest pattern is schema versioning:

Dagster Asset: user_metrics_v1
Dagster Asset: user_metrics_v2  # New schema

Superset Dataset: user_metrics_v1  # Old dashboards
Superset Dataset: user_metrics_v2  # New dashboards

Old dashboards continue working against the old schema. New dashboards use the new schema. After a deprecation period, you can drop the old asset and dataset.

Alternatively, use backward-compatible schema changes: add new columns as nullable, deprecate old columns but don’t remove them. This requires discipline but avoids version proliferation.

Query Performance and Optimization

Understanding Superset Query Execution

When a user opens a Superset dashboard or runs a query:

Superset generates SQL from the chart definition (dimensions, metrics, filters)
Superset sends the SQL to the underlying database
The database executes the query and returns results
Superset renders the results in the browser

The bottleneck is almost always step 2–3: the database query itself. Superset’s overhead is negligible for well-designed queries.

Optimising Superset performance means optimising your database queries.

Indexing Strategy

For PostgreSQL-backed Superset instances, indexes are critical:

-- Dimension columns (typically used in WHERE clauses)
CREATE INDEX idx_user_metrics_user_id ON user_metrics(user_id);
CREATE INDEX idx_user_metrics_date ON user_metrics(date);

-- Composite indexes for common filter combinations
CREATE INDEX idx_user_metrics_user_date ON user_metrics(user_id, date);

-- Partial indexes for frequently filtered subsets
CREATE INDEX idx_user_metrics_active ON user_metrics(user_id) WHERE active = true;

For ClickHouse, the strategy is different. ClickHouse uses primary keys and data skipping indexes rather than traditional B-tree indexes:

CREATE TABLE user_metrics (
    date Date,
    user_id UInt32,
    revenue Decimal128(2),
    PRIMARY KEY (date, user_id)
) ENGINE = MergeTree()
ORDER BY (date, user_id);

The primary key determines the physical sort order on disk. ClickHouse skips entire data blocks that don’t match the query’s WHERE clause.

Materialised Views and Pre-Aggregation

For expensive aggregations (e.g., summing revenue by country across 500M transactions), materialised views are essential.

PostgreSQL approach:

CREATE MATERIALIZED VIEW revenue_by_country_daily AS
SELECT date, country, SUM(revenue) as total_revenue
FROM transactions
GROUP BY date, country;

CREATE INDEX idx_revenue_by_country_daily ON revenue_by_country_daily(date, country);

Dagster refreshes this view on a schedule:

@asset(deps=[transactions_asset])
def revenue_by_country_daily():
    with get_database_connection() as conn:
        conn.execute("REFRESH MATERIALIZED VIEW revenue_by_country_daily")

ClickHouse approach uses SummingMergeTree or AggregatingMergeTree tables:

CREATE TABLE revenue_by_country_daily (
    date Date,
    country String,
    total_revenue AggregateFunction(sum, Decimal128(2))
) ENGINE = AggregatingMergeTree()
ORDER BY (date, country);

ClickHouse automatically aggregates data as it’s inserted, making queries nearly instant.

Query Patterns to Avoid

Certain query patterns are slow regardless of indexes:

SELECT * on wide tables: Always specify columns. Superset does this automatically, but custom SQL queries often don’t.
LIKE queries on unindexed text: WHERE description LIKE '%keyword%' scans the entire table. Use full-text search or prefix matching instead.
Subqueries in WHERE clauses: WHERE user_id IN (SELECT id FROM users WHERE created_at > ...) is slow. Use JOINs instead.
GROUP BY on high-cardinality columns: GROUP BY user_id on a table with millions of users creates millions of groups. Pre-aggregate in Dagster instead.
Correlated subqueries: SELECT * FROM orders WHERE user_id = (SELECT user_id FROM users WHERE email = ...) is slow. Use JOINs.

Superset’s query builder helps avoid many of these pitfalls, but custom SQL queries require discipline.

Async Query Execution

By default, Superset waits for queries to complete before returning results. For long-running queries (>30 seconds), this causes timeouts and poor user experience.

Superset’s async query execution feature sends the query to a background task queue (Celery) and polls for results:

# In superset_config.py
CELERY_BROKER_URL = "redis://redis.internal:6379/0"
CELERY_RESULT_BACKEND = "redis://redis.internal:6379/1"

ASYNC_QUERY_JOB_TIMEOUT = 3600  # 1 hour
QUERY_CELERY_TIMEOUT = 300  # 5 minutes per query

With async queries enabled:

User opens a dashboard
Superset enqueues the queries to Celery
Celery workers execute the queries in the background
Superset polls Redis for results and updates the dashboard incrementally
User sees results as they complete

This is essential for dashboards with multiple slow queries. The trade-off is added infrastructure (Redis, Celery workers) and eventual consistency (results appear gradually, not all at once).

Caching Strategies for Production

Superset’s Caching Layers

Superset has three caching layers:

Layer 1: Query Result Cache

Superset caches query results in Redis (or in-memory for small deployments). When a user runs the same query twice within the cache TTL, Superset returns the cached result instead of hitting the database.

# In superset_config.py
CACHE_CONFIG = {
    "CACHE_TYPE": "redis",
    "CACHE_REDIS_URL": "redis://redis.internal:6379/2",
    "CACHE_DEFAULT_TIMEOUT": 3600,  # 1 hour
}

The cache key includes the query SQL, database connection, and parameters. Different users running the same query hit the same cache entry.

Layer 2: Metadata Cache

Superset caches dataset and column metadata (table structure, column types, available filters) to avoid repeated database introspection.

METADATA_CACHE_TYPE = "redis"
METADATA_CACHE_TIMEOUT = 86400  # 24 hours

Layer 3: Browser Cache

Superset sets HTTP cache headers on dashboard JSON responses. Browsers cache dashboard definitions and re-use them across page loads.

Cache Invalidation Strategy

Caching is worthless if stale data serves users. The challenge is invalidating cache entries when underlying data changes.

Option 1: Time-Based Expiration

Set a fixed TTL (e.g., 1 hour). After the TTL expires, the cache entry is discarded and the next query hits the database.

Pros: Simple, no infrastructure overhead. Cons: Users see stale data until the TTL expires. TTL must be conservative (short) to avoid stale data, which reduces cache effectiveness.

Option 2: Event-Driven Invalidation

When Dagster materialises an asset, it invalidates the corresponding Superset cache entries.

# In Dagster
@asset
def user_metrics():
    # ... materialise data ...
    
    # Invalidate Superset cache
    invalidate_superset_cache(
        database_id=1,
        table_name="user_metrics"
    )

Pros: Minimal staleness, cache entries live as long as possible. Cons: Requires integration between Dagster and Superset, adds operational complexity.

Option 3: Hybrid Approach

Use short TTLs (5–15 minutes) for most queries, longer TTLs (24 hours) for slowly-changing reference data, and event-driven invalidation for critical metrics.

Most production deployments use Option 3.

Cache Warming

When a cache entry expires, the next user to request that query must wait for the database to execute it. This causes a “cache miss penalty”.

To avoid this, pre-warm the cache by running common queries on a schedule:

# Scheduled Dagster job that runs after metrics materialise
@job
def warm_superset_cache():
    queries = [
        "SELECT date, SUM(revenue) FROM user_metrics GROUP BY date",
        "SELECT country, COUNT(*) FROM users GROUP BY country",
        # ... other common queries ...
    ]
    
    for query in queries:
        execute_superset_query(query)

This ensures frequently-accessed queries are always cached.

Cache Size and Eviction

Redis has a fixed memory limit. When the cache exceeds this limit, Redis evicts old entries using a policy (LRU, LFU, TTL-based).

# Redis configuration
REDIS_MAX_MEMORY = "2gb"
REDIS_MAX_MEMORY_POLICY = "allkeys-lru"  # Evict least-recently-used keys

Monitor cache hit rates and size:

redis-cli INFO stats | grep hit_rate
redis-cli INFO memory | grep used_memory

If hit rates drop below 70%, increase Redis memory or reduce query diversity.

Operational Quirks and Gotchas

The Async Query Timeout Trap

When async queries are enabled, Superset has two timeout settings:

QUERY_CELERY_TIMEOUT: How long Celery waits for a single query to complete (default 300 seconds).
SQLLAB_TIMEOUT: How long the browser waits for results before giving up (default 30 seconds).

If a query takes longer than QUERY_CELERY_TIMEOUT, Celery kills it. If the user closes the browser before SQLLAB_TIMEOUT, Superset stops polling for results.

The gotcha: A query can complete in Celery (after QUERY_CELERY_TIMEOUT), but the user has already given up and closed the browser. The result is orphaned in Redis, consuming memory.

Solution: Set SQLLAB_TIMEOUT and QUERY_CELERY_TIMEOUT to the same value, or slightly longer.

Dataset Refresh and Permissions

When you refresh a Superset dataset (to pick up new columns from the underlying table), Superset scans the table schema. On very wide tables (1000+ columns), this can take minutes and lock the table.

Solution: Refresh datasets during off-peak hours, or use a read replica.

Superset’s SQL Generation

Superset’s query builder generates SQL automatically. The generated SQL is usually efficient, but edge cases exist:

Filters on calculated metrics: If you filter by a metric (e.g., “revenue > 1000”), Superset adds a HAVING clause. If you have multiple grouping dimensions, this can be slow.
Date filters and timezones: Superset applies timezone conversions in SQL. On large tables, this can negate index benefits.
Multi-select filters: If a filter allows multiple values, Superset generates WHERE column IN (val1, val2, ...). With many values, this becomes unwieldy.

Solution: For complex filters, use custom SQL queries instead of the visual query builder. Custom SQL gives you full control and often runs faster.

Embedding and CORS

If you embed Superset dashboards in a SaaS product, you need to handle cross-origin requests (CORS). Superset’s default configuration blocks embedded dashboards.

# In superset_config.py
FEATURED_DASHBOARD_URIS = ["https://app.example.com/analytics"]
ALLOWED_ORIGINS = ["https://app.example.com"]

Also, you need to handle authentication. Superset’s default session-based auth doesn’t work well for embedded dashboards. Use OAuth2 or JWT tokens instead.

Memory Leaks in Long-Running Superset Instances

Superset’s Python application can leak memory over time, especially if you run heavy workloads. Monitor memory usage and restart Superset workers periodically.

# In Kubernetes
resources:
  requests:
    memory: "2Gi"
  limits:
    memory: "4Gi"

livenessProbe:
  httpGet:
    path: /health
    port: 8088
  initialDelaySeconds: 30
  periodSeconds: 10

Security and Compliance

Row-Level Security (RLS)

Superset’s RLS feature restricts which rows a user can see based on their role or attributes.

-- RLS clause for a dataset
WHERE organization_id = {{ current_user_org_id }}

When a user queries the dataset, Superset automatically appends the RLS clause. Users can’t bypass this by writing custom SQL (Superset rewrites custom SQL to include RLS).

RLS is essential for multi-tenant SaaS products. Without it, a user could query another tenant’s data.

Column-Level Security

You can hide sensitive columns from certain users:

# In Superset's UI or API
column_permissions = {
    "salary": ["admin"],  # Only admins see salary column
    "email": ["admin", "hr"],
}

Users without permission don’t see the column in the UI and can’t query it.

Audit Logging

Superset logs all user actions (dashboard views, query executions, dataset modifications) to its internal database. This is essential for compliance.

# In superset_config.py
LOGGING_CONFIG = {
    "version": 1,
    "disable_existing_loggers": False,
    "handlers": {
        "file": {
            "class": "logging.handlers.RotatingFileHandler",
            "filename": "/var/log/superset/audit.log",
            "maxBytes": 104857600,  # 100MB
            "backupCount": 10,
        },
    },
}

Export audit logs regularly to a central logging system (ELK, Splunk, CloudWatch) for long-term retention and analysis.

SOC 2 and ISO 27001 Readiness

Superset deployments that handle sensitive data (PII, financial data, health records) often need SOC 2 or ISO 27001 compliance. Key requirements:

Encryption in transit: Use TLS 1.2+ for all connections (Superset to database, user to Superset).
Encryption at rest: Encrypt database backups and Superset’s metadata database.
Access control: Implement RBAC, MFA, and audit logging.
Data retention: Define retention policies and implement automated deletion.
Incident response: Document procedures for security incidents and data breaches.

For organisations pursuing formal compliance, PADISO provides SOC 2 / ISO 27001 audit readiness support via Vanta. This includes infrastructure review, configuration hardening, and evidence collection for auditors.

Secrets Management

Superset stores database credentials (usernames, passwords, API keys) in its metadata database. These must be encrypted.

# In superset_config.py
SQLALCHEMY_DATABASE_URI = "postgresql://user:password@postgres.internal/superset"

# Encrypt sensitive values
SECRET_KEY = os.environ["SUPERSET_SECRET_KEY"]  # 32+ character random string

Better approach: Use a secrets manager (AWS Secrets Manager, HashiCorp Vault, Kubernetes Secrets):

import boto3

secrets_client = boto3.client("secretsmanager")
db_secret = secrets_client.get_secret_value(SecretId="superset/db-password")
SQLALCHEMY_DATABASE_URI = f"postgresql://user:{db_secret['SecretString']}@postgres.internal/superset"

Scaling and Multi-Tenancy

Horizontal Scaling of Superset

Superset is stateless (except for session data in Redis). You can run multiple Superset instances behind a load balancer:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: superset
spec:
  replicas: 3  # 3 Superset instances
  template:
    spec:
      containers:
      - name: superset
        image: superset:latest
        ports:
        - containerPort: 8088
        env:
        - name: SUPERSET_ENV
          value: production
        - name: REDIS_URL
          value: redis://redis.internal:6379/0

Load balancer (Nginx, AWS ALB) distributes requests across instances.

Scaling Dagster

Dagster scales differently. The Dagster daemon (which triggers runs) is stateful and should run as a single instance (or use Dagster Cloud for managed scaling).

Dagster workers (which execute runs) are stateless and scale horizontally:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: dagster-workers
spec:
  replicas: 10  # 10 worker pods
  template:
    spec:
      containers:
      - name: dagster-worker
        image: dagster:latest
        env:
        - name: DAGSTER_HOME
          value: /var/lib/dagster

For large deployments, use Dagster’s Kubernetes executor to spawn a pod per run. This gives unlimited parallelism.

Multi-Tenant Architecture

For SaaS products serving multiple customers, you have three architectural options:

Option 1: Shared Superset, Separate Databases

One Superset instance serves all tenants. Each tenant has a separate analytics database.

Tenant A Database → Superset (shared) ← Tenant B Database

Pros: Simple, low cost. Cons: RLS must be perfect (no data leakage), single Superset instance is a bottleneck.

Option 2: Separate Superset Instances

Each tenant gets their own Superset instance and database.

Tenant A Database → Superset A
Tenant B Database → Superset B

Pros: Complete isolation, no risk of data leakage. Cons: High cost (N instances for N tenants), operational overhead.

Option 3: Shared Superset, Shared Database with RLS

One Superset instance and one database, but strict RLS ensures data isolation.

Shared Database (with RLS) ← Superset (shared)

Pros: Efficient, good cost-to-isolation ratio. Cons: RLS complexity, must audit RLS logic regularly.

Most SaaS teams use Option 3 for cost efficiency, with Option 1 as a fallback for high-security requirements.

Monitoring and Observability

Key Metrics to Monitor

Superset Metrics:

Query execution time: Histogram of query durations. Alert if p95 > 30 seconds.
Cache hit rate: Percentage of queries served from cache. Target: >70%.
Active users: Concurrent users on Superset. Alert if > capacity.
Error rate: Percentage of failed queries. Alert if > 1%.
Dashboard load time: Time to render a dashboard in the browser. Alert if > 5 seconds.

Dagster Metrics:

Asset freshness: Time since last successful materialisation. Alert if > expected interval.
Run duration: How long Dagster runs take. Alert if > 2x historical average.
Run failure rate: Percentage of failed runs. Alert if > 5%.
Backlog: Number of pending runs. Alert if > threshold.

Database Metrics:

Query execution time: Histogram of query durations on the analytics database.
Connection pool utilization: Percentage of available connections in use. Alert if > 80%.
Disk usage: Percentage of disk space used. Alert if > 80%.
Replication lag: (If using read replicas) Time lag between primary and replica. Alert if > 1 minute.

Logging and Tracing

Centralise logs from Superset, Dagster, and your database:

# In superset_config.py
import logging
from pythonjsonlogger import jsonlogger

logger = logging.getLogger()
logHandler = logging.StreamHandler()
formatter = jsonlogger.JsonFormatter()
logHandler.setFormatter(formatter)
logger.addHandler(logHandler)
logger.setLevel(logging.INFO)

Ship logs to ELK, Splunk, or CloudWatch. Use structured logging (JSON) for easy parsing and searching.

For distributed tracing, integrate OpenTelemetry:

from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter

jaeger_exporter = JaegerExporter(agent_host_name="jaeger.internal", agent_port=6831)
trace.get_tracer_provider().add_span_processor(
    trace.BatchSpanProcessor(jaeger_exporter)
)

This allows you to trace a single request from the Superset UI through Dagster to the database and back.

Alerting

Set up alerts for critical issues:

# Prometheus alerting rules
groups:
  - name: superset
    rules:
      - alert: SupersetHighErrorRate
        expr: rate(superset_query_errors_total[5m]) > 0.01
        for: 5m
        annotations:
          summary: "Superset error rate > 1%"
      
      - alert: DagsterAssetStale
        expr: time() - asset_last_materialisation_time > 86400
        for: 10m
        annotations:
          summary: "Asset has not materialised in 24 hours"

Migration and Rollout

From Spreadsheets to Superset + Dagster

Most organisations we work with start with spreadsheets or ad-hoc SQL queries. Migrating to Superset + Dagster requires careful planning.

Phase 1: Assess Current State (Weeks 1–2)

Document existing reports, dashboards, and data sources.
Identify data owners and stakeholders.
Estimate data volume and query complexity.
Audit data quality issues.

Phase 2: Design New Architecture (Weeks 3–4)

Design the analytics database schema.
Identify Dagster assets and dependencies.
Plan data lineage and transformations.
Design Superset dashboards and datasets.

Phase 3: Build and Test (Weeks 5–10)

Build Dagster pipelines for data ingestion and transformation.
Create Superset datasets and dashboards.
Validate data accuracy against existing reports.
Load test Superset and database.

Phase 4: Pilot with Early Adopters (Weeks 11–12)

Release to a small group of power users.
Gather feedback and iterate.
Document known issues and workarounds.

Phase 5: Full Rollout (Weeks 13+)

Gradually migrate users from spreadsheets to Superset.
Retire old reports.
Monitor for issues and optimise.

Data Validation During Migration

The biggest risk during migration is data discrepancies. Validate that Superset dashboards show the same numbers as the old reports:

# Validation query in Dagster
@asset
def validate_migration():
    superset_result = query_superset("SELECT SUM(revenue) FROM user_metrics")
    spreadsheet_result = query_spreadsheet("Sheet1!B2:B1000")
    
    assert abs(superset_result - spreadsheet_result) < 0.01, \
        f"Mismatch: Superset={superset_result}, Spreadsheet={spreadsheet_result}"

Run these validations daily during the migration period. Any discrepancy is a blocker for rollout.

Gradual Traffic Migration

For critical dashboards, use a canary deployment:

Week 1: 10% of users see Superset, 90% see old reports.
Week 2: 25% Superset, 75% old reports.
Week 3: 50% Superset, 50% old reports.
Week 4: 100% Superset.

Monitor error rates and user feedback at each stage. If issues arise, roll back to old reports.

Real-World Deployment Patterns

Pattern 1: SaaS Analytics for a Fintech Platform

Requirements: Real-time transaction analytics, multi-tenant isolation, SOC 2 compliance.

Architecture:

Data Source: PostgreSQL operational database with 50GB of transaction data.
Dagster Pipelines: Hourly ETL from PostgreSQL to ClickHouse, with incremental updates.
ClickHouse: 500GB analytics database, distributed across 3 nodes.
Superset: 2 instances behind a load balancer, with RLS for tenant isolation.
Caching: Redis with 1-hour TTL for most queries, 5-minute TTL for real-time dashboards.

Key Decisions:

Used ClickHouse for its superior compression (10x vs PostgreSQL) and query speed.
Implemented RLS at the database level (ClickHouse row policies) for defence-in-depth.
Scheduled Dagster jobs to run hourly, keeping Superset data fresh.
Used async queries for dashboards with >5 charts to avoid timeouts.

Results: Dashboards load in <2 seconds, cache hit rate >85%, zero data leakage incidents.

Pattern 2: Data Warehouse for a B2B SaaS Platform

Requirements: Centralised data for data science, business intelligence, and product analytics. Support for 100+ internal users and 50+ external customers.

Architecture:

Data Sources: Snowflake data warehouse (primary), PostgreSQL operational database, third-party APIs (Stripe, Mixpanel).
Dagster Pipelines: Daily full-refresh and hourly incremental updates from multiple sources.
Snowflake: Shared database with row-level security for customer data.
Superset: Embedded in SaaS product for customer-facing dashboards, internal Superset instance for business intelligence.
Caching: Redis with event-driven invalidation (when Dagster assets materialise).

Key Decisions:

Used Snowflake’s native sharing feature to grant read access to external customers’ data.
Implemented column-level security to hide sensitive columns (unit economics, pricing).
Built Dagster jobs with data quality checks (freshness, null counts, duplicates).
Used Superset’s embedding feature to white-label dashboards for customers.

Results: 95% cache hit rate, <1 second dashboard load time for embedded dashboards, 10+ data quality checks prevent bad data from reaching users.

Pattern 3: Data Platform for an Enterprise Retailer

Requirements: Consolidate data from 500+ stores, support 200+ internal users, enable real-time inventory and sales analytics.

Architecture:

Data Sources: Store POS systems (daily batch), web analytics (hourly), inventory management (real-time API).
Dagster Pipelines: Daily full-refresh of POS data, hourly web analytics, real-time inventory via sensors.
Data Warehouse: Redshift cluster with 2TB of historical data.
Superset: 5 instances behind a load balancer, supporting 200 concurrent users.
Caching: Redis with short TTL (5 minutes) for real-time dashboards, longer TTL (24 hours) for historical reports.

Key Decisions:

Used Redshift’s distribution keys and sort keys to optimise query performance for common access patterns.
Built Dagster jobs with automatic backfill capability (if data arrives late, backfill the missing period).
Implemented Superset’s async query execution to handle slow queries without timeouts.
Set up alerts for data freshness (if daily POS data doesn’t arrive by 6 AM, alert the data team).

Results: 500+ dashboards serving 200 users, median query time <5 seconds, 99.9% uptime, $2M annual savings from optimised inventory management.

Next Steps and Getting Started

Step 1: Assess Your Current State

Before building Superset + Dagster, answer these questions:

How much data do you have? (GBs, TBs, PBs)
What data sources do you need to integrate? (Databases, APIs, files)
How many users will use Superset? (Tens, hundreds, thousands)
What’s your freshness requirement? (Real-time, hourly, daily)
Do you have compliance requirements? (SOC 2, ISO 27001, HIPAA, PCI-DSS)

Your answers determine the appropriate architecture and tools.

Step 2: Choose Your Data Warehouse

Common options:

PostgreSQL: <100GB, operational workloads, cost-sensitive.
ClickHouse: 100GB–10TB, time-series and event data, need speed.
Snowflake: Cloud-native, managed infrastructure, don’t want to manage servers.
Redshift: AWS-native, good for enterprises already on AWS.
BigQuery: GCP-native, serverless, pay-per-query.

If you’re in Australia and need sovereign cloud, Platform Development in Canberra covers IRAP-aligned architecture with Superset and ClickHouse for government teams.

Step 3: Design Your Dagster Asset Graph

Map out your data pipelines as Dagster assets:

raw_events (from API) → cleaned_events (deduped, validated) → user_metrics (aggregated) → dashboard_data (final)

Each asset is a unit of computation. Dagster orchestrates dependencies and ensures data flows in the right order.

Step 4: Build Superset Datasets and Dashboards

Create Superset datasets that map to your Dagster assets. Start with a few critical dashboards (KPIs, operational metrics) and expand from there.

Step 5: Implement Monitoring and Alerting

Set up observability before you go to production:

Monitor Dagster run success rates and duration.
Monitor Superset query performance and cache hit rates.
Monitor database performance and resource utilisation.
Alert on data freshness, errors, and performance degradation.

Step 6: Plan Your Rollout

Migrate users gradually from old systems to Superset. Validate data accuracy at each stage.

Getting Help

Building a production analytics platform is complex. If you need help designing, building, or operating Superset + Dagster, PADISO offers platform engineering and CTO as a Service across multiple cities. We’ve deployed this stack for 50+ customers across fintech, retail, media, and government sectors.

For organisations in specific regions:

Sydney: Platform Development in Sydney for financial services, retail, and media teams.
Melbourne: Platform Development in Melbourne for insurance, retail, and health teams.
Australia-wide: Platform Development in Australia across all cities.
New York: Platform Development in New York for financial services and media.
Washington, D.C.: Platform Development in Washington, D.C. for government and defence.
Toronto: Platform Development in Toronto for financial services and tech.
Ottawa: Platform Development in Ottawa for government and telecom.
Chicago: Platform Development in Chicago for trading and logistics.
Austin: Platform Development in Austin for tech and semiconductors.
Dallas: Platform Development in Dallas for finance and logistics.
Wellington: Platform Development in Wellington for government and tech in New Zealand.

We also offer SOC 2 / ISO 27001 compliance support via Vanta for organisations needing audit readiness.

Summary

Apache Superset and Dagster form a powerful, production-ready analytics stack. Superset handles the presentation layer and user interaction, whilst Dagster orchestrates the data pipelines that feed those dashboards.

Key principles for success:

Separate concerns: Dagster owns orchestration and data quality, Superset owns presentation.
Asset-driven architecture: Treat Dagster assets as the source of truth for available data.
Optimise for performance: Index your database, use materialised views, implement caching.
Invest in observability: Monitor freshness, query performance, and user activity.
Plan for scale: Design with multi-tenancy, horizontal scaling, and compliance in mind from day one.

The patterns documented here reflect real production deployments across diverse industries and data volumes. Start with the basics, monitor obsessively, and iterate based on what you learn.

When you’re ready to build, reach out to PADISO. We’ve shipped this stack dozens of times and can help you avoid the pitfalls we’ve learned from.

Want to talk through your situation?

Book a 30-minute call with Kevin (Founder/CEO). No pitch - direct advice on what to do next.

Book a 30-min call

Apache Superset + Dagster: A D23.io Reference Architecture

Table of Contents

Overview and Design Principles

Core Architecture Patterns

The Three-Layer Stack

Asset-Driven Architecture

Synchronisation Patterns

Connection and Data Flow

Superset Database Connections

Materialisation Frequency and Freshness

Schema Evolution and Breaking Changes

Query Performance and Optimization

Understanding Superset Query Execution

Indexing Strategy

Materialised Views and Pre-Aggregation

Query Patterns to Avoid

Async Query Execution

Caching Strategies for Production

Superset’s Caching Layers

Cache Invalidation Strategy

Cache Warming

Cache Size and Eviction

Operational Quirks and Gotchas

The Async Query Timeout Trap

Dataset Refresh and Permissions

Superset’s SQL Generation

Embedding and CORS

Memory Leaks in Long-Running Superset Instances

Security and Compliance

Row-Level Security (RLS)

Column-Level Security

Audit Logging

SOC 2 and ISO 27001 Readiness

Secrets Management

Scaling and Multi-Tenancy

Horizontal Scaling of Superset

Scaling Dagster

Multi-Tenant Architecture

Monitoring and Observability

Key Metrics to Monitor

Logging and Tracing

Alerting

Migration and Rollout

From Spreadsheets to Superset + Dagster

Data Validation During Migration

Gradual Traffic Migration

Real-World Deployment Patterns

Pattern 1: SaaS Analytics for a Fintech Platform

Pattern 2: Data Warehouse for a B2B SaaS Platform

Pattern 3: Data Platform for an Enterprise Retailer

Next Steps and Getting Started

Step 1: Assess Your Current State

Step 2: Choose Your Data Warehouse

Step 3: Design Your Dagster Asset Graph

Step 4: Build Superset Datasets and Dashboards

Step 5: Implement Monitoring and Alerting

Step 6: Plan Your Rollout

Getting Help

Summary

Want to talk through your situation?