Table of Contents
- Overview and Design Principles
- Core Architecture Patterns
- Connection and Data Flow
- Query Performance and Optimization
- Caching Strategies for Production
- Operational Quirks and Gotchas
- Security and Compliance
- Scaling and Multi-Tenancy
- Monitoring and Observability
- Migration and Rollout
- Real-World Deployment Patterns
- Next Steps and Getting Started
Overview and Design Principles
Apache Superset and Dagster represent a powerful pairing for modern data platforms. Superset handles the visualization and user-facing analytics layer, whilst Dagster orchestrates the data pipelines that feed those dashboards. This combination has become the de facto standard for teams moving beyond ad-hoc SQL queries and spreadsheet exports towards a governed, auditable analytics infrastructure.
D23.io, the open-source data orchestration platform built by PADISO, has deployed this stack across 50+ customer environments ranging from seed-stage startups to enterprise financial services firms. The patterns documented here reflect real production deployments, not theoretical ideals.
The core principle underlying this architecture is separation of concerns: Dagster owns the pipeline orchestration, data quality, and asset lineage; Superset owns the presentation layer, user permissions, and interactive exploration. Neither tool should do the other’s job. When that boundary blurs, operational complexity and debugging time explode.
Why this pairing works:
- Dagster’s asset-driven model maps naturally to Superset’s dataset and chart concepts. An asset in Dagster becomes a dataset in Superset.
- Superset’s lightweight footprint means you can embed analytics into existing SaaS applications without heavyweight infrastructure.
- Both tools are open-source, reducing vendor lock-in and giving you full control over customisation and deployment.
- Dagster’s observability (asset materialisation logs, sensor triggers, run history) pairs cleanly with Superset’s query logging for end-to-end traceability.
This guide covers the architectural decisions, implementation patterns, and operational trade-offs we’ve seen work consistently across diverse industries and data volumes.
Core Architecture Patterns
The Three-Layer Stack
The standard D23.io deployment model separates the stack into three logical layers:
Layer 1: Data Ingestion & Orchestration (Dagster)
Dagster owns all upstream data movement. This includes:
- Extracting data from source systems (APIs, databases, data warehouses, SaaS platforms)
- Transforming raw data into analytics-ready datasets
- Scheduling recurring pipelines and handling backfills
- Monitoring pipeline health and alerting on failures
Dagster’s job graph ensures data dependencies are explicit. If a dashboard in Superset depends on a derived metric, the Dagster asset graph makes that dependency visible and prevents stale data from reaching users.
Layer 2: Analytics Data Store
This is the database or data warehouse that Superset queries directly. Common choices:
- PostgreSQL for teams under 100GB of data or requiring ACID compliance
- ClickHouse for time-series and event data at scale (billions of rows)
- Snowflake for cloud-native teams prioritising managed infrastructure
- DuckDB for local or embedded analytics (increasingly popular for SaaS products)
- Redshift, BigQuery, Athena for AWS, GCP, or data-lake-first organisations
The key constraint: Superset needs fast, read-optimised queries. Write latency doesn’t matter. This is why columnar databases (ClickHouse, Redshift, BigQuery) often outperform row-oriented systems for analytics workloads.
Layer 3: Presentation & User Interaction (Superset)
Superset provides:
- Interactive dashboards and charts
- Ad-hoc SQL query builder
- Role-based access control (RBAC) and row-level security (RLS)
- Caching and query optimisation
- Embedding and white-labelling for SaaS products
Asset-Driven Architecture
The most successful deployments treat Dagster assets as the source of truth for what data exists and is available. Each Dagster asset maps to a Superset dataset.
Dagster Asset Graph:
raw_events → cleaned_events → user_metrics → dashboard_data
Superset Datasets:
[raw_events] → [cleaned_events] → [user_metrics] → [dashboard_data]
Superset Dashboards:
[KPI Dashboard] queries [user_metrics] and [dashboard_data]
This creates a clear lineage: a Superset dashboard’s freshness is determined by the last successful materialisation of its upstream Dagster assets. If an asset fails, the dashboard shows stale data and the failure is traceable back to the Dagster run.
Synchronisation Patterns
You have three options for keeping Superset datasets in sync with Dagster assets:
Option 1: Manual Registration
After Dagster materialises an asset, an operator manually creates or updates the Superset dataset definition. This is slow and error-prone but gives full control.
Option 2: Metadata-Driven Registration
Dagster emits asset metadata (schema, table name, row count) to a metadata store. A scheduled job reads this metadata and automatically updates Superset datasets. This is the most common pattern in production.
Option 3: Real-Time Hooks
Dagster’s event system triggers a webhook when an asset materialises. The webhook calls Superset’s API to refresh the dataset. This is lowest-latency but requires careful error handling.
Most teams we’ve deployed with start with Option 2, then migrate to Option 3 as their infrastructure matures.
Connection and Data Flow
Superset Database Connections
Superset connects to your analytics data store via a database driver. The connection string pattern is:
dialect+driver://user:password@host:port/database
Common examples:
- PostgreSQL:
postgresql://analytics_user:password@postgres.internal:5432/analytics_db - ClickHouse:
clickhouse://default:password@clickhouse.internal:9000/default - Snowflake:
snowflake://user:password@account.region.snowflakecomputing.com/database/schema - DuckDB:
duckdb:///path/to/analytics.duckdb
Critical design decision: Should Superset connect directly to your data warehouse, or through an intermediate query layer?
Direct connection is simpler but exposes your data warehouse to user-generated queries. This can cause runaway queries that degrade performance for operational systems.
The safer pattern is a query proxy or read replica:
Dagster → Primary Data Warehouse
↓
Read Replica / Query Cache
↓
Superset
This isolates user workloads from your primary analytics database. Read replicas in PostgreSQL, ClickHouse, and Snowflake are mature and well-understood. The trade-off is replication lag (typically seconds to minutes) and cost.
Materialisation Frequency and Freshness
Dagster assets can materialise on:
- Fixed schedules (hourly, daily, weekly)
- Sensor triggers (when upstream data arrives, on file system changes, via API)
- Manual triggers (for backfills or ad-hoc refreshes)
Superset’s freshness is bounded by the materialisation frequency. If your user metrics asset materialises daily at 2 AM, your dashboard shows data no fresher than the previous day at 2 AM.
For real-time dashboards, you have two options:
-
Increase materialisation frequency to every 5–15 minutes. This works up to ~10M rows per day. Beyond that, Dagster and your data warehouse become the bottleneck.
-
Stream directly from the source using Dagster’s asset sensors to trigger incremental updates. This is more complex but supports true real-time analytics.
Most SaaS teams we’ve worked with settle on a hybrid approach: core metrics materialise hourly, operational dashboards query the source system directly with caching.
Schema Evolution and Breaking Changes
When a Dagster asset’s schema changes (columns added, renamed, or removed), Superset’s dataset definition becomes stale. This is a common source of production incidents.
The safest pattern is schema versioning:
Dagster Asset: user_metrics_v1
Dagster Asset: user_metrics_v2 # New schema
Superset Dataset: user_metrics_v1 # Old dashboards
Superset Dataset: user_metrics_v2 # New dashboards
Old dashboards continue working against the old schema. New dashboards use the new schema. After a deprecation period, you can drop the old asset and dataset.
Alternatively, use backward-compatible schema changes: add new columns as nullable, deprecate old columns but don’t remove them. This requires discipline but avoids version proliferation.
Query Performance and Optimization
Understanding Superset Query Execution
When a user opens a Superset dashboard or runs a query:
- Superset generates SQL from the chart definition (dimensions, metrics, filters)
- Superset sends the SQL to the underlying database
- The database executes the query and returns results
- Superset renders the results in the browser
The bottleneck is almost always step 2–3: the database query itself. Superset’s overhead is negligible for well-designed queries.
Optimising Superset performance means optimising your database queries.
Indexing Strategy
For PostgreSQL-backed Superset instances, indexes are critical:
-- Dimension columns (typically used in WHERE clauses)
CREATE INDEX idx_user_metrics_user_id ON user_metrics(user_id);
CREATE INDEX idx_user_metrics_date ON user_metrics(date);
-- Composite indexes for common filter combinations
CREATE INDEX idx_user_metrics_user_date ON user_metrics(user_id, date);
-- Partial indexes for frequently filtered subsets
CREATE INDEX idx_user_metrics_active ON user_metrics(user_id) WHERE active = true;
For ClickHouse, the strategy is different. ClickHouse uses primary keys and data skipping indexes rather than traditional B-tree indexes:
CREATE TABLE user_metrics (
date Date,
user_id UInt32,
revenue Decimal128(2),
PRIMARY KEY (date, user_id)
) ENGINE = MergeTree()
ORDER BY (date, user_id);
The primary key determines the physical sort order on disk. ClickHouse skips entire data blocks that don’t match the query’s WHERE clause.
Materialised Views and Pre-Aggregation
For expensive aggregations (e.g., summing revenue by country across 500M transactions), materialised views are essential.
PostgreSQL approach:
CREATE MATERIALIZED VIEW revenue_by_country_daily AS
SELECT date, country, SUM(revenue) as total_revenue
FROM transactions
GROUP BY date, country;
CREATE INDEX idx_revenue_by_country_daily ON revenue_by_country_daily(date, country);
Dagster refreshes this view on a schedule:
@asset(deps=[transactions_asset])
def revenue_by_country_daily():
with get_database_connection() as conn:
conn.execute("REFRESH MATERIALIZED VIEW revenue_by_country_daily")
ClickHouse approach uses SummingMergeTree or AggregatingMergeTree tables:
CREATE TABLE revenue_by_country_daily (
date Date,
country String,
total_revenue AggregateFunction(sum, Decimal128(2))
) ENGINE = AggregatingMergeTree()
ORDER BY (date, country);
ClickHouse automatically aggregates data as it’s inserted, making queries nearly instant.
Query Patterns to Avoid
Certain query patterns are slow regardless of indexes:
-
SELECT * on wide tables: Always specify columns. Superset does this automatically, but custom SQL queries often don’t.
-
LIKE queries on unindexed text:
WHERE description LIKE '%keyword%'scans the entire table. Use full-text search or prefix matching instead. -
Subqueries in WHERE clauses:
WHERE user_id IN (SELECT id FROM users WHERE created_at > ...)is slow. Use JOINs instead. -
GROUP BY on high-cardinality columns:
GROUP BY user_idon a table with millions of users creates millions of groups. Pre-aggregate in Dagster instead. -
Correlated subqueries:
SELECT * FROM orders WHERE user_id = (SELECT user_id FROM users WHERE email = ...)is slow. Use JOINs.
Superset’s query builder helps avoid many of these pitfalls, but custom SQL queries require discipline.
Async Query Execution
By default, Superset waits for queries to complete before returning results. For long-running queries (>30 seconds), this causes timeouts and poor user experience.
Superset’s async query execution feature sends the query to a background task queue (Celery) and polls for results:
# In superset_config.py
CELERY_BROKER_URL = "redis://redis.internal:6379/0"
CELERY_RESULT_BACKEND = "redis://redis.internal:6379/1"
ASYNC_QUERY_JOB_TIMEOUT = 3600 # 1 hour
QUERY_CELERY_TIMEOUT = 300 # 5 minutes per query
With async queries enabled:
- User opens a dashboard
- Superset enqueues the queries to Celery
- Celery workers execute the queries in the background
- Superset polls Redis for results and updates the dashboard incrementally
- User sees results as they complete
This is essential for dashboards with multiple slow queries. The trade-off is added infrastructure (Redis, Celery workers) and eventual consistency (results appear gradually, not all at once).
Caching Strategies for Production
Superset’s Caching Layers
Superset has three caching layers:
Layer 1: Query Result Cache
Superset caches query results in Redis (or in-memory for small deployments). When a user runs the same query twice within the cache TTL, Superset returns the cached result instead of hitting the database.
# In superset_config.py
CACHE_CONFIG = {
"CACHE_TYPE": "redis",
"CACHE_REDIS_URL": "redis://redis.internal:6379/2",
"CACHE_DEFAULT_TIMEOUT": 3600, # 1 hour
}
The cache key includes the query SQL, database connection, and parameters. Different users running the same query hit the same cache entry.
Layer 2: Metadata Cache
Superset caches dataset and column metadata (table structure, column types, available filters) to avoid repeated database introspection.
METADATA_CACHE_TYPE = "redis"
METADATA_CACHE_TIMEOUT = 86400 # 24 hours
Layer 3: Browser Cache
Superset sets HTTP cache headers on dashboard JSON responses. Browsers cache dashboard definitions and re-use them across page loads.
Cache Invalidation Strategy
Caching is worthless if stale data serves users. The challenge is invalidating cache entries when underlying data changes.
Option 1: Time-Based Expiration
Set a fixed TTL (e.g., 1 hour). After the TTL expires, the cache entry is discarded and the next query hits the database.
Pros: Simple, no infrastructure overhead. Cons: Users see stale data until the TTL expires. TTL must be conservative (short) to avoid stale data, which reduces cache effectiveness.
Option 2: Event-Driven Invalidation
When Dagster materialises an asset, it invalidates the corresponding Superset cache entries.
# In Dagster
@asset
def user_metrics():
# ... materialise data ...
# Invalidate Superset cache
invalidate_superset_cache(
database_id=1,
table_name="user_metrics"
)
Pros: Minimal staleness, cache entries live as long as possible. Cons: Requires integration between Dagster and Superset, adds operational complexity.
Option 3: Hybrid Approach
Use short TTLs (5–15 minutes) for most queries, longer TTLs (24 hours) for slowly-changing reference data, and event-driven invalidation for critical metrics.
Most production deployments use Option 3.
Cache Warming
When a cache entry expires, the next user to request that query must wait for the database to execute it. This causes a “cache miss penalty”.
To avoid this, pre-warm the cache by running common queries on a schedule:
# Scheduled Dagster job that runs after metrics materialise
@job
def warm_superset_cache():
queries = [
"SELECT date, SUM(revenue) FROM user_metrics GROUP BY date",
"SELECT country, COUNT(*) FROM users GROUP BY country",
# ... other common queries ...
]
for query in queries:
execute_superset_query(query)
This ensures frequently-accessed queries are always cached.
Cache Size and Eviction
Redis has a fixed memory limit. When the cache exceeds this limit, Redis evicts old entries using a policy (LRU, LFU, TTL-based).
# Redis configuration
REDIS_MAX_MEMORY = "2gb"
REDIS_MAX_MEMORY_POLICY = "allkeys-lru" # Evict least-recently-used keys
Monitor cache hit rates and size:
redis-cli INFO stats | grep hit_rate
redis-cli INFO memory | grep used_memory
If hit rates drop below 70%, increase Redis memory or reduce query diversity.
Operational Quirks and Gotchas
The Async Query Timeout Trap
When async queries are enabled, Superset has two timeout settings:
- QUERY_CELERY_TIMEOUT: How long Celery waits for a single query to complete (default 300 seconds).
- SQLLAB_TIMEOUT: How long the browser waits for results before giving up (default 30 seconds).
If a query takes longer than QUERY_CELERY_TIMEOUT, Celery kills it. If the user closes the browser before SQLLAB_TIMEOUT, Superset stops polling for results.
The gotcha: A query can complete in Celery (after QUERY_CELERY_TIMEOUT), but the user has already given up and closed the browser. The result is orphaned in Redis, consuming memory.
Solution: Set SQLLAB_TIMEOUT and QUERY_CELERY_TIMEOUT to the same value, or slightly longer.
Dataset Refresh and Permissions
When you refresh a Superset dataset (to pick up new columns from the underlying table), Superset scans the table schema. On very wide tables (1000+ columns), this can take minutes and lock the table.
Solution: Refresh datasets during off-peak hours, or use a read replica.
Superset’s SQL Generation
Superset’s query builder generates SQL automatically. The generated SQL is usually efficient, but edge cases exist:
-
Filters on calculated metrics: If you filter by a metric (e.g., “revenue > 1000”), Superset adds a HAVING clause. If you have multiple grouping dimensions, this can be slow.
-
Date filters and timezones: Superset applies timezone conversions in SQL. On large tables, this can negate index benefits.
-
Multi-select filters: If a filter allows multiple values, Superset generates
WHERE column IN (val1, val2, ...). With many values, this becomes unwieldy.
Solution: For complex filters, use custom SQL queries instead of the visual query builder. Custom SQL gives you full control and often runs faster.
Embedding and CORS
If you embed Superset dashboards in a SaaS product, you need to handle cross-origin requests (CORS). Superset’s default configuration blocks embedded dashboards.
# In superset_config.py
FEATURED_DASHBOARD_URIS = ["https://app.example.com/analytics"]
ALLOWED_ORIGINS = ["https://app.example.com"]
Also, you need to handle authentication. Superset’s default session-based auth doesn’t work well for embedded dashboards. Use OAuth2 or JWT tokens instead.
Memory Leaks in Long-Running Superset Instances
Superset’s Python application can leak memory over time, especially if you run heavy workloads. Monitor memory usage and restart Superset workers periodically.
# In Kubernetes
resources:
requests:
memory: "2Gi"
limits:
memory: "4Gi"
livenessProbe:
httpGet:
path: /health
port: 8088
initialDelaySeconds: 30
periodSeconds: 10
Security and Compliance
Row-Level Security (RLS)
Superset’s RLS feature restricts which rows a user can see based on their role or attributes.
-- RLS clause for a dataset
WHERE organization_id = {{ current_user_org_id }}
When a user queries the dataset, Superset automatically appends the RLS clause. Users can’t bypass this by writing custom SQL (Superset rewrites custom SQL to include RLS).
RLS is essential for multi-tenant SaaS products. Without it, a user could query another tenant’s data.
Column-Level Security
You can hide sensitive columns from certain users:
# In Superset's UI or API
column_permissions = {
"salary": ["admin"], # Only admins see salary column
"email": ["admin", "hr"],
}
Users without permission don’t see the column in the UI and can’t query it.
Audit Logging
Superset logs all user actions (dashboard views, query executions, dataset modifications) to its internal database. This is essential for compliance.
# In superset_config.py
LOGGING_CONFIG = {
"version": 1,
"disable_existing_loggers": False,
"handlers": {
"file": {
"class": "logging.handlers.RotatingFileHandler",
"filename": "/var/log/superset/audit.log",
"maxBytes": 104857600, # 100MB
"backupCount": 10,
},
},
}
Export audit logs regularly to a central logging system (ELK, Splunk, CloudWatch) for long-term retention and analysis.
SOC 2 and ISO 27001 Readiness
Superset deployments that handle sensitive data (PII, financial data, health records) often need SOC 2 or ISO 27001 compliance. Key requirements:
- Encryption in transit: Use TLS 1.2+ for all connections (Superset to database, user to Superset).
- Encryption at rest: Encrypt database backups and Superset’s metadata database.
- Access control: Implement RBAC, MFA, and audit logging.
- Data retention: Define retention policies and implement automated deletion.
- Incident response: Document procedures for security incidents and data breaches.
For organisations pursuing formal compliance, PADISO provides SOC 2 / ISO 27001 audit readiness support via Vanta. This includes infrastructure review, configuration hardening, and evidence collection for auditors.
Secrets Management
Superset stores database credentials (usernames, passwords, API keys) in its metadata database. These must be encrypted.
# In superset_config.py
SQLALCHEMY_DATABASE_URI = "postgresql://user:password@postgres.internal/superset"
# Encrypt sensitive values
SECRET_KEY = os.environ["SUPERSET_SECRET_KEY"] # 32+ character random string
Better approach: Use a secrets manager (AWS Secrets Manager, HashiCorp Vault, Kubernetes Secrets):
import boto3
secrets_client = boto3.client("secretsmanager")
db_secret = secrets_client.get_secret_value(SecretId="superset/db-password")
SQLALCHEMY_DATABASE_URI = f"postgresql://user:{db_secret['SecretString']}@postgres.internal/superset"
Scaling and Multi-Tenancy
Horizontal Scaling of Superset
Superset is stateless (except for session data in Redis). You can run multiple Superset instances behind a load balancer:
apiVersion: apps/v1
kind: Deployment
metadata:
name: superset
spec:
replicas: 3 # 3 Superset instances
template:
spec:
containers:
- name: superset
image: superset:latest
ports:
- containerPort: 8088
env:
- name: SUPERSET_ENV
value: production
- name: REDIS_URL
value: redis://redis.internal:6379/0
Load balancer (Nginx, AWS ALB) distributes requests across instances.
Scaling Dagster
Dagster scales differently. The Dagster daemon (which triggers runs) is stateful and should run as a single instance (or use Dagster Cloud for managed scaling).
Dagster workers (which execute runs) are stateless and scale horizontally:
apiVersion: apps/v1
kind: Deployment
metadata:
name: dagster-workers
spec:
replicas: 10 # 10 worker pods
template:
spec:
containers:
- name: dagster-worker
image: dagster:latest
env:
- name: DAGSTER_HOME
value: /var/lib/dagster
For large deployments, use Dagster’s Kubernetes executor to spawn a pod per run. This gives unlimited parallelism.
Multi-Tenant Architecture
For SaaS products serving multiple customers, you have three architectural options:
Option 1: Shared Superset, Separate Databases
One Superset instance serves all tenants. Each tenant has a separate analytics database.
Tenant A Database → Superset (shared) ← Tenant B Database
Pros: Simple, low cost. Cons: RLS must be perfect (no data leakage), single Superset instance is a bottleneck.
Option 2: Separate Superset Instances
Each tenant gets their own Superset instance and database.
Tenant A Database → Superset A
Tenant B Database → Superset B
Pros: Complete isolation, no risk of data leakage. Cons: High cost (N instances for N tenants), operational overhead.
Option 3: Shared Superset, Shared Database with RLS
One Superset instance and one database, but strict RLS ensures data isolation.
Shared Database (with RLS) ← Superset (shared)
Pros: Efficient, good cost-to-isolation ratio. Cons: RLS complexity, must audit RLS logic regularly.
Most SaaS teams use Option 3 for cost efficiency, with Option 1 as a fallback for high-security requirements.
Monitoring and Observability
Key Metrics to Monitor
Superset Metrics:
- Query execution time: Histogram of query durations. Alert if p95 > 30 seconds.
- Cache hit rate: Percentage of queries served from cache. Target: >70%.
- Active users: Concurrent users on Superset. Alert if > capacity.
- Error rate: Percentage of failed queries. Alert if > 1%.
- Dashboard load time: Time to render a dashboard in the browser. Alert if > 5 seconds.
Dagster Metrics:
- Asset freshness: Time since last successful materialisation. Alert if > expected interval.
- Run duration: How long Dagster runs take. Alert if > 2x historical average.
- Run failure rate: Percentage of failed runs. Alert if > 5%.
- Backlog: Number of pending runs. Alert if > threshold.
Database Metrics:
- Query execution time: Histogram of query durations on the analytics database.
- Connection pool utilization: Percentage of available connections in use. Alert if > 80%.
- Disk usage: Percentage of disk space used. Alert if > 80%.
- Replication lag: (If using read replicas) Time lag between primary and replica. Alert if > 1 minute.
Logging and Tracing
Centralise logs from Superset, Dagster, and your database:
# In superset_config.py
import logging
from pythonjsonlogger import jsonlogger
logger = logging.getLogger()
logHandler = logging.StreamHandler()
formatter = jsonlogger.JsonFormatter()
logHandler.setFormatter(formatter)
logger.addHandler(logHandler)
logger.setLevel(logging.INFO)
Ship logs to ELK, Splunk, or CloudWatch. Use structured logging (JSON) for easy parsing and searching.
For distributed tracing, integrate OpenTelemetry:
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
jaeger_exporter = JaegerExporter(agent_host_name="jaeger.internal", agent_port=6831)
trace.get_tracer_provider().add_span_processor(
trace.BatchSpanProcessor(jaeger_exporter)
)
This allows you to trace a single request from the Superset UI through Dagster to the database and back.
Alerting
Set up alerts for critical issues:
# Prometheus alerting rules
groups:
- name: superset
rules:
- alert: SupersetHighErrorRate
expr: rate(superset_query_errors_total[5m]) > 0.01
for: 5m
annotations:
summary: "Superset error rate > 1%"
- alert: DagsterAssetStale
expr: time() - asset_last_materialisation_time > 86400
for: 10m
annotations:
summary: "Asset has not materialised in 24 hours"
Migration and Rollout
From Spreadsheets to Superset + Dagster
Most organisations we work with start with spreadsheets or ad-hoc SQL queries. Migrating to Superset + Dagster requires careful planning.
Phase 1: Assess Current State (Weeks 1–2)
- Document existing reports, dashboards, and data sources.
- Identify data owners and stakeholders.
- Estimate data volume and query complexity.
- Audit data quality issues.
Phase 2: Design New Architecture (Weeks 3–4)
- Design the analytics database schema.
- Identify Dagster assets and dependencies.
- Plan data lineage and transformations.
- Design Superset dashboards and datasets.
Phase 3: Build and Test (Weeks 5–10)
- Build Dagster pipelines for data ingestion and transformation.
- Create Superset datasets and dashboards.
- Validate data accuracy against existing reports.
- Load test Superset and database.
Phase 4: Pilot with Early Adopters (Weeks 11–12)
- Release to a small group of power users.
- Gather feedback and iterate.
- Document known issues and workarounds.
Phase 5: Full Rollout (Weeks 13+)
- Gradually migrate users from spreadsheets to Superset.
- Retire old reports.
- Monitor for issues and optimise.
Data Validation During Migration
The biggest risk during migration is data discrepancies. Validate that Superset dashboards show the same numbers as the old reports:
# Validation query in Dagster
@asset
def validate_migration():
superset_result = query_superset("SELECT SUM(revenue) FROM user_metrics")
spreadsheet_result = query_spreadsheet("Sheet1!B2:B1000")
assert abs(superset_result - spreadsheet_result) < 0.01, \
f"Mismatch: Superset={superset_result}, Spreadsheet={spreadsheet_result}"
Run these validations daily during the migration period. Any discrepancy is a blocker for rollout.
Gradual Traffic Migration
For critical dashboards, use a canary deployment:
- Week 1: 10% of users see Superset, 90% see old reports.
- Week 2: 25% Superset, 75% old reports.
- Week 3: 50% Superset, 50% old reports.
- Week 4: 100% Superset.
Monitor error rates and user feedback at each stage. If issues arise, roll back to old reports.
Real-World Deployment Patterns
Pattern 1: SaaS Analytics for a Fintech Platform
Requirements: Real-time transaction analytics, multi-tenant isolation, SOC 2 compliance.
Architecture:
- Data Source: PostgreSQL operational database with 50GB of transaction data.
- Dagster Pipelines: Hourly ETL from PostgreSQL to ClickHouse, with incremental updates.
- ClickHouse: 500GB analytics database, distributed across 3 nodes.
- Superset: 2 instances behind a load balancer, with RLS for tenant isolation.
- Caching: Redis with 1-hour TTL for most queries, 5-minute TTL for real-time dashboards.
Key Decisions:
- Used ClickHouse for its superior compression (10x vs PostgreSQL) and query speed.
- Implemented RLS at the database level (ClickHouse row policies) for defence-in-depth.
- Scheduled Dagster jobs to run hourly, keeping Superset data fresh.
- Used async queries for dashboards with >5 charts to avoid timeouts.
Results: Dashboards load in <2 seconds, cache hit rate >85%, zero data leakage incidents.
Pattern 2: Data Warehouse for a B2B SaaS Platform
Requirements: Centralised data for data science, business intelligence, and product analytics. Support for 100+ internal users and 50+ external customers.
Architecture:
- Data Sources: Snowflake data warehouse (primary), PostgreSQL operational database, third-party APIs (Stripe, Mixpanel).
- Dagster Pipelines: Daily full-refresh and hourly incremental updates from multiple sources.
- Snowflake: Shared database with row-level security for customer data.
- Superset: Embedded in SaaS product for customer-facing dashboards, internal Superset instance for business intelligence.
- Caching: Redis with event-driven invalidation (when Dagster assets materialise).
Key Decisions:
- Used Snowflake’s native sharing feature to grant read access to external customers’ data.
- Implemented column-level security to hide sensitive columns (unit economics, pricing).
- Built Dagster jobs with data quality checks (freshness, null counts, duplicates).
- Used Superset’s embedding feature to white-label dashboards for customers.
Results: 95% cache hit rate, <1 second dashboard load time for embedded dashboards, 10+ data quality checks prevent bad data from reaching users.
Pattern 3: Data Platform for an Enterprise Retailer
Requirements: Consolidate data from 500+ stores, support 200+ internal users, enable real-time inventory and sales analytics.
Architecture:
- Data Sources: Store POS systems (daily batch), web analytics (hourly), inventory management (real-time API).
- Dagster Pipelines: Daily full-refresh of POS data, hourly web analytics, real-time inventory via sensors.
- Data Warehouse: Redshift cluster with 2TB of historical data.
- Superset: 5 instances behind a load balancer, supporting 200 concurrent users.
- Caching: Redis with short TTL (5 minutes) for real-time dashboards, longer TTL (24 hours) for historical reports.
Key Decisions:
- Used Redshift’s distribution keys and sort keys to optimise query performance for common access patterns.
- Built Dagster jobs with automatic backfill capability (if data arrives late, backfill the missing period).
- Implemented Superset’s async query execution to handle slow queries without timeouts.
- Set up alerts for data freshness (if daily POS data doesn’t arrive by 6 AM, alert the data team).
Results: 500+ dashboards serving 200 users, median query time <5 seconds, 99.9% uptime, $2M annual savings from optimised inventory management.
Next Steps and Getting Started
Step 1: Assess Your Current State
Before building Superset + Dagster, answer these questions:
- How much data do you have? (GBs, TBs, PBs)
- What data sources do you need to integrate? (Databases, APIs, files)
- How many users will use Superset? (Tens, hundreds, thousands)
- What’s your freshness requirement? (Real-time, hourly, daily)
- Do you have compliance requirements? (SOC 2, ISO 27001, HIPAA, PCI-DSS)
Your answers determine the appropriate architecture and tools.
Step 2: Choose Your Data Warehouse
Common options:
- PostgreSQL: <100GB, operational workloads, cost-sensitive.
- ClickHouse: 100GB–10TB, time-series and event data, need speed.
- Snowflake: Cloud-native, managed infrastructure, don’t want to manage servers.
- Redshift: AWS-native, good for enterprises already on AWS.
- BigQuery: GCP-native, serverless, pay-per-query.
If you’re in Australia and need sovereign cloud, Platform Development in Canberra covers IRAP-aligned architecture with Superset and ClickHouse for government teams.
Step 3: Design Your Dagster Asset Graph
Map out your data pipelines as Dagster assets:
raw_events (from API) → cleaned_events (deduped, validated) → user_metrics (aggregated) → dashboard_data (final)
Each asset is a unit of computation. Dagster orchestrates dependencies and ensures data flows in the right order.
Step 4: Build Superset Datasets and Dashboards
Create Superset datasets that map to your Dagster assets. Start with a few critical dashboards (KPIs, operational metrics) and expand from there.
Step 5: Implement Monitoring and Alerting
Set up observability before you go to production:
- Monitor Dagster run success rates and duration.
- Monitor Superset query performance and cache hit rates.
- Monitor database performance and resource utilisation.
- Alert on data freshness, errors, and performance degradation.
Step 6: Plan Your Rollout
Migrate users gradually from old systems to Superset. Validate data accuracy at each stage.
Getting Help
Building a production analytics platform is complex. If you need help designing, building, or operating Superset + Dagster, PADISO offers platform engineering and CTO as a Service across multiple cities. We’ve deployed this stack for 50+ customers across fintech, retail, media, and government sectors.
For organisations in specific regions:
- Sydney: Platform Development in Sydney for financial services, retail, and media teams.
- Melbourne: Platform Development in Melbourne for insurance, retail, and health teams.
- Australia-wide: Platform Development in Australia across all cities.
- New York: Platform Development in New York for financial services and media.
- Washington, D.C.: Platform Development in Washington, D.C. for government and defence.
- Toronto: Platform Development in Toronto for financial services and tech.
- Ottawa: Platform Development in Ottawa for government and telecom.
- Chicago: Platform Development in Chicago for trading and logistics.
- Austin: Platform Development in Austin for tech and semiconductors.
- Dallas: Platform Development in Dallas for finance and logistics.
- Wellington: Platform Development in Wellington for government and tech in New Zealand.
We also offer SOC 2 / ISO 27001 compliance support via Vanta for organisations needing audit readiness.
Summary
Apache Superset and Dagster form a powerful, production-ready analytics stack. Superset handles the presentation layer and user interaction, whilst Dagster orchestrates the data pipelines that feed those dashboards.
Key principles for success:
- Separate concerns: Dagster owns orchestration and data quality, Superset owns presentation.
- Asset-driven architecture: Treat Dagster assets as the source of truth for available data.
- Optimise for performance: Index your database, use materialised views, implement caching.
- Invest in observability: Monitor freshness, query performance, and user activity.
- Plan for scale: Design with multi-tenancy, horizontal scaling, and compliance in mind from day one.
The patterns documented here reflect real production deployments across diverse industries and data volumes. Start with the basics, monitor obsessively, and iterate based on what you learn.
When you’re ready to build, reach out to PADISO. We’ve shipped this stack dozens of times and can help you avoid the pitfalls we’ve learned from.