Guide 18 mins

Apache Superset + Airflow: A D23.io Reference Architecture

Production-ready Apache Superset + Airflow architecture. Connection patterns, query performance, caching, and operational insights from D23.io deployments.

The PADISO Team ·2026-06-14

Apache Superset + Airflow: A D23.io Reference Architecture

Why Superset + Airflow Works
Connection Patterns and Data Flow
Query Performance and Caching Strategy
Operational Architecture and Monitoring
Security, Compliance, and Access Control
Common Pitfalls and How to Avoid Them
Scaling Considerations for Production
Implementation Timeline and Next Steps

Why Superset + Airflow Works

Apache Superset and Apache Airflow are complementary tools that solve different but adjacent problems in modern data platforms. Superset provides lightweight, self-service analytics and visualisation, whilst Airflow orchestrates the pipelines that feed data into those dashboards.

The pairing is not accidental. Both tools emerged from the same open-source ecosystem, and their origin story reflects a shared philosophy: data engineers should own the plumbing, but business users should be able to self-serve analytics without bottlenecks.

When you combine them, you get a full-stack analytics platform:

Airflow orchestrates data ingestion, transformation, and enrichment. It schedules jobs, handles retries, manages dependencies, and logs everything.
Superset sits downstream, consuming clean data and letting analysts, product managers, and executives build dashboards and drill into metrics without touching code.

This separation of concerns is powerful. Your data engineering team focuses on reliability and freshness. Your analytics team focuses on insight. No single tool does both well, and trying to use one tool for both jobs inevitably creates friction.

At PADISO, we’ve implemented this pattern across platform development in Sydney, Melbourne, and Canberra for financial services, retail, and government teams. The reference architecture we’ve settled on handles 50+ concurrent users, 500M+ daily events, and sub-second query latency on 95th percentile dashboards.

This guide covers the patterns, trade-offs, and operational realities you need to know before building this stack.

Connection Patterns and Data Flow

The Standard Data Flow

Most Superset + Airflow deployments follow this pattern:

Airflow DAG triggers ingestion from a source system (API, database, data warehouse).
Data lands in a staging area (S3, Postgres table, or ClickHouse).
Airflow runs transformations (dbt, SQL, Python) to clean and denormalise the data.
Transformed data moves to an analytics-optimised store (ClickHouse, Postgres, Snowflake, Redshift).
Superset connects to that store and serves dashboards.
Airflow logs and metadata flow back into Superset for operational visibility.

The key insight: Superset should never query raw data or staging tables. It should query pre-computed, denormalised analytics tables. This is where most teams go wrong.

Connection Layer: Superset to Data Store

Superset connects to your data warehouse via SQLAlchemy database drivers. The most common production patterns are:

PostgreSQL as the analytics layer:

Suitable for sub-1B row datasets.
Simple to operate. Fits in a single managed RDS instance.
Superset’s native performance is good up to ~500M rows per table.
Use read replicas for Superset to avoid impacting transactional traffic.
Cost-effective for startups and mid-market teams.

ClickHouse for scale:

Designed for analytical workloads. Compresses 10:1 to 100:1 by default.
Handles 1B+ row tables without breaking a sweat.
Superset’s ClickHouse driver is stable and well-maintained.
Requires more operational overhead (cluster management, replication tuning).
Used by platform teams in Sydney, Melbourne, and across Australia when event volumes exceed 500M daily.

Snowflake / Redshift as the single source of truth:

Decouples compute from storage.
Superset queries run as separate compute clusters, so they don’t impact ETL.
Higher per-query cost, but simpler operations.
Best for teams already committed to cloud data warehouses.

Airflow DAGs Feeding Superset

Your Airflow DAGs should be structured to produce analytics-ready tables, not raw data. A typical pattern:

DAG: daily_analytics_pipeline
├── extract_source_data (API → S3)
├── load_to_staging (S3 → Postgres staging)
├── transform_with_dbt (dbt models → Postgres analytics schema)
├── compute_aggregates (SQL → pre-computed summary tables)
├── refresh_superset_cache (POST to Superset API)
└── log_pipeline_metrics (metadata → Superset audit table)

The refresh_superset_cache step is critical. After your Airflow pipeline completes, it should trigger Superset to refresh cached datasets. This ensures dashboards show fresh data without users waiting for queries to run.

Metadata Integration: Airflow Metrics in Superset

One of the most underused capabilities is tracking Airflow metrics in Superset. Your Airflow instance logs every task duration, failure, and retry. These logs live in Airflow’s metadata database (usually Postgres).

You can point Superset at Airflow’s metadata database and build dashboards showing:

Pipeline freshness (when did the last run complete?).
Task failure rates and which tasks are slow.
SLA compliance (did data arrive on time?).
Infrastructure utilisation (how many workers are busy?).

This creates a feedback loop: data engineers see which pipelines are bottlenecks, and they prioritise fixes based on impact to downstream dashboards.

Connection Pooling and Resource Limits

Superset will hammer your database with concurrent queries if you don’t set limits. Configure:

Database connection pool size: 10–20 connections for Superset. Not 100. Superset will queue queries if the pool is full.
Query timeout: 60 seconds for dashboards, 300 seconds for ad-hoc exploration. Queries that exceed the timeout should be rewritten as pre-computed tables.
Max rows per query: 10,000 for dashboards. If an analyst needs more, they’re asking the wrong question.

These constraints force the right architecture: pre-computed tables for dashboards, on-demand queries for exploration.

Query Performance and Caching Strategy

Why Query Performance Matters

A Superset dashboard with 10 charts might execute 10 queries in parallel. If each query takes 5 seconds, the dashboard takes 5 seconds to load (parallel execution) or 50 seconds (serial, if you’re unlucky). Users will not tolerate 50-second page loads.

The solution is not faster queries. It’s fewer queries. And the way to achieve that is caching.

Superset’s Native Caching

Superset has two caching layers:

Dataset cache: Caches the result of a SQL query for N seconds. When a user opens a dashboard, Superset checks the cache. If the query result is fresh, it serves the cached result. If not, it runs the query.
Chart cache: Caches the rendered chart (JSON) for N seconds. Faster than dataset cache because it skips the query entirely.

Both are backed by Redis by default (or Memcached). Configure cache TTLs based on data freshness requirements:

Real-time dashboards (SLA, fraud detection): 0–10 second TTL. Queries run every time, but Redis caches the result across users.
Daily operational dashboards (sales, marketing): 3600 second (1 hour) TTL. Fresh enough for daily decisions, cheap enough to scale.
Historical/analytical dashboards: 86400 second (24 hour) TTL. Rarely change, can be stale.

The key trade-off: longer TTL = cheaper and faster, but staler data. Shorter TTL = fresher data, but more database load.

Pre-Computed Tables and Materialized Views

Caching helps, but it’s not a substitute for good data modelling. The real performance win comes from pre-computing expensive aggregations in Airflow.

Instead of asking Superset to:

SELECT date, product_id, SUM(revenue), COUNT(orders)
FROM events
WHERE date >= NOW() - INTERVAL 90 DAY
GROUP BY date, product_id

You compute this in Airflow once per day and store it in a table:

CREATE TABLE daily_product_revenue AS
SELECT date, product_id, SUM(revenue), COUNT(orders)
FROM events
WHERE date >= NOW() - INTERVAL 90 DAY
GROUP BY date, product_id

Then Superset queries the pre-computed table. Query time drops from 30 seconds to 100 milliseconds.

This is the core principle of the modern data stack: orchestrate transformations with Airflow and dbt, then serve the results with Superset. Don’t push computation into the presentation layer.

Indexing and Query Plans

Even pre-computed tables need proper indexing. In PostgreSQL:

CREATE INDEX idx_daily_product_revenue_date ON daily_product_revenue(date DESC);
CREATE INDEX idx_daily_product_revenue_product ON daily_product_revenue(product_id);

In ClickHouse, use ORDER BY and primary key tuning:

CREATE TABLE daily_product_revenue (
  date Date,
  product_id UInt32,
  revenue Float64,
  order_count UInt64
)
ENGINE = MergeTree()
ORDER BY (date, product_id)

Always run EXPLAIN on your Superset queries to see the query plan. If the plan shows a full table scan, fix the index.

Handling Slow Dashboards

When a dashboard is slow, the problem is almost always one chart. Use Superset’s query inspector to identify which chart is slow, then:

Check if the underlying query is hitting an index.
Check if the data is pre-computed or if Superset is computing it on the fly.
If pre-computed, check if the table is stale. If so, check Airflow.
If the query is correct but slow, add a materialized view or denormalised table.

Rarely is the answer “upgrade the database.”

Operational Architecture and Monitoring

Infrastructure Layout

A production Superset + Airflow deployment typically looks like this:

┌─────────────────────────────────────────────────────┐
│ Superset (3 replicas, load-balanced)                │
│ - Web tier (Gunicorn)                               │
│ - Worker tier (Celery, for async chart rendering)   │
│ - Redis cache (shared, 16GB)                        │
└─────────────────────────────────────────────────────┘
                       ↓
┌─────────────────────────────────────────────────────┐
│ Airflow (Scheduler + 5 workers)                     │
│ - PostgreSQL metadata store                         │
│ - Redis broker (for task queue)                     │
└─────────────────────────────────────────────────────┘
                       ↓
┌─────────────────────────────────────────────────────┐
│ Analytics Data Store                                │
│ - PostgreSQL (read replica for Superset)            │
│ - OR ClickHouse (3-node cluster)                    │
│ - Backups to S3                                     │
└─────────────────────────────────────────────────────┘

Key principles:

Superset and Airflow are separate clusters. They don’t share compute. If Superset gets hammered with queries, Airflow pipelines keep running.
Superset has a dedicated read replica. Queries don’t impact the primary database.
Redis is shared. Both Superset (cache) and Airflow (task queue) use it. Size it for peak load.
Backups are automated. ClickHouse or PostgreSQL should backup to S3 daily.

Monitoring and Alerting

You need visibility into three layers:

Airflow layer:

DAG success/failure rate (target: >99%).
Task duration (alert if slow tasks get slower).
SLA misses (did data arrive on time?).
Worker utilisation (are you under-provisioned?).

Superset layer:

Query latency (p50, p95, p99).
Cache hit rate (target: >80% for dashboards).
User concurrency (are you hitting connection limits?).
Error rate (SQL errors, connection timeouts).

Data store layer:

Query latency (slow queries indicate missing indexes).
Connection count (alert if approaching max connections).
Disk usage (alert at 80%, 90%).
Replication lag (for read replicas, alert if >1 second).

Tools like Prometheus + Grafana integrate with both Superset and Airflow. Set up dashboards for each layer and alert on the metrics above.

Deployment and Rollbacks

Superset and Airflow should be deployed independently. Use:

Docker for both (reproducible environments).
Kubernetes for orchestration (easier scaling, self-healing).
Helm charts for templating (Superset and Airflow both have official Helm charts).
GitOps for configuration (Flux or ArgoCD, so all changes are tracked).

Rollbacks should be one command: helm rollback superset 1 to go back to the previous version.

Test changes in a staging environment first. Use feature flags in Superset (behind an admin panel) to test new dashboards with a subset of users before rolling out to everyone.

Security, Compliance, and Access Control

Authentication and Authorisation

Superset supports multiple auth backends:

LDAP/Active Directory: For enterprise teams. Users log in with their corporate credentials.
OAuth2 (Google, GitHub, etc.): For startups and distributed teams.
SAML: For regulated industries (finance, healthcare).
Database users: Not recommended for production, but fine for testing.

Choose based on your organisation’s identity provider. If you’re on AWS, use Cognito. If you’re in Azure, use Azure AD. If you’re in a regulated industry, SAML is often required.

Once authenticated, users need role-based access control (RBAC):

Admin: Full access to all dashboards, data, and settings.
Editor: Can create and edit dashboards, but only for datasets they own.
Viewer: Read-only access to dashboards.

Superset also supports dataset-level permissions. You can restrict which datasets certain users can see. This is critical if you have sensitive data (PII, financial data) that only certain teams should access.

Data Governance and Lineage

When a user looks at a Superset dashboard, they should know:

Where did this data come from?
When was it last updated?
Who owns this dataset?
What transformations were applied?

Superset’s dataset description field helps, but it’s not enough. The real answer is data lineage, which you get from dbt integrated with Airflow. dbt generates metadata about which tables depend on which sources, and Superset can display this lineage.

For teams in regulated industries (platform development in Canberra for government, for example), data lineage is often a compliance requirement. Document it from day one.

SOC 2 and ISO 27001 Readiness

If you’re building a Superset + Airflow platform for a SaaS company that needs SOC 2 or ISO 27001 compliance, pay attention to:

Encryption in transit: All connections to Superset should be HTTPS. All connections from Superset to the database should be encrypted (SSL/TLS).
Encryption at rest: The database should encrypt data at rest. Redis should encrypt cached data.
Access logs: Log all user actions in Superset (who viewed which dashboard, when). Store logs for 90 days minimum.
Database backups: Encrypt backups and store them in a separate AWS account or region.
Secrets management: Use AWS Secrets Manager, HashiCorp Vault, or similar to store database credentials. Never hardcode them.

These controls are not optional for compliance. Build them in from the start. PADISO can help audit your architecture against SOC 2 and ISO 27001 requirements; reach out to discuss your platform development needs across New York, Washington, D.C., or other cities.

Common Pitfalls and How to Avoid Them

Pitfall 1: Superset Queries Hit Raw Data

The problem: You point Superset at your operational database or data lake and let users query raw data. Queries become slow, analysts write inefficient SQL, and the database gets overloaded.

The fix: Always put a denormalised analytics layer between Superset and raw data. Use Airflow to pre-compute aggregations and flatten nested structures. Superset should only query this analytics layer.

Pitfall 2: No Cache Strategy

The problem: Every dashboard load triggers 10 fresh queries. A dashboard with 50 concurrent users = 500 queries per minute. Your database can’t keep up.

The fix: Configure Superset’s cache with appropriate TTLs. Use Redis for caching. Monitor cache hit rate (target >80%). For dashboards that need real-time data, use shorter TTLs. For historical dashboards, use longer TTLs.

Pitfall 3: Airflow Pipelines Don’t Refresh Superset Cache

The problem: Airflow finishes a pipeline at 8am. Superset’s cache is still serving stale data from 7:59am. Users see yesterday’s numbers until the cache expires at 9am.

The fix: Add a final task to your Airflow DAG that calls Superset’s API to refresh the cache for affected datasets. This ensures dashboards show fresh data as soon as the pipeline completes.

Pitfall 4: No Monitoring of Query Latency

The problem: A dashboard that used to load in 2 seconds now takes 30 seconds. You don’t notice until users complain.

The fix: Set up Prometheus + Grafana to track query latency (p50, p95, p99) and alert if p95 latency exceeds a threshold (e.g., 5 seconds). Set up Superset’s query inspector so you can drill into slow queries.

The problem: A heavy Airflow job maxes out CPU, and Superset becomes unresponsive. Users can’t load dashboards.

The fix: Run Superset and Airflow on separate Kubernetes clusters or separate EC2 instance groups. This way, a noisy Airflow job doesn’t impact dashboard availability.

Pitfall 6: Missing Indexes on Analytics Tables

The problem: Pre-computed tables exist, but they’re not indexed. Superset queries still take 10 seconds because the database is doing full table scans.

The fix: After creating an analytics table in Airflow, create indexes on columns used in Superset filters and GROUP BY clauses. Run EXPLAIN on your Superset queries to verify the index is being used.

Scaling Considerations for Production

Scaling Superset

Superset scales horizontally. Add more replicas behind a load balancer:

1–10 concurrent users: 1 Superset instance, 1 worker.
10–50 concurrent users: 2 instances, 2 workers, shared Redis cache.
50–200 concurrent users: 3–4 instances, 4–8 workers, larger Redis (16GB).
200+ concurrent users: 5+ instances, 10+ workers, Redis cluster, consider a dedicated BI tool or query federation.

As you scale, bottlenecks shift:

At 10 users: Database connection limit.
At 50 users: Redis memory or query latency.
At 200 users: Database query latency or Superset worker saturation.

Address each bottleneck as you hit it. Don’t over-provision from day one.

Scaling Airflow

Airflow scales by adding workers. Each worker runs tasks in parallel:

1–10 DAGs: 1 scheduler, 1 worker (can run on the same machine).
10–50 DAGs: 1 scheduler, 3–5 workers (separate machines).
50–200 DAGs: 2 schedulers (HA), 10+ workers, consider Kubernetes Executor or Celery Executor with auto-scaling.
200+ DAGs: Kubernetes Executor with auto-scaling, separate metadata database (RDS Multi-AZ), separate Redis broker.

Airflow’s scheduler is the critical path. If the scheduler is slow, DAGs don’t start on time. Monitor scheduler latency and upgrade if it exceeds 10 seconds.

Scaling the Data Store

PostgreSQL scales vertically (bigger instance) or horizontally (read replicas):

<100GB data: Single RDS instance with read replica for Superset.
100GB–1TB: RDS Multi-AZ (primary + standby) + read replica, or switch to ClickHouse.
1TB+: ClickHouse cluster (3+ nodes) or Snowflake.

ClickHouse is the sweet spot for most analytics workloads. It compresses 10:1 to 100:1, handles 1B+ rows per table, and costs less than Snowflake at scale.

Data Freshness vs. Cost

There’s a trade-off between data freshness and infrastructure cost. Hourly pipelines cost less than 10-minute pipelines because fewer resources are idle waiting for the next run.

Define SLAs for each dashboard:

Real-time (SLA, fraud): Data should be <5 minutes old. Requires frequent Airflow runs and short cache TTLs. Expensive.
Daily (operations): Data should be <24 hours old. Nightly Airflow runs, 1-hour cache TTL. Cheap.
Weekly (strategic): Data should be <7 days old. Weekly Airflow runs, 24-hour cache TTL. Very cheap.

Most dashboards should be daily or weekly. Real-time dashboards are rare and expensive. Be intentional about which dashboards need real-time data.

Implementation Timeline and Next Steps

Phase 1: Foundation (Weeks 1–4)

Goal: Get Superset and Airflow running with a single analytics table.

Deploy Superset on Kubernetes or Docker Compose.
Deploy Airflow on Kubernetes or Docker Compose.
Create a single analytics table in PostgreSQL (denormalised, pre-aggregated).
Write an Airflow DAG to populate this table daily.
Create a Superset dashboard that queries this table.
Set up Redis caching and monitor cache hit rate.

Success criteria:

Dashboard loads in <2 seconds.
Cache hit rate >80%.
Airflow DAG runs daily without errors.
No manual intervention needed.

Phase 2: Expansion (Weeks 5–12)

Goal: Add more datasets and dashboards, implement monitoring.

Migrate 3–5 key datasets to the analytics layer.
Build 5–10 dashboards covering key business metrics.
Implement Prometheus + Grafana for monitoring.
Set up alerts for SLA misses and slow queries.
Implement RBAC and audit logging.
Document data lineage and ownership.

Success criteria:

50+ concurrent users can access dashboards without slowdown.
Query latency p95 <5 seconds.
Cache hit rate >80%.
All alerts firing correctly.

Phase 3: Hardening (Weeks 13–16)

Goal: Prepare for production scale and compliance.

Implement SOC 2 / ISO 27001 controls (encryption, backups, access logs).
Set up disaster recovery (automated backups, tested restores).
Load test the platform (simulate 200 concurrent users).
Document runbooks for common operational tasks.
Train your team on troubleshooting and monitoring.

Success criteria:

Platform passes SOC 2 audit or ISO 27001 readiness review.
Recovery time objective (RTO) <4 hours, recovery point objective (RPO) <1 hour.
Runbooks exist for top 10 operational tasks.
Team is confident in troubleshooting.

Getting Help

If you’re building a Superset + Airflow platform and need guidance on architecture, scaling, or compliance, PADISO specialises in exactly this work. We’ve shipped platforms like this for financial services teams in Sydney, retail and property teams in Melbourne, and government teams in Canberra.

We can help with:

Architecture review: Is your design sound? Will it scale?
Implementation: Co-build the platform with your team.
Compliance: Audit against SOC 2, ISO 27001, or other standards.
Troubleshooting: Debug slow queries, high memory usage, or other production issues.
Training: Get your team confident in operating the platform.

We also work with teams across the US, Canada, and New Zealand. If you’re in New York, Washington, D.C., Toronto, Montreal, Chicago, Austin, Dallas, Ottawa, or Wellington, we can work with your local team.

Key Takeaways

Apache Superset and Airflow are a powerful combination for analytics platforms, but only if you get the architecture right. Here’s what matters:

Separate concerns. Airflow orchestrates pipelines. Superset serves dashboards. Don’t ask Superset to compute aggregations on the fly.
Pre-compute everything. Build denormalised, pre-aggregated tables in Airflow. Superset queries these tables, not raw data.
Cache aggressively. Use Redis to cache dataset results and rendered charts. Target >80% cache hit rate.
Monitor relentlessly. Track query latency, cache hit rate, Airflow SLA compliance, and database health. Alert on anomalies.
Separate infrastructure. Run Superset and Airflow on different clusters so a noisy job doesn’t impact dashboard availability.
Plan for compliance from day one. Encryption, access logs, backups, and secrets management aren’t optional. Build them in.
Start small, scale intentionally. Don’t over-provision. Add capacity as you hit bottlenecks.

The reference architecture we’ve outlined here handles 50+ concurrent users, 500M+ daily events, and sub-second dashboard latency. It’s battle-tested across Australian cities and North America. If you follow these patterns, you’ll build a platform that scales with your business without constant firefighting.

The modern data stack is mature. Superset and Airflow are production-ready. The hard part isn’t the tools—it’s the discipline to build the right data model, cache strategy, and operational practices. Get those right, and the tools will serve you well.

If you’re ready to build this or need help with an existing platform, let’s talk. We’ve done this dozens of times and can help you avoid the common pitfalls.

Want to talk through your situation?

Book a 30-minute call with Kevin (Founder/CEO). No pitch - direct advice on what to do next.

Book a 30-min call

Apache Superset + Airflow: A D23.io Reference Architecture

Apache Superset + Airflow: A D23.io Reference Architecture

Table of Contents

Why Superset + Airflow Works

Connection Patterns and Data Flow

The Standard Data Flow

Connection Layer: Superset to Data Store

Airflow DAGs Feeding Superset

Metadata Integration: Airflow Metrics in Superset

Connection Pooling and Resource Limits

Query Performance and Caching Strategy

Why Query Performance Matters

Superset’s Native Caching

Pre-Computed Tables and Materialized Views

Indexing and Query Plans

Handling Slow Dashboards

Operational Architecture and Monitoring

Infrastructure Layout

Monitoring and Alerting

Deployment and Rollbacks

Security, Compliance, and Access Control

Authentication and Authorisation

Data Governance and Lineage

SOC 2 and ISO 27001 Readiness

Common Pitfalls and How to Avoid Them

Pitfall 1: Superset Queries Hit Raw Data

Pitfall 2: No Cache Strategy

Pitfall 3: Airflow Pipelines Don’t Refresh Superset Cache

Pitfall 4: No Monitoring of Query Latency

Pitfall 5: Superset and Airflow Share Infrastructure

Pitfall 6: Missing Indexes on Analytics Tables

Scaling Considerations for Production

Scaling Superset

Scaling Airflow

Scaling the Data Store

Data Freshness vs. Cost

Implementation Timeline and Next Steps

Phase 1: Foundation (Weeks 1–4)

Phase 2: Expansion (Weeks 5–12)

Phase 3: Hardening (Weeks 13–16)

Getting Help

Key Takeaways

Want to talk through your situation?