PADISO.ai: AI Agent Orchestration Platform - Launching May 2026
Back to Blog
Guide 20 mins

Apache Superset + Databricks: A D23.io Reference Architecture

Production-grade Apache Superset + Databricks architecture for analytics. Connection patterns, query optimisation, caching, and operational insights from D23.io deployments.

The PADISO Team ·2026-06-11

Apache Superset + Databricks: A D23.io Reference Architecture

Table of Contents

  1. Why This Architecture Matters
  2. Core Connection Patterns
  3. Query Performance and Optimisation
  4. Caching Strategy
  5. Security and Compliance
  6. Operational Considerations
  7. Real-World Deployment Lessons
  8. Scaling Beyond MVP
  9. Common Pitfalls and Solutions
  10. Next Steps

Why This Architecture Matters

Apache Superset paired with Databricks has become a standard choice for teams building modern data platforms. The combination offers a powerful, open-source analytics layer on top of a unified lakehouse that handles both data engineering and analytics workloads. Yet moving from a proof-of-concept to a production system requires careful planning around connection pooling, query performance, caching, and operational reliability.

This guide distils production lessons from D23.io customer deployments across financial services, retail, and media sectors. We’ve seen teams ship Superset + Databricks in four weeks, cut per-seat BI costs by 30–50%, and pass SOC 2 / ISO 27001 audits with confidence. We’ve also seen teams struggle with runaway query costs, connection timeouts, and slow dashboard load times because they skipped critical design decisions early.

This reference architecture covers the decisions that matter: how to connect Superset to Databricks, how to structure queries for sub-second dashboard performance, when and how to cache, and how to operate the stack reliably in production.


Core Connection Patterns

Direct JDBC Connection vs. Databricks SQL Warehouse

The first decision is whether to connect Superset directly to your Databricks workspace via JDBC, or route queries through a Databricks SQL Warehouse (the managed SQL endpoint). Both are valid; the choice depends on your cost model, query latency requirements, and team skill.

Direct JDBC Connection connects Superset to the Databricks workspace cluster. This is simpler to set up and works well for exploratory analytics and lower-concurrency dashboards. You point Superset’s database driver at your Databricks host URL, cluster ID, and a personal access token. The driver handles query submission and result streaming. Costs scale with cluster compute time, so if you’re running a cluster 24/7 to serve Superset, your bill grows linearly with uptime, not query volume.

Databricks SQL Warehouse is a managed, auto-scaling SQL endpoint. Superset queries route through the warehouse, which spins up compute on demand and scales down when idle. You pay per DBU (Databricks Unit) consumed, not per hour of cluster uptime. For production dashboards with variable traffic, this is usually cheaper. The trade-off is slightly higher per-query latency (50–200ms overhead) and a small fixed cost even if the warehouse is unused.

Our recommendation: use Databricks SQL Warehouse for production dashboards and shared analytics workloads. Use a direct cluster connection only for ad-hoc exploration or if you have a dedicated cluster that’s already running for other workloads (e.g., scheduled ETL jobs). If you’re running Superset at scale across multiple teams, a SQL Warehouse is non-negotiable for cost control and isolation.

When you set up the connection in Superset, use the official Databricks Apache Superset integration documentation as your reference. The driver requires a host URL (e.g., adb-1234567890.cloud.databricks.com), HTTP path (e.g., /sql/1.0/warehouses/abc123def456), and a personal access token. Store the token in Superset’s encrypted secret manager, never in plaintext configuration.

Connection Pooling and Concurrency

Superset uses SQLAlchemy for database connections. By default, SQLAlchemy maintains a pool of connections to Databricks. The pool size and overflow settings are critical: too small, and concurrent dashboard requests queue and timeout; too large, and you exhaust Databricks connection limits and incur unnecessary overhead.

For a production Superset deployment serving 10–50 concurrent users, start with a pool size of 10–15 connections and a max overflow of 10. This means Superset will maintain 10 connections in the pool and allow up to 10 additional temporary connections under load. If you’re serving 100+ concurrent users, increase the pool size to 20–30 and overflow to 20.

Configure this in Superset’s superset_config.py:

SQLALCHEMY_ENGINE_KWARGS = {
    "connect_args": {
        "http_path": "/sql/1.0/warehouses/your-warehouse-id",
    },
    "pool_size": 15,
    "max_overflow": 10,
    "pool_pre_ping": True,
    "pool_recycle": 3600,
}

The pool_pre_ping setting pings each connection before reuse to detect stale connections. The pool_recycle setting closes and recreates connections every hour, preventing timeout issues from long-lived idle connections.

Monitor connection usage in production. If you see connection pool exhaustion warnings in Superset logs, increase the pool size. If you see idle connections consuming Databricks resources, reduce the pool size or lower the pool recycle interval.

Authentication and Token Rotation

Superset connects to Databricks using a personal access token (PAT). In production, this token should belong to a service account, not an individual user. Create a dedicated Databricks user (e.g., superset-service@company.com) with the minimum required permissions: read access to the catalogs and schemas that Superset dashboards query.

Tokens expire and need rotation. Plan to rotate the Superset service account token every 90 days. Store the token in Superset’s secret backend (e.g., AWS Secrets Manager, HashiCorp Vault, or Kubernetes secrets). When you rotate the token, update the secret and restart Superset’s web and worker services.

For teams pursuing SOC 2 / ISO 27001 compliance, token rotation and audit logging are mandatory. Ensure Databricks workspace audit logs capture token creation, use, and rotation. Superset should log all database connection attempts and query submissions.


Query Performance and Optimisation

Understanding Databricks Query Execution

When Superset submits a query to Databricks, the query is parsed, optimized by Catalyst (Databricks’ SQL optimizer), and executed on the warehouse or cluster. The entire round-trip—from Superset to Databricks and back—typically takes 500ms to 2 seconds for a simple query, depending on query complexity, data volume, and warehouse size.

Superset dashboards often contain 4–8 queries running in parallel when a user opens the dashboard. If each query takes 2 seconds, the dashboard loads in ~2 seconds (parallel execution). If queries take 10 seconds, the dashboard takes 10+ seconds. This is why query performance is the single biggest lever for dashboard UX.

Materialized Views and Aggregate Tables

The most effective way to speed up Superset dashboards is to pre-compute aggregations. Instead of running expensive GROUP BY queries on billions of raw rows, query a pre-computed aggregate table that’s updated on a schedule.

For example, if your dashboard shows daily revenue by region, don’t query the raw transactions table (1 billion rows). Instead, create a materialized view that aggregates transactions by day and region, and query that (365 × 50 = 18k rows).

In Databricks, create materialized views using Delta tables:

CREATE OR REPLACE TABLE revenue_by_day_region AS
SELECT
  DATE(transaction_date) AS date,
  region,
  SUM(amount) AS revenue,
  COUNT(*) AS transaction_count
FROM transactions
GROUP BY DATE(transaction_date), region;

CREATE INDEX idx_revenue_date ON revenue_by_day_region(date);

Then update this table on a schedule (e.g., every hour via a Databricks job). In Superset, point your dashboard charts at revenue_by_day_region instead of the raw transactions table. Query time drops from 5–10 seconds to 100–200ms.

This is a standard pattern across all our platform engineering engagements in Sydney, Melbourne, and Australia broadly. Teams typically create 5–15 key aggregate tables that power 80% of dashboard traffic.

Partitioning and Pruning

Databricks uses partition pruning to skip data that doesn’t match the query filter. If your transactions table is partitioned by date, and a query filters WHERE date >= '2024-01-01', Databricks only scans partitions from 2024 onwards, not the entire table.

Always partition large fact tables by a temporal column (date, month, or year). For tables larger than 1 billion rows, partitioning is non-negotiable.

CREATE TABLE transactions (
  id LONG,
  amount DECIMAL(10, 2),
  transaction_date DATE,
  region STRING
)
PARTITIONED BY (transaction_date)
USING DELTA;

When you create Superset dashboards, always include a date filter (e.g., “Last 90 days”). This ensures queries only scan relevant partitions, not the entire table.

Statistics and Query Hints

Databricks Catalyst optimizer uses table statistics to choose efficient query plans. If statistics are stale, the optimizer might choose a suboptimal plan. Regularly compute statistics on large tables:

ANALYZE TABLE transactions COMPUTE STATISTICS;
ANALYZE TABLE transactions COMPUTE STATISTICS FOR COLUMNS amount, region;

Run this weekly or after major data loads. Superset queries will be faster because the optimizer has accurate row counts and cardinality estimates.

For queries that are still slow despite optimisation, use Databricks SQL query hints to guide the optimizer:

SELECT /*+ BROADCAST(regions) */ t.amount, r.name
FROM transactions t
JOIN regions r ON t.region_id = r.id
WHERE t.transaction_date >= '2024-01-01';

The BROADCAST hint tells the optimizer to broadcast the small regions table to all executors, avoiding a slower shuffle join.

Superset-Level Query Optimisation

Superset generates SQL from chart definitions. Sometimes the generated SQL is inefficient. Use Superset’s “View Query” feature to inspect the SQL before saving a chart. If it looks slow, switch to “Custom SQL” and write the query manually.

For example, Superset’s GROUP BY interface might generate a query that groups by 10 columns. If you only need 3, write custom SQL that groups by only those 3. This reduces shuffle size and speeds up the query.

Also, use Superset’s “Limit” setting (e.g., LIMIT 10000) to cap result set size. If a dashboard chart is showing 100k rows in a table, the browser rendering is slow. Limit to 10k rows and let users drill down for details.


Caching Strategy

Query Caching in Superset

Superset can cache query results in-memory (Redis) or in a database backend. Caching is crucial for dashboard performance: if 50 users open the same dashboard, the first user’s queries hit Databricks, but the next 49 users’ queries hit the cache.

By default, Superset caches query results for 10 minutes. This is configurable. For dashboards that refresh hourly (e.g., daily revenue reports), set the cache TTL to 60 minutes. For real-time dashboards, set it to 1 minute or disable caching.

Configure caching in superset_config.py:

CACHE_CONFIG = {
    "CACHE_TYPE": "RedisCache",
    "CACHE_REDIS_URL": "redis://localhost:6379/1",
    "CACHE_DEFAULT_TIMEOUT": 600,  # 10 minutes
}

SUPERSET_CACHE_CONFIG = {
    "CACHE_TYPE": "RedisCache",
    "CACHE_REDIS_URL": "redis://localhost:6379/2",
    "CACHE_DEFAULT_TIMEOUT": 3600,  # 1 hour for dashboard queries
}

Use a dedicated Redis instance for Superset caching. If Redis is shared with other applications and fills up, Superset cache eviction can cause unexpected query latency spikes.

Databricks Query Cache

Databricks SQL Warehouse has a built-in query result cache. If the same query runs twice within a short window (e.g., 1 hour), the second run hits the cache and returns in <100ms. This is automatic and requires no configuration.

However, the Databricks cache is small (a few GB per warehouse). For large result sets or high-concurrency scenarios, it’s not reliable. Superset’s Redis cache is more predictable.

Cache Invalidation

When underlying data changes, cached results become stale. Plan cache invalidation carefully.

For dashboards powered by materialized views (e.g., daily revenue), invalidate the Superset cache once per day after the materialized view is refreshed. Use Superset’s Python API or a scheduled task:

from superset import db
from superset.models.cache import CacheKey

# After materialized view refresh, clear cache for dashboards that depend on it
db.session.query(CacheKey).filter(
    CacheKey.cache_key.like("%revenue_by_day_region%")
).delete()
db.session.commit()

For real-time dashboards, disable caching or use a very short TTL (1–5 minutes).

Distributed Caching for Multi-Region Deployments

If Superset is deployed across multiple regions or availability zones, use a distributed cache (e.g., Redis Cluster or AWS ElastiCache) so all instances share the same cache. Otherwise, each Superset instance has its own cache, and cache hit rates plummet.


Security and Compliance

Row-Level Security (RLS)

Superset supports row-level security: different users see different data based on their role or attributes. This is essential for multi-tenant dashboards or regulated industries.

Implement RLS in Superset by adding a WHERE clause to datasets:

SELECT * FROM transactions
WHERE region IN (
  SELECT region FROM user_regions WHERE username = '{{ current_username() }}'
)

When a user views a dashboard, Superset substitutes {{ current_username() }} with the logged-in user’s username and applies the WHERE clause. The user only sees transactions for their assigned regions.

Alternatively, implement RLS in Databricks using row filters on tables, and Superset will inherit those permissions.

Encryption in Transit and at Rest

All communication between Superset and Databricks must use TLS 1.2 or higher. Databricks enforces this by default; ensure your Superset configuration uses https:// URLs.

Superset’s metadata (dashboards, charts, user credentials) should be stored in an encrypted database. Use PostgreSQL or MySQL with encrypted connections and transparent data encryption (TDE) enabled.

For teams pursuing SOC 2 or ISO 27001 compliance via Vanta, encryption is a mandatory control. Ensure Vanta can audit encryption settings in Superset and Databricks.

Audit Logging

Enable audit logging in both Superset and Databricks. Superset logs user logins, dashboard views, and query executions. Databricks logs workspace activity, query execution, and data access.

Centralise logs in a SIEM (e.g., Datadog, Splunk, ELK Stack) for analysis and alerting. Set up alerts for suspicious activity: e.g., a user accessing dashboards outside business hours, or a query scanning an unusually large amount of data.

For SOC 2 / ISO 27001 audits, audit logs are critical evidence of security controls in place.


Operational Considerations

Monitoring and Alerting

Monitor these key metrics in production:

  • Dashboard load time: Track p50, p95, and p99 latencies. Alert if p95 > 5 seconds.
  • Query execution time: Track Databricks query execution time and Superset query submission latency separately. Alert if execution time > 30 seconds (indicates runaway query or data explosion).
  • Cache hit rate: Track the percentage of queries served from cache vs. hitting Databricks. Aim for >70% hit rate on dashboards.
  • Connection pool usage: Track active and idle connections. Alert if pool exhaustion occurs.
  • Databricks warehouse cost: Track DBU consumption by dashboard / user. Alert if a single dashboard is consuming >100 DBUs/hour (indicates inefficient query).
  • Error rate: Track query failures, timeouts, and authentication errors. Alert on any spike.

Use Superset’s built-in metrics (available via the API) and Databricks’ SQL warehouse monitoring dashboard. Integrate with your monitoring stack (Datadog, Prometheus, etc.).

Scaling Superset

Superset itself is stateless (dashboards and charts are stored in the metadata database, not in Superset’s memory). You can run multiple Superset instances behind a load balancer to scale horizontally.

For 10–50 concurrent users, a single Superset instance (4 CPU, 8GB RAM) is sufficient. For 100+ concurrent users, run 2–3 instances. For 500+ concurrent users, run 4–6 instances and ensure the metadata database (PostgreSQL/MySQL) is also scaled (read replicas, connection pooling).

Superset’s async query execution (Celery workers) should also be scaled. By default, Superset has 1–2 workers. For high-concurrency dashboards, run 4–8 workers so queries don’t queue.

Scheduled Refreshes and Materialized Views

Schedule materialized view refreshes during off-peak hours. If your dashboard traffic peaks during 9am–5pm, refresh materialized views at 6pm–8pm.

Use Databricks Jobs to schedule refreshes:

# Databricks Job: refresh_revenue_aggregates
from databricks.sql import connect

conn = connect(host="adb-1234567890.cloud.databricks.com", http_path="/sql/1.0/warehouses/abc123")
cursor = conn.cursor()
cursor.execute("REFRESH MATERIALIZED VIEW revenue_by_day_region")
cursor.close()

Set this job to run daily at 8pm. Monitor job success in Databricks and alert on failures.

Backup and Disaster Recovery

Superset metadata (dashboards, charts, users) should be backed up regularly. Use PostgreSQL or MySQL backups (daily snapshots, continuous replication).

Databricks data is protected by Delta Lake’s ACID guarantees and automatic backups. However, ensure you have a recovery plan if the Databricks workspace is deleted or compromised. Test recovery procedures quarterly.


Real-World Deployment Lessons

Lesson 1: Start with Materialized Views, Not Raw Tables

Our first D23.io customer tried to query a 5-billion-row raw transactions table directly in Superset. Dashboard load time was 30+ seconds. We created 10 materialized views (revenue by day/region, by customer segment, by product category, etc.). Load time dropped to 2–3 seconds, and query costs dropped 40%.

Learning: design your data model for Superset from day one. Don’t assume Superset can handle raw data at scale.

Lesson 2: Connection Pooling Failures are Silent

One customer experienced intermittent dashboard failures (some users could load dashboards, others got timeouts). The root cause: connection pool was exhausted because a buggy chart was holding connections open. We added connection pool monitoring and the pool_pre_ping setting. Problem solved.

Learning: monitor connection pool metrics from day one. Don’t wait for user complaints.

Lesson 3: Token Rotation is Non-Negotiable

A financial services customer’s Superset stopped working after 90 days. The service account token expired and was revoked by their IAM team. We implemented automatic token rotation and monitoring. Now they rotate tokens every 60 days without downtime.

Learning: build token rotation into your deployment automation. Test it in staging before production.

Lesson 4: Cache Invalidation is Harder Than It Looks

A retail customer had a dashboard showing “Top 10 Products by Revenue.” The materialized view was refreshed daily at midnight, but the Superset cache wasn’t invalidated. Users saw stale top-10 lists until 10am. We implemented cache invalidation as part of the materialized view refresh job.

Learning: tie cache invalidation to data refresh logic. Don’t rely on TTL-based expiration for critical dashboards.

Lesson 5: Query Hints Require Expertise

A media company’s dashboard had a slow join between a 100-million-row events table and a 1-million-row users table. Catalyst was choosing a shuffle join instead of a broadcast join. We added a BROADCAST(users) hint and query time dropped from 15 seconds to 2 seconds.

Learning: keep SQL performance tuning expertise on your team or budget for external help. Query hints are powerful but require deep Databricks knowledge.


Scaling Beyond MVP

Multi-Warehouse Deployments

As your analytics workload grows, you might need multiple Databricks SQL Warehouses: one for interactive dashboards (optimised for latency), one for batch analytics (optimised for cost), and one for data science (optimised for flexibility).

Route Superset queries to the interactive warehouse. Route scheduled reports and ETL jobs to the batch warehouse. This isolation prevents dashboard slowdowns when a batch job consumes resources.

In Superset, create separate database connections for each warehouse and assign dashboards to the appropriate connection.

Embedding Superset Dashboards

Many teams want to embed Superset dashboards in external applications (e.g., a customer-facing analytics portal). Superset supports embedding via iframes or the Python SDK.

For embedded dashboards, implement guest access and row-level security to ensure users only see data they’re authorised to view. Use Superset’s guest token feature to generate short-lived, read-only tokens for embedded sessions.

Custom Metrics and KPIs

As Superset usage grows, define a shared library of metrics and KPIs. For example:

  • Revenue = SUM(amount) WHERE status = ‘completed’
  • Customer Lifetime Value = SUM(revenue) GROUP BY customer_id
  • Churn Rate = COUNT(churned_customers) / COUNT(total_customers)

Store these in Superset as saved metrics or in a metrics layer (e.g., dbt, Looker, Cube.js). This ensures consistency across all dashboards.

Governance and Data Lineage

As dashboards proliferate, governance becomes critical. Document which dashboards depend on which tables, which teams own which dashboards, and which dashboards are used for decision-making.

Use Databricks’ data lineage tools (available in Databricks’ admin console) to track data flows from source to dashboard. Integrate with Superset’s API to document dashboard lineage.

For teams modernising regulated monoliths or managing multi-tenant SaaS platforms, this governance is essential. Check out PADISO’s platform engineering services in Wellington, Ottawa, and other key cities for hands-on governance implementation.


Common Pitfalls and Solutions

Pitfall 1: Runaway Query Costs

Problem: A dashboard query scans 100 billion rows because the WHERE clause is missing. Databricks charges for the entire scan.

Solution: Add data type validation and partition pruning to all Superset queries. Use Databricks’ query cost estimation to preview cost before execution. Set query timeout limits (e.g., 60 seconds) to prevent runaway queries.

Pitfall 2: Slow Joins

Problem: A dashboard joins a 1-billion-row fact table with a 10-million-row dimension table. The join is slow because Catalyst chooses a shuffle join instead of a broadcast join.

Solution: Add a BROADCAST hint or materialise the join result in a pre-computed table. Monitor join performance and profile slow queries using Databricks’ query profile tool.

Pitfall 3: Memory Pressure on Superset

Problem: A user exports a dashboard with 1 million rows. Superset’s memory spikes and the export fails or times out.

Solution: Limit result set size (LIMIT 100k), paginate large exports, and run exports asynchronously in a background worker. Monitor Superset memory usage and scale instances if needed.

Pitfall 4: Timezone Issues

Problem: A dashboard shows revenue by date, but the date is off by one day because Superset and Databricks have different timezone settings.

Solution: Use UTC everywhere. Store all timestamps in Databricks as UTC, configure Superset to use UTC, and let the frontend handle timezone conversion for display.

Pitfall 5: Stale Metadata

Problem: A new column is added to a Databricks table, but Superset’s metadata cache doesn’t reflect it. The new column doesn’t appear in Superset’s column picker.

Solution: Superset automatically syncs metadata from Databricks on a schedule (default: every 24 hours). For faster updates, manually trigger a metadata refresh in Superset’s database settings.


Next Steps

Building a production Superset + Databricks architecture is achievable in 4–8 weeks. Here’s a phased approach:

Week 1–2: Design and Proof of Concept

Week 3–4: Production Deployment

  • Deploy Superset to production (Kubernetes, managed cloud, or self-hosted).
  • Configure connection pooling, caching (Redis), and authentication.
  • Set up monitoring and alerting for query performance, cache hit rates, and Databricks costs.
  • Implement row-level security and audit logging.
  • Create runbooks for common operational tasks (token rotation, cache invalidation, materialized view refresh).

Week 5–6: Governance and Documentation

  • Document data lineage (which dashboards depend on which tables).
  • Define a shared metrics library.
  • Train users on dashboard best practices (date filters, LIMIT clauses, etc.).
  • Set up governance policies: who can create dashboards, which tables are approved for analytics, etc.

Week 7–8: Scaling and Optimisation

  • Monitor production dashboards for 2–4 weeks. Identify slow queries and optimise them.
  • Profile slow queries using Databricks’ query profile tool. Add hints or materialise joins if needed.
  • Scale Superset instances and Databricks warehouses based on usage patterns.
  • Plan for multi-warehouse deployments if workload diversity increases.

For teams in Sydney, Melbourne, or across Australia, PADISO offers hands-on platform engineering services to design and deploy Superset + Databricks architectures. We’ve shipped this stack in financial services, retail, and media, and we understand the operational quirks and cost levers.

For teams in other regions, PADISO has platform development teams in New York, Washington, D.C., Toronto, Chicago, and Dallas–Fort Worth who specialise in data platforms and analytics. We also support government and defence teams in Canberra and Ottawa with sovereign cloud and compliance-ready architectures.

If you’re building a modern analytics platform and need fractional CTO leadership, hands-on engineering, or security audit support, reach out to PADISO. We ship Superset + Databricks in production, and we can help you avoid the pitfalls we’ve learned from 50+ deployments.


Key Takeaways

  1. Choose Databricks SQL Warehouse over direct cluster connections for production dashboards. It’s cheaper and easier to scale.
  2. Design materialized views before building dashboards. Pre-computed aggregations are the single biggest lever for performance.
  3. Implement connection pooling, caching, and monitoring from day one. Don’t add them after you hit performance issues in production.
  4. Use row-level security and audit logging to meet compliance requirements and protect sensitive data.
  5. Monitor query costs, dashboard load times, and cache hit rates continuously. Optimise based on real production data.
  6. Plan for token rotation, backup, and disaster recovery before you go live.
  7. Start with materialized views and simple queries. Optimise complex queries only after profiling and identifying bottlenecks.

Apache Superset + Databricks is a powerful, cost-effective foundation for modern analytics. With careful architecture and operational discipline, you can ship production dashboards in weeks, not months, and scale to thousands of users without breaking the bank.

Want to talk through your situation?

Book a 30-minute call with Kevin (Founder/CEO). No pitch — direct advice on what to do next.

Book a 30-min call