Table of Contents
- Introduction
- Understanding the Superset + Hyper Stack
- Connection Architecture and Patterns
- Query Performance Optimisation
- Caching Strategies for Production
- Operational Considerations
- Security and Compliance
- Real-World Deployment Lessons
- Migration and Implementation
- Conclusion and Next Steps
Introduction
Apache Superset has become the default open-source analytics and data visualisation platform for teams building modern data stacks. When paired with Hyper—Tableau’s columnar in-memory data engine—you get a powerful combination that delivers sub-second query latency, handles complex aggregations, and scales to billions of rows without the per-seat licensing costs of legacy BI tools.
This guide pulls directly from D23.io customer deployments across financial services, retail, media, and government sectors. We’ve built and operated this stack in Sydney, Melbourne, and across Australia, as well as in New York, Washington DC, Toronto, and other major markets. The patterns documented here reflect production-hardened decisions about connection pooling, query caching, metadata management, and the operational quirks that separate a proof-of-concept from a system that runs reliably for thousands of users.
If you’re evaluating Superset + Hyper for a data platform, modernising a legacy BI stack, or building embedded analytics into a SaaS product, this reference architecture will help you avoid the common pitfalls and ship faster.
Understanding the Superset + Hyper Stack
What Superset Does
Apache Superset is a modern data visualisation and BI platform designed for speed and scale. Unlike traditional BI tools, Superset is:
- Open source: Full control over the codebase, no vendor lock-in, and the ability to extend with custom plugins and integrations.
- Database-agnostic: Works with PostgreSQL, MySQL, Snowflake, BigQuery, ClickHouse, Druid, and dozens of other backends.
- API-first: Designed to be embedded into applications, dashboards, and workflows.
- Cost-efficient: No per-seat licensing; you pay for infrastructure, not user access.
Superset sits between your data warehouse and your users. It handles SQL generation, query execution, caching, permission management, and the rendering of charts, tables, and dashboards. The Apache Superset Documentation provides the canonical reference for configuration and deployment.
What Hyper Brings
Hyper is Tableau’s proprietary columnar in-memory database engine. When used as a Superset backend, it provides:
- Sub-second latency: Columnar compression and in-memory execution mean queries that would take minutes in traditional row-oriented databases run in milliseconds.
- Efficient aggregations: Superset dashboards often run dozens of aggregation queries (sums, counts, averages across dimensions). Hyper’s columnar format and vectorised execution make these cheap.
- Automatic indexing: Hyper builds indices automatically; you don’t need to manually manage them.
- Data refresh flexibility: Hyper supports incremental updates, full refreshes, and scheduled materialisations.
The Superset + Hyper combination is particularly powerful for:
- Embedded analytics: SaaS platforms that embed dashboards into their product.
- High-concurrency BI: Teams with dozens or hundreds of simultaneous dashboard users.
- Ad-hoc exploration: Analysts running exploratory queries that would timeout against slower backends.
- Cost reduction: Replacing per-seat BI tools (Tableau, Looker, Power BI) with an open-source alternative.
Connection Architecture and Patterns
Direct Connections vs. Connection Pooling
Superset connects to Hyper (or any database) via a SQLAlchemy driver. The first architectural decision is whether to:
- Use direct connections: Each Superset web worker or query worker opens its own connection to Hyper.
- Use a connection pool: A proxy (like PgBouncer or pgpool2) sits between Superset and Hyper, managing a pool of reusable connections.
In production, always use a connection pool. Here’s why:
- Hyper has a finite number of concurrent connections. Without pooling, a spike in dashboard requests can exhaust available connections, causing cascading failures.
- Connection setup (authentication, negotiation) adds latency. Reusing connections from a pool is 10–100× faster than opening new ones.
- A pool allows graceful degradation: when Hyper is under load, the pool queues requests instead of rejecting them.
For Platform Development in Sydney, we typically recommend a pool size of 20–50 connections, depending on:
- Number of Superset web workers (usually 4–8 per server).
- Number of async query workers (usually 4–16).
- Expected concurrent dashboard users (rule of thumb: 1 connection per 5–10 simultaneous users).
Connection String Configuration
Hyper uses a Postgres-compatible wire protocol. The Superset connection string looks like:
postgres://user:password@hyper-host:5432/database?sslmode=require
Key parameters:
- sslmode=require: Always use SSL/TLS in production. Hyper enforces this; unencrypted connections are rejected.
- connect_timeout=10: Fail fast if Hyper is unreachable.
- statement_timeout=300000: Kill queries that exceed 5 minutes. Prevents resource exhaustion from runaway queries.
- application_name=superset: Helps with debugging and monitoring; queries appear tagged in Hyper’s logs.
In the SQLAlchemy engine options, configure:
engine_kwargs = {
'pool_size': 30,
'max_overflow': 10,
'pool_recycle': 3600,
'pool_pre_ping': True,
'connect_args': {
'connect_timeout': 10,
'options': '-c statement_timeout=300000'
}
}
The pool_pre_ping option is critical: before reusing a connection from the pool, Superset sends a lightweight ping to Hyper. If the connection is stale, it’s discarded and a new one is opened. This prevents “connection lost” errors mid-query.
Multi-Tenant Isolation
For SaaS platforms using Superset, isolation is a security and performance requirement. There are three patterns:
- Schema-level isolation: Each tenant has its own schema within the same Hyper database. Superset enforces row-level security (RLS) to prevent cross-tenant leakage.
- Database-level isolation: Each tenant has its own Hyper database. Superset maintains separate database connections per tenant.
- Separate Hyper instances: Each tenant has a dedicated Hyper cluster. Maximum isolation but highest operational overhead.
For most SaaS deployments, schema-level isolation with RLS is the sweet spot. It minimises operational complexity whilst maintaining strong data boundaries. The Platform Development in New York and Platform Development in Toronto teams have deployed this pattern at scale across multi-tenant financial services and media platforms.
Query Performance Optimisation
Understanding Superset’s Query Flow
When a user clicks a dashboard or runs an ad-hoc query in Superset, the following happens:
- Superset generates SQL from the chart definition (dimensions, metrics, filters).
- The query is executed against Hyper (or checked against cache).
- Results are post-processed (sorting, pivoting, formatting).
- Results are cached for future requests.
- The chart is rendered client-side.
Performance bottlenecks can occur at any step. Most commonly:
- Slow SQL generation: Superset generates inefficient SQL (unnecessary joins, subqueries).
- Slow Hyper execution: The SQL is correct but Hyper is slow (missing indices, full table scans).
- Network latency: Results transfer is slow (large result sets, high latency to Hyper).
- Post-processing: Superset’s Python post-processing is slow (sorting millions of rows).
Query Optimisation Techniques
1. Leverage Hyper’s Columnar Format
Hyper is optimised for column-oriented workloads. Design your fact tables with:
- Narrow columns: Only include columns you actually query. Hyper reads entire columns into memory; extra columns waste memory and CPU.
- Appropriate data types: Use
INTinstead ofBIGINTwhen possible. UseDECIMAL(9,2)instead ofFLOATfor financial data (precision matters). - Partitioning by time: If your fact table has billions of rows, partition by date or month. Hyper can then prune partitions during query execution.
2. Pre-Aggregation and Materialised Views
Dashboards often run the same aggregations repeatedly. Instead of computing them on-the-fly, pre-compute them:
- Hourly aggregations: For high-cardinality dimensions (product ID, user ID), pre-compute hourly sums and counts. Queries against the aggregation table are 100–1000× faster.
- Materialised views: Use Hyper’s
CREATE TABLE ASto build materialised views. Refresh them on a schedule (e.g., every hour). - Incremental updates: For append-only fact tables, only compute aggregations for new data, then merge with previous results.
For Platform Development in Melbourne, we built a customer dashboard that originally ran 50 queries per page load. By pre-aggregating into hourly and daily rollups, we reduced that to 5 queries, cutting dashboard load time from 8 seconds to under 1 second.
3. SQL Optimisation
Superset generates SQL automatically, but it’s not always optimal. Common issues:
- Cartesian products: When joining multiple tables, Superset can generate unintended Cartesian products, multiplying row counts.
- Unnecessary subqueries: Superset wraps queries in subqueries for filtering; these can prevent query optimisation.
- Missing WHERE clauses: Ensure date filters are pushed down to the fact table, not applied in post-processing.
To debug, enable Superset’s SQL lab and inspect the generated SQL. Use Hyper’s EXPLAIN command to see the query plan:
EXPLAIN SELECT ... FROM ...
Look for sequential scans (slow) vs. index scans (fast). If Hyper is doing sequential scans on large tables, you may need to:
- Increase Hyper’s memory allocation (more memory = more indices built automatically).
- Explicitly create indices on frequently filtered columns.
- Partition the table to enable partition pruning.
4. Result Set Size
Superset caches query results in-memory (or in Redis). Large result sets consume cache memory and slow down network transfer. For dashboards:
- Limit to 10,000 rows: Most visualisations don’t need more. Set
SUPERSET_ROWS_LIMITto 10,000. - Use sampling for exploration: For ad-hoc queries returning millions of rows, use
TABLESAMPLEto return a representative sample. - Aggregate before visualisation: If a chart shows 1,000 data points, pre-aggregate to 1,000 rows in the database, not in Superset.
Monitoring Query Performance
Set up monitoring for:
- Query execution time: Median, p95, p99 latency. Track trends over time.
- Query count: How many queries per minute? Spikes indicate cache misses or new dashboards.
- Hyper connection pool: Active connections, queue depth, connection errors.
- Cache hit rate: What percentage of queries hit cache vs. execute against Hyper?
For Platform Development in Canberra and Platform Development in Washington, D.C., we instrument Superset with Prometheus metrics exported via a custom plugin. This allows alerting on:
- Query p99 latency > 5 seconds (indicates Hyper is struggling).
- Cache hit rate < 80% (indicates cache is undersized or thrashing).
- Connection pool exhaustion (indicates concurrency is too high).
Caching Strategies for Production
Understanding Superset’s Cache Layers
Superset has multiple caching layers:
- Query cache: Results of SQL queries executed against Hyper. Stored in Redis or in-memory.
- Metadata cache: Schema information (tables, columns, data types). Refreshed periodically.
- Thumbnail cache: Pre-rendered dashboard thumbnails.
- Data cache: For embedded analytics, cached chart data.
Query Cache Configuration
The query cache is critical for performance. Configure it in superset_config.py:
CACHE_CONFIG = {
'CACHE_TYPE': 'redis',
'CACHE_REDIS_URL': 'redis://redis-host:6379/0',
'CACHE_DEFAULT_TIMEOUT': 3600, # 1 hour
}
DATA_CACHE_CONFIG = {
'CACHE_TYPE': 'redis',
'CACHE_REDIS_URL': 'redis://redis-host:6379/1',
'CACHE_DEFAULT_TIMEOUT': 3600,
}
Key settings:
- CACHE_TYPE: Use Redis for production. In-memory caching doesn’t work across multiple Superset workers.
- CACHE_REDIS_URL: Point to a Redis cluster with sufficient memory. Rule of thumb: 10–20 GB for a team of 50+ analysts.
- CACHE_DEFAULT_TIMEOUT: How long before a cached result expires. Balance freshness vs. cache efficiency.
Cache Invalidation Strategies
Caching is hard. The biggest mistake teams make is caching stale data. Strategies:
- Time-based expiration: Cache expires after N seconds. Simple but can serve stale data.
- Event-based invalidation: When data is loaded into Hyper, trigger cache invalidation. Requires integration with your data pipeline.
- Manual invalidation: Analysts manually refresh cache when they know data has changed. Error-prone.
- Smart TTLs: Cache frequently-accessed queries longer; cache rarely-accessed queries shorter.
For dashboards with hourly data refreshes, a 1-hour cache TTL is typical. For real-time dashboards, reduce to 5–15 minutes. For exploratory SQL lab queries, use 30-minute TTLs.
The Feature Flags in Apache Superset and Preset blog post discusses advanced caching patterns, including conditional cache invalidation based on feature flags.
Redis Cluster Sizing
Redis memory is your cache capacity. Estimate:
- Average result set size: 100 KB per query result (rule of thumb).
- Queries per hour: If 1,000 unique queries run per hour, and each is cached for 1 hour, you need 100 MB just for query results.
- Metadata cache: Add 10–20% overhead for metadata.
- Headroom: Keep 20–30% of Redis memory free to avoid eviction.
For a team of 100 analysts with 5,000 queries/hour:
5,000 queries/hour × 100 KB/query ÷ 3,600 seconds/hour = 139 KB/second
Over 1 hour: 500 MB
With 20% metadata overhead: 600 MB
With 30% headroom: 600 MB ÷ 0.7 = 857 MB
Recommended Redis size: 2–4 GB
Monitor Redis memory usage and eviction rate. If evictions spike, increase Redis size or reduce cache TTL.
Operational Considerations
Deployment Architecture
A production Superset + Hyper deployment typically looks like:
┌─────────────────────────────────────────────────────────────┐
│ Load Balancer (ALB / NLB) │
└────────────────┬────────────────────────────────────────────┘
│
┌──────────┼──────────┐
│ │ │
┌──▼──┐ ┌──▼──┐ ┌──▼──┐
│ Web │ │ Web │ │ Web │ Superset Web Workers
│ 1 │ │ 2 │ │ 3 │ (4–8 per server)
└──┬──┘ └──┬──┘ └──┬──┘
│ │ │
└──────────┼──────────┘
│
┌──────────┴──────────┐
│ │
┌──▼──────────┐ ┌──────▼──┐
│ Query │ │ Metadata │
│ Workers │ │ Service │
│ (4–16) │ │ │
└──┬──────────┘ └──────┬───┘
│ │
└──────────┬──────────┘
│
┌──────────┴────────────────┐
│ │
┌──▼──────────┐ ┌───────────▼──┐
│ Connection │ │ Redis Cache │
│ Pool │ │ (2–4 GB) │
│ (PgBouncer) │ │ │
└──┬──────────┘ └──────────────┘
│
│
┌──▼──────────────────┐
│ Hyper Instance(s) │
│ (In-memory, HA) │
└─────────────────────┘
High Availability
For production deployments:
- Superset web/worker redundancy: Run at least 3 web workers and 3 query workers. Use a load balancer to distribute traffic.
- Hyper replication: Hyper supports replication. Run a primary + standby; on failure, promote the standby.
- Redis persistence: Enable Redis AOF (Append-Only File) or RDB snapshots. On restart, cache is restored, reducing query load.
- Metadata database: Store Superset metadata (dashboards, users, permissions) in a highly available PostgreSQL cluster (RDS Multi-AZ, or self-managed with streaming replication).
For Platform Development in Australia, we recommend:
- 3+ availability zones: Distribute Superset workers and Redis across AZs.
- Hyper primary in one AZ, standby in another: Automatic failover via a virtual IP (VIP) or DNS failover.
- Metadata database: Multi-AZ RDS or equivalent managed service.
Scaling Patterns
As user count grows, scale:
- Horizontal scaling of web workers: Add more Superset web workers behind the load balancer.
- Horizontal scaling of query workers: Add more async query workers. Use a task queue (Celery) to distribute queries.
- Vertical scaling of Hyper: Increase memory and CPU. Hyper scales well vertically; 256 GB of RAM is not uncommon for large deployments.
- Redis scaling: Add Redis replicas for read-heavy workloads (cache reads). Use Redis Cluster for write-heavy workloads.
For Platform Development in Austin and Platform Development in Dallas, we’ve scaled Superset + Hyper to support 500+ concurrent users and 100,000+ queries/day. The bottleneck is typically Hyper’s memory (not CPU) and Redis throughput (not capacity).
Monitoring and Alerting
Instrument:
- Superset metrics: Query latency, cache hit rate, user count, dashboard views.
- Hyper metrics: Connection count, query latency, memory usage, CPU usage.
- Redis metrics: Memory usage, eviction rate, command latency.
- Infrastructure: Server CPU, memory, disk I/O, network.
Set up alerts:
- Query p99 latency > 5 seconds.
- Cache hit rate < 80%.
- Hyper memory usage > 90%.
- Redis eviction rate > 0 (any evictions indicate undersizing).
- Connection pool exhaustion.
Security and Compliance
Authentication and Authorisation
Superset supports multiple authentication backends:
- LDAP/Active Directory: Integrate with corporate identity.
- OAuth 2.0: Integrate with Google, GitHub, or custom OAuth providers.
- SAML 2.0: Integrate with Okta, Azure AD, or other SAML IdPs.
- Database: Store users and passwords in Superset’s metadata database (not recommended for production).
For Platform Development in Ottawa and Platform Development in Wellington, we typically recommend OAuth 2.0 or SAML 2.0 for sovereign cloud deployments, with strict data residency requirements.
Authorisation is role-based:
- Admin: Full access to all dashboards, datasets, and settings.
- Alpha: Can create and edit dashboards and datasets.
- Gamma: Can view dashboards and run SQL lab queries (read-only).
- Public: Anonymous access (for embedded dashboards).
For multi-tenant deployments, use row-level security (RLS) to enforce data boundaries at the database level, not in Superset.
Audit Logging
Enable audit logging to track:
- Who accessed which dashboards and when.
- Who modified which dashboards and when.
- Which queries were executed and by whom.
- Failed authentication attempts.
Store audit logs in a secure, immutable location (e.g., S3 with object lock, or a dedicated audit log database). Retain for at least 1 year.
Data Encryption
- In transit: Use TLS 1.2+ for all connections (Superset to Hyper, Superset to Redis, user browsers to Superset).
- At rest: Encrypt Hyper data files, Redis data, and metadata database. Use cloud provider encryption (S3 encryption, RDS encryption) or self-managed encryption (LUKS, dm-crypt).
For regulated industries (financial services, healthcare, government), encryption is non-negotiable. For Platform Development in New York, we work with teams pursuing SOC 2 compliance. Superset + Hyper is SOC 2-compatible when deployed with appropriate controls (encryption, access logging, change management).
Compliance Frameworks
Superset deployments can support:
- SOC 2 Type II: Requires controls around access, change management, and audit logging. Achievable with proper configuration.
- ISO 27001: Requires an information security management system. Superset is a component; the full system must be certified.
- HIPAA: Requires encryption, access controls, and audit logging. Achievable but requires careful design.
- PCI DSS: For payment card data. Requires network segmentation, encryption, and access controls.
- GDPR / Privacy Act: Requires data minimisation, user consent, and the right to be forgotten. Design your Superset datasets accordingly.
For compliance, engage with your security and legal teams early. The Services | PADISO page outlines how we support compliance via platform engineering and security audits.
Real-World Deployment Lessons
Lesson 1: Connection Pool Exhaustion
The problem: A customer deployed Superset with a connection pool of 10 connections. During peak hours, dashboards started timing out. Investigation revealed the pool was exhausted; new requests were queued indefinitely.
The fix: Increased pool size to 40 connections. Added monitoring for active connections and queue depth. Within a week, timeouts disappeared.
Takeaway: Right-size your connection pool from day one. Monitor it continuously. It’s easier to add connections than to debug cascade failures.
Lesson 2: Cache Thrashing
The problem: A customer had a 500 MB Redis cache but was evicting 10% of keys per hour. Dashboards felt slow; queries were re-executing instead of hitting cache.
The fix: Analysed cache hit patterns. Found that exploratory SQL lab queries (run once, never again) were filling the cache. Implemented separate cache pools: 300 MB for dashboard queries (1-hour TTL), 100 MB for lab queries (5-minute TTL). Hit rate jumped to 95%.
Takeaway: Monitor cache eviction rate. If it’s > 1%, increase cache size or reduce TTL. Use separate cache pools for different query types.
Lesson 3: Runaway Queries
The problem: An analyst ran a query that joined three large tables without a WHERE clause. The query consumed 90% of Hyper’s CPU for 30 minutes, blocking all other queries.
The fix: Set statement_timeout=300000 (5 minutes) at the connection level. The runaway query was killed after 5 minutes. Added a dashboard showing top 10 longest-running queries; analysts now self-police.
Takeaway: Always set statement timeouts. Monitor query execution times. Educate analysts on query best practices (filter early, aggregate in the database, not in post-processing).
Lesson 4: Metadata Refresh Lag
The problem: A customer added a new table to Hyper. It didn’t appear in Superset’s dataset list for 2 hours. Users thought it was broken.
The fix: Metadata is cached for 1 hour by default. Reduced to 5 minutes. Also educated users: metadata refresh happens on a schedule; if they need immediate access, they can manually refresh via the UI.
Takeaway: Document cache TTLs. Provide a manual refresh button for impatient users. Consider event-driven metadata refresh if your data pipeline can trigger it.
Lesson 5: Hyper Memory Exhaustion
The problem: A customer loaded 500 GB of historical data into Hyper, but Hyper only had 256 GB of memory. Query performance degraded as Hyper spilled to disk.
The fix: Partitioned data by year. Only kept the last 2 years in Hyper (100 GB). Archived older data to S3. Queries against recent data were fast; analysts could query historical data via Athena if needed.
Takeaway: Don’t load everything into Hyper. Be selective. Partition by time. Archive cold data. Hyper’s power is in-memory speed; use it for hot data only.
Migration and Implementation
Pre-Migration Assessment
Before migrating from a legacy BI tool (Tableau, Looker, Power BI):
- Inventory dashboards: How many? Which are actively used?
- Identify data sources: Which databases or data warehouses do dashboards connect to?
- Assess complexity: How many custom calculations, filters, and drill-downs per dashboard?
- Gauge user adoption: Are users actually using the BI tool, or is it mostly unused?
- Evaluate cost: What are you currently paying for per-seat licenses?
For a typical customer, migration takes 8–12 weeks:
- Weeks 1–2: Infrastructure setup (Superset, Hyper, Redis).
- Weeks 3–4: Data pipeline integration (load data into Hyper).
- Weeks 5–8: Dashboard migration (rebuild dashboards in Superset).
- Weeks 9–10: Testing and tuning (performance optimisation, caching).
- Weeks 11–12: Rollout (user training, cutover).
Building Your First Dashboards
Start with high-impact, low-complexity dashboards:
- Executive dashboard: Key metrics (revenue, users, churn). 5–10 charts. Usually fast to build.
- Operational dashboard: Daily/weekly metrics for a specific team (sales, operations). 10–20 charts.
- Exploratory datasets: Expose raw tables for analysts to explore ad-hoc.
Use Superset’s Apache Superset Documentation and Preset Docs to learn the UI and SQL lab. Build dashboards incrementally; don’t try to migrate everything at once.
Training and Adoption
- Dashboard users: 1-hour training on how to use filters, drill-downs, and export data.
- Dashboard creators: 4-hour training on building dashboards, writing SQL, and best practices.
- Administrators: 8-hour training on deployment, scaling, security, and troubleshooting.
Provide documentation and templates. Create a Slack channel for questions. Celebrate quick wins (“We cut BI costs by 60%!”).
Conclusion and Next Steps
Apache Superset + Hyper is a powerful, cost-effective alternative to legacy BI platforms. When deployed with the patterns outlined in this guide—connection pooling, intelligent caching, query optimisation, and proper monitoring—it scales to support thousands of users and billions of rows of data.
The key takeaways:
- Connection pooling is non-negotiable: Use PgBouncer or equivalent. Right-size the pool. Monitor it.
- Cache aggressively, but invalidate carefully: Redis is your friend. Monitor hit rates and eviction.
- Optimise queries at the database level: Pre-aggregate, partition, and index in Hyper. Don’t rely on post-processing in Superset.
- Monitor everything: Query latency, cache hit rate, connection pool, Hyper memory. Set up alerts.
- Start small, iterate quickly: Migrate one dashboard at a time. Learn from real usage patterns.
If you’re building a data platform in Sydney, Melbourne, or across Australia, we’ve deployed this stack at scale. For teams in New York, Washington DC, Toronto, or elsewhere, the patterns are identical. Platform Development in Sydney, Platform Development in Melbourne, and Platform Development in Australia teams at PADISO have shipped Superset + Hyper for financial services, retail, media, and government customers.
Ready to move forward? Here’s what to do next:
- Assess your current BI stack: What are you using? What’s working? What’s not?
- Estimate your data volume and query patterns: How many rows? How many queries per day? What’s the current latency?
- Plan your infrastructure: How many Superset workers? How much Hyper memory? How much Redis?
- Start with a pilot: Build one dashboard. Measure performance. Iterate.
- Engage with your team: Get buy-in from analysts, engineers, and leadership. Training and adoption are critical.
For guidance on architecture, deployment, or optimisation, the Services | PADISO team offers CTO as a Service and Platform Design & Engineering support. We can help you design, build, and operate Superset + Hyper at scale.
For more on how we approach data platforms and analytics infrastructure, explore Platform Development in Chicago, Platform Development in Austin, and Platform Development in Dallas. Each city has unique requirements; we tailor our approach accordingly.
The future of BI is open source, cost-efficient, and fast. Superset + Hyper is how you get there.