Table of Contents
- Why Superset + ClickHouse
- Architecture Overview
- Connection Patterns and Setup
- Query Performance Optimisation
- Caching Strategies
- Operational Quirks and Gotchas
- Security and Compliance
- Scaling from MVP to Production
- Monitoring and Observability
- Next Steps
Why Superset + ClickHouse
Apache Superset paired with ClickHouse delivers a compelling alternative to expensive per-seat BI tools. We’ve deployed this combination across 50+ customer environments—from fintech platforms in Sydney to regulated data warehouses across North America—and consistently seen the same outcome: 60–70% reduction in annual BI licensing costs, query response times under 2 seconds on billion-row datasets, and the ability to embed analytics directly into product without per-user seat charges.
ClickHouse is a column-oriented OLAP database built for analytical workloads. Superset is an open-source, enterprise-grade visualisation layer. Together, they form a modern data stack that doesn’t require a data engineering team to operate. Unlike traditional row-oriented databases or cloud data warehouses that struggle with ad-hoc analytics at scale, ClickHouse compresses data by 10–40x, executes queries in milliseconds, and costs a fraction of what you’d spend on Snowflake or Redshift seats.
For founders and operators building analytics into product, this stack is non-negotiable. You avoid vendor lock-in, own your analytics layer, and scale horizontally without renegotiating licensing. For mid-market teams modernising legacy BI infrastructure, Superset + ClickHouse lets you retire expensive Tableau or Looker deployments while keeping your analysts productive.
At PADISO, we’ve built this architecture into customer platforms across platform development in Sydney, platform development in Melbourne, and platform development across Australia. We’ve also deployed variants in platform development in New York, platform development in Washington, D.C., and platform development in Toronto for regulated financial services and government teams. The patterns in this guide are pulled directly from those production deployments.
Architecture Overview
The Three-Layer Model
The reference architecture we recommend separates concerns into three layers: ingestion, storage, and presentation.
Ingestion Layer: Data flows from your application, event stream, or third-party APIs into ClickHouse via HTTP, native TCP, or message queues. We typically use Kafka for streaming events and batch jobs for historical backfills. ClickHouse’s native ReplicatedMergeTree engine handles deduplication and replication automatically.
Storage Layer: ClickHouse stores data in columnar format, partitioned by time and sharded across multiple nodes if needed. A single ClickHouse instance can handle 100M+ events per day without breaking a sweat. For high-availability deployments, we run 2–3 replicas behind a load balancer.
Presentation Layer: Superset connects to ClickHouse as a database backend, caches query results, and serves dashboards to end users. Superset handles role-based access control (RBAC), alerting, and embedded analytics.
This separation means your data pipeline can evolve independently from your visualisation layer. You can upgrade Superset, refactor ClickHouse schemas, or migrate to a different presentation tool without touching the others.
Network Topology
For production deployments, we isolate ClickHouse in a private subnet and route Superset traffic through a bastion or VPN. ClickHouse runs on port 8123 (HTTP) or 9000 (native protocol); Superset typically uses the HTTP endpoint for simplicity and better debugging.
If you’re deploying across multiple regions—say, Sydney and Melbourne for local data residency—replicate ClickHouse asynchronously and configure Superset to query the nearest replica. This adds latency but keeps data local and avoids cross-region bandwidth costs.
For teams pursuing SOC 2 compliance via Vanta or ISO 27001 certification, isolating ClickHouse and auditing all Superset queries is non-negotiable. We’ll cover this in the security section.
Connection Patterns and Setup
Initial Connection Configuration
Connecting Superset to ClickHouse is straightforward. The official ClickHouse Apache Superset documentation covers the basics, but here’s what you need to know operationally.
First, install the ClickHouse SQLAlchemy driver in your Superset environment:
pip install clickhouse-driver sqlalchemy-clickhouse
Then, in Superset’s database configuration, use a connection string like:
clickhouse+http://user:password@clickhouse-host:8123/database_name
Or, if you’re using the native TCP protocol (faster but less debuggable):
clickhouse://user:password@clickhouse-host:9000/database_name
We recommend HTTP for development and initial testing. It’s easier to debug with curl, logs are cleaner, and Superset’s error messages are more helpful. Once you’ve validated your schema and queries, switch to the native protocol for 20–30% faster execution.
Authentication and User Isolation
ClickHouse supports multiple authentication methods: password-based, LDAP, and OAuth (via a proxy). For Superset, we typically create a dedicated read-only user with permissions limited to specific databases and tables.
In your ClickHouse config, define a user:
<users>
<superset>
<password>your_secure_password</password>
<networks>
<ip>::/0</ip>
</networks>
<databases>
<database_name>
<table_name>
<.*>
<select>1</select>
</.*>
</table_name>
</database_name>
</databases>
</superset>
</users>
This restricts the Superset user to SELECT queries only, preventing accidental schema changes or deletions. For row-level security (RLS), you’ll need to implement it in Superset’s SQL layer or create separate ClickHouse views with WHERE clauses that Superset queries.
According to the ClickHouse Superset integration guide, you can also use Superset’s native RBAC to restrict which users see which dashboards, but that’s presentation-layer security. For data-layer security, ClickHouse users and row-level permissions are your lever.
Testing Connectivity
Before building dashboards, validate the connection with a simple test query. In Superset, go to SQL Lab and run:
SELECT 1 AS test, now() AS timestamp
If it returns instantly, you’re connected. If it times out or fails, check:
- Network connectivity:
telnet clickhouse-host 8123 - ClickHouse server status:
curl http://clickhouse-host:8123/ping - User permissions: Verify the user exists and has SELECT on the database
- Firewall rules: Confirm Superset’s IP is whitelisted
Once connectivity is confirmed, you’re ready to build datasets and dashboards.
Query Performance Optimisation
Understanding ClickHouse Query Execution
ClickHouse is fundamentally different from relational databases. Queries execute in a columnar fashion, scanning only the columns you request. This means a query like SELECT user_id FROM events is orders of magnitude faster than SELECT *, even on a 100-column table.
Second, ClickHouse uses primary keys for data ordering and partitioning, not uniqueness. If your primary key is (user_id, timestamp), ClickHouse stores data sorted by user_id first, then timestamp. Queries filtering on user_id will be fast; queries filtering only on timestamp will still scan the full table unless you’ve partitioned by timestamp.
Third, ClickHouse applies aggressive compression. A column of integers typically compresses to 10–20% of its original size. Text columns compress even more if they have repetition. This means your storage footprint is tiny, but decompression is fast because ClickHouse’s codec algorithms are optimised for SIMD operations on modern CPUs.
Indexing and Partitioning Strategy
For a typical events table, partition by date and set the primary key to high-cardinality dimensions:
CREATE TABLE events (
event_id UUID,
user_id UInt64,
session_id String,
event_type String,
timestamp DateTime,
properties String
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/events', 'replica1')
PARTITION BY toDate(timestamp)
ORDER BY (user_id, timestamp)
TTL timestamp + INTERVAL 90 DAY;
This creates daily partitions, orders data by user_id and timestamp, and automatically deletes rows older than 90 days. The primary key means queries filtering on user_id will skip partitions and data blocks that don’t match.
If you have a second access pattern—say, queries filtering by event_type—add a secondary index:
ALTER TABLE events
ADD INDEX idx_event_type event_type TYPE set(1000)
GRANULARITY 1;
This index doesn’t prevent full table scans but helps ClickHouse prune blocks faster. For most use cases, a well-chosen primary key is sufficient.
Superset Query Optimisation
When building Superset dashboards, write efficient SQL. Avoid SELECT *; list only the columns you need. Avoid complex joins; denormalise data into ClickHouse if you can. Use aggregations sparingly in Superset; push them into ClickHouse SQL.
For example, instead of selecting raw events and aggregating in Superset:
-- Inefficient: Superset aggregates
SELECT user_id, event_type, timestamp FROM events WHERE timestamp > now() - INTERVAL 7 DAY
Do this:
-- Efficient: ClickHouse aggregates
SELECT
user_id,
event_type,
toDate(timestamp) AS date,
COUNT(*) AS count
FROM events
WHERE timestamp > now() - INTERVAL 7 DAY
GROUP BY user_id, event_type, date
The second query returns 100x fewer rows, executes faster, and reduces memory usage in Superset.
According to the Altinity guide on visualising ClickHouse with Superset, you should also pre-aggregate data into materialised views if you have common queries. A materialised view in ClickHouse is a table that automatically updates whenever the underlying table changes:
CREATE MATERIALIZED VIEW events_daily_mv
ENGINE = SummingMergeTree()
ORDER BY (date, user_id, event_type)
AS
SELECT
toDate(timestamp) AS date,
user_id,
event_type,
COUNT(*) AS count
FROM events
GROUP BY date, user_id, event_type;
Now, Superset queries against events_daily_mv instead of the raw table. Results are instant because the aggregation is pre-computed.
Benchmarking and Profiling
Before deploying a dashboard to production, benchmark it against your ClickHouse instance. Use the system.query_log table to see execution times, rows processed, and bytes read:
SELECT
query,
query_duration_ms,
read_rows,
read_bytes,
result_rows
FROM system.query_log
WHERE event_date = today()
ORDER BY query_duration_ms DESC
LIMIT 10;
If a query takes more than 2 seconds, optimise it. Common culprits: missing partition filters, inefficient joins, or aggregations on high-cardinality columns. The OneUptime integration guide covers query optimisation in detail.
Caching Strategies
Superset’s Query Cache
Superset has a built-in query cache that stores results in Redis or Memcached. By default, cache TTL is 24 hours, but you can tune it per dataset or dashboard.
For dashboards with static data (e.g., historical reports), set cache TTL to 24 hours. For real-time dashboards, disable caching or set TTL to 60 seconds. In Superset’s dataset configuration:
Cache Timeout: 3600 # 1 hour in seconds
This prevents Superset from re-querying ClickHouse every time a user views the dashboard, reducing load on your database and improving perceived dashboard speed.
ClickHouse Query Cache
ClickHouse 22.11+ includes a native query cache. Enable it in your config:
<query_cache>
<max_size_in_bytes>1073741824</max_size_in_bytes> <!-- 1 GB -->
<max_entries>1024</max_entries>
</query_cache>
Now, identical queries return cached results within milliseconds, even if they’re not in Superset’s cache. This is particularly useful for high-frequency dashboards or shared queries across multiple Superset users.
Be careful: if your data updates frequently, cached results may be stale. Use the SETTINGS use_query_cache=0 hint to bypass the cache for real-time queries:
SELECT user_id, COUNT(*) AS count
FROM events
WHERE timestamp > now() - INTERVAL 1 MINUTE
GROUP BY user_id
SETTINGS use_query_cache=0;
Materialised Views and Pre-Aggregation
For dashboards that aggregate the same data repeatedly, create a materialised view that updates on a schedule. Instead of Superset querying raw events millions of times, it queries a pre-aggregated table with thousands of rows.
For example, if you have a dashboard showing daily active users (DAU), create:
CREATE TABLE dau_daily
ENGINE = ReplacingMergeTree()
PARTITION BY date
ORDER BY (date, user_id)
AS
SELECT
toDate(timestamp) AS date,
user_id
FROM events
GROUP BY date, user_id;
Then, in Superset, query dau_daily instead of events. Results are instant, and you can update dau_daily on a schedule (e.g., hourly) without impacting the raw events table.
Redis Integration
For teams running Superset at scale, integrate Redis for distributed caching. Superset can cache not just query results but also metadata, permissions, and dashboard definitions.
In Superset’s configuration:
CACHE_CONFIG = {
'CACHE_TYPE': 'redis',
'CACHE_REDIS_URL': 'redis://redis-host:6379/0',
'CACHE_DEFAULT_TIMEOUT': 3600,
}
RESULT_CACHE_CONFIG = {
'CACHE_TYPE': 'redis',
'CACHE_REDIS_URL': 'redis://redis-host:6379/1',
'CACHE_DEFAULT_TIMEOUT': 3600,
}
This offloads caching to Redis, freeing up Superset’s memory and enabling cache sharing across multiple Superset instances if you’re running a load-balanced deployment.
Operational Quirks and Gotchas
Null Handling
ClickHouse’s handling of NULLs differs from SQL databases. By default, ClickHouse columns are non-nullable. To allow NULLs, use the Nullable type:
CREATE TABLE events (
event_id UUID,
user_id UInt64,
email Nullable(String)
) ENGINE = MergeTree();
But here’s the gotcha: filtering on nullable columns is slower than non-nullable columns. If a column is 99% non-null, make it non-nullable and use an empty string or a sentinel value for missing data.
In Superset, be explicit about NULL handling in filters. A filter like email IS NOT NULL will work, but email != '' may not if email is nullable.
String Encoding and Collation
ClickHouse uses UTF-8 by default. If you’re storing data in other encodings, convert it before inserting. Superset expects UTF-8, so mismatches cause display issues.
For case-insensitive searches, use lower() or upper() in ClickHouse, not in Superset:
SELECT user_id FROM events WHERE lower(email) = 'test@example.com'
This is faster and avoids encoding issues.
Timestamp and Timezone Handling
ClickHouse stores timestamps in UTC by default. If your application uses a different timezone, convert it explicitly:
INSERT INTO events (event_id, user_id, timestamp)
VALUES (generateUUIDv4(), 123, toDateTime('2024-01-15 10:00:00', 'Australia/Sydney'))
In Superset, configure the default timezone in dashboard settings. If you have users in multiple timezones, let them set their own timezone preference in Superset’s user settings.
Memory and Query Limits
ClickHouse has hard limits on memory usage per query. If a query exceeds the limit, it fails. For Superset dashboards with many visualisations, this can be a problem.
Set reasonable limits in your ClickHouse config:
<settings>
<max_memory_usage>10000000000</max_memory_usage> <!-- 10 GB -->
<max_rows_to_read>1000000000</max_rows_to_read>
<max_execution_time>300</max_execution_time> <!-- 5 minutes -->
</settings>
If Superset queries frequently hit these limits, either optimise the queries or increase the limits. But be careful: increasing limits can destabilise the server if a runaway query consumes all memory.
Replication and Failover
For production deployments, use ReplicatedMergeTree to replicate data across multiple ClickHouse nodes. This ensures high availability: if one node fails, the other takes over.
However, replication adds complexity. You need ZooKeeper to coordinate replicas, and you need to handle failover in Superset. We recommend running 2–3 ClickHouse replicas behind a load balancer (e.g., HAProxy or AWS NLB) and configuring Superset to connect to the load balancer.
If a replica fails, the load balancer automatically routes traffic to the healthy replica. No Superset configuration changes needed.
Security and Compliance
Network Isolation
ClickHouse should never be exposed to the internet. Run it in a private subnet and restrict access to Superset’s IP. Use a VPN or bastion host for remote access.
For teams pursuing SOC 2 compliance or ISO 27001 certification, document network topology, access controls, and audit logging. We typically implement this via Vanta, which integrates with your infrastructure and generates compliance reports automatically.
Authentication and Authorisation
Use strong passwords for ClickHouse users. Rotate them every 90 days. For teams using LDAP or OAuth, integrate with your identity provider so users authenticate via their corporate credentials.
In Superset, enable RBAC to restrict which users can see which dashboards. For data-layer security, use ClickHouse’s row-level permissions or create separate views per user/role.
Encryption in Transit and at Rest
Enable TLS for ClickHouse connections. In your ClickHouse config:
<https_port>8443</https_port>
<openSSL>
<server>
<certificateFile>/path/to/cert.pem</certificateFile>
<privateKeyFile>/path/to/key.pem</privateKeyFile>
</server>
</openSSL>
Then, in Superset, use an HTTPS connection string:
clickhouse+https://user:password@clickhouse-host:8443/database_name
For encryption at rest, use filesystem-level encryption (e.g., LUKS on Linux) or cloud-provider encryption (e.g., AWS EBS encryption). ClickHouse doesn’t have native encryption at rest, so you must handle it at the infrastructure level.
Audit Logging
Enable query logging in ClickHouse to track who ran what queries and when:
<query_log>
<database>system</database>
<table>query_log</table>
<partition_by>event_date</partition_by>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
</query_log>
Now, every query is logged to system.query_log. You can query it to audit user activity:
SELECT user, query, query_start_time, query_duration_ms
FROM system.query_log
WHERE event_date = today()
ORDER BY query_start_time DESC;
For compliance audits, export these logs regularly and store them in an immutable archive.
Data Privacy and PII Handling
If your data contains personally identifiable information (PII), implement row-level security in ClickHouse or Superset. For example, if you have a users table with email addresses, create a view that masks emails for non-admin users:
CREATE VIEW users_masked AS
SELECT
user_id,
concat(substring(email, 1, 3), '***@***') AS email,
created_at
FROM users;
Then, restrict non-admin users to query users_masked instead of users.
For teams handling sensitive data, work with a CTO as a Service or security consultant to design a privacy-preserving architecture. PADISO has experience with security audits and SOC 2 compliance across regulated industries.
Scaling from MVP to Production
Single-Node to Multi-Node
Start with a single ClickHouse node. This is simple, cheap, and sufficient for most startups. A single node can ingest 100M+ events per day and serve 50+ concurrent Superset users.
When you outgrow a single node (typically at 500M+ events per day or 100+ concurrent users), shard data across multiple nodes. In ClickHouse, sharding is done via the Distributed table engine:
CREATE TABLE events_distributed AS events
ENGINE = Distributed('cluster_name', 'database_name', 'events', rand());
Now, writes and reads are distributed across the cluster. But sharding adds operational complexity: you need to manage multiple ClickHouse instances, coordinate schema changes, and handle node failures.
Before sharding, ensure your single-node deployment is optimised. Often, a well-tuned single node can handle 10x more load than a poorly configured cluster.
Superset Scaling
Superset is stateless, so scaling is straightforward: run multiple Superset instances behind a load balancer. All instances share the same metadata database (PostgreSQL or MySQL) and cache backend (Redis).
For 50+ concurrent users, run 2–3 Superset instances. For 500+ concurrent users, run 5–10 instances. Monitor CPU and memory; scale up if either exceeds 70% utilisation.
Data Ingestion at Scale
As data volume grows, ingestion becomes a bottleneck. ClickHouse can ingest data via HTTP, native TCP, or Kafka. For high-volume ingestion (1M+ events per second), use Kafka with a consumer that batches inserts:
from kafka import KafkaConsumer
import clickhouse_driver
client = clickhouse_driver.Client('clickhouse-host')
consumer = KafkaConsumer('events', bootstrap_servers=['kafka-host:9092'])
batch = []
for message in consumer:
batch.append(json.loads(message.value))
if len(batch) >= 10000:
client.execute('INSERT INTO events VALUES', batch)
batch = []
This batches 10,000 events before inserting, reducing network overhead and improving throughput.
Cost Optimisation
ClickHouse is cheap to run, but costs add up at scale. Here are ways to reduce costs:
- Compression: Ensure your tables use aggressive compression codecs. A 10:1 compression ratio is typical; 20:1 is achievable with the right data.
- Partitioning: Partition by time so old partitions can be deleted or archived. Use TTL to automatically delete old data.
- Denormalisation: Instead of joining tables, denormalise data into ClickHouse. This reduces storage but increases ingestion complexity.
- Pre-aggregation: Use materialised views to pre-aggregate data. This reduces query load on the raw tables.
- Cloud provider: If running on AWS, use spot instances for ClickHouse replicas. Spot instances are 70% cheaper but can be interrupted. Use them for read-only replicas, not for writes.
At PADISO, we’ve helped customers reduce their data infrastructure costs by 40–60% by optimising their Superset + ClickHouse deployments. If you’re looking for help, our platform development services cover architecture design, optimisation, and cost reduction across all regions.
Monitoring and Observability
ClickHouse Metrics
Monitor ClickHouse health via Prometheus. ClickHouse exposes metrics at http://clickhouse-host:8123/metrics:
curl http://clickhouse-host:8123/metrics | grep -E 'ClickHouse|Query|Insert|Merge'
Key metrics to watch:
ClickHouseProfileEvents_Query: Total queries executedClickHouseProfileEvents_Insert: Total inserts executedClickHouseProfileEvents_Merge: Total background mergesClickHouseAsyncMetrics_MemoryTracking: Memory usageClickHouseAsyncMetrics_OpenFileDescriptors: Open connections
Set up alerts if memory usage exceeds 80% or if query latency exceeds 5 seconds.
Superset Monitoring
Superset logs to superset.log by default. For production, send logs to a centralised logging system (e.g., ELK, Splunk, or CloudWatch).
Key metrics to track:
- Dashboard load time: If > 5 seconds, optimise queries or increase cache TTL.
- Query execution time: If > 2 seconds, optimise ClickHouse queries.
- Error rate: If > 1%, investigate failed queries or ClickHouse downtime.
- User concurrency: If > 100, scale Superset horizontally.
Query Performance Profiling
Use the Preset blog and ClickHouse documentation to understand query performance. ClickHouse’s EXPLAIN command shows query execution plans:
EXPLAIN SELECT user_id, COUNT(*) FROM events GROUP BY user_id;
This shows which indexes are used, how many blocks are scanned, and estimated execution time.
Alerting
Set up alerts for critical metrics:
- ClickHouse down (TCP port 8123 unreachable)
- Query latency > 5 seconds
- Memory usage > 90%
- Disk space < 10% free
- Replication lag > 1 minute
Use a monitoring tool like Prometheus + Alertmanager or Datadog to send alerts to Slack or PagerDuty.
Next Steps
You now have a comprehensive reference architecture for Superset + ClickHouse. Here’s how to move forward:
For Startups Building Analytics
If you’re a founder or CTO building analytics into your product, this stack is your foundation. Start with a single ClickHouse node, ingest your application events, and build Superset dashboards for your team. As you grow, add caching, optimise queries, and scale horizontally.
For hands-on help, PADISO offers CTO as a Service and platform development across Australia, with offices in Sydney and Melbourne. We’ve deployed this exact architecture for 50+ startups and can help you avoid the pitfalls we’ve seen in production.
For Mid-Market Teams Modernising BI
If you’re replacing Tableau, Looker, or other expensive BI tools, Superset + ClickHouse is a compelling alternative. You’ll save 60–70% on licensing, own your analytics layer, and avoid vendor lock-in.
The migration path is straightforward: replicate your data warehouse to ClickHouse, rebuild dashboards in Superset, and retire the old tool. PADISO has experience with this migration for regulated teams in financial services, insurance, and government. We can help you design a migration plan, handle the technical implementation, and ensure compliance throughout.
For Teams Pursuing Compliance
If you need SOC 2 or ISO 27001 certification, Superset + ClickHouse can be built to meet those requirements. Network isolation, encryption, audit logging, and RBAC are all built-in or easily added.
For compliance-heavy deployments, work with a partner who understands the regulatory landscape. PADISO has helped teams across Washington, D.C., Toronto, and Canberra pass audits with Superset + ClickHouse architectures. We use Vanta to automate compliance evidence collection and can guide you through the audit process.
Resources and Further Reading
For deeper dives, refer to:
- The official ClickHouse Apache Superset documentation
- ClickHouse’s Superset integration guide
- Instaclustr’s practical guide to connecting Superset to ClickHouse
- The OneUptime integration and optimisation guide
- Altinity’s detailed tutorial on building Superset dashboards over ClickHouse
- Preset’s blog for advanced Superset patterns
- Real-time analytics demonstrations showing Superset and ClickHouse working together
- Technical deep-dives on how ClickHouse and Superset work together
Getting Help
If you’re building this architecture and need guidance, PADISO is here. We specialise in:
- Architecture design: We’ll design a Superset + ClickHouse stack tailored to your data volume, query patterns, and compliance requirements.
- Implementation: We’ll set up ClickHouse, Superset, and all supporting infrastructure, then hand it over fully documented.
- Optimisation: If your dashboards are slow or your costs are high, we’ll profile queries, redesign schemas, and implement caching strategies.
- Compliance: For SOC 2, ISO 27001, or other regulatory requirements, we’ll design an audit-ready architecture and help you pass certification.
Across platform development in New York, platform development in Chicago, platform development in Austin, platform development in Canada, platform development in New Zealand, platform development in the United States, and platform development in Australia, we’ve deployed this stack successfully. We understand the regional nuances—from IRAP compliance in Canberra to FedRAMP requirements in Washington, D.C., to PIPEDA in Toronto.
Book a call with our team to discuss your use case. We’ll assess your current state, recommend an architecture, and outline a roadmap to ship analytics quickly without sacrificing performance or compliance.
Superset + ClickHouse is battle-tested and production-ready. With the patterns in this guide and the right implementation partner, you’ll build an analytics platform that scales with your business, costs a fraction of traditional BI tools, and gives you full control over your data stack.