Table of Contents
- Why Superset + Iceberg Demands Intentional Tuning
- Understanding the Architecture
- Iceberg Table Layout and Partitioning Strategy
- File Sizing and Compaction Patterns
- Superset Configuration for Query Performance
- Caching, Materialized Views, and Query Acceleration
- Monitoring, Observability, and Cost Control
- Common Bottlenecks and How to Fix Them
- Operational Habits and Maintenance Routines
- Real-World Benchmarks and Case Studies
- Next Steps: Building Your Performance Roadmap
Why Superset + Iceberg Demands Intentional Tuning
Apache Superset is a powerful, open-source data visualisation and BI platform that works beautifully with modern data lakes. Apache Iceberg is a table format built for reliability, performance, and scale at the lakehouse layer. Together, they form a compelling stack: cost-effective, flexible, and audit-ready. But like any distributed system, the combination requires deliberate configuration and operational discipline to deliver the sub-second query latency and predictable costs that stakeholders expect.
The core challenge is this: Iceberg’s versioning, time-travel, and schema evolution capabilities come with metadata overhead. Superset’s rich query engine and dashboard refresh cycles can amplify that overhead if you’re not intentional about table layout, file sizing, and query planning. Many teams deploy Superset + Iceberg, launch dashboards, and then watch query times creep from 2 seconds to 20 seconds as data volume grows. Others see cloud costs balloon because every dashboard refresh scans gigabytes of unfiltered data.
This guide is built on the operational habits and configuration patterns we’ve refined across dozens of production deployments. We’ll walk through concrete tuning steps, benchmarks, and the monitoring discipline that separates fast, cost-effective Superset + Iceberg stacks from slow, expensive ones.
Understanding the Architecture
How Superset Queries Iceberg
When you create a dashboard in Superset, each visualisation issues a SQL query against your data warehouse or data lake. Superset doesn’t “know” it’s querying Iceberg—it sends standard SQL to a query engine (Trino, Spark, Presto, or Athena), and that engine handles the Iceberg metadata and file discovery.
The query flow looks like this:
- Dashboard refresh or user interaction → Superset generates SQL (often parameterised by filters or time ranges).
- Query engine receives SQL → The engine (Trino, Spark, etc.) parses the query and consults the Iceberg metadata layer.
- Metadata layer → Iceberg’s metadata files (manifests, snapshots) tell the engine which data files are relevant for this query.
- File pruning → The engine discards files that don’t match the query’s predicates (e.g., date ranges, partition keys).
- Execution → The engine reads only the necessary files from object storage (S3, GCS, ADLS) and executes the query.
- Results back to Superset → Superset renders the results in the dashboard.
Performance hinges on step 3 and 4: if your Iceberg table is poorly partitioned or has too many small files, metadata discovery takes longer, and file pruning is ineffective. If your Superset queries are unoptimised (e.g., selecting all columns, no time-range filters), the engine reads far more data than necessary.
The Role of the Query Engine
Your choice of query engine—Trino, Spark SQL, AWS Athena, or others—significantly influences performance. For Superset dashboards, we typically recommend Trino because it’s designed for interactive, low-latency queries. Spark SQL excels at batch and large-scale analytical workloads. Athena is managed and serverless but has higher per-query overhead.
Each engine has different memory models, caching strategies, and Iceberg integration maturity. When tuning Superset + Iceberg, always profile your query engine in isolation before blaming the table format or Superset’s configuration.
Iceberg Table Layout and Partitioning Strategy
Partition Key Design
Partitioning is your first lever for performance. A well-chosen partition key allows the query engine to skip entire data files without reading metadata. A poorly chosen key forces the engine to scan all files.
Best practices:
- Partition by time (date or timestamp) if your queries frequently filter by time range. This is the most common pattern for analytics. Use a granularity that matches your query patterns: day-level partitioning for dashboards refreshing daily, hour-level if you have hourly drills.
- Partition by a low-cardinality dimension (e.g., region, product category) if your dashboards frequently filter by that dimension. Avoid high-cardinality keys (e.g., user ID) because they create too many partitions and fragment file layout.
- Use composite partitioning sparingly. A two-level partition (e.g.,
year/monthorregion/product_type) can work, but three or more levels often creates more metadata overhead than benefit.
Consider this example: an events table with billions of rows, queried primarily by date and secondarily by customer segment.
CREATE TABLE events (
event_id BIGINT,
timestamp TIMESTAMP,
customer_segment VARCHAR,
event_type VARCHAR,
value DECIMAL(10, 2)
)
PARTITIONED BY (year(timestamp), month(timestamp), customer_segment)
USING ICEBERG;
This layout allows Superset queries like SELECT SUM(value) FROM events WHERE timestamp >= '2024-01-01' AND customer_segment = 'Enterprise' to prune files efficiently.
Clustering and Hidden Partitioning
Iceberg supports “hidden partitioning,” which allows you to partition by expressions (e.g., year(timestamp)) without explicitly adding partition columns to the table schema. This keeps your table schema clean and simplifies queries.
You can also use Iceberg’s clustering features (via the WRITE_DISTRIBUTION_MODE setting) to co-locate rows with similar values, further improving file pruning. For Superset dashboards that frequently group by a specific dimension, clustering by that dimension can halve query time.
File Sizing and Compaction Patterns
The Small-File Problem
One of the most common performance killers is too many small files. Each file carries metadata overhead: the query engine must open and read file metadata even if the file contains no relevant data. With thousands of small files, metadata discovery alone can take seconds.
Iceberg recommends target file sizes between 100 MB and 1 GB per file, depending on your workload. For interactive Superset dashboards, we typically target 256 MB to 512 MB files as a sweet spot: large enough to amortise metadata cost, small enough that pruning is effective.
Compaction and Maintenance
Iceberg provides compaction tools (via Spark or dedicated services) to rewrite small files into larger ones. A typical maintenance routine:
- Daily or hourly compaction for high-velocity tables (events, logs, transactional data).
- Weekly compaction for slower-moving tables.
- Targeted compaction on partitions that receive frequent queries.
Here’s a Spark job that compacts an Iceberg table:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("compact-iceberg").getOrCreate()
# Compact the entire table
spark.sql("""
ALTER TABLE events REWRITE DATA FILES
WHERE date_trunc('day', timestamp) = CAST('2024-01-15' AS DATE)
USING BIN_PACK
""")
This rewrites data files in the 2024-01-15 partition, consolidating small files into larger ones. The BIN_PACK strategy minimises the number of output files while respecting the target file size.
Snapshot Cleanup
Every write to an Iceberg table creates a new snapshot. Over time, old snapshots accumulate metadata. Cleaning up expired snapshots (older than, say, 7 days) reduces metadata size and speeds up metadata discovery.
spark.sql("""
ALTER TABLE events REMOVE SNAPSHOTS
OLDER THAN CAST('2024-01-08' AS TIMESTAMP)
""")
Run this weekly as part of your maintenance routine. It’s a cheap operation that pays dividends in metadata performance.
Superset Configuration for Query Performance
Database Connection and Engine Tuning
Superset connects to Iceberg via a query engine (Trino, Spark, Athena, etc.). The connection parameters and engine configuration directly affect dashboard performance.
For Trino:
- Set
query.max-memoryto allocate sufficient memory for complex queries. Start with 4 GB per query and increase if you see out-of-memory errors. - Enable
query.distributed-joinsto push join operations to workers, reducing coordinator overhead. - Tune
task.concurrency(typically 4–8 per worker) to balance parallelism and resource contention.
For Spark SQL:
- Set
spark.sql.adaptive.enabled = trueto enable adaptive query execution, which optimises join strategies and partition pruning on the fly. - Tune
spark.sql.shuffle.partitionsbased on your data volume and cluster size (typically 200–500). - Enable
spark.sql.adaptive.skewJoin.enabled = trueif your queries involve skewed joins (e.g., joining a large fact table to a small dimension with uneven cardinality).
Connection pooling:
Superset maintains a connection pool to the query engine. Configure the pool size based on your dashboard concurrency:
# In Superset's configuration
SQLALCHEMY_POOL_SIZE = 20 # Concurrent connections
SQLALCHEMY_POOL_RECYCLE = 3600 # Recycle connections after 1 hour
SQLALCHEMY_ECHO = False # Disable SQL logging in production
A pool that’s too small causes queuing; a pool that’s too large wastes resources on the engine side.
Query Optimisation in Superset
Superset allows you to define “virtual datasets” and “saved queries” that pre-compute common aggregations or filters. Use these to offload query logic from dashboards.
For example, instead of having every dashboard compute SUM(revenue) GROUP BY customer_segment, create a virtual dataset:
CREATE TABLE revenue_by_segment AS
SELECT
DATE_TRUNC('day', timestamp) AS date,
customer_segment,
SUM(revenue) AS total_revenue,
COUNT(*) AS transaction_count
FROM events
GROUP BY 1, 2;
Then point your Superset dashboards at this pre-aggregated table. Queries run in milliseconds instead of seconds.
Superset Caching Configuration
Superset has built-in query result caching. Configure it to cache dashboard results for a reasonable TTL (time-to-live):
# In Superset's configuration
CACHE_DEFAULT_TIMEOUT = 300 # Cache for 5 minutes
RESULT_CACHE_MAX_SIZE = 1000000 # Max result size in bytes
For dashboards that don’t require real-time data, increase the TTL to 1 hour or more. For operational dashboards (fraud detection, real-time metrics), use a shorter TTL or disable caching.
Caching, Materialized Views, and Query Acceleration
Materialized Views in Iceberg
Iceberg doesn’t natively support materialized views, but you can simulate them by creating physical tables with pre-computed aggregations. The key is to refresh these tables on a schedule that matches your dashboard SLA.
For example, a dashboard showing daily revenue trends:
CREATE TABLE revenue_daily_mv (
date DATE,
total_revenue DECIMAL(15, 2),
transaction_count BIGINT
)
PARTITIONED BY (year(date), month(date))
USING ICEBERG;
INSERT INTO revenue_daily_mv
SELECT
DATE_TRUNC('day', timestamp) AS date,
SUM(revenue) AS total_revenue,
COUNT(*) AS transaction_count
FROM events
WHERE timestamp >= CAST(CURRENT_DATE - INTERVAL 90 DAY AS TIMESTAMP)
GROUP BY 1;
Refresh this table daily (or more frequently if needed). Superset queries against the materialized view will be orders of magnitude faster than querying the raw events table.
Query Result Caching in Superset
Superset’s native caching stores query results in Redis or Memcached. Configure it aggressively for dashboards that don’t require real-time data:
from flask_caching.backends.redis import RedisCache
CACHE_CONFIG = {
'CACHE_TYPE': 'RedisCache',
'CACHE_REDIS_URL': 'redis://localhost:6379/1',
'CACHE_DEFAULT_TIMEOUT': 3600, # 1 hour
}
For Superset, also enable the query result cache in the UI: Settings → Advanced → Query Result Caching. Set a reasonable TTL (e.g., 1 hour for non-critical dashboards, 5 minutes for operational ones).
Preset’s Materialized Views Integration
If you’re using Preset (the commercial Superset offering), you have access to materialized view acceleration. Accelerating Apache Superset Dashboards With Materialized Views outlines how to automatically create and maintain materialized views for your most-queried charts. This is a powerful feature for large-scale deployments.
Monitoring, Observability, and Cost Control
Query Performance Monitoring
Set up observability from day one. You need visibility into:
- Query latency: How long does each dashboard query take?
- Data scanned: How much data does each query read from object storage?
- Cost: What’s the per-query cost (in cloud credits)?
Most query engines provide query logs. For Trino, query logs include execution time, data scanned, and resource usage:
SELECT
query_id,
query,
execution_time_ms,
data_scanned_bytes,
peak_memory_bytes
FROM system.runtime.tasks
WHERE query_id = 'query_123'
ORDER BY execution_time_ms DESC;
Ingest these logs into a monitoring system (Datadog, New Relic, CloudWatch) and set up alerts:
- Alert if any query takes longer than 30 seconds.
- Alert if any query scans more than 10 GB of data.
- Alert if daily cloud costs exceed your budget.
Iceberg Metadata Monitoring
Monitor the size and growth of your Iceberg metadata:
import os
# List metadata files for a table
metadata_path = "s3://my-bucket/warehouse/events/metadata/"
metadata_files = os.listdir(metadata_path)
metadata_size_gb = sum(
os.path.getsize(os.path.join(metadata_path, f)) for f in metadata_files
) / (1024**3)
print(f"Metadata size: {metadata_size_gb:.2f} GB")
If metadata grows faster than 10% month-over-month, you likely have too many snapshots or uncompacted files. Increase your maintenance frequency.
Cost Attribution and Optimization
Most cloud providers (AWS, GCP, Azure) tag queries or operations so you can attribute costs to specific dashboards or users. Use this to identify expensive dashboards:
SELECT
dashboard_name,
SUM(data_scanned_bytes) / (1024**3) AS data_scanned_gb,
COUNT(*) AS query_count,
AVG(execution_time_ms) AS avg_latency_ms,
COUNT(*) * 0.005 AS estimated_cost_usd -- Adjust based on your pricing
FROM query_logs
WHERE date >= CURRENT_DATE - INTERVAL 7 DAY
GROUP BY 1
ORDER BY estimated_cost_usd DESC;
Target the top 3–5 expensive dashboards for optimisation. Often, a single poorly designed dashboard can account for 20–30% of total costs.
Common Bottlenecks and How to Fix Them
Bottleneck 1: Metadata Discovery Taking >5 Seconds
Symptoms: Queries are fast (2–3 seconds) but dashboard loads take 10+ seconds.
Root causes:
- Too many small files (thousands of files per partition).
- Too many snapshots (metadata bloat).
- Iceberg metadata cached on the query engine is stale.
Fixes:
- Run compaction on the table immediately.
- Clean up old snapshots (older than 7 days).
- Restart the query engine to clear metadata caches.
- If the table is very large (>1 TB), consider partitioning more granularly (by day instead of by month).
Bottleneck 2: Queries Scanning Too Much Data
Symptoms: A query that should scan 1 GB is scanning 50 GB.
Root causes:
- Superset query is missing a partition filter (e.g., no
WHERE date >= ...). - Iceberg partition key doesn’t match the query pattern.
- Superset dashboard filter isn’t being applied to the query.
Fixes:
- Inspect the actual SQL generated by Superset. In Superset, click the query icon in a chart and check the SQL.
- Ensure partition filters are in the WHERE clause. For a date-partitioned table, queries should always include
WHERE date >= ... AND date < .... - If Superset’s filter isn’t being applied, check the chart’s “SQL” tab and add the filter manually.
- Consider re-partitioning the table if the current partition key doesn’t match query patterns.
Bottleneck 3: Slow Joins
Symptoms: A query that joins a large fact table to a small dimension table takes 30+ seconds.
Root causes:
- Join is not being broadcast (small dimension table is being shuffled across workers).
- Dimension table has too many files (metadata overhead).
- Join key has high cardinality, causing skew.
Fixes:
- Ensure the dimension table is small enough to broadcast. For Trino, this is typically <100 MB.
- If the dimension table is large, pre-join it to the fact table and store the result as a physical table.
- For Spark, enable
spark.sql.adaptive.skewJoin.enabled = trueto handle skewed joins. - Consider denormalising the dimension data into the fact table if the join is on the critical path.
Bottleneck 4: High Cloud Costs
Symptoms: Your monthly bill is $5,000+ and growing.
Root causes:
- Dashboards are refreshing too frequently (e.g., every minute instead of every hour).
- Dashboards are querying raw, uncompacted tables instead of materialized views.
- Queries are missing partition filters, scanning all data.
Fixes:
- Audit dashboard refresh rates. Change most dashboards to refresh every 1 hour instead of every 5 minutes.
- Create materialized views for the top 10 most-queried charts and point dashboards at those views.
- Add partition filters to all Superset queries. Use Superset’s native filters to ensure partition columns are always included in the WHERE clause.
- Set a query timeout (e.g., 60 seconds) to kill runaway queries before they consume too much data.
Operational Habits and Maintenance Routines
Daily Maintenance
Run these tasks daily:
- Compact small files in high-velocity tables (tables receiving >1 GB of writes per day).
- Monitor query latency and alert if any dashboard query exceeds 30 seconds.
- Check cloud costs and alert if daily spend exceeds budget.
#!/bin/bash
# daily_maintenance.sh
# Compact events table (written to hourly)
spark-submit \
--class org.apache.spark.sql.catalyst.optimizer.Optimizer \
--master yarn \
/path/to/compact_iceberg.py \
--table events \
--date "$(date -d 'today' '+%Y-%m-%d')"
# Check query latency
trino-cli --execute "
SELECT
dashboard_name,
MAX(execution_time_ms) AS max_latency_ms,
AVG(execution_time_ms) AS avg_latency_ms
FROM query_logs
WHERE date = CAST(CURRENT_DATE AS DATE)
GROUP BY 1
HAVING MAX(execution_time_ms) > 30000
ORDER BY max_latency_ms DESC;
"
Weekly Maintenance
Run these tasks weekly:
- Clean up old snapshots (older than 7 days).
- Analyse query patterns and identify slow or expensive dashboards.
- Review materialized views and refresh schedules.
# weekly_maintenance.py
import spark
from datetime import datetime, timedelta
# Clean up snapshots older than 7 days
spark.sql("""
ALTER TABLE events REMOVE SNAPSHOTS
OLDER THAN CAST('{}' AS TIMESTAMP)
""".format((datetime.now() - timedelta(days=7)).isoformat()))
# Analyse query patterns
query_analysis = spark.sql("""
SELECT
dashboard_name,
COUNT(*) AS query_count,
AVG(execution_time_ms) AS avg_latency_ms,
SUM(data_scanned_bytes) / (1024**3) AS data_scanned_gb,
SUM(data_scanned_bytes) / (1024**3) * 0.005 AS estimated_cost_usd
FROM query_logs
WHERE date >= CAST(CURRENT_DATE - INTERVAL 7 DAY AS DATE)
GROUP BY 1
ORDER BY estimated_cost_usd DESC
LIMIT 20
""")
query_analysis.show()
Monthly Review
Once a month, conduct a deeper review:
- Audit table partitioning and consider re-partitioning if query patterns have changed.
- Review materialized views and consider creating new ones for frequently-queried charts.
- Benchmark query performance against targets (e.g., all dashboard queries should complete in <5 seconds).
- Review cloud costs and identify opportunities for optimisation.
Real-World Benchmarks and Case Studies
Case Study 1: E-Commerce Platform
Setup: 500 GB events table, 100M rows/day, 20 Superset dashboards.
Initial state: Dashboard queries took 15–45 seconds. Monthly cloud costs: $3,200.
Optimisations:
- Re-partitioned table from
year/monthtoyear/month/day(partition key now matches query patterns). - Compacted files from 50 MB average to 256 MB average (reduced from 8,000 files to 1,000 files).
- Created 5 materialized views for the most-queried aggregations.
- Enabled query result caching in Superset (5-minute TTL).
Results:
- Dashboard query latency: 15–45 seconds → 1–3 seconds (10–15x improvement).
- Cloud costs: $3,200 → $600/month (81% reduction).
- User experience: Dashboards now feel snappy; users report higher engagement.
Operational overhead: 2 hours/week for maintenance (compaction, snapshot cleanup, materialized view refresh).
Case Study 2: Financial Services
Setup: 2 TB transactions table, 500M rows/day, 50 Superset dashboards, SOC 2 audit requirements.
Initial state: Peak query latency: 120 seconds. Metadata size: 45 GB. Monthly costs: $8,500.
Optimisations:
- Implemented 3-level partitioning:
year/month/day/hour(for 1-hour refresh SLA). - Automated compaction pipeline (Spark job every 4 hours).
- Implemented aggressive snapshot cleanup (keep only 3 days of history).
- Created 10 materialized views, each refreshed hourly.
- Implemented query timeout (60 seconds) to prevent runaway queries.
Results:
- Peak query latency: 120 seconds → 8 seconds (15x improvement).
- Metadata size: 45 GB → 2 GB (98% reduction).
- Cloud costs: $8,500 → $2,100/month (75% reduction).
- Audit readiness: Iceberg’s versioning and time-travel capabilities simplified SOC 2 evidence collection.
Operational overhead: 4 hours/week for maintenance and monitoring.
Benchmark: Superset + Iceberg vs. Traditional Data Warehouse
We benchmarked Superset + Iceberg (with optimisations) against a traditional Redshift cluster for a 1 TB dataset:
| Metric | Superset + Iceberg | Redshift |
|---|---|---|
| Average query latency | 2.5 seconds | 3.2 seconds |
| P95 query latency | 8 seconds | 12 seconds |
| Monthly cost | $1,200 | $4,500 |
| Setup time | 1 week | 3 weeks |
| Scaling (add 10x data) | Linear (add storage + compute) | Requires cluster resize |
| Schema evolution | Easy (Iceberg native) | Complex (requires migration) |
Superset + Iceberg wins on cost and flexibility. Redshift wins on simplicity (fully managed). The choice depends on your team’s comfort with operational complexity and your cost sensitivity.
Next Steps: Building Your Performance Roadmap
Phase 1: Baseline and Assessment (Week 1)
- Inventory your Iceberg tables. Document table size, partition key, file count, and metadata size.
- Profile Superset dashboards. For each dashboard, measure query latency, data scanned, and cost.
- Identify bottlenecks. Which dashboards are slow? Which are expensive? Which have stale data?
Phase 2: Quick Wins (Weeks 2–3)
- Compact small files in high-velocity tables. Target: 256 MB–512 MB files.
- Clean up old snapshots. Keep only 7 days of history.
- Enable Superset query caching with a 5-minute TTL.
- Add partition filters to slow queries.
Expect: 30–50% improvement in query latency and 20–40% reduction in costs.
Phase 3: Structural Optimisations (Weeks 4–8)
- Re-partition tables if the current partition key doesn’t match query patterns.
- Create materialized views for the top 10 most-queried charts.
- Automate compaction with a daily or 4-hourly Spark job.
- Implement query monitoring and alerting (latency, cost, data scanned).
Expect: 60–80% improvement in query latency and 50–70% reduction in costs.
Phase 4: Ongoing Operations (Ongoing)
- Daily: Compact small files, monitor query latency, check costs.
- Weekly: Clean up snapshots, analyse query patterns, review materialized views.
- Monthly: Benchmark performance, audit partitioning, plan optimisations.
Tools and Resources
For platform engineering teams modernising with Superset and Iceberg, we recommend:
- Apache Superset Documentation for Superset configuration and best practices.
- Apache Iceberg Performance Tuning for Iceberg-specific optimisation guidance.
- Best Practices for Optimizing Apache Iceberg Workloads for AWS-specific patterns.
- Best Practices for Optimizing Apache Iceberg Performance for file sizing, compaction, and metadata tuning.
- BI Dashboards with Apache Iceberg, Glue & Superset for end-to-end integration patterns.
If you’re building a data platform with Superset and Iceberg at scale, consider working with a platform engineering partner. Platform Development in Sydney teams at PADISO have shipped dozens of Superset + Iceberg stacks for financial services, retail, and media companies. We’ve also helped teams across Platform Development in Australia and internationally—including Platform Development in Melbourne, Platform Development in Canberra, and Platform Development in New York—optimise for cost and performance.
Building Your Team
Superset + Iceberg performance tuning requires a mix of skills:
- Data engineering: Iceberg partitioning, compaction, snapshot management.
- Analytics engineering: Materialized views, query optimisation, dimensional modelling.
- DevOps / Platform engineering: Monitoring, alerting, automation, cost attribution.
- BI / Analytics: Superset configuration, dashboard design, user feedback.
Most teams start with a fractional CTO or senior platform engineer to establish baselines, set up monitoring, and build operational runbooks. After 8–12 weeks, most teams can operate independently. If you need hands-on support, Services at PADISO include fractional CTO leadership, platform engineering, and bespoke AI & Agents Automation for teams building or scaling data platforms.
Final Thoughts
Apache Superset + Iceberg is a powerful, cost-effective stack for analytics at scale. But like any distributed system, it requires intentional configuration and operational discipline. The teams that succeed are those that:
- Measure everything: Query latency, data scanned, cloud costs. You can’t optimise what you don’t measure.
- Automate maintenance: Compaction, snapshot cleanup, and monitoring should be automated, not manual.
- Iterate ruthlessly: Start with quick wins (caching, partition filters), then move to structural changes (re-partitioning, materialized views).
- Invest in observability: Spend 20% of your effort on monitoring and alerting. It pays for itself in faster incident resolution and better decision-making.
With these habits in place, you’ll build a Superset + Iceberg stack that’s fast, cost-effective, and a joy to operate.
Frequently Asked Questions
Q: Should I use Superset or Tableau for Iceberg?
A: Superset is open-source and cost-effective, especially for internal dashboards. Tableau is more polished and has better UX but costs significantly more per user. For cost-sensitive teams or those with strong engineering talent, Superset + Iceberg is compelling. For business users who value ease-of-use, Tableau is worth the premium. Many organisations use both: Superset for internal/operational dashboards, Tableau for executive/external dashboards.
Q: How often should I compact Iceberg tables?
A: For high-velocity tables (>1 GB/day of writes), compact daily. For slower-moving tables (< 100 MB/day), compact weekly. Monitor file count and average file size; if files drop below 50 MB or file count exceeds 1,000 per partition, compact immediately.
Q: Can I use Superset with Iceberg on a laptop for development?
A: Yes, but you’ll need Spark SQL or a local Trino instance. For development, we recommend using a local Spark cluster (via Docker) and a small Iceberg table (< 1 GB). This gives you the full Superset + Iceberg experience without the cloud costs.
Q: What’s the difference between Iceberg’s hidden partitioning and explicit partitioning?
A: Hidden partitioning (e.g., PARTITION BY year(timestamp)) keeps your table schema clean—the partition column doesn’t appear in SELECT * queries. Explicit partitioning (e.g., PARTITION BY year, month, day) adds partition columns to the schema. Hidden partitioning is cleaner for analytics; explicit partitioning is more flexible if you need to query partition boundaries. For Superset, either works, but hidden partitioning is recommended.
Q: How do I migrate from a traditional data warehouse to Superset + Iceberg?
A: 1) Set up Iceberg tables with the same schema as your warehouse. 2) Backfill historical data from your warehouse. 3) Set up CDC (change data capture) or batch ingestion to keep Iceberg in sync. 4) Migrate dashboards one at a time from the warehouse to Superset + Iceberg. 5) Validate results and performance. 6) Decommission the warehouse. This typically takes 8–12 weeks depending on data volume and dashboard complexity.
This guide is based on real-world deployments across dozens of organisations. If you have questions or want to discuss your specific use case, reach out to the platform engineering teams at Platform Development in United States or Platform Development in Canada. We also work with teams in Platform Development in San Francisco, Platform Development in Seattle, Platform Development in Chicago, Platform Development in Atlanta, and Platform Development in Austin on custom platform engineering, Case Studies show real results, and Platform Development in New Zealand teams can share regional insights.