Guide 25 mins

Apache Superset Semantic Layer Design: Patterns from Real Deployments

Deep technical guide to semantic layer design in production Superset clusters. Code examples, performance benchmarks, and production gotchas.

The PADISO Team ·2026-06-13

Why Semantic Layer Design Matters
Core Concepts and Architecture
Dataset Abstraction Patterns
Metric Definition and Governance
Performance Optimization in Production
Security and Access Control
Integration Patterns with dbt and Cube
Common Gotchas and Solutions
Real-World Implementation Case Study
Next Steps: Building Your Semantic Layer

Why Semantic Layer Design Matters

When you’re running Superset at scale—whether you’re a Series-A fintech, a mid-market retailer modernising your analytics stack, or an enterprise consolidating BI tools across five acquired companies—the semantic layer becomes your competitive moat. It’s the difference between shipping a dashboard in 4 weeks and shipping it in 4 months. It’s the difference between your analysts writing SQL by hand (and breaking things) and your business users self-serving insights through a governed, cached, performant abstraction.

We’ve built Superset semantic layers for financial services teams, retail operators, and media companies across Australia and North America. What separates the ones that scale from the ones that become technical debt is intentional semantic layer design from day one.

A semantic layer in Superset is not just a pretty UI. It’s a contract between your data infrastructure and your consumption layer. It defines what a metric is, where it lives, how it’s calculated, who can see it, and how fresh it needs to be. Without that contract, you end up with 47 different definitions of “Monthly Recurring Revenue” scattered across dashboards, each one slightly wrong in a different way.

This guide walks you through the patterns we’ve shipped in production, the benchmarks you should expect, and the gotchas the official docs don’t surface.

Core Concepts and Architecture

Understanding Superset’s Semantic Layer Stack

Superset’s semantic layer sits between your raw data warehouse and your end-users. It comprises three core components:

Datasets are your abstraction layer. A dataset in Superset is not a table; it’s a SQL query wrapped in metadata. You define which columns are available, which are dimensions (categorical), which are measures (numeric), what the grain is, and what filters are safe. When a user builds a chart, they’re selecting from the dataset, not writing SQL.

Metrics are calculated, governed business logic. A metric in Superset has a SQL expression, a data type, a description, and a set of valid dimensions to group by. When you define revenue_usd as a metric on your orders dataset, every dashboard that uses that metric gets the same calculation, cached the same way, and audited the same way.

Columns in a Superset dataset are not raw table columns. They’re semantic definitions. A column has a name, a data type, a description, a filter expression (if it’s a dimension), and a set of groupable/filterable flags. This is where you encode business rules: “this column is only visible to finance”, “this column should always be filtered to the current fiscal year”, “this column is a date and should default to a date picker”.

The architecture looks like this:

Raw Data Warehouse (PostgreSQL, Snowflake, ClickHouse, etc.)
         ↓
    [Semantic Layer]
    - Datasets (SQL + metadata)
    - Metrics (expressions + governance)
    - Columns (business logic)
         ↓
Superset UI / Charts / Dashboards
         ↓
End Users (Analysts, Business Users, Executives)

When you query a chart in Superset, the engine translates your UI selections (“show me revenue by region, filtered to Q4”) into SQL that hits your warehouse, but that SQL is generated from your semantic definitions. If your metric definition changes, every chart using it gets the new calculation automatically.

The official Superset documentation covers the basics, but what it doesn’t tell you is how to structure this at scale without it becoming a maintenance nightmare.

Why This Matters for Your Business

If you’re a founder or CTO building analytics for your product, a semantic layer buys you three things:

Speed: Your product team can answer questions in days, not weeks. You’re not waiting for an analyst to write bespoke SQL; the dimensions and metrics are pre-defined and cached.
Correctness: You define “Monthly Recurring Revenue” once, in one place. Every dashboard, every email report, every API call uses the same definition. No more arguments about whose number is right.
Governance: You can enforce access control at the semantic level. Finance sees revenue metrics; product sees usage metrics. You can audit who queried what, when, and why.

For enterprises modernising their stack, the semantic layer is how you consolidate BI tools. Instead of migrating 200 dashboards from Tableau to Superset one by one, you define your semantic layer once, then map each legacy tool’s definitions to it. You’re migrating the logic, not the dashboards.

Dataset Abstraction Patterns

Pattern 1: The Fact-Dimension Star Schema in Superset

The most common pattern we see in production is a lightweight star schema abstraction within Superset itself. You’re not necessarily reorganising your warehouse; you’re defining datasets that look like a star schema to the user, even if your warehouse is a denormalised table or a columnar store.

Here’s a concrete example: you have an orders table in your warehouse. It has 150 columns, including order ID, customer ID, revenue, cost, date, region, product category, and 147 other fields. Most of your analysts don’t need all 150 columns, and some columns are sensitive (cost, margin, customer email).

You create a dataset in Superset called orders_analytics that:

Sources the raw orders table (or a pre-aggregated view)
Exposes only the columns your analysts need: order_id, customer_id, revenue_usd, date_ordered, region, product_category
Hides sensitive columns like cost and customer_email
Adds calculated columns: profit_margin (as (revenue - cost) / revenue), is_high_value (as revenue > 10000), order_month (as DATE_TRUNC('month', date_ordered))
Defines which columns are dimensions (categorical, groupable) and which are measures (numeric, aggregatable)

In Superset’s UI, you define this as:

Dataset: orders_analytics
Table: public.orders

Columns:
  - name: order_id
    type: string
    is_dimension: true
    filterable: true
    
  - name: customer_id
    type: string
    is_dimension: true
    filterable: true
    
  - name: revenue_usd
    type: numeric
    is_dimension: false
    filterable: false
    
  - name: date_ordered
    type: date
    is_dimension: true
    filterable: true
    
  - name: region
    type: string
    is_dimension: true
    filterable: true
    
  - name: product_category
    type: string
    is_dimension: true
    filterable: true
    
  - name: profit_margin
    type: numeric
    expression: (revenue_usd - cost) / revenue_usd
    is_dimension: false
    
  - name: is_high_value
    type: boolean
    expression: revenue_usd > 10000
    is_dimension: true

Now, when an analyst uses Superset, they see a clean, pre-filtered dataset with only the columns they need. They can’t accidentally expose sensitive data. They can’t write bad SQL. They can’t accidentally join to the wrong table.

Pattern 2: Pre-Aggregated Datasets for Performance

One of the biggest gotchas we see is teams building semantic layers on top of raw, fact-level tables. If your orders table has 500 million rows, and every analyst query has to scan it, your Superset cluster will grind to a halt. The solution is pre-aggregated datasets.

Instead of sourcing your dataset from the raw orders table, you create a materialised view or table that’s pre-aggregated at the day/region/product level:

CREATE MATERIALIZED VIEW orders_daily_agg AS
SELECT
  DATE_TRUNC('day', date_ordered) AS date_ordered,
  region,
  product_category,
  COUNT(*) AS order_count,
  SUM(revenue_usd) AS revenue_usd,
  SUM(cost) AS cost,
  AVG(revenue_usd) AS avg_order_value,
  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY revenue_usd) AS p95_order_value
FROM public.orders
GROUP BY DATE_TRUNC('day', date_ordered), region, product_category;

Then in Superset, you create a dataset on top of this view. Your analysts can still slice and dice by date, region, and product, but the queries are orders of magnitude faster because they’re hitting a 10 million row aggregated table instead of a 500 million row fact table.

The trade-off is freshness. If you refresh this view nightly, your Superset dashboards are 24 hours stale. For most business use cases, that’s fine. For real-time dashboards, you need a different approach (we’ll cover that below).

Pattern 3: Virtual Datasets with SQL Templates

For teams with complex business logic, Superset’s SQL template feature lets you define datasets that are parameterised SQL queries. This is powerful but dangerous if you’re not careful.

Example: you want a dataset that lets analysts query revenue by customer cohort (the month they first purchased). The SQL is complex: it requires a window function to identify the first purchase date, then grouping by that cohort. Instead of having every analyst write this from scratch, you define it once as a virtual dataset:

WITH customer_cohorts AS (
  SELECT
    customer_id,
    DATE_TRUNC('month', MIN(date_ordered)) AS cohort_month
  FROM public.orders
  GROUP BY customer_id
),
orders_with_cohort AS (
  SELECT
    o.order_id,
    o.customer_id,
    o.date_ordered,
    o.revenue_usd,
    cc.cohort_month,
    DATEDIFF(MONTH, cc.cohort_month, o.date_ordered) AS months_since_cohort
  FROM public.orders o
  JOIN customer_cohorts cc ON o.customer_id = cc.customer_id
)
SELECT
  cohort_month,
  months_since_cohort,
  COUNT(DISTINCT customer_id) AS customer_count,
  SUM(revenue_usd) AS revenue_usd,
  AVG(revenue_usd) AS avg_order_value
FROM orders_with_cohort
GROUP BY cohort_month, months_since_cohort

Now analysts can use this dataset like any other. They can filter by cohort month, group by months since cohort, and compare cohort performance without writing a single line of SQL.

The gotcha: make sure this query is efficient. If it’s a full table scan with three CTEs and two joins, it will time out. Test it on your actual warehouse with realistic data volumes before you expose it to users.

Metric Definition and Governance

Defining Metrics at Scale

Metrics are where the real power of a semantic layer emerges. A metric is a calculated, governed business measure. In Superset, a metric has:

A SQL expression: how it’s calculated
A data type: numeric, percentage, currency, etc.
A description: what it means
Valid dimensions: which columns can you group by?
A cache key: how long should the result be cached?

Here’s a real example from a fintech client. They have a transactions table with millions of rows. They define metrics like:

Metric: total_volume_usd
Expression: SUM(amount_usd)
Data Type: currency
Description: "Total transaction volume in USD, across all transactions."
Valid Dimensions: [date, merchant_category, payment_method, region]
Cache TTL: 1 hour

Metric: transaction_count
Expression: COUNT(*)
Data Type: integer
Description: "Total number of transactions."
Valid Dimensions: [date, merchant_category, payment_method, region]
Cache TTL: 1 hour

Metric: avg_transaction_value
Expression: AVG(amount_usd)
Data Type: currency
Description: "Average transaction value across all transactions."
Valid Dimensions: [date, merchant_category, payment_method, region]
Cache TTL: 1 hour

Metric: failed_transaction_rate
Expression: SUM(CASE WHEN status = 'failed' THEN 1 ELSE 0 END) / COUNT(*)
Data Type: percentage
Description: "Percentage of transactions that failed."
Valid Dimensions: [date, merchant_category, payment_method, region]
Cache TTL: 30 minutes

Notice the cache TTL is different for the failed transaction rate. It’s more volatile and needs fresher data. The others are stable and can be cached longer.

When you define metrics this way, you’re encoding business logic in one place. Every dashboard, every API call, every email report uses the same definition. When the finance team updates the definition of “transaction volume” (e.g., to exclude test transactions), you update the metric once, and every downstream consumer gets the new definition.

Metric Governance and Versioning

One of the biggest challenges we see is metric sprawl. You define metrics, they work, then someone tweaks one for a specific use case, and now you have two versions of the same metric floating around. Six months later, you have seven versions, and nobody knows which one is “correct”.

The solution is metric governance. Here’s a pattern that works:

Metric ownership: Every metric has an owner (usually a data analyst or analytics engineer). They’re responsible for keeping it correct and documenting changes.
Metric reviews: Before a new metric goes into production, it’s reviewed by at least one other analyst. They check the SQL, verify the logic, and test it against known good values.
Metric versioning: If a metric definition changes, you increment the version. Old dashboards can still use the old version if they need to, but new dashboards default to the latest.
Metric deprecation: If a metric is no longer used, you deprecate it (mark it as “do not use”) instead of deleting it. This prevents accidental breakage of old dashboards.

In Superset, you can encode this in the metric description:

Metric: total_volume_usd
Version: 2
Owner: data-team@company.com
Status: active

Changelog:
- v2 (2024-01-15): Exclude test transactions (status = 'test')
- v1 (2023-06-01): Initial definition

Expression: SUM(CASE WHEN status != 'test' THEN amount_usd ELSE 0 END)

Using dbt Semantic Layer with Superset

If you’re using dbt in your data stack, the dbt Semantic Layer is a powerful way to define metrics once and sync them into Superset. The dbt Semantic Layer documentation walks through the basics, but here’s how it works in practice:

You define your metrics in dbt YAML:

semantics:
  - name: transactions
    entities:
      - name: transaction
        type: primary
      - name: customer
        type: foreign
    measures:
      - name: total_volume
        agg: sum
        expr: amount_usd
      - name: transaction_count
        agg: count
      - name: failed_count
        agg: sum
        expr: "CASE WHEN status = 'failed' THEN 1 ELSE 0 END"
    dimensions:
      - name: status
        type: categorical
      - name: merchant_category
        type: categorical

Then, using dbt’s Semantic Layer API, you can sync these metrics into Superset. The advantage is single source of truth: your metrics are defined in dbt, versioned in Git, tested in dbt, and automatically synced to Superset. If you need to change a metric, you change it in dbt, run dbt parse, and Superset picks up the change.

The Cube Semantic Layer Sync with Apache Superset documentation shows a similar pattern with Cube, if you’re using that instead.

Performance Optimization in Production

Query Caching Strategy

The single biggest performance lever in Superset is caching. Every query that hits your warehouse is expensive (in terms of latency and cost). If you can serve it from cache, you should.

Superset has two caching layers:

Query caching: Results of queries are cached. If the same query runs twice in 5 minutes, the second one hits cache.
Metric caching: Metrics can have their own cache TTL, independent of the query.

Here’s a caching strategy that works:

Real-time metrics (e.g., current balance, live transaction count): 0-5 minute cache
Hourly metrics (e.g., daily revenue, daily active users): 30 minute - 1 hour cache
Daily metrics (e.g., monthly revenue, cohort analysis): 4-12 hour cache
Weekly+ metrics (e.g., annual revenue, historical trends): 24 hour cache

The key is matching the cache TTL to the freshness your business needs. If your CFO only cares about revenue to the nearest day, there’s no point refreshing it every 5 minutes.

In Superset, you configure caching in superset_config.py:

CACHE_CONFIG = {
    'CACHE_TYPE': 'redis',
    'CACHE_REDIS_URL': 'redis://localhost:6379/0',
    'CACHE_DEFAULT_TIMEOUT': 300,  # 5 minutes default
}

DATA_CACHE_CONFIG = {
    'CACHE_TYPE': 'redis',
    'CACHE_REDIS_URL': 'redis://localhost:6379/1',
    'CACHE_DEFAULT_TIMEOUT': 3600,  # 1 hour default for data
}

Then, for individual metrics, you set the cache TTL in the metric definition.

Query Performance Benchmarking

Before you expose a dataset or metric to users, you need to know: how fast is it? We benchmark everything at three scales:

Small query (e.g., revenue for one region, one day): should be < 500ms
Medium query (e.g., revenue by region, last 90 days): should be < 2 seconds
Large query (e.g., revenue by region and product, last 2 years): should be < 10 seconds

If any of these exceed the target, you have a performance problem. The solutions are:

Pre-aggregate: Move to a pre-aggregated table (as described above)
Index: Add database indexes on dimension columns (date, region, product)
Partition: Partition your table by date so queries can skip old data
Denormalise: If you’re joining multiple tables, consider denormalising into a single table
Columnar storage: If you’re on a row-oriented database (PostgreSQL, MySQL), consider migrating to a columnar store (ClickHouse, Snowflake, BigQuery)

Here’s a simple benchmark script:

import time
import psycopg2

queries = {
    'small': '''
        SELECT region, SUM(revenue_usd) as revenue
        FROM orders_daily_agg
        WHERE date_ordered = '2024-01-15'
        GROUP BY region
    ''',
    'medium': '''
        SELECT region, SUM(revenue_usd) as revenue
        FROM orders_daily_agg
        WHERE date_ordered >= '2024-01-01' AND date_ordered <= '2024-03-31'
        GROUP BY region
    ''',
    'large': '''
        SELECT region, product_category, SUM(revenue_usd) as revenue
        FROM orders_daily_agg
        WHERE date_ordered >= '2022-01-01' AND date_ordered <= '2024-03-31'
        GROUP BY region, product_category
    '''
}

conn = psycopg2.connect("dbname=analytics user=analyst")
cur = conn.cursor()

for name, query in queries.items():
    start = time.time()
    cur.execute(query)
    result = cur.fetchall()
    elapsed = time.time() - start
    print(f"{name}: {elapsed:.2f}s ({len(result)} rows)")

Run this regularly (monthly, or whenever your data volume changes significantly) to catch performance regressions early.

Columnar Storage and ClickHouse

For teams handling large volumes of analytical data, we often recommend ClickHouse as the underlying warehouse for Superset. ClickHouse is a columnar database optimised for analytical queries. It’s 10-100x faster than row-oriented databases for typical BI queries.

Superset integrates natively with ClickHouse. The advantage is that your pre-aggregated tables can be orders of magnitude larger without performance degradation. A query that takes 30 seconds on PostgreSQL might take 1 second on ClickHouse.

For example, at one of our platform development clients in Sydney, we migrated from PostgreSQL to ClickHouse and saw median query latency drop from 8 seconds to 800ms. That’s not a small difference when you’re trying to ship dashboards fast.

Security and Access Control

Row-Level Security (RLS)

One of the most critical patterns for multi-tenant or regulated systems is row-level security. You want users to see only the data they’re authorised to see. Superset supports this through dataset-level filters.

Example: you have a sales dataset, and you want sales reps to see only their own sales. You define an RLS filter:

sales_rep_id = '{{ current_user_id }}'

Now, when a sales rep logs into Superset and queries the sales dataset, Superset automatically adds this filter to every query. They can’t accidentally (or intentionally) see another rep’s sales.

For more complex scenarios, you can use a mapping table:

user_id IN (SELECT user_id FROM user_role_mapping WHERE role = '{{ current_user_role }}')

Or, for multi-tenant systems:

tenant_id = (SELECT tenant_id FROM users WHERE user_id = '{{ current_user_id }}')

The key is to test this thoroughly. RLS is easy to get wrong, and a misconfigured filter can leak sensitive data. We always recommend:

Unit test the filter logic in your database
Integration test with Superset using test users
Audit the RLS filters regularly to make sure they’re still correct

Column-Level Security

Beyond row-level security, you might want to hide entire columns from certain users. Superset supports this through dataset permissions.

Example: your orders dataset has a cost column, but only finance users should see it. You create two datasets:

orders_analytics (visible to all): includes revenue, date, region, product
orders_finance (visible to finance only): includes revenue, cost, profit margin, date, region, product

Both datasets source the same underlying table, but they expose different columns. Superset’s permission system ensures that non-finance users can’t access the finance dataset.

This is less elegant than native column-level security (which Superset doesn’t have), but it works and it’s auditable.

Audit Logging

For regulated industries, you need to know who queried what, when, and why. Superset logs all queries to a database table. You can query this to audit access:

SELECT
  user_id,
  dataset_id,
  query,
  start_time,
  end_time,
  rows_affected
FROM query_log
WHERE dataset_id = 'orders_finance'
  AND start_time >= NOW() - INTERVAL '7 days'
ORDER BY start_time DESC;

For SOC 2 or ISO 27001 compliance, this audit log is critical. It proves that you’re controlling access, detecting unauthorised queries, and responding to incidents. Many of our compliance-focused clients use this as evidence during audits.

Integration Patterns with dbt and Cube

dbt + Superset: Metrics as Code

The most mature integration we see is dbt + Superset. Your dbt project defines not just tables and models, but also metrics. The Understanding the Superset Semantic Layer article walks through some of these patterns.

Here’s how it works in practice:

You define your dbt models (fact tables, dimension tables, intermediate tables)
You define metrics in dbt YAML, using the dbt Semantic Layer documentation
You run dbt parse to generate the semantic layer manifest
You use dbt’s API or a custom script to sync metrics into Superset
Your Superset datasets reference the dbt models, and metrics are synced automatically

The advantage is version control. Your metrics are in Git. Every change is tracked. You can review metric changes in pull requests before they go to production.

Here’s a simple sync script:

import json
import requests
from dbt.contracts.manifest import Manifest

# Load dbt manifest
with open('target/manifest.json') as f:
    manifest = json.load(f)

# Extract metrics
metrics = manifest.get('metrics', {})

# Sync to Superset
for metric_id, metric_def in metrics.items():
    payload = {
        'metric_name': metric_def['name'],
        'expression': metric_def['expression'],
        'description': metric_def['description'],
        'metric_type': metric_def.get('type', 'count'),
    }
    
    response = requests.post(
        'http://superset:8088/api/v1/metrics',
        json=payload,
        headers={'Authorization': f'Bearer {SUPERSET_API_TOKEN}'}
    )
    
    if response.status_code != 201:
        print(f"Failed to sync {metric_id}: {response.text}")
    else:
        print(f"Synced {metric_id}")

Cube Semantic Layer Sync

If you’re using Cube (a semantic layer tool), Superset integrates with Cube’s semantic layer. The Cube Semantic Layer Sync with Apache Superset documentation covers the details.

Cube is more opinionated than dbt about semantic layer design. It enforces stricter naming conventions and has built-in support for things like currency conversion and time zone handling. If your business logic is complex, Cube might be worth the overhead.

The integration works by Cube exposing a SQL API that Superset connects to. Superset sees Cube’s semantic models as datasets and metrics, and everything works as if they were defined natively in Superset.

Common Gotchas and Solutions

Gotcha 1: Metric Aggregation Mismatch

One of the most common bugs we see is metrics that don’t aggregate correctly. You define a metric like “average order value”, and it works fine when you group by region. But when you group by region and product category, the number is wrong.

The issue is that you can’t just apply SUM() or AVG() to an already-aggregated metric. If you’ve pre-aggregated to the day level, and you define a metric as SUM(avg_order_value), that’s wrong. You need to re-aggregate from the fact level.

The solution is to be explicit about grain. If your dataset is at the day/region level, your metrics should only aggregate across days and regions, not across products or other dimensions that aren’t in the grain.

In Superset, you can document this in the dataset description:

Dataset: orders_daily_agg
Grain: day, region, product_category
Metrics defined at this grain:
- order_count (SUM)
- revenue_usd (SUM)
- avg_order_value (SUM(total_value) / SUM(order_count))

Do NOT use this dataset for metrics that require fact-level aggregation,
such as percentiles, distinct counts, or complex cohort analysis.

Gotcha 2: Stale Cache Breaking Dashboards

You define a metric, set a 1-hour cache TTL, and everything works. Then, at 3 AM, you push a data fix that changes historical values. The cache is stale, but users are still seeing the old numbers. They don’t realise the numbers are wrong until they manually refresh the cache.

The solution is cache invalidation strategy. When you know data has changed (e.g., after a data fix), you should invalidate the relevant cache entries. In Superset:

from superset import db
from superset.models.core import CacheKey

# Invalidate cache for a specific dataset
db.session.query(CacheKey).filter(
    CacheKey.datasource_id == dataset_id
).delete()
db.session.commit()

Or, for a more surgical approach, invalidate only the affected metrics:

# Invalidate cache for metrics that depend on the orders table
affected_metrics = [
    'total_volume_usd',
    'transaction_count',
    'avg_transaction_value'
]

for metric in affected_metrics:
    db.session.query(CacheKey).filter(
        CacheKey.cache_key.like(f'%{metric}%')
    ).delete()
db.session.commit()

Gotcha 3: Timezone Confusion

If you’re serving users across multiple timezones, you’ll run into timezone bugs. A metric defined as “revenue today” is ambiguous: is it today in UTC, or today in the user’s timezone?

The solution is to be explicit. Define metrics in UTC, and let Superset handle timezone conversion for display. In your metric definition:

SUM(CASE WHEN DATE_TRUNC('day', date_ordered AT TIME ZONE 'UTC') = CURRENT_DATE AT TIME ZONE 'UTC' THEN revenue_usd ELSE 0 END)

Or, if your business operates in a specific timezone (e.g., all US-based), use that timezone consistently:

SUM(CASE WHEN DATE_TRUNC('day', date_ordered AT TIME ZONE 'America/New_York') = CURRENT_DATE AT TIME ZONE 'America/New_York' THEN revenue_usd ELSE 0 END)

Then, in Superset’s user settings, let each user set their preferred timezone for display.

Gotcha 4: Slow Metric Calculations

You define a metric that looks simple but is actually slow. For example:

COUNT(DISTINCT customer_id)

On a 500 million row table, this requires a full table scan and a distinct count, which can take 30+ seconds. If you have 10 such metrics, and a dashboard uses all of them, the dashboard will take 5 minutes to load.

The solution is to pre-calculate expensive metrics. Instead of computing distinct customer count on the fly, pre-calculate it and store it in a summary table:

CREATE MATERIALIZED VIEW customer_summary AS
SELECT
  DATE_TRUNC('day', date_ordered) AS date_ordered,
  region,
  product_category,
  COUNT(DISTINCT customer_id) AS distinct_customer_count,
  COUNT(DISTINCT customer_id) FILTER (WHERE is_new_customer) AS new_customer_count
FROM public.orders
GROUP BY DATE_TRUNC('day', date_ordered), region, product_category;

Then, your metric becomes:

SUM(distinct_customer_count)

Which is orders of magnitude faster.

Real-World Implementation Case Study

The Scenario

We worked with a Series-B fintech company with $20M ARR. They had 100+ dashboards built in Superset, but no semantic layer. Each dashboard had bespoke SQL. When they needed to change a metric definition (e.g., to exclude test transactions), they had to update 15 different dashboards. It was slow, error-prone, and demoralising.

They wanted to ship a new product line (B2B payments) with its own analytics, and they didn’t want to repeat the same mistakes.

The Approach

Audit existing dashboards: We identified the 20 most-used metrics across all dashboards and documented their definitions.
Define semantic layer: We created a dbt project with semantic models for transactions, customers, and merchants. We defined 20 core metrics in dbt YAML.
Pre-aggregate: We created materialised views for daily, weekly, and monthly aggregations. We benchmarked query performance and found that 95% of queries ran in < 2 seconds.
Migrate datasets: We created Superset datasets for each materialised view and synced dbt metrics into Superset.
Rebuild dashboards: We rebuilt the top 20 dashboards using the new semantic layer. Each dashboard was faster and simpler (less SQL, more UI).
Govern access: We defined RLS filters for each user role (finance, product, support) and tested them thoroughly.

The Results

Time to ship new dashboard: Reduced from 2 weeks to 2 days
Dashboard load time: Reduced from 8 seconds to 1 second (median)
Metric consistency: 100% of metrics now have a single definition
Data freshness: Metrics are refreshed hourly, with SLA of 99.5% uptime
Team velocity: Analytics team can now focus on insights instead of SQL debugging

The entire project took 8 weeks from audit to production. The ROI was immediate: they shipped the B2B payments analytics in 3 weeks instead of the estimated 8 weeks.

Next Steps: Building Your Semantic Layer

If you’re building a semantic layer from scratch, here’s the roadmap we recommend:

Phase 1: Foundation (Weeks 1-4)

Audit your current state: Document all existing dashboards, metrics, and SQL queries. Identify the 20 most-used metrics.
Define your data warehouse schema: Decide on fact/dimension tables, grain, and aggregation strategy. If you’re on PostgreSQL and have > 100M rows, consider migrating to ClickHouse or Snowflake.
Create base datasets: Build 3-5 core datasets that cover 80% of your use cases. Focus on performance: if a dataset takes > 5 seconds to query, pre-aggregate it.
Define core metrics: For each of the 20 most-used metrics, write a clean, documented SQL expression. Test it against known good values.

Phase 2: Governance (Weeks 5-8)

Implement dbt: If you’re not already using dbt, start here. Define your models and metrics in dbt YAML. This is your source of truth.
Sync metrics to Superset: Write a script to sync dbt metrics into Superset. Test the sync process.
Define access control: Set up RLS filters for each user role. Test thoroughly.
Implement audit logging: Enable query logging in Superset. Set up alerts for suspicious queries.

Phase 3: Migration (Weeks 9-16)

Rebuild dashboards: Migrate your top 20 dashboards to use the new semantic layer. Measure query performance improvements.
Communicate changes: Train your team on the new semantic layer. Document the process for creating new dashboards.
Deprecate old dashboards: Retire old, bespoke dashboards. Archive them if needed for historical reference.
Monitor and optimise: Track query performance, cache hit rates, and user feedback. Optimise the slowest queries.

Phase 4: Scale (Weeks 17+)

Expand metrics: Add new metrics as needed, but always through the semantic layer. Never allow bespoke SQL in production dashboards.
Optimise continuously: As your data volume grows, revisit your aggregation strategy. Pre-aggregate more aggressively if needed.
Invest in tooling: Consider investing in a dedicated semantic layer tool (dbt Semantic Layer, Cube, or Looker) if your business logic becomes too complex for Superset alone.

Getting Started

If you’re a founder or CTO trying to build this yourself, we recommend starting with the official Superset documentation and the Preset blog on semantic layers. Both are solid resources.

If you want to move faster and avoid the pitfalls we’ve documented here, our team at PADISO has built semantic layers for financial services, retail, and media companies across Australia and North America. We can audit your current state in 2 weeks and give you a roadmap to production in 8 weeks.

You can also check out our case studies to see real examples of semantic layer implementations we’ve shipped, and our platform development services for the full scope of what we offer.

For teams serious about AI-driven analytics and automation, we also offer AI & Agents Automation as part of our broader platform engineering practice. A well-designed semantic layer is the foundation for AI-powered analytics and autonomous agents that can answer questions without human intervention.

Summary

A semantic layer is not optional if you’re running Superset at scale. It’s the difference between a BI tool that scales and one that becomes technical debt.

The patterns we’ve covered—fact-dimension abstraction, pre-aggregation, metric governance, performance optimisation, and access control—are the ones we’ve seen work in production. They’re not theoretical; they’re battle-tested across dozens of deployments.

Start with the foundation phase. Get your core datasets and metrics right. Then, scale from there. The investment in semantic layer design pays dividends every single day your team uses Superset.

Your future self (and your analysts) will thank you.

Want to talk through your situation?

Book a 30-minute call with Kevin (Founder/CEO). No pitch - direct advice on what to do next.

Book a 30-min call