PADISO.ai: AI Agent Orchestration Platform - Launching May 2026
Back to Blog
Guide 17 mins

Apache Superset + BigQuery: A D23.io Reference Architecture

Production-ready Apache Superset + BigQuery architecture. Connection patterns, query performance, caching, and operational deployment from D23.io customer data.

The PADISO Team ·2026-06-15

Table of Contents

  1. Why Apache Superset + BigQuery
  2. Architecture Overview
  3. Connection Patterns & Authentication
  4. Query Performance & Optimisation
  5. Caching Strategy
  6. Operational Deployment
  7. Common Pitfalls & Solutions
  8. Security & Compliance
  9. Cost Optimisation
  10. Next Steps

Why Apache Superset + BigQuery

Apache Superset and Google BigQuery form a powerful combination for organisations that need self-service analytics without the per-seat licensing costs of traditional business intelligence platforms. Superset is an open-source data visualisation platform that supports dozens of database backends, whilst BigQuery is Google Cloud’s serverless data warehouse designed for massive-scale analytics workloads.

The pairing works because BigQuery handles the heavy lifting—petabyte-scale queries, automatic scaling, and near-instant results on historical data—whilst Superset provides the user-facing layer: dashboards, ad-hoc exploration, alerting, and role-based access control. Unlike licensed BI tools, this combination lets you scale from a handful of analysts to hundreds without hitting licensing walls.

At PADISO, we’ve deployed this architecture across financial services, retail, media, and government clients across Australia, New York, Toronto, and Washington, D.C.. The pattern consistently delivers 40–60% cost savings versus per-seat BI tools, faster time-to-dashboard (4–6 weeks vs. 12+ weeks for enterprise implementations), and better operational control over data pipelines.

This guide covers the production reference architecture we’ve refined across 50+ customer deployments. It focuses on real operational constraints: connection reliability, query cost, dashboard latency, and the specific quirks that catch teams by surprise in production.


Architecture Overview

High-Level Design

A production Superset + BigQuery stack typically consists of five layers:

  1. Data layer: BigQuery datasets, tables, and materialised views
  2. Connection layer: SQLAlchemy dialect (sqlalchemy-bigquery), service accounts, and credential management
  3. Query layer: Superset’s SQL Lab, native queries, and semantic layer (datasets)
  4. Cache layer: Redis or Memcached for dashboard tile caching
  5. Application layer: Superset frontend, API, and scheduler for alerts and refresh jobs

Each layer has distinct failure modes and optimisation opportunities. Superset itself is stateless (except for metadata), so it scales horizontally. BigQuery is fully managed and auto-scales. The critical pinch point is usually the connection pool and cache layer.

Why This Matters

Superset doesn’t push computation to BigQuery and then forget. Every dashboard tile, every SQL Lab query, every chart refresh is a separate BigQuery API call. A dashboard with 12 charts can fire 12 parallel queries. If those queries aren’t optimised or cached, you’ll hit BigQuery slot limits, spike your costs, and frustrate your users with slow load times.

The reference architecture below solves this by:

  • Using BigQuery materialised views to pre-aggregate heavy metrics
  • Implementing a two-tier cache: Redis for hot tiles, BigQuery’s BI Engine for in-memory acceleration
  • Tuning Superset’s query timeouts and connection pooling to match BigQuery’s behaviour
  • Monitoring slot usage and query costs in real time

Connection Patterns & Authentication

Setting Up the SQLAlchemy Dialect

Superset connects to BigQuery via the sqlalchemy-bigquery dialect, which translates Superset’s SQL abstraction layer into BigQuery-compatible queries. The dialect handles dialect-specific functions (e.g., CURRENT_TIMESTAMP() vs. NOW()), data type mapping, and query hints.

To connect Superset to BigQuery, you’ll need:

  1. A Google Cloud service account with BigQuery Data Editor and BigQuery Job User roles
  2. A JSON key file for that service account
  3. The sqlalchemy-bigquery package installed in your Superset environment
  4. A connection string in the format: bigquery://{project_id}/{dataset_id}?user_supplied_oauth=false

Here’s a minimal Superset connection configuration:

Engine: Google BigQuery
Database name: my-project-dataset
Connection string: bigquery://my-gcp-project/my_dataset?user_supplied_oauth=false
Extra JSON config:
{
  "credentials_info": {
    "type": "service_account",
    "project_id": "my-gcp-project",
    "private_key_id": "...",
    "private_key": "...",
    "client_email": "superset@my-gcp-project.iam.gserviceaccount.com",
    "client_id": "...",
    "auth_uri": "https://accounts.google.com/o/oauth2/auth",
    "token_uri": "https://oauth2.googleapis.com/token"
  },
  "location": "US"
}

The location parameter is crucial: it must match the dataset’s location. Mismatches cause cryptic errors. If your data is in EU or multi-region, specify that explicitly.

Service Account & IAM

Don’t use a human user’s credentials. Create a dedicated service account for Superset with minimal permissions:

  • BigQuery Data Editor: Allows reading data and creating temporary tables
  • BigQuery Job User: Allows running queries
  • Optionally, BigQuery Metadata Viewer if you want Superset to auto-discover tables

Rotate the key every 90 days. Store the JSON key in a secrets manager (Google Secret Manager, AWS Secrets Manager, or HashiCorp Vault), not in your Superset config files.

Connection Pooling

Superset uses SQLAlchemy’s connection pool by default. BigQuery’s official documentation doesn’t recommend connection pooling because BigQuery is serverless—each query is independent. However, Superset’s pool reduces the overhead of creating new connections.

Recommended pool settings for Superset:

SQLALCHEMY_ENGINE_OPTIONS = {
    "pool_size": 20,
    "max_overflow": 10,
    "pool_pre_ping": True,
    "pool_recycle": 3600,
    "echo": False,
}
  • pool_size: 20–30 concurrent connections
  • max_overflow: Allow 10 additional connections if the pool is exhausted
  • pool_pre_ping: Test connections before reusing them (catches stale credentials)
  • pool_recycle: Recycle connections every hour (BigQuery tokens expire)

Monitor the pool’s behaviour in production. If you see connection timeouts, increase pool_size. If you see idle connections, decrease it.

Handling Credentials in Production

Never commit credentials to version control. Use environment variables or a secrets manager:

export SUPERSET_BIGQUERY_CREDENTIALS=$(cat /path/to/service-account-key.json | base64)

Then, in Superset’s configuration, decode and inject:

import json
import base64
import os

creds_b64 = os.environ.get("SUPERSET_BIGQUERY_CREDENTIALS")
if creds_b64:
    creds_json = json.loads(base64.b64decode(creds_b64))
else:
    creds_json = {}  # Fallback for local dev

For Kubernetes deployments (common in Sydney, Melbourne, and Canberra), use Kubernetes Secrets and mount them as volumes. For Washington, D.C. and Ottawa government clients, use Workload Identity Federation to avoid storing keys altogether.


Query Performance & Optimisation

Understanding BigQuery’s Cost Model

BigQuery charges by bytes scanned, not by query count. A query that scans 100 GB costs the same whether it returns 1 row or 1 million rows. This fundamentally changes how you optimise.

A typical Superset user might run 20–50 queries per day. If each query scans 10 GB (common for unoptimised dashboards), that’s 200–500 GB per user per day. At $6.25 per TB, that’s $1.25–$3.13 per user per day, or $375–$938 per user per month. With 50 users, you’re looking at $18,750–$46,900 per month just in query costs.

Optimisation cuts this 60–80% through three mechanisms: partitioning, clustering, and materialised views.

Partitioning

Partition your BigQuery tables by date (or timestamp). This dramatically reduces the bytes scanned for time-series queries.

CREATE TABLE `project.dataset.events`
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id, event_type
AS
SELECT * FROM `project.dataset.raw_events`;

Now, a query filtering for events in the last 7 days scans only 7 partitions, not the entire table. Superset’s native filters (e.g., “Date range: last 7 days”) should automatically push this filter to BigQuery. Verify this in the query plan.

Clustering

Cluster tables by frequently filtered columns (e.g., user_id, event_type, region). Clustering is free and can reduce bytes scanned by 50–90% for filtered queries.

CREATE TABLE `project.dataset.transactions`
PARTITION BY DATE(transaction_date)
CLUSTER BY merchant_id, transaction_type, region
AS
SELECT * FROM `project.dataset.raw_transactions`;

When you query for transactions from a specific merchant in a region, BigQuery reads only the clusters containing that data.

Materialised Views

For metrics that are queried repeatedly (e.g., “daily active users”, “revenue by region”), create materialised views. These are pre-computed tables that BigQuery refreshes on a schedule.

CREATE MATERIALIZED VIEW `project.dataset.daily_metrics` AS
SELECT
  DATE(event_timestamp) AS event_date,
  region,
  COUNT(DISTINCT user_id) AS dau,
  SUM(revenue) AS daily_revenue,
  COUNT(*) AS event_count
FROM `project.dataset.events`
WHERE event_timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 90 DAY)
GROUP BY event_date, region;

Then, in Superset, point your dashboards to this view instead of the raw table. Queries run in milliseconds instead of seconds.

Refresh the view every 4–6 hours (or hourly for real-time dashboards). Use BigQuery’s scheduled queries to automate this:

CALL BQ.REFRESH_MATERIALIZED_VIEW('project.dataset.daily_metrics');

Query Hints & BigQuery-Specific Optimisations

Superset’s SQL Lab allows raw SQL. For power users, add BigQuery-specific hints:

-- Force a specific join strategy
SELECT /*+ HASH_JOIN(a) */ a.id, b.revenue
FROM `project.dataset.events` a
JOIN `project.dataset.revenue` b
ON a.user_id = b.user_id;

BigQuery also supports BI Engine, an in-memory acceleration layer. If you enable BI Engine for a dataset (requires annual commitment), BigQuery automatically accelerates queries on that dataset. For dashboards with high query frequency, BI Engine can cut query costs by 50%+ and reduce latency to sub-second.

Monitoring Query Performance

Enable query logging in Superset:

LOG_QUERY_WHEN_EMPTY = True
SUPERSET_LOG_LEVEL = "DEBUG"

Then, query Superset’s query table to see which dashboards and queries are slowest:

SELECT
  created_on,
  execution_time_ms,
  sql,
  database,
  user_id
FROM query
WHERE database = 'BigQuery'
ORDER BY execution_time_ms DESC
LIMIT 20;

Look for queries taking >5 seconds. These are candidates for optimisation via partitioning, clustering, or materialised views.


Caching Strategy

Two-Tier Caching

Superset’s caching is two-tier:

  1. Query result cache: Redis or Memcached stores the raw SQL query results
  2. Dashboard tile cache: Superset caches rendered chart tiles separately

Both are critical for performance. Without caching, every dashboard load fires every query. With caching, repeat queries hit Redis in <10ms.

Configuring Redis

For production, use a managed Redis service (Google Cloud Memorystore, AWS ElastiCache, or Azure Cache for Redis). Self-hosted Redis works, but requires careful monitoring for memory pressure and eviction.

In Superset’s superset_config.py:

REDIS_HOST = os.environ.get("REDIS_HOST", "localhost")
REDIS_PORT = int(os.environ.get("REDIS_PORT", 6379))
REDIS_DB = int(os.environ.get("REDIS_DB", 0))
REDIS_PASSWORD = os.environ.get("REDIS_PASSWORD", None)

CACHE_CONFIG = {
    "CACHE_TYPE": "redis",
    "CACHE_REDIS_URL": f"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:{REDIS_PORT}/{REDIS_DB}",
    "CACHE_DEFAULT_TIMEOUT": 3600,  # 1 hour
}

DATA_CACHE_CONFIG = {
    "CACHE_TYPE": "redis",
    "CACHE_REDIS_URL": f"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:{REDIS_PORT}/{REDIS_DB + 1}",
    "CACHE_DEFAULT_TIMEOUT": 3600,
}

Use separate Redis databases for metadata cache and data cache to avoid eviction conflicts.

Cache Invalidation

Set cache TTL (time-to-live) based on data freshness requirements:

  • Real-time dashboards: 5–15 minutes (BigQuery slot cost: high)
  • Hourly dashboards: 30–60 minutes (balanced)
  • Daily dashboards: 4–24 hours (low cost, acceptable staleness)

For datasets that update on a schedule (e.g., ETL runs at 2 AM), invalidate the cache right after the ETL completes:

import redis
from superset import app

redis_client = redis.from_url(app.config["CACHE_REDIS_URL"])
redis_client.delete("superset_data_cache:*")

Or, use Superset’s API to refresh a specific dashboard:

curl -X POST https://superset.example.com/api/v1/dashboards/{dashboard_id}/refresh

Memory Sizing

Estimate Redis memory usage:

  • Typical dashboard: 50 KB–500 KB per tile (depending on chart type and data volume)
  • 100 dashboards × 10 tiles × 200 KB = 200 MB
  • Add 50% buffer for growth: 300 MB

For production, provision at least 2–4 GB for 500+ concurrent users. Monitor eviction rates; if Redis evicts keys, increase memory or reduce cache TTL.


Operational Deployment

Containerisation & Kubernetes

Deploy Superset as a containerised application on Kubernetes. Here’s a minimal Dockerfile:

FROM apache/superset:latest

RUN pip install sqlalchemy-bigquery

COPY superset_config.py /app/
ENV SUPERSET_CONFIG_PATH /app/superset_config.py

EXPOSE 8088
CMD ["superset", "run", "-h", "0.0.0.0", "-p", "8088"]

For teams in Sydney, Melbourne, and Canberra, deploy on Google Cloud’s GKE or your own Kubernetes cluster. For Washington, D.C. and Ottawa government clients, ensure the cluster is FedRAMP or ITSG-33 compliant.

A typical Kubernetes deployment includes:

  • Superset frontend: 2–3 replicas, load-balanced
  • Superset worker: 2–4 replicas for async query execution
  • Superset scheduler: 1 replica for alerts and refresh jobs
  • PostgreSQL: Superset’s metadata database (managed service recommended)
  • Redis: Query result cache (managed service recommended)

Database Migrations

Superset uses Alembic for database migrations. Run migrations before deploying:

superset db upgrade

For zero-downtime deployments, run migrations in a separate job before rolling out new Superset pods.

Monitoring & Observability

Instrument Superset with Prometheus metrics and ship logs to a centralised logging service (Google Cloud Logging, Datadog, or Splunk).

Key metrics to monitor:

  • Query latency: P50, P95, P99 latencies for dashboard queries
  • Cache hit rate: Percentage of queries served from Redis
  • BigQuery slot usage: Ensure you don’t exceed provisioned slots
  • Superset API latency: Response times for dashboard loads
  • Redis memory usage: Eviction rate and available memory
  • PostgreSQL connections: Active connections and pool saturation

Set up alerts for:

  • Query latency >10 seconds
  • Cache hit rate <70% (indicates cache thrashing)
  • BigQuery slot usage >80%
  • Redis eviction rate >0.1 evictions/sec

Scaling Considerations

Superset scales horizontally: add more replicas for more concurrent users. However, the bottleneck is usually BigQuery query slots, not Superset capacity.

BigQuery’s on-demand pricing charges $6.25 per TB scanned. For predictable workloads, purchase annual or monthly slots (e.g., 100 slots = 100 concurrent queries). Slots cost $2,000/month (100 slots) to $40,000/month (2,000 slots).

For a team with 50 active users running 10 queries/day each (500 queries/day), with average query size of 5 GB, you’ll scan 2.5 TB/day. On-demand costs $15.63/day or $469/month. Annual slots at 100 slots cost $24,000/year or $2,000/month—break-even occurs at ~6.4 TB/day of scanning.

Calculate your break-even point and purchase slots accordingly.


Common Pitfalls & Solutions

Pitfall 1: Superset Doesn’t Push Filters to BigQuery

Symptom: Dashboard filters don’t reduce query bytes scanned.

Cause: Superset’s semantic layer (datasets) isn’t configured correctly, or the filter column isn’t indexed in BigQuery.

Solution:

  1. Create a Superset dataset (semantic layer) on top of your table
  2. Map filter columns to physical table columns
  3. Ensure BigQuery table is partitioned/clustered by the filter column
  4. Test in SQL Lab: check the generated SQL to confirm the WHERE clause

Pitfall 2: Slow Dashboard Loads

Symptom: Dashboard takes >5 seconds to load even with cached queries.

Cause: Too many tiles (>20), or tiles querying raw tables instead of materialised views.

Solution:

  1. Reduce tiles to 10–15 per dashboard
  2. Use materialised views for pre-aggregated metrics
  3. Enable dashboard caching: set cache TTL to 15–60 minutes
  4. Use Superset’s async query mode: queries execute in the background, tiles load progressively

Pitfall 3: BigQuery Quota Exceeded

Symptom: Queries fail with “Quota exceeded” or “Slot quota exceeded”.

Cause: Dashboard queries exceed provisioned BigQuery slots or API rate limits.

Solution:

  1. Monitor slot usage in real time (BigQuery console > Monitoring > Slots)
  2. Purchase additional slots if consistently >80% utilised
  3. Reduce query frequency: increase cache TTL, batch queries, use scheduled refreshes instead of real-time
  4. Optimise queries: add partitioning, clustering, materialised views

Pitfall 4: Connection Pool Exhaustion

Symptom: Superset returns “QueuePool limit exceeded” or connection timeouts.

Cause: More concurrent queries than available connections in the pool.

Solution:

  1. Increase pool_size and max_overflow (see Connection Pooling section)
  2. Reduce query timeout: kill long-running queries after 5–10 minutes
  3. Use async query execution: queries queue instead of blocking connections
  4. Monitor pool statistics: add logging to track pool saturation

Pitfall 5: Stale Data in Dashboards

Symptom: Dashboard shows yesterday’s data even though the table was updated this morning.

Cause: Cache TTL is too long, or materialised view wasn’t refreshed.

Solution:

  1. Reduce cache TTL to match data freshness requirements
  2. Schedule materialised view refresh to run after your ETL completes
  3. Use Superset’s “Force Refresh” button to manually invalidate cache
  4. For real-time dashboards, disable caching entirely (set TTL to 0)

Security & Compliance

Row-Level Security (RLS)

BigQuery doesn’t have native row-level security, but Superset can enforce it via SQL filters. Create a Superset dataset with a WHERE clause that filters by the current user:

SELECT * FROM `project.dataset.events`
WHERE region = '{{ current_user.extra_data.region }}'

This requires storing user attributes (e.g., region) in Superset’s user table. For complex RLS (e.g., hierarchical access), use a lookup table:

SELECT e.* FROM `project.dataset.events` e
JOIN `project.dataset.user_permissions` p
ON e.region = p.allowed_region
WHERE p.user_id = '{{ current_user.id }}'

SOC 2 & ISO 27001 Compliance

For organisations pursuing SOC 2 or ISO 27001 compliance, Superset + BigQuery can meet most requirements:

  • Encryption in transit: Enable TLS for all connections (Superset → BigQuery, user → Superset)
  • Encryption at rest: BigQuery encrypts all data by default; enable customer-managed keys (CMEK) for regulated workloads
  • Audit logging: Enable BigQuery audit logs and Superset query logging
  • Access control: Use IAM for BigQuery, RBAC for Superset
  • Data retention: Configure BigQuery table expiration and Superset query log retention

For government clients in Washington, D.C. and Canberra, ensure BigQuery datasets are in a compliant region (US or Australia) and enable audit logging to Cloud Logging.

Network Security

Deploy Superset in a private subnet without direct internet access. Use VPN or bastion hosts for admin access. Configure BigQuery to accept connections only from your Superset IP range:

gcloud bigquery datasets update my_dataset \
  --add-iam-policy-binding='roles/bigquery.dataEditor:serviceAccount:superset@my-project.iam.gserviceaccount.com'

For multi-tenant deployments, use separate service accounts per tenant to prevent cross-tenant data leakage.


Cost Optimisation

Quantifying Savings

A typical enterprise BI tool (Tableau, Looker) costs $70–150 per user per month. With 50 users, that’s $3,500–$7,500/month.

Superset + BigQuery costs:

  • Superset hosting: $500–2,000/month (GKE, managed database, Redis)
  • BigQuery: $2,000–10,000/month (depends on data volume and query frequency)
  • Total: $2,500–12,000/month

Break-even occurs at 20–50 users. Beyond that, Superset is 60–80% cheaper.

Reducing BigQuery Costs

  1. Partition and cluster tables: Reduces bytes scanned by 50–90%
  2. Use materialised views: Pre-compute expensive aggregations
  3. Enable BI Engine: In-memory acceleration for $0.04/GB/month
  4. Purchase annual slots: $2,000–40,000/month depending on volume
  5. Archive old data: Move data >90 days old to Cloud Storage (cheaper)
  6. Use external tables: Query data in Cloud Storage without loading into BigQuery

For a typical customer, these optimisations cut BigQuery costs by 60–70%.

Reducing Superset Hosting Costs

  1. Use managed services: Google Cloud Run for Superset (pay per request), managed PostgreSQL and Redis
  2. Right-size instances: Monitor CPU/memory and scale down if over-provisioned
  3. Batch queries: Use Superset’s scheduler to pre-compute dashboards instead of computing on-demand
  4. Reduce concurrent users: Encourage off-peak usage, use scheduled reports instead of real-time dashboards

Next Steps

If you’re planning to deploy Superset + BigQuery, here’s a 12-week roadmap:

Weeks 1–2: Architecture & Planning

  • Define data sources and BigQuery schema
  • Design dashboard taxonomy and user roles
  • Plan for partitioning, clustering, and materialised views
  • Estimate query volume and BigQuery costs

Weeks 3–4: Infrastructure Setup

  • Provision BigQuery datasets and tables
  • Create service accounts and IAM roles
  • Deploy Superset on Kubernetes (or Cloud Run)
  • Configure PostgreSQL and Redis

Weeks 5–8: Data Modelling & Optimisation

  • Build materialised views for key metrics
  • Create Superset datasets (semantic layer)
  • Implement row-level security
  • Optimise queries and test performance

Weeks 9–10: Dashboard Development

  • Build 5–10 core dashboards
  • Set up alerts and scheduled reports
  • Test caching and performance
  • Train users on SQL Lab and dashboard exploration

Weeks 11–12: Security & Go-Live

  • Enable audit logging and monitoring
  • Run security tests (penetration testing, access control verification)
  • Document runbooks and troubleshooting guides
  • Go live and iterate based on user feedback

Getting Help

If you’re building Superset + BigQuery in Sydney, Melbourne, or other Australian cities, PADISO’s platform engineering team has deployed this architecture across 50+ customers. We specialise in data platforms, analytics infrastructure, and custom software development for financial services, retail, and government teams.

For teams in New York, Toronto, Chicago, Austin, and Dallas, we offer the same architecture patterns tuned to your region’s compliance and latency requirements.

For Washington, D.C. and Ottawa government clients, we specialise in FedRAMP and ITSG-33 compliance, ensuring your Superset + BigQuery stack meets regulatory requirements.

Explore our platform engineering services or book a call to discuss your specific architecture needs. We’ve refined this pattern across multiple geographies, and we’re happy to share what we’ve learned.


Key Takeaways

  • Superset + BigQuery is cost-effective: 60–80% cheaper than per-seat BI tools for 50+ users
  • Connection and caching matter: Proper configuration of SQLAlchemy, Redis, and BigQuery can reduce latency by 90% and costs by 70%
  • Optimisation is iterative: Start with partitioning and materialised views, then monitor and refine
  • Security is built-in: With proper IAM, encryption, and audit logging, Superset + BigQuery meets SOC 2 and ISO 27001 requirements
  • Operational complexity is manageable: Deploy on Kubernetes, monitor query costs, and refresh dashboards on a schedule

The architecture outlined in this guide has been battle-tested across financial services, retail, media, and government organisations. It’s production-ready, scalable, and cost-effective. The key is to start simple (partition tables, add materialised views, enable caching), measure, and iterate.

For detailed implementation support, PADISO offers platform engineering and custom software development services across Australia, North America, and beyond. We can help you design, build, and operate a Superset + BigQuery stack tailored to your organisation’s needs.

Want to talk through your situation?

Book a 30-minute call with Kevin (Founder/CEO). No pitch — direct advice on what to do next.

Book a 30-min call