Table of Contents
- Why Superset + Athena
- Architecture Overview
- Connection Patterns and Setup
- Query Performance Optimisation
- Caching Strategies
- Operational Considerations
- Cost Management
- Security and Compliance
- Troubleshooting Common Issues
- Next Steps
Why Superset + Athena
Apache Superset and Amazon Athena together form a compelling stack for organisations running analytics on data lakes. Athena is serverless—you pay per query, no infrastructure to manage. Superset is open-source and flexible, letting you build dashboards, charts, and ad-hoc SQL without vendor lock-in.
Why this combination matters: you avoid per-seat BI licensing costs (Tableau, Looker, PowerBI), you eliminate data warehouse operational overhead (no Redshift clusters to scale), and you keep data in S3 where it already lives. For Platform Development in Sydney teams modernising analytics, this stack cuts licensing spend by 60–80% while maintaining query performance for most analytical workloads.
At PADISO, we’ve deployed Superset + Athena across dozens of customer organisations—financial services, retail, media, and government. The pattern is consistent: teams start with a single Superset instance, connect it to Athena, and within weeks they’re decommissioning expensive BI tools. The real complexity isn’t the connection; it’s optimising query patterns, managing Athena costs, and handling the operational quirks that emerge at scale.
This guide pulls from those real deployments. We’ll walk through production architecture, connection patterns, performance tuning, and the operational gotchas that will hit you if you skip them.
Architecture Overview
High-Level Stack
At its core, the stack looks like this:
- Data Layer: Data in S3 (Parquet, ORC, or CSV), partitioned by date or business entity.
- Query Engine: Amazon Athena, which uses Presto under the hood and queries S3 directly.
- Metadata: AWS Glue Data Catalog, which Athena uses to understand table schemas.
- BI Layer: Apache Superset, running on EC2, ECS, or managed Kubernetes, pointing to Athena as a SQL data source.
- Caching: Redis or Memcached (optional but recommended for production).
- Access Control: IAM roles for Superset to assume, VPC endpoints for private connectivity.
Why This Architecture Works
Superset connects to Athena via a JDBC or Python DB API driver. Apache Superset’s user documentation covers supported databases; Athena is listed as a first-class SQL engine. Superset sends SQL queries to Athena, Athena parses them against your Glue Data Catalog, scans S3, and returns results. Superset caches those results and renders dashboards.
The beauty is simplicity: no ETL pipeline, no data warehouse, no schema synchronisation. Your source data lives in S3 (often the output of a data lake pipeline like Spark, dbt, or Fivetran). Athena queries it directly. Superset visualises it.
For teams in Platform Development in Melbourne or Platform Development in Canberra running regulated workloads, this architecture also simplifies compliance: data is immutable in S3, Athena is stateless, and Superset is the only stateful component you need to harden for SOC 2 or ISO 27001 audit readiness.
Deployment Topology
Superset typically runs in one of three ways:
- Self-Hosted on EC2: Simple, full control, no vendor fees. Best for teams with DevOps capacity.
- Managed Kubernetes (EKS): Scales horizontally, supports multiple replicas, integrates with your platform infrastructure.
- Preset (Superset-as-a-Service): Hosted by Preset Inc., handles updates and scaling, but introduces a third-party dependency.
For production workloads we recommend self-hosted on EKS with Redis caching and a private VPC endpoint to Athena. This gives you control, cost predictability, and compliance flexibility. Preset’s documentation is solid if you choose the managed route, but you’ll pay per user seat and lose some customisation.
Connection Patterns and Setup
Prerequisites
Before you connect Superset to Athena, you need:
- An S3 bucket where Athena will write query results (Athena needs this; it’s separate from your data bucket).
- A Glue Data Catalog with tables pointing to your S3 data. If you’re using dbt, Spark, or Fivetran, this is usually automatic.
- IAM permissions for Superset to call Athena’s
StartQueryExecution,GetQueryExecution, andGetQueryResultsAPIs. - Superset installed (via Docker, pip, or a managed service).
Installing the Athena Driver
Superset doesn’t ship with Athena support out of the box. You need the PyAthena driver:
pip install pyathena
If you’re using Preset, this is already included. If you’re self-hosting, add it to your requirements.txt and rebuild your Superset image.
Configuring the Database Connection
In Superset, go to Settings > Database Connections > Add Database. Choose Amazon Athena from the dropdown (or Presto if Athena isn’t listed—Athena is Presto-compatible).
Fill in:
- Database Name: Anything descriptive, e.g.,
athena-prod. - SQLAlchemy URI: This is the critical part. The format is:
awsathena+rest://athena.{region}.amazonaws.com:443?s3_staging_dir=s3://your-results-bucket/path/&work_group=primary&poll_interval=1&kill_on_interrupt=True
Break this down:
athena.{region}.amazonaws.com: Athena’s endpoint in your region (e.g.,athena.us-east-1.amazonaws.com).s3_staging_dir: The S3 bucket where Athena writes query results. Must be writable by the IAM role Superset assumes.work_group: Athena’s workgroup name. Useprimaryunless you’ve created a custom one.poll_interval: How often (in seconds) Superset checks for query completion. 1 second is reasonable; higher values reduce API calls but increase latency.kill_on_interrupt: If set toTrue, Superset cancels the Athena query if the user cancels in the UI.
IAM Permissions
Superset needs an IAM role with these permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"athena:StartQueryExecution",
"athena:GetQueryExecution",
"athena:GetQueryResults",
"athena:StopQueryExecution",
"athena:GetWorkGroup"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::your-results-bucket",
"arn:aws:s3:::your-results-bucket/*",
"arn:aws:s3:::your-data-bucket",
"arn:aws:s3:::your-data-bucket/*"
]
},
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:GetTable",
"glue:GetPartitions",
"glue:GetDatabases",
"glue:GetTables"
],
"Resource": "*"
}
]
}
If Superset runs on EC2 or EKS, attach this role to the instance or pod. If using Preset, provide your AWS credentials (though Preset recommends using cross-account IAM roles instead—see their docs).
Testing the Connection
Once configured, click Test Connection in Superset. Superset will try to query the Glue Data Catalog and list tables. If it succeeds, you’re connected. If it fails, check CloudWatch logs for Athena errors—usually missing permissions or a typo in the S3 staging directory.
Query Performance Optimisation
Understanding Athena’s Query Model
Athena is a scan-based query engine. It reads data from S3, filters, aggregates, and returns results. There’s no index, no pre-aggregated table, no query cache at the Athena level. Every query scans the underlying data.
This has two implications:
- Partition pruning is critical. If your table is partitioned by date and you query only today’s data, Athena scans only today’s files. If you query without a date filter, it scans everything. This can cost 10× more in compute and time.
- File format matters. Parquet and ORC are columnar and compressed; CSV is row-oriented and uncompressed. A Parquet query costs 10–50× less than CSV on the same data.
Partition Strategy
Partition your S3 data by the columns you filter on most often. Typical partitions:
- By date:
s3://bucket/table/year=2024/month=01/day=15/data.parquet - By tenant/customer:
s3://bucket/table/tenant_id=acme/year=2024/month=01/data.parquet - By region:
s3://bucket/table/region=au/year=2024/month=01/data.parquet
In your Glue table definition, declare these as partition columns. When Athena executes a query with a WHERE clause on a partition column, it prunes partitions and scans only the relevant files.
Example: A query on a 10TB table partitioned by date, filtering for the last 7 days, might scan only 70GB. That’s a 140× cost reduction.
File Format and Compression
Use Parquet or ORC for all new data. If you’re still generating CSV, migrate:
- Parquet: Smaller, faster, widely supported. Use Snappy or GZIP compression. Superset handles it transparently.
- ORC: Even smaller, optimised for Hive/Spark. Also excellent.
- CSV: Avoid for analytical data. Only use for small reference tables or one-off exports.
For a typical financial dataset (100GB), switching from CSV to Parquet cuts storage by 80% and query time by 70%.
Superset-Level Query Optimisation
In Superset, when you build a chart or dashboard, Superset generates SQL and sends it to Athena. You can optimise this SQL:
- Use native queries, not ad-hoc exploration: Define charts with explicit SQL. Avoid the visual query builder for large tables.
- Pre-filter in SQL: Instead of loading 1M rows and filtering in Superset, filter in the SQL itself:
SELECT customer_id, revenue, date FROM transactions WHERE date >= CURRENT_DATE - INTERVAL '7' DAY AND status = 'completed' - Aggregate in SQL: Don’t load raw data and aggregate in Superset. Aggregate in Athena:
SELECT DATE_TRUNC('day', date) as day, SUM(revenue) as total_revenue FROM transactions WHERE date >= CURRENT_DATE - INTERVAL '30' DAY GROUP BY 1 - Use LIMIT for exploration: When building dashboards, add
LIMIT 10000to test queries quickly, then remove it for the final chart.
Monitoring Query Performance
In the AWS Athena console, view query history. Look for:
- Data Scanned: How many bytes Athena read. High values suggest missing partition pruning or uncompressed data.
- Execution Time: How long the query took. Should be <10 seconds for most analytical queries.
- Cost: Athena charges $5 per TB scanned (as of 2024). A query scanning 100GB costs $0.50.
If a query scans more data than expected, check:
- Is the table partitioned? Are you filtering on the partition column?
- Is the data in Parquet or ORC? If CSV, migrate.
- Are you joining large tables? Can you pre-aggregate?
Caching Strategies
Why Caching Matters
Athena has no built-in query cache. If two users run the same query within seconds, Athena executes it twice, charging twice. For dashboards with multiple charts, this compounds: a 10-chart dashboard might run the same aggregation query 5 times if users refresh it.
Superset’s caching layer solves this. It stores query results in Redis (or another backend) and serves cached results instead of re-querying Athena.
Setting Up Redis Caching
First, deploy Redis. In production, use AWS ElastiCache (managed Redis). For development, run it in Docker:
docker run -d -p 6379:6379 redis:7-alpine
Then configure Superset to use Redis. In your Superset config file (superset_config.py), add:
CACHE_CONFIG = {
'CACHE_TYPE': 'redis',
'CACHE_REDIS_URL': 'redis://your-redis-endpoint:6379/0',
'CACHE_DEFAULT_TIMEOUT': 3600, # 1 hour
}
DATA_CACHE_CONFIG = {
'CACHE_TYPE': 'redis',
'CACHE_REDIS_URL': 'redis://your-redis-endpoint:6379/1',
'CACHE_DEFAULT_TIMEOUT': 3600,
}
Restart Superset. Now, when a user runs a query, Superset checks Redis first. If the result is cached, it returns it immediately (no Athena call). If not, it queries Athena, caches the result, and returns it.
Cache Invalidation Strategy
The challenge: how do you know when to invalidate a cache entry? If underlying data changes, stale cache causes wrong answers.
Approach 1: Time-Based Expiry
Set a CACHE_DEFAULT_TIMEOUT (e.g., 1 hour). After 1 hour, the cache expires and Superset re-queries Athena. Simple but blunt: if data updates every 5 minutes, a 1-hour cache is stale for 55 minutes.
Approach 2: Manual Invalidation
When you deploy new data (e.g., a daily dbt run completes), manually clear the cache:
from superset.extensions import cache
cache.clear()
Or use Superset’s API:
curl -X DELETE http://superset:8088/api/v1/cache
This is precise but requires orchestration. Pair it with a data pipeline: after dbt finishes, trigger a cache clear.
Approach 3: Hybrid
Set a short TTL (e.g., 5 minutes) for high-frequency dashboards, and longer TTLs (e.g., 24 hours) for static reports. Use manual invalidation for critical dashboards.
Caching at the Athena Level
Athena itself doesn’t cache queries, but you can use Trino’s Athena Connector to add a caching layer. Trino is a distributed SQL engine that can sit between Superset and Athena, caching query results at the Trino level.
This is overkill for most teams but useful if you have hundreds of concurrent users or very expensive queries. The architecture becomes: Superset → Trino → Athena. Trino caches results, reducing Athena load.
Operational Considerations
Monitoring and Alerting
Set up CloudWatch alarms for:
- Query Failure Rate: If >5% of Athena queries fail, something’s wrong (bad data, permission issue, quota hit).
- Query Latency: If P95 query time exceeds 30 seconds, investigate partition pruning or data format.
- Data Scanned: If a dashboard suddenly scans 10× more data than yesterday, a filter might have broken.
- Cost: Set a monthly budget alarm. If Athena spend exceeds your forecast, review query patterns.
Use CloudWatch Logs Insights to query Athena logs:
fields @timestamp, queryExecutionId, statementType, dataScannedInBytes, executionTimeInMillis
| filter ispresent(dataScannedInBytes)
| stats avg(dataScannedInBytes), max(executionTimeInMillis) by statementType
Handling Athena Throttling
Athena has quotas: you can run up to 100 concurrent queries (configurable). If Superset tries to run 150 concurrent queries, Athena throttles.
In production, set Superset’s database pool size to match Athena’s quota:
SQLALCHEMY_POOL_SIZE = 50
SQLALCHEMY_POOL_RECYCLE = 3600
Also, enable query queueing in Superset. Instead of failing, queries wait for a slot to free up.
Handling Glue Data Catalog Drift
Your Glue Data Catalog is the source of truth for table schemas. If data in S3 changes (new column added, type changed) but the Glue table definition doesn’t, Athena queries fail or return wrong results.
Keep your Glue catalog in sync:
- If using dbt: dbt automatically syncs schemas to Glue (via
dbt-athenaadapter). - If using Spark: Spark writes partition metadata to Glue.
- If manual: Run
MSCK REPAIR TABLEperiodically to rescan S3 and update Glue:MSCK REPAIR TABLE my_table;
Superset Upgrades
Superset releases frequently. Test upgrades in a staging environment before production. Upgrading can break:
- Custom plugins or extensions.
- Database driver compatibility (e.g., PyAthena updates).
- Cache configuration (Redis API changes).
Maintain a clear upgrade process: stage → test dashboards → production.
Cost Management
Understanding Athena Costs
Athena charges $5 per TB of data scanned (as of 2024, varies by region). A query scanning 100GB costs $0.50. A query scanning 1TB costs $5.
For a team running 100 queries per day on 50GB per query, that’s 5TB/day × $5 = $25/day = $750/month.
This is often 10–20× cheaper than a traditional BI tool (Tableau: $70–140/user/month), but it can still add up if queries are inefficient.
Cost Optimisation Tactics
- Partition aggressively: Pruning partitions can reduce scanned data by 90%.
- Use Parquet/ORC: Compression reduces scanned bytes by 80–90%.
- Pre-aggregate in the data pipeline: Instead of loading raw transactions and aggregating in Superset, aggregate in dbt or Spark before writing to S3. This reduces query scope.
- Limit dashboard refresh frequency: If a dashboard refreshes every minute, it’s 1,440 queries/day. Reduce to hourly (60 queries/day) and save 95% of Athena costs.
- Use result caching: Superset’s Redis cache prevents redundant Athena queries.
- Archive old data: Move data older than 1 year to S3 Glacier. Athena can still query it (with a small delay), but storage costs drop 90%.
Cost Monitoring
Create a Superset dashboard that tracks Athena costs:
SELECT
DATE_TRUNC('day', query_execution_time) as day,
COUNT(*) as query_count,
SUM(data_scanned_in_bytes) / 1024 / 1024 / 1024 / 1024 as tb_scanned,
SUM(data_scanned_in_bytes) / 1024 / 1024 / 1024 / 1024 * 5 as estimated_cost_usd
FROM athena.query_logs
GROUP BY 1
ORDER BY 1 DESC
This query (which you’d run on CloudTrail logs or a custom logging table) shows daily costs. If costs spike, investigate which queries caused it.
Security and Compliance
Network Security
Superset and Athena should communicate over a private connection, not the public internet.
- VPC Endpoints: Create an S3 Gateway Endpoint and an Athena VPC Endpoint in your VPC. Superset connects to Athena through the endpoint, avoiding NAT gateway costs and reducing latency.
- IAM Policies: Restrict Athena and S3 access to specific roles. Don’t use root credentials.
- Encryption in Transit: Athena supports TLS. Ensure Superset connects over HTTPS.
Data Residency and Compliance
For teams in Platform Development in Washington, D.C. or Platform Development in Toronto, data residency is critical. Ensure:
- S3 buckets are in the correct region (e.g.,
us-east-1for US,ca-central-1for Canada). - Athena queries execute in the same region.
- Query results (in the staging S3 bucket) stay in-region.
For government or defence work, check if your region supports IRAP/FedRAMP. AWS GovCloud regions are available for US federal agencies; AWS Asia Pacific (Sydney) is suitable for Australian government workloads.
Audit and Logging
Enable CloudTrail logging for all Athena and S3 API calls. This provides an audit trail for compliance audits (SOC 2, ISO 27001).
{
"eventSource": "athena.amazonaws.com",
"eventName": "StartQueryExecution",
"requestParameters": {
"queryString": "SELECT * FROM my_table WHERE ...",
"resultConfiguration": {
"outputLocation": "s3://results-bucket/"
}
},
"sourceIPAddress": "10.0.0.5",
"userIdentity": {
"principalId": "AIDAI...",
"arn": "arn:aws:iam::123456789012:user/superset"
}
}
In Superset, enable audit logging for all user actions (chart creation, dashboard access, etc.). This is essential for compliance frameworks.
Access Control
Superset has row-level security (RLS) and column-level security (CLS). Use these to restrict data access:
- Row-Level Security: Only show a user data for their own tenant or region.
- Column-Level Security: Hide PII columns (email, phone) from non-admin users.
Example: A SaaS company with 100 tenants can use RLS so each tenant sees only their data, even though all data is in the same Athena table.
Encryption
- At Rest: Enable S3 server-side encryption (SSE-S3 or SSE-KMS). Athena supports both.
- In Transit: Athena-to-Superset communication should use TLS. Superset-to-Redis should use TLS if Redis is external.
- Keys: Use AWS KMS for key management. Rotate keys annually.
Troubleshooting Common Issues
Issue 1: “No Credentials Provided”
Symptom: Superset can’t connect to Athena. Error: “No AWS credentials found.”
Cause: Superset doesn’t have IAM credentials to assume the Athena role.
Fix:
- If Superset runs on EC2, attach an IAM instance profile with Athena permissions.
- If Superset runs on EKS, use IRSA (IAM Roles for Service Accounts) to bind the pod to an IAM role.
- If using Preset, provide AWS credentials in the database connection settings.
Issue 2: “Query Timed Out”
Symptom: Superset shows “Query timed out” after 30 seconds.
Cause: Athena query is slow (scanning too much data) or stuck.
Fix:
- Check Athena query history in the AWS console. Is the query still running or did it fail?
- If running, check data scanned. If >500GB, optimize with partitioning or aggregation.
- Increase Superset’s query timeout:
SUPERSET_TIMEOUT = 300(5 minutes). - Check if Athena is throttled (too many concurrent queries). Reduce concurrent connections.
Issue 3: “Table Not Found”
Symptom: Superset lists tables in the database, but when you try to query a table, it says “Table not found.”
Cause: The Glue Data Catalog is out of sync with S3. The table exists in S3 but not in Glue.
Fix:
- In Athena console, run
SHOW TABLES;to see what Glue knows about. - Run
MSCK REPAIR TABLE my_table;to rescan S3 and update Glue. - If the table still doesn’t appear, manually create it in Glue with the correct S3 path and schema.
Issue 4: “Permission Denied”
Symptom: Superset can connect to Athena, but queries fail with “Permission denied” or “Access denied.”
Cause: The IAM role lacks S3 permissions for the data bucket.
Fix:
- Check the IAM role’s S3 permissions. It needs
s3:GetObjectands3:ListBucketon both the data bucket and the results bucket. - Check S3 bucket policies. If the bucket has a restrictive policy, the IAM role might be blocked.
- Check Glue permissions. The role needs
glue:GetTable,glue:GetDatabase, etc.
Issue 5: “Stale Cache”
Symptom: A dashboard shows old data even though the underlying table was updated.
Cause: Superset’s cache hasn’t expired or hasn’t been invalidated.
Fix:
- Manually clear the cache:
from superset.extensions import cache cache.clear() - Or use Superset’s API:
curl -X DELETE http://superset:8088/api/v1/cache - Or reduce
CACHE_DEFAULT_TIMEOUTto a shorter interval (e.g., 5 minutes).
Issue 6: “High Costs”
Symptom: Athena charges are higher than expected.
Cause: Queries are scanning too much data (missing partition pruning, uncompressed data, or inefficient SQL).
Fix:
- Review Athena query history. Sort by “Data Scanned” (highest first).
- For the top 10 expensive queries, check if they have partition filters. If not, add them.
- Check if data is in Parquet. If CSV, migrate.
- Consider pre-aggregating data in your pipeline (dbt, Spark) instead of querying raw data in Superset.
Integration with PADISO Services
If you’re building this architecture from scratch or scaling an existing Superset + Athena deployment, PADISO’s Platform Design & Engineering service can accelerate your timeline.
Our team has deployed Superset + Athena stacks across Platform Development in Australia and internationally. We handle:
- Architecture design: Optimising S3 partitioning, choosing Parquet vs. ORC, sizing Superset infrastructure.
- Performance tuning: Profiling expensive queries, implementing caching strategies, cost optimisation.
- Security hardening: IAM policies, network isolation, audit logging, compliance readiness for SOC 2 or ISO 27001.
- Operational handoff: Runbooks, monitoring, alerting, and training for your team.
For teams in Platform Development in New York, Platform Development in Chicago, or Platform Development in Austin, we also integrate Superset + Athena with existing data platforms (dbt, Fivetran, Spark) and compliance frameworks.
If you’re running this stack at scale or facing specific operational challenges, book a call to discuss your architecture.
Next Steps
Immediate Actions
- Assess your data: Is it in S3? Is it partitioned? Is it in Parquet or CSV? These answers determine your starting point.
- Set up Glue Data Catalog: If you’re not using dbt or Spark, manually create Glue tables pointing to your S3 data.
- Deploy Superset: Use Docker or Kubernetes. Start small (single instance) and scale horizontally if needed.
- Connect to Athena: Follow the connection steps above. Test with a simple query first.
- Build a pilot dashboard: Create 2–3 charts on a small table to validate performance and cost.
30-Day Plan
- Week 1: Superset + Athena connection, pilot dashboard, cost baseline.
- Week 2: Optimise data format (migrate to Parquet if needed), implement partition pruning.
- Week 3: Set up Redis caching, monitoring, and alerting.
- Week 4: Harden security (IAM, encryption, audit logging), document runbooks, train your team.
Long-Term Roadmap
- Months 2–3: Migrate all dashboards from legacy BI tools to Superset. Decommission per-seat licenses.
- Months 4–6: Integrate Superset with your data pipeline (dbt, Fivetran, Spark). Automate Glue catalog updates.
- Months 6+: Explore advanced features (embedding, alerts, custom plugins) and optimise costs further.
Resources
For detailed setup guidance, refer to:
- Apache Superset’s official documentation
- Amazon Athena’s user guide
- Preset’s documentation for managed Superset
- Preset’s tutorial on Superset + Athena
- StrongDM’s guide to Athena + Superset connections
Getting Help
If you hit roadblocks or need architecture review, PADISO’s Services team specialises in data platform engineering. We’ve deployed this exact stack across dozens of organisations—financial services, SaaS, media, and government. Our experience spans Platform Development in Dallas, Platform Development in Wellington, and Platform Development in Ottawa, so we understand regional compliance and operational requirements.
Reach out if you want to discuss your specific architecture, optimise an existing deployment, or accelerate your analytics platform modernisation.
Conclusion
Apache Superset + Amazon Athena is a proven, cost-effective stack for analytics on data lakes. It eliminates per-seat BI licensing, avoids data warehouse operational overhead, and scales with your data.
The architecture is straightforward: S3 data → Glue catalog → Athena queries ← Superset dashboards. The complexity lies in optimisation: partition pruning, file formats, query patterns, caching, cost control, and compliance.
This guide covered production patterns from dozens of D23.io customer deployments. Follow these patterns—partition aggressively, use Parquet, implement caching, monitor costs, harden security—and you’ll have a performant, cost-efficient analytics platform.
Start small, test thoroughly, and scale incrementally. Within 30 days, you should have dashboards running. Within 6 months, you should have decommissioned legacy BI tools and recouped your investment.