Table of Contents
- Why Superset + Pinot?
- Core Architecture Patterns
- Connection and Query Patterns
- Performance Tuning and Caching
- Operational Quirks and Gotchas
- Security, Compliance, and Multi-Tenancy
- Scaling Considerations
- Real-World Deployment Examples
- Next Steps and Vendor Partnership
Why Superset + Pinot?
Apache Superset and Apache Pinot have emerged as a powerful pairing for organisations building analytics platforms that demand both sub-second query latency and operational simplicity. Unlike traditional data warehouses, this combination excels at real-time analytics workloads where you need to serve thousands of users querying massive event streams without breaking budget or engineering capacity.
The case for this architecture is straightforward: Apache Pinot handles the hard problem of distributed analytical queries across petabyte-scale datasets, whilst Apache Superset provides a lightweight, SQL-native front-end that doesn’t require custom development for every dashboard or ad-hoc query. For teams at PADISO, we’ve seen this combination reduce per-seat BI costs by 60–70% whilst simultaneously improving query responsiveness and operational transparency.
What makes this pairing particularly valuable is that neither component forces you into a proprietary ecosystem. Both are Apache-licensed, widely deployed in production (Uber, Roku, LinkedIn, and others run Pinot at scale), and designed to integrate cleanly with existing data pipelines. Superset’s metadata layer and permission model sit naturally on top of Pinot’s distributed query engine, and the operational surface area—monitoring, scaling, failover—remains predictable.
However, this architecture is not a drop-in replacement for every analytics use case. Pinot excels at time-series and event-stream analytics with high cardinality dimensions and a strong preference for real-time ingestion. If your workload is primarily batch ETL with complex multi-table joins and slowly-changing dimensions, you may find ClickHouse or a traditional cloud data warehouse more ergonomic. The decision tree is simple: if you need sub-second latency on high-cardinality event streams and your team can operate a distributed cluster, Superset + Pinot is worth the investment.
Core Architecture Patterns
The Three-Layer Model
A production Superset + Pinot deployment typically follows a three-layer architecture: ingestion, storage and query, and presentation.
Ingestion Layer: Data flows into Pinot via Kafka, a message queue, or batch uploads. Pinot’s real-time table type consumes from Kafka and builds segments in memory, whilst batch tables ingest pre-built segments. This dual-mode ingestion is where Pinot differs fundamentally from row-oriented databases—segments are immutable, columnar, and indexed before query time arrives.
Storage and Query Layer: Pinot brokers and servers form a distributed cluster. Brokers receive SQL queries from Superset (or any JDBC/ODBC client), scatter the query to the appropriate servers (which hold segment replicas), gather results, and merge them. Servers are stateless and can be added or removed without coordination. The Apache Pinot architecture documentation covers this in detail, but the key insight is that query routing is deterministic—the broker knows which servers hold which segments and can route intelligently based on segment metadata.
Presentation Layer: Superset connects to Pinot via the JDBC driver or native Pinot connector. Superset’s metadata layer (datasets, columns, metrics) maps to Pinot tables and columns. Dashboards, charts, and ad-hoc queries are rendered in Superset and execute against Pinot via SQL. Superset handles caching, permission enforcement, and the UI—Pinot handles the query execution.
This separation of concerns is critical. Superset is stateless and horizontally scalable; Pinot handles all the distributed query complexity. A single Superset instance can serve thousands of users querying a multi-petabyte Pinot cluster without performance degradation, because query execution is pushed entirely to Pinot.
Cluster Topology for Production
For production deployments across Australia and internationally, we typically recommend a minimum topology of:
- 3+ Pinot controllers (odd number for quorum; typically t3.medium or equivalent)
- 3+ Pinot brokers (load-balanced, t3.large or equivalent)
- 6+ Pinot servers (depending on data volume; r5.2xlarge or larger for in-memory index)
- 1+ Superset instance (t3.xlarge or equivalent, with Redis for caching)
- Kafka cluster (3+ brokers, m5.large or equivalent)
- ZooKeeper (3+ nodes, t3.small or equivalent)
For teams in Sydney, Melbourne, or other Australian cities, this topology can run on AWS, Azure, or GCP in a single availability zone or spread across multiple zones for resilience. The key constraint is network latency between brokers and servers—keep them in the same region or VPC.
Smaller deployments (proof-of-concept or <100 GB of data) can collapse this topology: run a single controller, broker, and 1–2 servers on the same machine, with Kafka optional. The operational burden remains similar, so the decision to scale up is primarily driven by data volume and query concurrency, not architectural complexity.
Connection and Query Patterns
Connecting Superset to Pinot
Superset’s native Pinot connector is the recommended path. The official Superset documentation for Apache Pinot covers the setup, but the essence is straightforward:
- Install Superset and the Pinot JDBC driver (or use the native connector).
- Add a database connection in Superset:
pinot://pinot-broker:8099. - Sync tables from Pinot into Superset’s metadata layer.
- Create datasets, columns, and metrics in Superset.
- Build dashboards and charts.
One operational detail: Superset queries Pinot via JDBC, which means the Superset instance must have network access to a Pinot broker on port 8099 (or your configured port). In a multi-region setup, you’ll typically run a Superset instance in each region and have it connect to a local Pinot broker, or use a load balancer to route Superset connections to the nearest broker. Cross-region queries are possible but incur latency penalties—avoid them for interactive dashboards.
SQL Dialect and Query Compatibility
Pinot’s SQL dialect is SQL-92 compatible with extensions for real-time analytics. Most standard SQL operations work: SELECT, WHERE, GROUP BY, ORDER BY, LIMIT, JOIN (single-level, not nested), and aggregations (COUNT, SUM, AVG, MIN, MAX, PERCENTILE, etc.).
However, Pinot has quirks:
- No subqueries: You cannot use subqueries in FROM or WHERE clauses. Superset’s query generation is aware of this and avoids subqueries, but custom SQL queries that include subqueries will fail.
- Limited JOIN support: Pinot supports single-level joins between a fact table and a dimension table. Multi-table joins or self-joins are not supported. If your dashboard requires complex joins, you’ll need to denormalise your data into a single Pinot table or use a staging layer (e.g., ClickHouse or Postgres) to pre-join and feed Pinot.
- Aggregation-only GROUP BY: When you GROUP BY, all non-aggregated columns must be GROUP BY expressions. Pinot doesn’t support the implicit aggregation that some databases allow.
- Timestamp handling: Pinot stores timestamps as milliseconds since epoch (or configurable units). Superset and Pinot handle timezone conversion, but be explicit about timezones in your schema and queries.
For teams building custom data platforms in Sydney, Canberra, or Melbourne, these constraints are manageable if you design your data model upfront. The key is to denormalise aggressively—Pinot is optimised for flat, wide tables with high cardinality dimensions, not normalised schemas.
Query Execution Flow
When a user clicks a chart in Superset, the following happens:
- Superset generates SQL based on the chart definition and user selections.
- Superset sends the SQL to a Pinot broker via JDBC.
- The broker parses the SQL and determines which segments (partitions) are relevant based on the WHERE clause and segment metadata.
- The broker scatters the query to all relevant servers in parallel.
- Each server executes the query against its local segments using Pinot’s query engine (which is columnar and vectorised).
- Servers return partial results to the broker.
- The broker merges partial results and returns the final result to Superset.
- Superset renders the result in the UI.
The entire flow—from Superset SQL to final result—typically takes 100–500 ms for well-tuned queries on large datasets. This is orders of magnitude faster than traditional row-oriented databases querying the same data volume, because Pinot’s columnar format and segment-level pruning eliminate the need to scan billions of rows.
For advanced use cases, Uber’s engineering blog post on rebuilding their Pinot query architecture provides insights into query optimisation at scale. Uber’s learnings around predicate pushdown, segment pruning, and broker-level caching directly apply to production Superset + Pinot deployments.
Performance Tuning and Caching
Pinot-Level Caching
Pinot implements segment-level caching and query result caching. Understanding both is essential for performance.
Segment Caching: Each Pinot server loads segments into memory (or mmap if memory is constrained). Segments are immutable, so once loaded, they don’t change. Pinot uses a columnar in-memory format optimised for analytical queries. If your dataset is smaller than available server memory, segments stay hot and query latency is sub-100 ms. If your dataset exceeds memory, Pinot can spill to disk or use mmap, but latency increases.
For production deployments, the rule of thumb is: allocate 2–3x the raw data size in server memory to account for index overhead and query buffers. If you have 500 GB of raw data, plan for 1.5–2 TB of server memory across your cluster.
Query Result Caching: Pinot can cache query results at the broker level. This is useful for dashboards with static or slowly-changing data. However, result caching is most effective when queries are identical or nearly identical. Superset’s caching layer (discussed below) is more fine-grained and typically more effective than Pinot’s result cache for interactive dashboards.
Superset-Level Caching
Superset uses Redis to cache query results. When a user views a dashboard or chart, Superset checks Redis for a cached result before querying Pinot. If a cached result exists and hasn’t expired, Superset returns the cached result instantly.
Cache keys are based on the query SQL and the user’s permissions. Two users with different row-level security filters will not share a cache entry. This is important for multi-tenant deployments.
Caching strategy:
- Real-time dashboards: Cache TTL of 10–30 seconds. Users see slightly stale data but queries are fast.
- Periodic reports: Cache TTL of 1–24 hours. Queries are minimal, data freshness is acceptable.
- Ad-hoc queries: No caching. Each query hits Pinot.
For teams building multi-tenant SaaS platforms in New York, Toronto, or Dallas, Superset’s cache isolation per user is critical. You must ensure that row-level security filters are included in the cache key, otherwise users can see each other’s data.
Index Configuration
Pinot’s indexing strategy directly impacts query performance. The most common indexes are:
- Inverted Index: Used for equality and range queries on a column. Pinot builds inverted indexes by default for all columns unless disabled.
- Range Index: Used for range queries (>, <, BETWEEN). Faster than inverted index for range queries but larger on disk.
- Text Index: Used for full-text search on string columns.
- Star-Tree Index: A pre-aggregated index for GROUP BY queries. Useful for high-cardinality dimensions.
For a typical analytics use case (time-series events with dimensions), the default inverted index on all columns is usually sufficient. If you have high-cardinality dimensions (e.g., user ID with millions of unique values), consider a Star-Tree index for common GROUP BY queries.
Index configuration is specified in the Pinot schema. Example schema for an event table:
{
"tableName": "events",
"segmentsConfig": {
"timeColumnName": "ts",
"timeType": "MILLISECONDS",
"retentionTimeUnit": "DAYS",
"retentionTimeValue": 90
},
"fieldConfigList": [
{
"name": "user_id",
"dataType": "STRING",
"indexes": {"inverted": {}}
},
{
"name": "event_type",
"dataType": "STRING",
"indexes": {"inverted": {}}
},
{
"name": "revenue",
"dataType": "DOUBLE",
"indexes": {"range": {}}
}
]
}
Query Performance Profiling
Pinot’s query execution plan can be inspected via the EXPLAIN command. Running EXPLAIN SELECT ... shows the query plan, including segment pruning decisions and index usage. This is invaluable for debugging slow queries.
Superset also exposes query execution time and can log raw SQL queries. Enable query logging in Superset to capture slow queries and feed them back into Pinot tuning.
Operational Quirks and Gotchas
Segment Management and Retention
Pinot segments are immutable, time-partitioned chunks of data. Each segment covers a time range (e.g., one day or one hour). As new data arrives, Pinot creates new segments. Old segments can be deleted based on retention policy.
Key operational considerations:
- Segment naming: Segments are named based on table name, segment type (OFFLINE or REALTIME), and time range. Example:
events__REALTIME__2024_01_15__2024_01_16. Ensure your naming convention is predictable for operational tooling. - Segment size: Aim for segments of 100 MB to 1 GB on disk. Segments smaller than 100 MB are inefficient (too much overhead); segments larger than 1 GB can cause GC pauses during merge operations.
- Segment merging: Pinot periodically merges small segments into larger ones. This is an I/O-intensive operation. Schedule merges during low-traffic windows.
- Retention policy: Define a retention policy (e.g., keep 90 days of data). Pinot automatically deletes old segments. Be explicit about retention to avoid surprise data loss.
Controller and Broker Failover
Pinot’s controller is a single point of failure for cluster metadata. If the controller goes down, the cluster continues to serve queries, but you cannot add new tables, segments, or brokers. Always run 3+ controllers in production.
Brokers are stateless and can be added or removed without impact. If a broker fails, queries are automatically routed to other brokers. Load balancers should perform health checks on brokers (HTTP GET /health) and remove unhealthy brokers from the pool.
Servers hold segment replicas. If a server fails, Pinot automatically re-replicates the affected segments to other servers. This process can take minutes to hours depending on segment size and network bandwidth. During re-replication, query latency may increase because fewer server replicas are available.
Kafka Offset Management
For real-time tables, Pinot consumes from Kafka and manages offsets. If Pinot crashes or is restarted, it resumes from the last committed offset. However, Pinot’s offset management is simple and does not guarantee exactly-once semantics—you may see duplicate events or missed events in edge cases.
For use cases requiring exactly-once semantics (e.g., financial transactions), consider pre-processing data through a deduplication layer before sending to Pinot, or accept the trade-off of eventual consistency.
Schema Evolution
Changing a Pinot schema (adding or removing columns) requires careful planning. Pinot allows schema changes, but:
- Adding a column: New segments will have the column; old segments will not. Queries that reference the new column must handle missing values gracefully.
- Removing a column: Old segments still contain the column data; new segments do not. Queries that reference the removed column will fail on new segments.
- Changing a column type: Not supported. You must create a new column with the new type.
Best practice: version your schema and plan schema changes during maintenance windows. Use a staging table to test schema changes before applying to production.
Multi-Tenancy and Data Isolation
Pinot does not have built-in multi-tenancy support. If you need to serve multiple customers from a single Pinot cluster, you must implement data isolation at the application layer.
Common patterns:
- Single table with tenant_id column: All customers’ data in one table, with a
tenant_idcolumn. Queries filter bytenant_id. Superset’s row-level security can enforce tenant isolation. - Separate tables per tenant: Each customer has their own Pinot table. More operational overhead but stronger isolation.
- Separate clusters per tenant: Each customer has their own Pinot cluster. Most expensive but simplest isolation.
For PADISO’s multi-tenant SaaS platform engineering work, we typically recommend the first pattern (single table with tenant_id) for early-stage deployments, then migrate to separate tables as the number of tenants grows.
Security, Compliance, and Multi-Tenancy
Authentication and Authorisation
Pinot supports LDAP, Kerberos, and basic authentication. Superset has its own authentication layer (LDAP, OAuth, SAML, etc.). For a production deployment, configure both:
- Pinot authentication: Restrict who can connect to Pinot brokers and controllers. Use LDAP or Kerberos if available.
- Superset authentication: Restrict who can log into Superset. Use SSO (LDAP, OAuth, SAML) for ease of management.
- Row-level security (RLS): In Superset, configure RLS rules to restrict which rows each user can see. RLS filters are applied at query time and included in the cache key.
For teams pursuing SOC 2 or ISO 27001 compliance, both Pinot and Superset must be configured for audit logging. Enable query logging in both systems and store logs in a tamper-proof location (e.g., S3 with versioning and MFA delete enabled).
Network Security
Pinot brokers and servers communicate internally over a gossip protocol. Ensure this traffic is encrypted (mTLS) or isolated to a private network. Superset connects to Pinot brokers over JDBC—encrypt this connection using TLS.
For government and defence deployments in Canberra or Washington, D.C., network isolation is critical. Run Pinot and Superset in a private VPC with no internet access. Use AWS PrivateLink, Azure Private Link, or similar for secure access from user networks.
Data Encryption
Pinot does not encrypt data at rest by default. For compliance-sensitive deployments, enable encryption at rest using the underlying storage system (e.g., EBS encryption on AWS, encryption at rest in Kubernetes). Encryption in transit (TLS) should be enabled for all network communication.
Audit Logging and Compliance
For SOC 2 and ISO 27001 compliance via Vanta, configure audit logging in both Pinot and Superset. Key events to log:
- User authentication and authorisation events
- Query execution (who ran what query, when, against which data)
- Schema changes (table creation, column addition, etc.)
- Cluster topology changes (broker/server additions, failures)
Store audit logs in a centralised location (e.g., S3, CloudWatch, Splunk) and configure alerts for suspicious activity. For PADISO’s compliance work, we typically integrate Pinot and Superset audit logs into a SIEM system for real-time monitoring.
Scaling Considerations
Horizontal Scaling
Both Pinot and Superset scale horizontally. To scale:
- Superset: Add more Superset instances behind a load balancer. Superset is stateless (state is in Redis and the database), so new instances pick up traffic immediately.
- Pinot brokers: Add more brokers behind a load balancer. New brokers are automatically discovered by clients and receive traffic.
- Pinot servers: Add more servers to the cluster. Pinot automatically re-balances segment replicas across the new servers. This process takes time (proportional to data volume), so add servers during low-traffic windows.
Vertical Scaling
If adding more instances is not feasible, scale vertically:
- Superset: Increase CPU and memory on the Superset instance. Superset is memory-hungry when executing large queries, so prioritise memory.
- Pinot brokers: Increase CPU and memory. Brokers are primarily CPU-bound (query planning and result merging).
- Pinot servers: Increase memory primarily. Servers need enough memory to hold segment indexes. Increase CPU secondarily.
Data Volume Scaling
As data volume grows, segment size and count increase. Monitor segment metrics:
- Segment count: If segment count exceeds 1000 per server, consider increasing segment size (e.g., from hourly to daily segments).
- Segment size: If segments exceed 1 GB, consider decreasing segment size (e.g., from daily to hourly).
- Server memory usage: If server memory usage exceeds 80%, add more servers or increase server memory.
Query Concurrency Scaling
As query concurrency increases, monitor broker and server CPU usage. If CPU is saturated:
- Add more brokers to distribute query planning load.
- Add more servers to distribute query execution load.
- Tune query execution (caching, index configuration, segment pruning).
For teams building large-scale data platforms in San Francisco, Chicago, or Austin, we typically see linear scaling up to 10,000+ concurrent queries per second on a well-tuned cluster.
Real-World Deployment Examples
Example 1: Real-Time Event Analytics
Use case: A mobile app company wants to track user events (app opens, clicks, purchases) in real time and build dashboards for product and marketing teams.
Architecture:
- Mobile app sends events to Kafka (via a collection service).
- Pinot consumes events from Kafka in real time and creates segments every hour.
- Superset connects to Pinot and provides dashboards for user funnels, retention, revenue trends.
- Data is retained for 90 days; older data is archived to S3.
Performance: Queries return results in 100–300 ms. Dashboards refresh every 30 seconds. 5,000+ concurrent users can be supported on a 6-server Pinot cluster.
Operational overhead: ~4 FTE for Pinot cluster management, Kafka monitoring, and dashboard maintenance.
This pattern is common across PADISO’s platform engineering work in Australia and the US.
Example 2: Ad-Tech Click Analytics
Use case: An ad-tech company needs to serve real-time dashboards showing click-through rates, impressions, and revenue by campaign, geography, and device.
Architecture:
- Ad servers send click events to Kafka.
- Pinot ingests events in real time with a 5-minute lag (to allow late arrivals).
- Superset provides dashboards with filters for date range, campaign, geography, device.
- Row-level security ensures each advertiser sees only their own data.
- Data is retained for 1 year; older data is moved to a data warehouse for long-term analysis.
Performance: Queries return results in 50–200 ms. Dashboards refresh every 5 minutes. 10,000+ concurrent queries per second are supported.
Operational overhead: ~6 FTE for cluster management, data quality monitoring, and SLA enforcement.
This is a pattern we’ve deployed for financial services and media platforms in New York and Toronto.
Example 3: Roku’s Advertising Analytics
Roku’s engineering team published a case study on their Pinot and Trino architecture. They use Pinot for real-time ad analytics and Trino for complex multi-table queries. This hybrid approach is worth studying: Pinot handles the hot path (real-time dashboards), whilst Trino handles the cold path (complex analysis).
For teams building large-scale platforms, this pattern is recommended if you have both real-time and batch analytical workloads.
Next Steps and Vendor Partnership
Evaluating Superset + Pinot for Your Use Case
Before committing to this architecture, evaluate whether it fits your workload:
Superset + Pinot is a good fit if:
- You have time-series or event-stream data with high cardinality dimensions.
- You need sub-second query latency on petabyte-scale datasets.
- Your queries are primarily aggregations (GROUP BY, COUNT, SUM) with filters.
- You have high query concurrency (1000+ concurrent users).
- You want to avoid per-seat BI licensing costs.
Superset + Pinot is not a good fit if:
- Your data is primarily normalised with complex multi-table joins.
- You need ACID transactions or row-level updates.
- Your queries are primarily point lookups or single-row retrievals.
- Your dataset is smaller than 10 GB (operational overhead is not justified).
Getting Started with D23.io and PADISO
If you’re ready to build a production Superset + Pinot deployment, PADISO’s platform engineering team can help. We’ve deployed this architecture for clients across Australia, the US, and internationally. Our approach:
- Assessment: We evaluate your data volume, query patterns, and compliance requirements. This typically takes 1–2 weeks.
- Design: We design a cluster topology, schema, and ingestion pipeline tailored to your workload. This takes 2–4 weeks.
- Proof of Concept: We build a minimal viable cluster with sample data and dashboards. This takes 2–4 weeks.
- Production Deployment: We deploy to production with monitoring, alerting, and runbooks. This takes 4–8 weeks.
- Knowledge Transfer: We train your team on operations, troubleshooting, and scaling. This is ongoing.
Our clients in Sydney, Melbourne, New York, and other cities have reduced their analytics infrastructure costs by 50–70% and improved query latency by 10–100x.
Learning Resources
For deeper learning, we recommend:
- Apache Pinot’s official documentation for architecture and operations.
- Superset’s Pinot documentation for connection and setup.
- A practical hands-on guide from StarTree on visualising Pinot data in Superset for step-by-step setup.
- Acceldata’s introduction to Pinot architecture for distributed systems concepts.
- Apache Pinot’s server and broker architecture video for visual learners.
Compliance and Governance
If you’re pursuing SOC 2 or ISO 27001 compliance, Superset + Pinot can be configured to meet audit requirements. Key considerations:
- Enable audit logging in both Pinot and Superset.
- Implement row-level security in Superset to enforce data access policies.
- Encrypt data in transit (TLS) and at rest (EBS encryption, etc.).
- Implement strong authentication (LDAP, Kerberos, OAuth).
- Monitor and alert on suspicious activity.
PADISO’s security audit team can guide you through compliance configuration and help you pass your audit.
Book a Consultation
If you’re building a data platform and want to explore Superset + Pinot, we’d love to help. PADISO offers platform engineering services across Australia and internationally. We can assess your workload, design a reference architecture, and help you deploy to production.
- Platform Development in Sydney
- Platform Development in Melbourne
- Platform Development in Canberra
- Platform Development in New York
- Platform Development in San Francisco
- Platform Development in Chicago
- Platform Development in Austin
- Platform Development in Dallas
- Platform Development in Washington, D.C.
- Platform Development in Toronto
Or explore our full services menu and case studies to see how we’ve helped other companies build production analytics platforms.
Summary
Apache Superset + Apache Pinot is a powerful, operationally efficient architecture for real-time analytics at scale. By combining Pinot’s distributed columnar query engine with Superset’s lightweight SQL-native front-end, teams can build analytics platforms that serve thousands of users with sub-second latency whilst cutting per-seat BI costs by 60–70%.
The key to success is understanding the operational quirks: segment management, schema evolution, multi-tenancy patterns, and compliance configuration. With proper planning and the right partner, you can deploy this architecture to production in 8–12 weeks and scale it to petabytes of data and 10,000+ concurrent queries per second.
If you’re ready to build, PADISO is here to help.