Guide 25 mins

AU Open Banking Data on D23.io: A Reference Architecture

Complete guide to building AU Open Banking data pipelines on D23.io lakehouse. CDR compliance, Superset analytics, Claude agents for fintechs.

The PADISO Team ·2026-04-24

Introduction: Why AU Open Banking Matters
Understanding Australia’s Open Banking Framework
D23.io as Your Data Foundation
Building the Ingestion Pipeline
Data Governance and CDR Compliance
Semantic Layer and Apache Superset
Agentic AI with Claude for Analytics
Security, SOC 2, and Audit Readiness
Deployment Patterns for AU Fintechs
Implementation Roadmap and Next Steps

Introduction: Why AU Open Banking Matters {#introduction}

Australia’s Consumer Data Right (CDR) framework has fundamentally reshaped how fintech companies access and leverage banking data. Since the rollout began in 2020, accredited data recipients can now ingest transaction history, account information, and product details directly from major Australian banks via standardised APIs. This represents a genuine competitive advantage for fintechs, but only if you architect your data stack correctly from day one.

The challenge isn’t access—it’s integration, governance, and analytics at scale. Most AU fintechs ingest Open Banking data into fragmented systems: some data lands in a data warehouse, some in operational databases, and critical analytics queries get stuck waiting for engineering resources. The result? Delayed insights, compliance risk, and engineering teams stretched thin.

This guide walks you through a proven reference architecture that centralises AU Open Banking data on D23.io’s lakehouse platform, surfaces it through Apache Superset dashboards, and enables non-technical teams to query it via Claude agents. We’ve deployed this pattern for AU CDR-accredited fintechs ingesting Open Banking data into D23.io’s lakehouse and exposing analytics through Superset and Claude agents—and it works.

If you’re a founder or operator at a seed-to-Series-B fintech, or an enterprise team modernising your data stack, this architecture will save you 6–12 months of engineering time and eliminate entire classes of compliance risk.

Understanding Australia’s Open Banking Framework {#understanding-framework}

The CDR Mandate and Data Holders

Australia’s Consumer Data Right (CDR) is a legislative framework, not just a technical standard. The Australian Banking Association’s overview of Open Banking explains that all authorised deposit-taking institutions (ADIs)—the big four banks and most regional players—are mandated to share customer data with accredited data recipients via secure APIs.

The framework operates in three tiers:

Tier 1: Basic Banking Data includes account identifiers, balances, and transaction history. This is the most widely available and the highest-value dataset for analytics and lending decisions.

Tier 2: Product Reference Data covers product features, fees, rates, and eligibility criteria. This powers comparison engines and personalisation engines.

Tier 3: Investment Data (coming later) will include superannuation and investment holdings, broadening the data surface significantly.

For most fintechs, Tier 1 is your immediate focus. It’s stable, widely available, and generates immediate ROI through credit scoring, affordability assessments, and customer insights.

Accreditation and Data Recipient Status

Accreditation is mandatory. You cannot legally ingest Open Banking data without being registered as an accredited data recipient with the Australian Information Commissioner (OAIC). The accreditation process typically takes 4–8 weeks and requires you to demonstrate:

Security controls (SOC 2 Type II or equivalent)
Privacy impact assessment
Data handling procedures
Incident response capability

This is where understanding the CDR framework via Stripe’s detailed explanation becomes essential. Stripe’s guide breaks down data holders, recipients, and secure API data flows in practical terms. Many fintechs underestimate the compliance overhead; building your architecture with compliance in mind from the start prevents expensive refactoring later.

API Standards and Data Formats

Australian Open Banking uses RESTful APIs with OAuth 2.0 for authentication. All responses follow the Consumer Data Standards (CDS), which define exact JSON schemas for accounts, transactions, products, and balances.

This standardisation is a gift. It means you’re not reverse-engineering proprietary APIs or building custom parsers for each bank. The data arrives in predictable, well-documented formats. Your ingestion pipeline can be deterministic and testable.

The Yodlee Developer Portal’s AU Open Banking documentation and the GitHub repository of Australian Open Banking Data Database examples both provide working code samples for integrating with these APIs. If you’re building in-house, these are your starting points.

D23.io as Your Data Foundation {#d23io-foundation}

Why a Lakehouse Over Traditional Data Warehouses

D23.io is an open-source lakehouse platform built on Apache Iceberg. A lakehouse combines the raw-data flexibility of a data lake with the query performance and ACID guarantees of a data warehouse. For AU Open Banking ingestion, this matters because:

Raw Data Retention: You ingest the full CDR payloads as-is, preserving every field. No schema mapping decisions forced upfront. This is critical for compliance audits—you have an immutable record of what you received.

Schema Evolution: As the CDR framework evolves (new fields, new endpoints), your lakehouse adapts without pipeline rewrites. Traditional warehouses force you to alter tables; Iceberg handles schema changes gracefully.

Cost Efficiency: You’re storing raw Parquet files in object storage (S3-compatible), not paying per-query fees or maintaining expensive compute clusters. Costs scale linearly with data volume, not query complexity.

Time-Travel Queries: Iceberg’s versioning lets you query data as it existed on any prior date. This is invaluable for regulatory inquiries—“show me what we had on this customer’s account on 15 March 2024.”

For a seed-stage fintech, D23.io’s open-source model means you’re not locked into a vendor. You can self-host or run it on AWS. For enterprise teams, it integrates seamlessly with existing data stacks.

Architecture: Ingestion → Raw → Curated

The proven pattern uses three layers:

Raw Layer: Ingest CDR API responses directly into D23.io tables, one table per endpoint (accounts, transactions, products, balances). No transformation. This is your source of truth for compliance.

Curated Layer: Build semantic tables from the raw layer—denormalised views optimised for analytics. Flatten nested JSON, join accounts to transactions, compute running balances. This is where you apply business logic.

Consumption Layer: Expose curated tables to Superset and AI agents. This is the public API for analytics.

This three-layer approach means:

Engineers own the raw and curated layers; data analysts and product teams own consumption.
You can replay the entire pipeline from raw data if a bug is discovered in transformation logic.
Audit trails are clean and traceable.
Governance rules are applied consistently.

Integration with Existing Systems

D23.io doesn’t replace your operational database. It complements it. Your fintech’s core ledger, customer accounts, and transaction engine stay in your primary database (PostgreSQL, MySQL, whatever you use). D23.io ingests the external CDR data and provides a queryable archive.

You then join internal data with external CDR data in the curated layer. For example:

SELECT 
  customer_id,
  internal_account_id,
  external_bank_account_id,
  external_bank_balance,
  internal_account_balance,
  (external_bank_balance - internal_account_balance) AS variance
FROM curated.account_reconciliation
WHERE variance > 0.01

This kind of query—comparing what you know about a customer’s external bank accounts to what they’ve told you internally—powers fraud detection, affordability assessment, and financial health scoring.

Building the Ingestion Pipeline {#ingestion-pipeline}

API Authentication and Data Pulls

Every Australian bank exposes Open Banking data via OAuth 2.0. Your fintech app redirects the customer to the bank’s login, they authenticate, and they grant your app permission to access their data. The bank returns an access token and refresh token.

You then use these tokens to call the bank’s Open Banking API endpoints:

GET /banking/accounts – List all accounts
GET /banking/accounts/{accountId}/transactions – Transaction history
GET /banking/products – Available products
GET /banking/payees – Payee list

The AWS blog post on implementing Open Banking on AWS walks through a reference architecture using Lambda functions to orchestrate these calls. The pattern is:

Store encrypted access tokens in AWS Secrets Manager (or equivalent).
Trigger a Lambda (or cron job) every 24 hours.
For each customer, refresh the token and call each endpoint.
Write responses to S3 in Parquet format.
Iceberg automatically tracks the new files.

Handling Rate Limits and Retries

Australian banks rate-limit Open Banking APIs. Typically, you get 100–300 requests per minute per customer. If you have 10,000 customers, a naive implementation that pulls sequentially will take hours.

The solution is parallelisation with backoff:

Use a task queue (SQS, Celery, Kafka) to enqueue one job per customer.
Run 10–20 concurrent workers, each pulling data for one customer.
On rate-limit (HTTP 429), exponential backoff: wait 1 second, then 2, then 4, etc.
Log every retry and failure for audit trails.

D23.io handles the write side gracefully—Iceberg is designed for high-concurrency writes. Multiple workers can write to the same table simultaneously without conflicts.

Schema Management and Validation

The CDR defines exact schemas, but banks sometimes deviate (missing fields, extra fields, type mismatches). Your pipeline needs schema validation:

from jsonschema import validate, ValidationError

CDR_ACCOUNTS_SCHEMA = {
    "type": "object",
    "properties": {
        "accountId": {"type": "string"},
        "displayName": {"type": "string"},
        "accountStatus": {"enum": ["OPEN", "CLOSED"]},
        "accountType": {"enum": ["TRANS_AND_SAVINGS", "TERM_DEPOSIT", "INVESTMENT"]},
        "creationDate": {"type": "string", "format": "date"},
    },
    "required": ["accountId", "displayName", "accountStatus"],
}

for account in response["data"]["accounts"]:
    try:
        validate(instance=account, schema=CDR_ACCOUNTS_SCHEMA)
    except ValidationError as e:
        log_schema_violation(customer_id, account, e)
        alert_engineering()

Every violation should trigger an alert and be logged. This catches data quality issues early and gives you evidence for bank support tickets.

Incremental Ingestion and Change Data Capture

Pulling all transactions for all customers every day is wasteful. After the initial historical load, you want incremental ingestion:

Store the last successful pull timestamp for each customer.
On the next run, fetch only transactions since that timestamp.
Update the timestamp after successful write.

D23.io’s time-travel capability makes this robust. If a pull fails mid-way, you can resume from the last committed transaction without duplicates.

For mature implementations, consider Change Data Capture (CDC) patterns: ask the bank for a webhook to notify you when new transactions arrive, rather than polling. Not all banks support this yet, but it’s coming.

Data Governance and CDR Compliance {#governance}

Retention Policies and Deletion Workflows

The CDR framework requires you to delete customer data upon request within 30 days. This is non-negotiable. Your D23.io architecture must support it:

Logical Deletion: Mark rows as deleted in the curated layer without physically removing raw data. This preserves audit trails.

ALTER TABLE raw.transactions ADD COLUMN deleted_at TIMESTAMP NULL;
ALTER TABLE curated.transactions ADD COLUMN deleted_at TIMESTAMP NULL;

UPDATE raw.transactions 
SET deleted_at = NOW() 
WHERE customer_id = ?;

Physical Deletion: For compliance, you’ll want to physically remove data after a retention period (e.g., 90 days). Iceberg supports this via DELETE statements:

DELETE FROM raw.transactions 
WHERE customer_id = ? AND deleted_at < NOW() - INTERVAL '90 days';

Audit Logging: Every deletion must be logged with timestamp, user, and reason. This is your evidence for regulators.

Access Control and Row-Level Security

Not all team members should see all customer data. D23.io integrates with Iceberg’s access control:

Engineers: Full access to raw and curated layers for debugging.
Data Analysts: Access only to curated, anonymised tables.
Support Team: Access only to their assigned customers’ data.
Finance/Compliance: Access to aggregated, anonymised reporting tables.

Implement row-level security (RLS) in the consumption layer:

CREATE VIEW curated.transactions_for_analyst AS
SELECT * FROM curated.transactions
WHERE customer_id IN (
  SELECT customer_id FROM access_control 
  WHERE user_id = CURRENT_USER
);

Apache Superset supports RLS natively—you can restrict dashboard rows by user role. When Claude agents query data, they inherit the same RLS rules.

Data Lineage and Audit Trails

Every row in your curated layer should have metadata:

ALTER TABLE curated.transactions ADD COLUMNS (
  _ingestion_timestamp TIMESTAMP,
  _source_bank VARCHAR,
  _cdr_api_version VARCHAR,
  _row_hash VARCHAR
);

This lets you answer questions like:

“Which rows came from which bank on which date?”
“Did this transaction’s balance change between pulls?”
“Can we reproduce this row from raw data?”

D23.io’s Iceberg metadata tracks file-level lineage automatically. Combine that with row-level metadata, and you have complete traceability for audits.

Encryption and Data Masking

Customer account numbers and BSB codes are sensitive. Your curated layer should mask them:

CREATE VIEW curated.transactions_masked AS
SELECT 
  customer_id,
  SUBSTRING(account_number, 1, 2) || '****' || SUBSTRING(account_number, -2) AS account_number_masked,
  amount,
  transaction_date
FROM curated.transactions;

Superset dashboards should reference the masked view. Raw data stays encrypted at rest in S3 using AWS KMS. Only authorised services with the KMS key can decrypt it.

Semantic Layer and Apache Superset {#semantic-layer}

Building the Semantic Model

A semantic layer sits between raw data and dashboards. It defines business metrics in one place, so every dashboard uses the same definition. For AU Open Banking, your semantic model includes:

Core Entities:

Customer
Account (external bank account linked to your platform)
Transaction
Product (bank product offering)

Key Metrics:

Total External Assets (sum of all linked account balances)
Monthly Inflow / Outflow (transaction volume)
Account Linking Rate (% of customers with linked accounts)
Data Freshness (time since last pull)

Implement this in D23.io using views or a dedicated semantic layer tool. The $50K D23.io Consulting Engagement guide breaks down a real $50K fixed-fee Apache Superset rollout including semantic layer design, delivered in 6 weeks. That engagement included architecture, SSO, semantic layer setup, and dashboard delivery—a useful benchmark if you’re planning your own build.

Apache Superset Configuration

Superset connects to D23.io via JDBC or native Iceberg connector. Configuration is straightforward:

Database Connection: Point Superset to your D23.io instance.
Dataset Creation: Define datasets from curated tables and views.
Metric Definition: Define metrics (e.g., SUM(amount), COUNT(DISTINCT customer_id)).
Dashboard Assembly: Build dashboards by combining charts.

Key dashboards for a fintech:

Customer Insights Dashboard: Shows linked accounts, total external assets, recent transactions, and affordability score (computed from transaction patterns).

Portfolio Analytics: Aggregate metrics across all customers—total AUM, asset distribution by bank, transaction velocity.

Data Quality Dashboard: Tracks ingestion success rate, latency, and schema violations.

Row-Level Security in Superset

Superset supports RLS via JINJA templating. For example, a support agent should only see their assigned customers:

SELECT * FROM curated.customers
WHERE support_agent_id = '{{ current_user.id }}'

When the support agent views the dashboard, Superset injects their user ID into the query. They see only their customers’ data.

Embedding and Self-Service Analytics

Superset dashboards can be embedded in your fintech’s admin panel. Use Superset’s guest token API:

from superset_client import SupersetClient

client = SupersetClient(host='superset.yourcompany.com', token='...')
guest_token = client.get_guest_token(
    user={'username': 'support_agent@company.com'},
    resources=[{'type': 'dashboard', 'id': 123}]
)

embedded_url = f"https://superset.yourcompany.com/embedded/{guest_token}"

Now your support team can see customer analytics without leaving your app.

Agentic AI with Claude for Analytics {#agentic-ai}

Why Claude Agents for Data Queries

Non-technical team members—support, product, finance—often need to query data but can’t write SQL. Claude agents bridge this gap. Instead of “write a SQL query to find customers with >$100K external assets,” you say “show me customers with high external assets,” and Claude translates that to SQL, executes it, and returns results.

The guide on Agentic AI + Apache Superset covers this pattern in depth. It explains how agentic AI like Claude integrates with Apache Superset to let non-technical users query dashboards naturally, with real examples.

Architecture: Claude + Superset Integration

The pattern is:

User Query: Support agent asks, “Which customers have had no transactions in the last 30 days?”
Claude Processing: Claude receives the query and the schema (table names, column names, descriptions).
SQL Generation: Claude generates SQL: SELECT customer_id FROM curated.customers WHERE last_transaction_date < NOW() - INTERVAL '30 days'
Execution: Your agent service executes the SQL against D23.io.
Result Formatting: Claude formats results in natural language: “5,234 customers have had no transactions in the last 30 days. The oldest account is from 3 months ago.”
Follow-ups: Support agent can ask follow-up questions: “How many of those are in Sydney?”, and Claude refines the query.

Implementing Claude Agents

Use the Anthropic SDK with tool use:

import anthropic

client = anthropic.Anthropic(api_key="...")

tools = [
    {
        "name": "query_d23",
        "description": "Execute SQL against D23.io curated layer",
        "input_schema": {
            "type": "object",
            "properties": {
                "sql": {
                    "type": "string",
                    "description": "SQL query to execute"
                }
            },
            "required": ["sql"]
        }
    },
    {
        "name": "get_schema",
        "description": "Get available tables and columns",
        "input_schema": {
            "type": "object",
            "properties": {}
        }
    }
]

messages = [
    {"role": "user", "content": "Show me customers with external assets > $100K"}
]

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    tools=tools,
    messages=messages
)

while response.stop_reason == "tool_use":
    for block in response.content:
        if block.type == "tool_use":
            if block.name == "query_d23":
                result = execute_query(block.input["sql"])
            elif block.name == "get_schema":
                result = get_schema()
    
    messages.append({"role": "assistant", "content": response.content})
    messages.append({"role": "user", "content": [{"type": "tool_result", "tool_use_id": block.id, "content": str(result)}]})
    
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        tools=tools,
        messages=messages
    )

print(response.content[0].text)

This loop lets Claude iteratively refine queries based on results. If the first query returns too many rows, Claude can add a LIMIT or refine the WHERE clause.

Guardrails and Safety

Claude agents querying your database need guardrails:

Query Allowlisting: Only allow queries against curated tables, not raw data. This prevents accidental exposure of unmasked PII.

ALLOWED_TABLES = [
    'curated.customers',
    'curated.transactions_masked',
    'curated.accounts',
    'curated.products'
]

def validate_query(sql):
    parsed = sqlparse.parse(sql)[0]
    tables = extract_tables(parsed)
    for table in tables:
        if table not in ALLOWED_TABLES:
            raise ValueError(f"Access denied: {table}")

Query Timeout: Set a 30-second timeout. If a query runs longer, kill it. This prevents runaway queries from locking resources.

Result Limits: Cap result sets at 10,000 rows. If a query would return more, ask Claude to add aggregation or filtering.

Audit Logging: Log every query executed via Claude, including the original user request, generated SQL, and results. This is essential for compliance.

Use Cases for Claude Agents

Support Triage: “Show me all customers who linked accounts but haven’t completed KYC.” Claude generates the join and filters instantly.

Fraud Detection: “Which customers have unusual transaction patterns in the last week?” Claude can query transaction velocity, average transaction size, and flag outliers.

Product Analytics: “What’s the correlation between external asset size and product adoption?” Claude joins curated tables and computes statistics.

Compliance Reporting: “How many customers have we ingested data for in the last 30 days?” Claude aggregates and filters by date.

The guide on agentic AI vs traditional automation compares agentic AI with traditional RPA and rule-based automation, explaining when to use each approach and how to migrate from legacy automation to intelligent autonomous agents. This is valuable context if you’re deciding between Claude agents and scheduled reports.

Security, SOC 2, and Audit Readiness {#security}

SOC 2 Type II Requirements

Accreditation as a CDR data recipient requires SOC 2 Type II certification. This isn’t optional—it’s a regulatory gate. Your architecture must support the five trust service criteria:

CC (Common Criteria): Security controls including access control, encryption, and change management.

A (Availability): Your system must be available 99.5%+ of the time. D23.io’s managed service provides SLA guarantees.

PI (Processing Integrity): Data is accurate and complete. Your ingestion pipeline must validate every record and log failures.

C (Confidentiality): Data is encrypted at rest and in transit. All API calls use TLS 1.2+. Access tokens are encrypted in storage.

PII (Privacy): Customer data is used only for authorised purposes. Your deletion and retention policies must be enforced.

Building SOC 2 readiness into your architecture from day one is far cheaper than retrofitting it later. The guide on security audits and SOC 2 / ISO 27001 compliance covers what audit-readiness via Vanta looks like—it’s a framework for continuous compliance, not a one-time checkbox.

Encryption and Key Management

At Rest: All data in S3 is encrypted using AWS KMS. Your D23.io instance has access to the KMS key, but your application layer does not. This prevents accidental decryption.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "d23.amazonaws.com"
      },
      "Action": "kms:Decrypt",
      "Resource": "arn:aws:kms:..."
    }
  ]
}

In Transit: All API calls between your app and D23.io, and between D23.io and banks, use TLS 1.2+. Certificate pinning is optional but recommended for high-security deployments.

Access Tokens: Store OAuth tokens from banks in AWS Secrets Manager, not in your database. Rotate them every 90 days. Log every rotation.

Audit Logging and Monitoring

Every action must be logged:

Which user queried which data at what time
Which rows were modified or deleted
Which access tokens were refreshed
Which schema violations occurred
Which data deletions were processed

Use CloudTrail (AWS) or equivalent to log all API calls to D23.io and S3. Use application-level logging to track row-level access.

import logging

audit_logger = logging.getLogger('audit')
audit_logger.info(
    'DATA_ACCESS',
    extra={
        'user_id': user_id,
        'table': 'curated.customers',
        'rows_returned': 1234,
        'timestamp': datetime.now().isoformat()
    }
)

Stream these logs to a SIEM (Splunk, DataDog, etc.) for real-time alerting. If someone queries 100,000 customer records at 3 AM, you want to know.

Incident Response

Your SOC 2 audit will examine your incident response plan. For data breaches:

Detection: Automated alerts on unusual query patterns, failed authentication, or data exfiltration.
Containment: Revoke compromised tokens, disable user access, isolate affected systems.
Investigation: Query audit logs to determine scope (which customers, which data, how long).
Notification: Notify affected customers and regulators within 30 days (CDR requirement).
Remediation: Fix the root cause, update controls, re-test.

Document this process and test it quarterly. Your auditors will ask to see evidence of a breach drill.

Vanta and Continuous Compliance

Vanta is a compliance automation platform that integrates with your infrastructure to provide continuous SOC 2 evidence. Instead of gathering evidence manually for your annual audit, Vanta collects it automatically:

Access logs from AWS and D23.io
Change logs from your code repository
Vulnerability scans from your CI/CD pipeline
Employee training records
Incident response logs

When your auditor arrives, you hand them a Vanta report. No scrambling for evidence.

For AU fintechs, the Fiskil guide to Australia’s CDR framework explains the compliance landscape. Vanta integration isn’t a compliance shortcut—it’s a force multiplier that lets your small team maintain enterprise-grade controls.

Deployment Patterns for AU Fintechs {#deployment-patterns}

Self-Hosted vs. Managed D23.io

Self-Hosted: You run D23.io on your own AWS account. Full control, no vendor lock-in, lower per-query costs at scale. Requires DevOps expertise to manage upgrades, backups, and monitoring.

Managed: D23.io operates the infrastructure. You focus on data pipelines. Higher per-query costs but no operational overhead. Better for seed-stage teams.

For a Series-B fintech with 50,000+ customers, self-hosted often wins on cost. For seed-stage, managed is faster to market.

Multi-Region Deployment

Australian data residency requirements mandate that customer data stays in Australia. If you plan to expand to NZ or Asia, you’ll need separate D23.io instances:

ap-southeast-2 (Sydney): AU customer data
ap-southeast-2 (Auckland): NZ customer data (if needed)

Each region has its own S3 bucket, its own KMS key, and its own Superset instance. Cross-region queries are rare and require explicit approval.

Staging and Production Separation

Run two D23.io instances: staging and production. Staging ingests a subset of data (e.g., 10% of customers) and runs daily. This lets you test schema changes, new metrics, and agent prompts before deploying to production.

# Staging: 10% sample
SELECT * FROM raw.transactions 
WHERE customer_id % 10 = 0

# Production: all data
SELECT * FROM raw.transactions

Your Superset and Claude agents point to staging by default. Only approved users can query production.

CI/CD for Data Pipelines

Treat your data pipelines as code. Store SQL migrations in Git:

data-pipelines/
├── migrations/
│   ├── 001_create_raw_accounts.sql
│   ├── 002_create_curated_accounts.sql
│   ├── 003_add_deletion_tracking.sql
├── tests/
│   ├── test_accounts_schema.sql
│   ├── test_transactions_completeness.sql
├── Makefile
└── README.md

On every commit:

Run schema tests against staging.
If tests pass, apply migrations to staging.
Run data quality checks.
If all green, tag the commit and await manual approval to production.

This prevents silent data corruption and makes rollbacks straightforward.

Monitoring and Alerting

Set up dashboards for pipeline health:

Ingestion SLA: % of customers successfully ingested in the last 24 hours. Target: 99.5%.

Data Freshness: Time since last successful pull for each customer. Alert if >36 hours.

Schema Violations: Count of records that failed schema validation. Alert if >0.

Query Performance: P99 query latency on Superset dashboards. Alert if >5 seconds.

Storage Growth: Bytes ingested per day. Alert if >3x normal.

Use CloudWatch, Datadog, or New Relic for monitoring. Integrate with PagerDuty for on-call escalation.

Implementation Roadmap and Next Steps {#roadmap}

Phase 1: Foundation (Weeks 1–4)

Objectives: Get first customer’s data flowing into D23.io.

Tasks:

Set up AWS account and D23.io instance (managed or self-hosted).
Obtain CDR accreditation (parallel track—start now if not done).
Implement OAuth flow to get first customer’s consent.
Build ingestion pipeline for accounts and transactions endpoints.
Create raw layer tables in D23.io.
Validate schema and log violations.
Deploy to staging; test with 10 customers.

Deliverables: First 10 customers’ data in D23.io raw layer, validated and auditable.

Phase 2: Analytics (Weeks 5–8)

Objectives: Enable self-service analytics via Superset.

Tasks:

Build curated layer views (denormalised accounts, transactions, products).
Define semantic metrics (total assets, transaction velocity, etc.).
Deploy Apache Superset connected to D23.io.
Build 3–5 key dashboards (customer insights, portfolio, data quality).
Implement row-level security for support team access.
Train support and product teams on dashboard usage.

Deliverables: Superset instance with 5+ dashboards, RLS configured, team trained.

Phase 3: AI and Automation (Weeks 9–12)

Objectives: Enable Claude agents for natural language queries.

Tasks:

Implement Claude agent service with D23.io tool integration.
Define guardrails (allowlisted tables, query timeouts, result limits).
Set up audit logging for all agent queries.
Build UI for support team to chat with Claude agent.
Test with common queries (find inactive customers, high-asset customers, etc.).
Gather feedback and refine prompts.

Deliverables: Claude agent service in production, support team using it daily, 100+ queries logged.

Phase 4: Compliance and Scale (Weeks 13–16)

Objectives: Achieve SOC 2 readiness and scale to all customers.

Tasks:

Implement deletion and retention workflows.
Set up comprehensive audit logging (CloudTrail, application logs).
Deploy Vanta for continuous compliance evidence collection.
Conduct security review and penetration testing.
Scale ingestion to 100% of customers.
Implement monitoring and alerting.
Document runbooks for incident response.

Deliverables: SOC 2 evidence collected, all customers’ data ingested, incident response plan documented and tested.

Post-Launch: Continuous Improvement

Ongoing:

Monitor ingestion SLA and data freshness daily.
Review Claude agent queries monthly for new use cases.
Refresh Superset dashboards quarterly based on business priorities.
Update semantic layer as new CDR endpoints become available.
Rotate access tokens and security keys every 90 days.
Conduct annual penetration testing.

Estimated Effort and Costs

Internal Team: 1 senior engineer (lead), 1 mid-level engineer (implementation), 1 data analyst (semantic layer and dashboards). 4 months part-time.

External Support: Consider engaging a partner like PADISO for fractional CTO leadership or co-build support. A 3-month engagement can accelerate Phase 1–2 by 6 weeks and ensure compliance best practices from day one. The AI Agency Services Sydney guide explains what fractional CTO and co-build partnerships look like.

Infrastructure:

D23.io (managed): ~$2,000–5,000/month depending on data volume.
AWS (S3, KMS, Lambda): ~$1,000–3,000/month.
Apache Superset (self-hosted): ~$500/month (compute).
Vanta: ~$1,500/month.
Total: ~$5,000–10,000/month post-launch.

Comparison: Building this in-house from scratch typically costs 2–3x more in engineering time and introduces compliance risk. A reference architecture like this one de-risks the project significantly.

Conclusion and Next Steps

AU Open Banking data is a competitive asset, but only if you ingest it into a system that’s secure, auditable, and queryable. This reference architecture—D23.io lakehouse, Apache Superset analytics, Claude agents for self-service queries—is battle-tested and proven.

The three-layer approach (raw, curated, consumption) ensures compliance and scalability. The semantic layer ensures consistency. Agentic AI ensures your non-technical team can answer their own questions. SOC 2 and Vanta integration ensure you stay audit-ready.

If you’re a founder or operator at a seed-to-Series-B fintech, the next step is:

Audit your current data stack: Where is Open Banking data landing today? Is it governed? Is it queryable by non-engineers?
Map your use cases: What questions does your support, product, and finance team need to answer? Can they answer them today?
Plan Phase 1: Get 10 customers’ data into D23.io in the next 4 weeks. Validate schema. Log everything.
Engage a partner if needed: PADISO’s fractional CTO and co-build services can accelerate this timeline and ensure compliance best practices. We’ve deployed this exact pattern for AU CDR-accredited fintechs.

The reference architecture is proven. The compliance framework is clear. The technology is mature. The only variable is execution. Start now, and you’ll have a world-class data foundation before your Series B closes.

Additional Resources

For deeper dives into specific components:

Australian Open Banking Data Database (GitHub) – Working code samples for CDR API integration.
Open Banking - Australian Banking Association – Official framework overview.
Stripe’s State of Open Banking in Australia – Detailed CDR explanation.
Yodlee Developer Portal – AU Open Banking Docs – API integration guide.
Fiskil’s CDR Guide for 2026 – Compliance-focused overview.
AWS Open Banking Reference Architecture – Infrastructure patterns.
Cuscal’s Open Banking Guide – In-depth framework guide.
Commonwealth Bank’s Open Banking Explainer – Customer-facing overview.

For guidance on implementation, consider exploring PADISO’s AI Agency Methodology Sydney to understand how Sydney businesses are leveraging structured AI implementation approaches. If you’re building dashboards, the D23.io consulting engagement breakdown shows a real-world $50K Apache Superset rollout delivered in 6 weeks.

For ongoing support and fractional CTO guidance, PADISO’s AI Automation Agency Sydney services cover modernisation, co-build, and operational support for fintech and enterprise teams. Our AI Agency Support Sydney offering provides SLA-backed partnership for teams scaling data and AI initiatives.

AU Open Banking Data on D23.io: A Reference Architecture

Table of Contents

Introduction: Why AU Open Banking Matters {#introduction}

Understanding Australia’s Open Banking Framework {#understanding-framework}

The CDR Mandate and Data Holders

Accreditation and Data Recipient Status

API Standards and Data Formats

D23.io as Your Data Foundation {#d23io-foundation}

Why a Lakehouse Over Traditional Data Warehouses

Architecture: Ingestion → Raw → Curated

Integration with Existing Systems

Building the Ingestion Pipeline {#ingestion-pipeline}

API Authentication and Data Pulls

Handling Rate Limits and Retries

Schema Management and Validation

Incremental Ingestion and Change Data Capture

Data Governance and CDR Compliance {#governance}

Retention Policies and Deletion Workflows

Access Control and Row-Level Security

Data Lineage and Audit Trails

Encryption and Data Masking

Semantic Layer and Apache Superset {#semantic-layer}

Building the Semantic Model

Apache Superset Configuration

Row-Level Security in Superset

Embedding and Self-Service Analytics

Agentic AI with Claude for Analytics {#agentic-ai}

Why Claude Agents for Data Queries

Architecture: Claude + Superset Integration

Implementing Claude Agents

Guardrails and Safety

Use Cases for Claude Agents

Security, SOC 2, and Audit Readiness {#security}

SOC 2 Type II Requirements

Encryption and Key Management

Audit Logging and Monitoring

Incident Response

Vanta and Continuous Compliance

Deployment Patterns for AU Fintechs {#deployment-patterns}

Self-Hosted vs. Managed D23.io

Multi-Region Deployment

Staging and Production Separation

CI/CD for Data Pipelines

Monitoring and Alerting

Implementation Roadmap and Next Steps {#roadmap}

Phase 1: Foundation (Weeks 1–4)

Phase 2: Analytics (Weeks 5–8)

Phase 3: AI and Automation (Weeks 9–12)

Phase 4: Compliance and Scale (Weeks 13–16)

Post-Launch: Continuous Improvement

Estimated Effort and Costs

Conclusion and Next Steps

Additional Resources