Akkio Databricks Integration

1 End-to-End Integration Overview

Akkio’s application layer instantiates a DatabricksDataSource for each customer tenant. The object (1) authenticates to a Databricks SQL Warehouse with a service-principal using OAuth M2M, (2) executes context-building or analytics queries, and (3) streams results back to Akkio’s Python runtime where they are cached in memory or materialized for downstream LLM pipelines. All writes (for example model outputs or audience tables) go back to the customer’s Databricks catalog, so data never leaves the customer’s storage account.

2 Connection & Authentication Architecture

2.1 Datasource abstraction

  • Each tenant is mapped to a datasource config that stores catalog, schema, warehouse IDs, and role bindings. These configs can point to shared or dedicated schemas according to four isolation patterns (shared creds, separate creds, shared schema + roles, hybrid) .

  • Storage isolation spans Databricks (physical/role), Firestore (logical by client ID) and S3 (logical by prefix) .

2.2 Service principals & OAuth M2M

  • Akkio registers one service principal per workspace and grants it SQL-warehouse access. OAuth secrets are generated with up-to-730-day TTL and rotated automatically

  • The connector builds a credentials provider in-code; on each call the SDK exchanges the secret for a short-lived OAuth token .

  • Service-principal identities are purpose-built for API access and cannot log in via the UI, reducing lateral-movement risk

2.3 User-Agent tagging

user_agent_entry="akkio" is injected into every connection so that Warehouse and audit logs clearly attribute activity to the Akkio integration . Beginning with SQL-Connector v4.0.1, Databricks exposes a first-class user_agent_entry parameter, which Akkio leverages for clean observability

(Figure 1: Warehouse monitoring showing service-account with M2M OAuth)

3 Security, Monitoring & Audit

Control

Implementation

AuthN / AuthZ

OAuth 2.0 M2M with service-principal + secret; token refresh is automatic

Least-privilege roles

Warehouse-level USAGE + catalog-level SELECT; optional INSERT/CREATE when write-back is enabled

User-Agent

All requests marked akkio for traceability

Audit logs

Databricks audit schema includes a userAgent field; logs show akkio calls, Warehouse IDs, SQL text hashes and response codes

Network

TLS 1.2 over HTTPS/JDBC; IP-allow-list or PrivateLink if required

4 Performance & Cost Sizing

  • Warehouse tiers – SQL Serverless, SQL Classic, and SQL Pro are supported.

  • Autoscaling – Warehouses are configured with min=1 max=4 clusters; production tenants often set min=0 off-hours to save DBUs.

  • Query optimisation – Context sampling uses small LIMITs; heavy analytics leverage Photon-enabled computed caches.

  • Egress – Results returned as compressed Arrow batches; traffic is negligible compared to Warehouse compute.

5 Library & Dependency Matrix

Layer

Key Libraries (import excerpt)

Connection

databricks-sql-connector, databricks-sdk, sqlalchemy

DataFrames

pandas, optional pyarrow for Arrow fetch

Async / logging

AkkioAsyncioUtils, loguru

Parsing

sqlparse & custom tokenizer in src.sql_parser

Key Takeaways

  • Service-principal + OAuth M2M provides secure, automated connectivity with token rotation.

  • user_agent_entry="akkio" ensures every query is attributable across monitoring, billing and audit systems.

  • Akkio’s context engine retrieves just-in-time data slices, keeping customer data at rest in Databricks while enabling rich LLM reasoning.

  • Cost is governed primarily by SQL Warehouse DBUs; light sampling and autoscaling keep spend low.

Last updated