Akkio Databricks Integration
1 End-to-End Integration Overview
Akkio’s application layer instantiates a DatabricksDataSource
for each customer tenant.
The object (1) authenticates to a Databricks SQL Warehouse with a service-principal using OAuth M2M, (2) executes context-building or analytics queries, and (3) streams results back to Akkio’s Python runtime where they are cached in memory or materialized for downstream LLM pipelines. All writes (for example model outputs or audience tables) go back to the customer’s Databricks catalog, so data never leaves the customer’s storage account.
2 Connection & Authentication Architecture
2.1 Datasource abstraction
Each tenant is mapped to a datasource config that stores catalog, schema, warehouse IDs, and role bindings. These configs can point to shared or dedicated schemas according to four isolation patterns (shared creds, separate creds, shared schema + roles, hybrid) .
Storage isolation spans Databricks (physical/role), Firestore (logical by client ID) and S3 (logical by prefix) .
2.2 Service principals & OAuth M2M
Akkio registers one service principal per workspace and grants it SQL-warehouse access. OAuth secrets are generated with up-to-730-day TTL and rotated automatically
The connector builds a credentials provider in-code; on each call the SDK exchanges the secret for a short-lived OAuth token .
Service-principal identities are purpose-built for API access and cannot log in via the UI, reducing lateral-movement risk
2.3 User-Agent tagging
user_agent_entry="akkio"
is injected into every connection so that Warehouse and audit logs clearly attribute activity to the Akkio integration . Beginning with SQL-Connector v4.0.1, Databricks exposes a first-class user_agent_entry
parameter, which Akkio leverages for clean observability
(Figure 1: Warehouse monitoring showing service-account with M2M OAuth)
3 Security, Monitoring & Audit
Control
Implementation
AuthN / AuthZ
OAuth 2.0 M2M with service-principal + secret; token refresh is automatic
Least-privilege roles
Warehouse-level USAGE
+ catalog-level SELECT
; optional INSERT/CREATE
when write-back is enabled
User-Agent
All requests marked akkio
for traceability
Audit logs
Databricks audit schema includes a userAgent
field; logs show akkio
calls, Warehouse IDs, SQL text hashes and response codes
Network
TLS 1.2 over HTTPS/JDBC; IP-allow-list or PrivateLink if required
4 Performance & Cost Sizing
Warehouse tiers – SQL Serverless, SQL Classic, and SQL Pro are supported.
Autoscaling – Warehouses are configured with min=1 max=4 clusters; production tenants often set min=0 off-hours to save DBUs.
Query optimisation – Context sampling uses small LIMITs; heavy analytics leverage Photon-enabled computed caches.
Egress – Results returned as compressed Arrow batches; traffic is negligible compared to Warehouse compute.
5 Library & Dependency Matrix
Layer
Key Libraries (import excerpt)
Connection
databricks-sql-connector, databricks-sdk, sqlalchemy
DataFrames
pandas
, optional pyarrow
for Arrow fetch
Async / logging
AkkioAsyncioUtils, loguru
Parsing
sqlparse
& custom tokenizer in src.sql_parser
Key Takeaways
Service-principal + OAuth M2M provides secure, automated connectivity with token rotation.
user_agent_entry="akkio"
ensures every query is attributable across monitoring, billing and audit systems.Akkio’s context engine retrieves just-in-time data slices, keeping customer data at rest in Databricks while enabling rich LLM reasoning.
Cost is governed primarily by SQL Warehouse DBUs; light sampling and autoscaling keep spend low.
Last updated