# Databricks

### 1 End-to-End Integration Overview

Akkio’s application layer instantiates a `DatabricksDataSource` for each customer tenant.\
The object (1) authenticates to a Databricks SQL Warehouse with a service-principal using OAuth M2M, (2) executes context-building or analytics queries, and (3) streams results back to Akkio’s Python runtime where they are cached in memory or materialized for downstream LLM pipelines. All writes (for example model outputs or audience tables) go back to the customer’s Databricks catalog, so data never leaves the customer’s storage account.

### Cloud Platform Support

Akkio integration provides native support for Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). The integration leverages the Databricks SDK, which enables compatibility with all supported data warehouse solutions without platform-specific limitations.

### 2 Connection & Authentication Architecture

#### 2.1 Datasource abstraction

* Each tenant is mapped to a datasource config that stores catalog, schema, warehouse IDs, and role bindings. These configs can point to shared or dedicated schemas according to four isolation patterns (shared creds, separate creds, shared schema + roles, hybrid) .
* Storage isolation spans Databricks (physical/role), Firestore (logical by client ID) and S3 (logical by prefix) .

#### 2.2 Service principals & OAuth M2M

* Akkio registers one service principal per workspace and grants it SQL-warehouse access. OAuth secrets are generated with up-to-730-day TTL and rotated automatically
* The connector builds a credentials provider in-code; on each call the SDK exchanges the secret for a short-lived OAuth token .
* Service-principal identities are purpose-built for API access and cannot log in via the UI, reducing lateral-movement risk

#### 2.3 User-Agent tagging

`user_agent_entry="akkio"` is injected into every connection so that Warehouse and audit logs clearly attribute activity to the Akkio integration . Beginning with SQL-Connector v4.0.1, Databricks exposes a first-class `user_agent_entry` parameter, which Akkio leverages for clean observability

<figure><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXdxSsNdIZ2HsZIl-8s3CKC5kDzW996PVJIPliKNqqOfb6CzFMdXHXIa3pdJhtcWUsfFxuvkOuM5QgnmnURMCx9AygooQxHxHgm9JfK2SBUu_Oy5o1XAvNLX4QHpLJrsJAt69FUloA?key=ZQ8ejwNd3FHRKz8z9okQpw" alt=""><figcaption></figcaption></figure>

(**Figure 1**: Warehouse monitoring showing service-account with M2M OAuth)

### 3 Security, Monitoring & Audit

| Control               | Implementation                                                                                                                   |
| --------------------- | -------------------------------------------------------------------------------------------------------------------------------- |
| AuthN / AuthZ         | OAuth 2.0 M2M with service-principal + secret; token refresh is automatic                                                        |
| Least-privilege roles | Warehouse-level `USAGE` + catalog-level `SELECT`; optional `INSERT/CREATE` when write-back is enabled                            |
| User-Agent            | All requests marked `akkio` for traceability                                                                                     |
| Audit logs            | Databricks audit schema includes a `userAgent` field; logs show `akkio` calls, Warehouse IDs, SQL text hashes and response codes |
| Network               | TLS 1.2 over HTTPS/JDBC; IP-allow-list or PrivateLink if required                                                                |

### 4 Performance & Cost Sizing

* Warehouse tiers – SQL Serverless, SQL Classic, and SQL Pro are supported.
* Autoscaling – Warehouses are configured with min=1 max=4 clusters; production tenants often set min=0 off-hours to save DBUs.
* Query optimisation – Context sampling uses small LIMITs; heavy analytics leverage Photon-enabled computed caches.
* Egress – Results returned as compressed Arrow batches; traffic is negligible compared to Warehouse compute.<br>

### 5 Library & Dependency Matrix

| Layer           | Key Libraries (import excerpt)                         |
| --------------- | ------------------------------------------------------ |
| Connection      | `databricks-sql-connector, databricks-sdk, sqlalchemy` |
| DataFrames      | `pandas`, optional `pyarrow` for Arrow fetch           |
| Async / logging | `AkkioAsyncioUtils, loguru`                            |
| Parsing         | `sqlparse` & custom tokenizer in `src.sql_parser`      |

#### Key Takeaways

* Service-principal + OAuth M2M provides secure, automated connectivity with token rotation.
* `user_agent_entry="akkio"` ensures every query is attributable across monitoring, billing and audit systems.
* Akkio’s context engine retrieves just-in-time data slices, keeping customer data at rest in Databricks while enabling rich LLM reasoning.
* Cost is governed primarily by SQL Warehouse DBUs; light sampling and autoscaling keep spend low.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.akkio.com/akkio-help-center/integrations/databricks.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
