Databricks (Single Workspace)
Configuring the Veza integration for Databricks
This integration provides support for the Databricks machine learning platform. Veza discovers entity and authorization metadata using the native Databricks REST API, using a user access token for authentication. A customer-provided cluster must be available for executing SQL commands to discover table entities and permissions.
For organizations that use single sign-on (SSO) for federated access to Databricks, the integration discovers authorization and effective permissions that Azure, Okta, and AWS identities have on Databricks resources.
For more information and details on supported entities, see notes.
If you use Unity Catalog to govern access to Databricks on Microsoft Azure, AWS, or Google Cloud, you can enable Databricks discovery as part of the cloud provider integration configuration. See Databricks Unity Catalog for more information.
Requirements
To connect to Azure Databricks, the workspace must enable both Workspace Access Control and Cluster, Pool, and Jobs Access Control. These features require a Premium Azure Databricks plan. To enable these settings:
As a Databricks administrator, click your username in the top bar of the Azure Databricks workspace and click Admin Settings.
Open Workspace Settings.
Toggle Workspace Access Control and Cluster, Pool and Jobs Access Control.
Click Confirm.
See Enable Access Control for details.
Authentication
The integration requires a Databricks user with account admin rights (required to list all workspace entities and permissions). Using a non-admin user token will result in an incomplete discovery.
Create a new Databricks Admin user Veza can connect as.
As an account admin, log in to the account console.
Click Account Console > User management.
On the Users tab, click Add User.
Provide a name and email address for the user.
Click Send invite.
Generate a personal access token.
Click Settings in the lower left corner of your Databricks workspace.
Click User Settings.
Go to the Access Tokens tab.
Click the Generate New Token button. Optionally enter a description (comment) and expiration period.
Click Generate. Copy the generated token, and store it securely.
Assign the admin role to the user.
As an account admin, log in to the account console.
Click Account Console > User management.
Find and click the user you created.
On the Roles tab, turn on Account admin.
See Authentication using Databricks personal access tokens for more information.
Creating a cluster
To extract metadata for the Databricks storage layer, Veza needs to run SQL queries on one of the clusters in the workspace. You should create a separate cluster for this purpose. The cluster will be automatically started only when Veza is conducting extractions, and automatically stopped after a set amount of inactivity.
To create a cluster using the Databricks UI, pick Create > Cluster:
The cluster can be a small single-node cluster
You should enable termination after an inactivity period (~10 minutes)
Add
spark.databricks.acl.sqlOnly true
to Advanced Options > Spark > Spark configEnsure the user created for the Veza integration has
CAN_MANAGE
permission on the cluster (More > Permissions)
Once the cluster is running, copy the cluster's HTTP endpoint from Advanced Options > JDBC/ODBC > HTTP path.
For more details on creating Databricks clusters see here.
Veza Configuration
From the Veza Configuration panel, navigate to the Apps & Data Sources tab. Scroll down to the Standalone Databases section and click Add New. Choose "Databricks" and provide the required information:
Field | Details |
---|---|
| Display name for the integration |
| Web address of the Databricks workspace (without the |
| Databricks user personal token |
| |
| The Identity Provider used for Single Sign On (optional). |
|
The Azure, AWS Identity Center, or Okta Identity Provider used for SSO must be integrated with Veza as a data source.
Cross Service Connections
To add an SSO connection for Databricks:
Use Authorization Graph to search for the Azure AD Domain, Okta Domain, or AWS Identity Center service. Open the Entity Details and copy the
Datasource ID
.Open Data Catalog > Apps and Data Sources and find your Databricks provider under Standalone Databases. Click Edit. If you haven't configured the provider yet, click Add New and choose Databricks.
Select your SSO provider as the SSO type. For SSO ID, use the
Datasource ID
of the Azure AD, AWS Identity Center, or Okta IdP.
Provider | Datasource ID |
---|---|
Azure AD | Azure Tenant ID ( |
Okta | Okta Domain ( |
AWS Identity Center | AWS Identity Center identity store ID ( |
Notes
Within Databricks, Access Control Lists govern permissions to different entities such as catalogs, schemas, tables, clusters, folders, and notebooks.
Every Databricks workspace has a central Hive metastore accessible by all clusters to store table metadata. This metastore (
hive_metastore
) is the sole catalog entity supported by Veza, along with schemas, tables, and permissions on those entities. Hive metastore do not support assigning permissions to catalogs.
Supported Entities
Entity | Details |
---|---|
Workspace | Contains all other resources (users, groups, clusters, directories, notebooks). Users typically have their own directories, which can have subdirectories and notebooks. |
Cluster | A set of computation resources (a Spark cluster). Only clusters with type High Concurrency support Permissions on tables. It has access to all hive data sources defined in a workspace. |
Notebook | Executable code that can be attached to a cluster and run. Each has its owner and set of permissions. |
Directory | Contains sub directories and notebooks. Each has its owner and set of permissions. By default each user gets its own directory. |
Catalog | Databricks catalog |
Schema | Databricks schema |
Table | Databricks table |
View | Databricks view |
User | Local workspace user |
Group | Local workspace group |
The following entities are not currently supported:
SQL Warehouse
Experiment
Job
Cluster Pool
Pipeline
Query
Dashboard
Effective Permissions
From Authorization Graph, select an EP node and click Explain Effective Permissions to view the raw Databricks ACLs that result in a set of effective permissions. Effective Permissions can account for the following scenarios:
In Databricks, Access Control Lists (ACLs) regulate an identity's permissions to access data tables, clusters, pools, jobs, as well as workspace objects such as notebooks, experiments, and folders. By default those controls are turned off: all users are allowed to do anything.
Permissions can be inherited from the parent entity.
Permissions on data tables are only available when enabled both in workspace settings and the cluster (available only on High Concurrency clusters).
If permissions on data tables aren't enabled for a cluster, then all users that have permissions on that cluster can also access all tables.
Last updated