Configuring the Veza integration for Databricks
This integration provides support for the Databricks machine learning platform. Veza discovers entity and authorization metadata using the native Databricks REST API, using a user access token for authentication. A customer-provided cluster must be available for executing SQL commands to discover table entities and permissions.
For organizations that use single sign-on (SSO) for federated access to Databricks, the integration discovers authorization and effective permissions that Azure, Okta, and AWS identities have on Databricks resources.
For more information and details on supported entities, see notes.
If you use Unity Catalog to govern access to Databricks on Microsoft Azure, AWS, or Google Cloud, you can enable Databricks discovery as part of the cloud provider integration configuration. See Databricks Unity Catalog for more information.
To connect to Azure Databricks, the workspace must enable both Workspace Access Control and Cluster, Pool, and Jobs Access Control. These features require a Premium Azure Databricks plan. To enable these settings:
As a Databricks administrator, click your username in the top bar of the Azure Databricks workspace and click Admin Settings.
Open Workspace Settings.
Toggle Workspace Access Control and Cluster, Pool and Jobs Access Control.
Click Confirm.
See Enable Access Control for details.
The integration requires a Databricks user with account admin rights (required to list all workspace entities and permissions). Using a non-admin user token will result in an incomplete discovery.
Create a new Databricks Admin user Veza can connect as.
As an account admin, log in to the account console.
Click Account Console > User management.
On the Users tab, click Add User.
Provide a name and email address for the user.
Click Send invite.
Generate a personal access token.
Click Settings in the lower left corner of your Databricks workspace.
Click User Settings.
Go to the Access Tokens tab.
Click the Generate New Token button. Optionally enter a description (comment) and expiration period.
Click Generate. Copy the generated token, and store it securely.
Assign the admin role to the user.
As an account admin, log in to the account console.
Click Account Console > User management.
Find and click the user you created.
On the Roles tab, turn on Account admin.
See Authentication using Databricks personal access tokens for more information.
To extract metadata for the Databricks storage layer, Veza needs to run SQL queries on one of the clusters in the workspace. You should create a separate cluster for this purpose. The cluster will be automatically started only when Veza is conducting extractions, and automatically stopped after a set amount of inactivity.
To create a cluster using the Databricks UI, pick Create > Cluster:
The cluster can be a small single-node cluster
You should enable termination after an inactivity period (~10 minutes)
Add spark.databricks.acl.sqlOnly true
to Advanced Options > Spark > Spark config
Ensure the user created for the Veza integration has CAN_MANAGE
permission on the cluster (More > Permissions)
Once the cluster is running, copy the cluster's HTTP endpoint from Advanced Options > JDBC/ODBC > HTTP path.
For more details on creating Databricks clusters see here.
From the Veza Configuration panel, navigate to the Apps & Data Sources tab. Scroll down to the Standalone Databases section and click Add New. Choose "Databricks" and provide the required information:
Name
Display name for the integration
Workspace URL
Web address of the Databricks workspace (without the https://
)
Access Token
Databricks user personal token
Cluster Endpoint
JDBC/ODBC endpoint for the configured for Veza use
SSO Type
The Identity Provider used for Single Sign On (optional).
SSO ID
Data Source ID of the identity provider used for
The Azure, AWS Identity Center, or Okta Identity Provider used for SSO must be integrated with Veza as a data source.
Cross Service Connections
To add an SSO connection for Databricks:
Use Authorization Graph to search for the Azure AD Domain, Okta Domain, or AWS Identity Center service. Open the Entity Details and copy the Datasource ID
.
Open Data Catalog > Apps and Data Sources and find your Databricks provider under Standalone Databases. Click Edit. If you haven't configured the provider yet, click Add New and choose Databricks.
Select your SSO provider as the SSO type. For SSO ID, use the Datasource ID
of the Azure AD, AWS Identity Center, or Okta IdP.
Azure AD
Azure Tenant ID (ff57cf71-ac1c-43b8-8111-43b1be101dab
)
Okta
Okta Domain (<domain>.okta.com
)
AWS Identity Center
AWS Identity Center identity store ID (ff57cf71-ac1c-43b8-8111-43b1be101dab
)
Within Databricks, Access Control Lists govern permissions to different entities such as catalogs, schemas, tables, clusters, folders, and notebooks.
Every Databricks workspace has a central Hive metastore accessible by all clusters to store table metadata. This metastore (hive_metastore
) is the sole catalog entity supported by Veza, along with schemas, tables, and permissions on those entities. Hive metastore do not support assigning permissions to catalogs.
Supported Entities
Workspace
Contains all other resources (users, groups, clusters, directories, notebooks). Users typically have their own directories, which can have subdirectories and notebooks.
Cluster
A set of computation resources (a Spark cluster). Only clusters with type High Concurrency support Permissions on tables. It has access to all hive data sources defined in a workspace.
Notebook
Executable code that can be attached to a cluster and run. Each has its owner and set of permissions.
Directory
Contains sub directories and notebooks. Each has its owner and set of permissions. By default each user gets its own directory.
Catalog
Databricks catalog
Schema
Databricks schema
Table
Databricks table
View
Databricks view
User
Local workspace user
Group
Local workspace group
The following entities are not currently supported:
SQL Warehouse
Experiment
Job
Cluster Pool
Pipeline
Query
Dashboard
Effective Permissions
From Authorization Graph, select an EP node and click Explain Effective Permissions to view the raw Databricks ACLs that result in a set of effective permissions. Effective Permissions can account for the following scenarios:
In Databricks, Access Control Lists (ACLs) regulate an identity's permissions to access data tables, clusters, pools, jobs, as well as workspace objects such as notebooks, experiments, and folders. By default those controls are turned off: all users are allowed to do anything.
Permissions can be inherited from the parent entity.
Permissions on data tables are only available when enabled both in workspace settings and the cluster (available only on High Concurrency clusters).
If permissions on data tables aren't enabled for a cluster, then all users that have permissions on that cluster can also access all tables.