Databricks (Single Workspace)

Configuring the Veza integration for Databricks

This integration provides support for the Databricks machine learning platform. Veza discovers entity and authorization metadata using the native Databricks REST API, using a user access token for authentication. A customer-provided cluster must be available for executing SQL commands to discover table entities and permissions.

For organizations that use single sign-on (SSO) for federated access to Databricks, the integration discovers authorization and effective permissions that Azure, Okta, and AWS identities have on Databricks resources.

For more information and details on supported entities, see notes.

If you use Unity Catalog to govern access to Databricks on Microsoft Azure, AWS, or Google Cloud, you can enable Databricks discovery as part of the cloud provider integration configuration. See Databricks Unity Catalog for more information.

Requirements

To connect to Azure Databricks, the workspace must enable both Workspace Access Control and Cluster, Pool, and Jobs Access Control. These features require a Premium Azure Databricks plan. To enable these settings:

  1. As a Databricks administrator, click your username in the top bar of the Azure Databricks workspace and click Admin Settings.

  2. Open Workspace Settings.

  3. Toggle Workspace Access Control and Cluster, Pool and Jobs Access Control.

  4. Click Confirm.

See Enable Access Control for details.

Authentication

The integration requires a Databricks user with account admin rights (required to list all workspace entities and permissions). Using a non-admin user token will result in an incomplete discovery.

  1. Create a new Databricks Admin user Veza can connect as.

    1. As an account admin, log in to the account console.

    2. Click Account Console > User management.

    3. On the Users tab, click Add User.

    4. Provide a name and email address for the user.

    5. Click Send invite.

  2. Generate a personal access token.

    1. Click Settings in the lower left corner of your Databricks workspace.

    2. Click User Settings.

    3. Go to the Access Tokens tab.

    4. Click the Generate New Token button. Optionally enter a description (comment) and expiration period.

    5. Click Generate. Copy the generated token, and store it securely.

  3. Assign the admin role to the user.

    1. As an account admin, log in to the account console.

    2. Click Account Console > User management.

    3. Find and click the user you created.

    4. On the Roles tab, turn on Account admin.

See Authentication using Databricks personal access tokens for more information.

Creating a cluster

To extract metadata for the Databricks storage layer, Veza needs to run SQL queries on one of the clusters in the workspace. You should create a separate cluster for this purpose. The cluster will be automatically started only when Veza is conducting extractions, and automatically stopped after a set amount of inactivity.

To create a cluster using the Databricks UI, pick Create > Cluster:

  • The cluster can be a small single-node cluster

  • You should enable termination after an inactivity period (~10 minutes)

  • Add spark.databricks.acl.sqlOnly true to Advanced Options > Spark > Spark config

  • Ensure the user created for the Veza integration has CAN_MANAGE permission on the cluster (More > Permissions)

Once the cluster is running, copy the cluster's HTTP endpoint from Advanced Options > JDBC/ODBC > HTTP path.

For more details on creating Databricks clusters see here.

Veza Configuration

From the Veza Configuration panel, navigate to the Apps & Data Sources tab. Scroll down to the Standalone Databases section and click Add New. Choose "Databricks" and provide the required information:

FieldDetails

Name

Display name for the integration

Workspace URL

Web address of the Databricks workspace (without the https://)

Access Token

Databricks user personal token

Cluster Endpoint

SSO Type

The Identity Provider used for Single Sign On (optional).

SSO ID

The Azure, AWS Identity Center, or Okta Identity Provider used for SSO must be integrated with Veza as a data source.

Cross Service Connections

To add an SSO connection for Databricks:

  1. Use Authorization Graph to search for the Azure AD Domain, Okta Domain, or AWS Identity Center service. Open the Entity Details and copy the Datasource ID.

  2. Open Data Catalog > Apps and Data Sources and find your Databricks provider under Standalone Databases. Click Edit. If you haven't configured the provider yet, click Add New and choose Databricks.

  3. Select your SSO provider as the SSO type. For SSO ID, use the Datasource ID of the Azure AD, AWS Identity Center, or Okta IdP.

ProviderDatasource ID

Azure AD

Azure Tenant ID (ff57cf71-ac1c-43b8-8111-43b1be101dab)

Okta

Okta Domain (<domain>.okta.com)

AWS Identity Center

AWS Identity Center identity store ID (ff57cf71-ac1c-43b8-8111-43b1be101dab)

Notes

  • Within Databricks, Access Control Lists govern permissions to different entities such as catalogs, schemas, tables, clusters, folders, and notebooks.

  • Every Databricks workspace has a central Hive metastore accessible by all clusters to store table metadata. This metastore (hive_metastore) is the sole catalog entity supported by Veza, along with schemas, tables, and permissions on those entities. Hive metastore do not support assigning permissions to catalogs.

Supported Entities

EntityDetails

Workspace

Contains all other resources (users, groups, clusters, directories, notebooks). Users typically have their own directories, which can have subdirectories and notebooks.

Cluster

A set of computation resources (a Spark cluster). Only clusters with type High Concurrency support Permissions on tables. It has access to all hive data sources defined in a workspace.

Notebook

Executable code that can be attached to a cluster and run. Each has its owner and set of permissions.

Directory

Contains sub directories and notebooks. Each has its owner and set of permissions. By default each user gets its own directory.

Catalog

Databricks catalog

Schema

Databricks schema

Table

Databricks table

View

Databricks view

User

Local workspace user

Group

Local workspace group

The following entities are not currently supported:

  • SQL Warehouse

  • Experiment

  • Job

  • Cluster Pool

  • Pipeline

  • Query

  • Dashboard

Effective Permissions

From Authorization Graph, select an EP node and click Explain Effective Permissions to view the raw Databricks ACLs that result in a set of effective permissions. Effective Permissions can account for the following scenarios:

  • In Databricks, Access Control Lists (ACLs) regulate an identity's permissions to access data tables, clusters, pools, jobs, as well as workspace objects such as notebooks, experiments, and folders. By default those controls are turned off: all users are allowed to do anything.

  • Permissions can be inherited from the parent entity.

  • Permissions on data tables are only available when enabled both in workspace settings and the cluster (available only on High Concurrency clusters).

  • If permissions on data tables aren't enabled for a cluster, then all users that have permissions on that cluster can also access all tables.

Last updated