Limiting Extractions

Options for restricting data source extractions

When connecting to a configured identity or data provider, Veza will attempt to discover all supported resources by default. There are two methods to limit the services and resources discovered:

  • Toggle discovery of select services (skipping services such AWS KMS or Azure SQL entirely)

  • Set allow and deny lists to limit data sources by name (only parsing individual resources)

Selecting services or resources to limit can be desirable to:

  • Omit unnecessary data sources following a naming pattern (such as test-db-*)

  • Prevent connection errors (for example if you haven't yet created a required local database user)

  • Improve overall performance by limiting the overall size of the Identity Data Entities catalog.

  • Ingest services one-by-one during initial parsing to incrementally update, instead of running a single long extraction

You can enable these preferences when adding a new provider, or change them for an existing integration by finding the provider in the Configuration menu and clicking the "Edit" button.

To toggle services discovered, choose Select services to enable in the provider configuration. When you save your changes, only the selected services will be scanned and added to the data catalog.

Allow or deny data sources

You can set allow and deny lists to limit extraction by resource name (including wildcards). Allow/deny lists are available for most data sources, including Google Cloud projects/domains, AWS Redshift/RDS databases, S3 buckets, and Snowflake databases.

When an allow list is saved, only resources with a matching name are parsed and added to the Identity Data Entities catalog. If a deny list is configured, any data sources with a matching name will be ignored during discovery.

The following rules apply:

  • If no values are provided, all data sources are extracted

  • If a resource name matches the allow list, it will be extracted

  • If a resource name matches the deny list, it will be ignored

  • Resources are only extracted if allowed and not denied (in the case that both allow and deny lists are configured)

Lists can have any number of wildcards (*), matching any number of characters.

Naming conventions

The value to use as the resource name depends on the provider. See the table below for more information about the format:

AWS Redshift database

Database ARN, for example: arn:aws:redshift:region:account-id:cluster:cluster-name

AWS RDS database

RDS database name

AWS S3 bucket

S3 bucket name

Google Cloud project

Project id

Google BiqQuery

Dataset name, table name

SQL Server

Database / Schema name

Snowflake

Snowflake dbname

To retrieve these values for an entity that has already been parsed:

  1. Search for the entity using the Authorization Graph,

  2. Click the node to open the actions sidebar, and choose "Show Details"

  3. The name to use will be one of the entity properties

You can also see the complete metadata for entities in your data catalog by browsing to Identity Data Entities, selecting the data provider, and adding a column for the properties you want to display.

Azure settings

When modifying an Azure tenant configuration, several additional options are available:

gather_guest_users

Whether to parse identity metadata for Azure AD Guest users

gather_disabled_users

Whether to include disabled users

domains

Comma-separated list of AD domains to discover, ignoring any others

gather_personal_sites

Whether to include personal SharePoint sites

Last updated