๐Ÿ”—

Connecting ADLS Gen2 to Databricks

Jul 10, 2025

Overview

This lecture explains multiple methods for connecting Azure Data Lake Storage Gen2 (ADLS Gen2) to Databricks, including hands-on steps for secure and scalable integration.

Basic Setup & Storage Account Keys

  • To connect ADLS Gen2 to Databricks, you need a storage account, a container with your data, and an active Databricks cluster.
  • The simplest way to connect is using the storage account name as username and the access key as password.
  • This method is not recommended due to security risks and poor scalability (must manage many keys).

Service Principal & App Registration Method

  • Instead of storage keys, use an Azure service principal (app registration) for improved security and governance.
  • Register a new app in Azure portal to create a service principal, then generate a client secret.
  • Assign the "Storage Blob Data Contributor" role to the service principal via the storage account's IAM settings.

Azure Key Vault & Databricks Secret Scope

  • Store the service principal client secret in Azure Key Vault instead of exposing it in Databricks notebooks.
  • Configure Key Vault access policies to allow the app registration to get and list secrets.
  • In Databricks, create a secret scope that connects to the Key Vault using its DNS name and resource ID.

Mounting Storage in Databricks

  • Use Databricks configuration commands to connect using the client ID, secret, and tenant ID stored in Key Vault via secret scope.
  • Mounting a container as a mount point in Databricks allows persistent connectivity without repeated token exchanges.
  • Unmount and remount containers if you encounter mount errors.

High Concurrency Cluster & Credential Passthrough

  • On high concurrency clusters, you can use credential passthrough, letting Databricks use the command runnerโ€™s credentials.
  • This approach is not recommended for production and does not support R or Scala commands.
  • It is best to use the service principal and Key Vault approach for production environments.

Key Terms & Definitions

  • ADLS Gen2 โ€” Azure Data Lake Storage Gen2, a scalable Microsoft cloud storage service for big data.
  • Databricks โ€” An analytics platform for big data and machine learning.
  • Service Principal/App Registration โ€” An Azure security identity used by applications to access resources.
  • Key Vault โ€” Azure service for securely storing keys, secrets, and certificates.
  • Secret Scope โ€” A secure wrapper in Databricks for managing secrets and connecting to Key Vault.
  • Mount Point โ€” A Databricks path tied to an external storage location, enabling persistent access.
  • Credential Passthrough โ€” A method for Databricks to use the userโ€™s own credentials for data access.

Action Items / Next Steps

  • Practice setting up the connection using service principal, Key Vault, and secret scope in Databricks.
  • Review permissions and IAM assignments for service principals.
  • Ensure you unmount storage before remounting if you encounter access errors.