Overview
This lecture explains multiple methods for connecting Azure Data Lake Storage Gen2 (ADLS Gen2) to Databricks, including hands-on steps for secure and scalable integration.
Basic Setup & Storage Account Keys
- To connect ADLS Gen2 to Databricks, you need a storage account, a container with your data, and an active Databricks cluster.
- The simplest way to connect is using the storage account name as username and the access key as password.
- This method is not recommended due to security risks and poor scalability (must manage many keys).
Service Principal & App Registration Method
- Instead of storage keys, use an Azure service principal (app registration) for improved security and governance.
- Register a new app in Azure portal to create a service principal, then generate a client secret.
- Assign the "Storage Blob Data Contributor" role to the service principal via the storage account's IAM settings.
Azure Key Vault & Databricks Secret Scope
- Store the service principal client secret in Azure Key Vault instead of exposing it in Databricks notebooks.
- Configure Key Vault access policies to allow the app registration to get and list secrets.
- In Databricks, create a secret scope that connects to the Key Vault using its DNS name and resource ID.
Mounting Storage in Databricks
- Use Databricks configuration commands to connect using the client ID, secret, and tenant ID stored in Key Vault via secret scope.
- Mounting a container as a mount point in Databricks allows persistent connectivity without repeated token exchanges.
- Unmount and remount containers if you encounter mount errors.
High Concurrency Cluster & Credential Passthrough
- On high concurrency clusters, you can use credential passthrough, letting Databricks use the command runnerโs credentials.
- This approach is not recommended for production and does not support R or Scala commands.
- It is best to use the service principal and Key Vault approach for production environments.
Key Terms & Definitions
- ADLS Gen2 โ Azure Data Lake Storage Gen2, a scalable Microsoft cloud storage service for big data.
- Databricks โ An analytics platform for big data and machine learning.
- Service Principal/App Registration โ An Azure security identity used by applications to access resources.
- Key Vault โ Azure service for securely storing keys, secrets, and certificates.
- Secret Scope โ A secure wrapper in Databricks for managing secrets and connecting to Key Vault.
- Mount Point โ A Databricks path tied to an external storage location, enabling persistent access.
- Credential Passthrough โ A method for Databricks to use the userโs own credentials for data access.
Action Items / Next Steps
- Practice setting up the connection using service principal, Key Vault, and secret scope in Databricks.
- Review permissions and IAM assignments for service principals.
- Ensure you unmount storage before remounting if you encounter access errors.