⚙️

CI/CD Setup for Databricks

Jul 10, 2025

Overview

This lecture explains how to set up Continuous Integration and Continuous Deployment (CI/CD) for Databricks using Azure DevOps, covering repo setup, branching, pipelines, and deployment automation.

Prerequisites & Setup

  • Ensure you have Databricks and Azure DevOps accounts with matching user IDs.
  • Create a new Azure DevOps project where code will be stored and managed.
  • Initialize a repository in Azure DevOps (Repos section).
  • In Databricks, access Repos to integrate with Azure DevOps.

Linking Databricks Repos with Azure DevOps

  • Use "Add Repo" in Databricks to link your development space with an Azure DevOps repository via the Git repository URL.
  • Pull the latest code from Azure DevOps to your Databricks environment for development.
  • After making changes in Databricks, commit and push code back to Azure DevOps.

Branching & Pull Requests

  • Create feature branches in Azure DevOps for individual development tasks.
  • Developers push changes to feature branches, then create pull requests to merge into the main branch.
  • Pull requests are reviewed and approved before merging; delete feature branches after merging to avoid clutter.

CI/CD Pipeline Creation

  • Create a YAML pipeline in Azure DevOps for automation.
  • The pipeline process:
    • Pulls the latest code and creates an artifact.
    • Uses Databricks CLI for workspace operations.
    • Connects to Databricks using a host URL and token (stored in Azure DevOps variable groups).
    • Deletes old files from the target folder (e.g., 'release') and imports updated artifacts.
  • Define triggers in the pipeline YAML to run on changes to the main branch.

Automating Deployments

  • On each push to the main branch, the pipeline updates the 'release' folder in Databricks with the latest code.
  • Only necessary files (e.g., .py) are included using an artifact ignore file.
  • The updated release folder content can be used for job scheduling in Databricks.

Error Handling & Best Practices

  • If errors occur while pushing, check the workspace settings for authorized Git repository restrictions.
  • Always pull the latest code before pushing changes to avoid conflicts.
  • Regularly clean up branches after merging to maintain a tidy repository.

Key Terms & Definitions

  • Databricks — A cloud-based data analytics and engineering platform.
  • Azure DevOps — Microsoft's suite for development collaboration, version control, and CI/CD.
  • Repo (Repository) — A storage location for code and versioned files.
  • Artifact — A bundle of code/files generated during pipeline execution for deployment.
  • Pipeline (YAML) — An automated workflow defined as code to build, test, and deploy changes.
  • Pull Request (PR) — A request to merge code from one branch into another, often reviewed by peers.

Action Items / Next Steps

  • Set up Azure DevOps and Databricks accounts if not already done.
  • Integrate Databricks with your Azure DevOps repository.
  • Create and configure your YAML CI/CD pipeline with proper variables and triggers.
  • Test code commits and monitor automated deployments to Databricks.