Overview
This lecture covers how to parameterize and productionalize Databricks jobs using Databricks Utilities (dbutils) widgets, focusing on writing data to Snowflake with dynamic parameters.
Parameterizing Databricks Notebooks
- Use
dbutils.widgets to add interactive input options (widgets) such as dropdowns or text boxes in Databricks notebooks.
- Widgets allow users to specify parameters like database, table, and schema names at runtime, making notebooks dynamic.
dbutils.widgets.dropdown("database", "test_database", ["test_database", "prod_database"]) creates a dropdown widget with configurable options.
- Retrieve widget values in notebook code using
dbutils.widgets.get("database").
- Avoid hardcoding values by referencing widget values, which supports reusability and easier modifications.
Productionalizing Jobs
- Transform an interactive notebook into a reusable production job by parameterizing it with widgets.
- In the Databricks Jobs UI, you can pass widget values as job parameters.
- Job parameters set in the interface take precedence over notebook values.
- This approach enables running the same notebook with different inputs (e.g., different tables or databases) without changing code.
- Typically use job clusters for production jobs, but interactive clusters can be used for testing.
Workflow Example
- Load and transform a Databricks dataset using PySpark.
- Filter, aggregate, and apply window functions to analyze top delayed airlines.
- Write results to a Snowflake table using parameters for table/database/schema/warehouse.
- Use
mode='overwrite' to replace existing data in the target table.
Key Terms & Definitions
- Databricks Utilities (dbutils) — Helper functions in Databricks for tasks like creating widgets and accessing parameters.
- Widget — UI input element (dropdown, text) used to parameterize notebook execution.
- Parameterization — Process of making notebooks dynamic by using variables instead of hardcoded values.
- Productionalize — Converting a notebook for repeatable, automated execution in production settings.
- Job Cluster — A cluster that is created and terminated automatically for a job run.
Action Items / Next Steps
- Practice creating widgets and retrieving their values in a Databricks notebook.
- Update existing notebooks to replace hardcoded values with widgets.
- Set up and test a Databricks job with parameters in the UI.
- Ensure job overwrites data only when intended by reviewing the
mode option.