Azure Data Factory Overview

Jul 22, 2025

Overview

This lecture provides a comprehensive, hands-on guide to mastering Azure Data Factory (ADF), an essential cloud ETL/ELT tool for data engineers. It covers setup, core concepts, real-world pipelines, triggers, transformations, and best practices for building automated, production-ready data workflows.

Introduction to Azure Data Factory

  • Azure Data Factory (ADF) is a cloud ETL/ELT tool for extracting, loading, and transforming data across diverse sources and destinations.
  • ADF is foundational for Azure-based data engineering, used in Synapse Analytics and Microsoft Fabric.
  • Learning ADF is crucial due to industry demand and its centrality in data migration, orchestration, and automation.

Setting Up Azure Resources

  • Prerequisites: PC/laptop, stable internet, and a free Azure account (with credit).
  • Resources needed: Resource Group (container for resources), Storage Account (for data lake).
  • Data Lake is configured within Storage Account using hierarchical namespace (for folders/subfolders).
  • Understand redundancy options: LRS (cheapest, single datacenter), ZRS (zone-redundant), GRS, and GZRS (geo-redundant).

Key Concepts: Linked Services and Datasets

  • Linked Service: Connection information to data sources/destinations (e.g., SQL, Blob, API).
  • Dataset: Schema or file/table/view specification within a linked service.

Core Activities and Pipelines

  • Copy Activity: Copies data from a source to a destination, essential for almost all data movement in ADF.
  • Pipelines organize and orchestrate sets of activities.
  • Supports moving data from Azure Data Lake, APIs, HTTP sources, and more.
  • Dataset parameterization allows dynamic file selection during iteration.

Real-World Pipeline Scenarios

  • Build pipelines to move data between containers and from APIs/GitHub directly into Data Lake.
  • Use Get Metadata activity to retrieve folder contents and iterate files.
  • Implement ForEach and If Condition activities to process only files matching specific patterns (e.g., those starting with 'fact').
  • Parameterized datasets enable dynamic data processing within loops.

Data Transformation with Data Flows

  • Data Flows provide GUI-based, code-free transformations powered by Spark clusters.
  • Common transformations: Select columns, Filter rows, Conditional Split, Derived Column, Group By, and Aggregates.
  • Sync activity writes transformed data back to Azure Data Lake or other sinks.

Automation with Triggers

  • Schedule Trigger: Runs pipelines at preset intervals (cannot backdate).
  • Tumbling Window Trigger: Supports historical intervals and backfilling.
  • Storage Events Trigger: Automatically triggers pipelines when a file is uploaded to a specified path.

Advanced Orchestration and Variables

  • Use Set Variable activity to store dynamic values (e.g., file lists, IDs) between steps.
  • Parent pipelines can execute multiple child pipelines using the Execute Pipeline activity.
  • Integrate all components to form a fully automated, event-driven ETL solution.

Key Terms & Definitions

  • ETL/ELT — Extract, Transform, Load/Extract, Load, Transform: data integration processes.
  • Resource Group — Logical container in Azure for related resources.
  • Storage Account — Azure service to store blobs (files), tables, and more.
  • Blob Storage — Object storage for unstructured data in Azure.
  • Data Lake — Storage with hierarchical namespace for big data analytics.
  • Linked Service — Connection to external data source/destination.
  • Dataset — Data structure representing data to move or transform.
  • Pipeline — Set of activities in ADF to perform a workflow.
  • Copy Activity — ADF activity to move data from source to destination.
  • Get Metadata Activity — Retrieves information (e.g., file names) from a dataset.
  • ForEach Activity — Loops over a collection of items in a pipeline.
  • If Condition Activity — Branches pipeline logic based on conditions.
  • Data Flow — Visually designed, scalable transform logic in ADF.
  • Trigger — Mechanism to automate pipeline execution.
  • Set Variable Activity — Stores and passes values between activities.
  • Execute Pipeline Activity — Runs one pipeline from another.
  • Sync — Destination in copy or data flow activity.

Action Items / Next Steps

  • Create a free Azure account and set up the required Resource Group and Storage Account.
  • Practice building pipelines with copy, metadata, and iteration/condition activities.
  • Experiment with Data Flows to perform transformations.
  • Set up schedule and storage event triggers for your workflows.
  • Review and reimplement complex sections (e.g., file iteration, dynamic datasets) for mastery.
  • Consider watching additional videos on related topics (e.g., PySpark, Delta Lake) to further expand your skills.