🔗

Next-Gen Data Integration with Pipeline Builder

May 18, 2025

Lecture Notes: Next-Generation Data Integration with Pipeline Builder

Introduction

  • Importance of data integration across sectors for future stability.
  • Transition to a non-fragmented data landscape.

Pipeline Builder Overview

  • Described as the largest advancement since Foundry's original data integration suite.
  • Key Features:
    • Democratizes data integration.
    • Maintains robustness and security.

Building a Data Pipeline

  • Objective: Create a pipeline from scratch.
  • Data Sources:
    • Tablet data of all sizes.
    • Streaming data (e.g. sensor, geospatial, IoT devices).
    • Unstructured/semi-structured formats (e.g. imagery, XML, PDFs).
  • Example Used: Supply chain disruption integrating regional suppliers.
    • Datasets: Supplier US West and Supplier Europe.
    • Tasks: Clean and union datasets, derive new columns.
    • Address schema mismatches by transforming data as needed.

Pipeline Workflow

  • Union Creation:
    • Resolve schema mismatches by renaming and dropping columns.
    • Immediate error feedback prevents unnecessary computations.
  • Pipeline Output:
    • Supports diverse outputs (datasets, streams, export systems).
    • Immediate warnings if targets aren't met, preventing deployment.
    • Example output: Expectation of 35 columns.

Data Integration Functions

  • Create additional columns using built-in functions:
    • Encrypt supplier tax ID.
    • Concatenate address fields.
    • Calculate transaction spans.
    • Clean whitespace.
    • Create Boolean columns based on conditions.
    • Use window functions for group calculations.

User Collaboration and Version Control

  • Pipeline Builder supports collaboration among users with different expertise levels.
  • Technical users benefit from reduced complexity and stronger typing.
  • Business users have a no-code tool for integration tasks.
  • Pipeline managers get structured, easily managed pipelines with clear change logs.
  • Version control allows branching, sandboxing, and merging with conflict checks.

Advanced Example and Deployment

  • Example: Healthcare systems pipeline templates for rapid deployment.
  • Example: Aviation industry real-time data integration.
  • Empowerment of citizen data engineers to build quality pipelines.
  • Sandbox creation for safe testing and error resolution.
    • Example: Add State Data sandbox for integrating new datasets.
    • Address casting and schema matching issues.

Conclusion

  • Pipeline Builder combines no-code fluidity with robust Foundry fundamentals.
  • Designed for scalable multi-user collaboration and data integration.
  • Modular architecture supports diverse enterprise systems (data lakes, warehouses, etc.).

Final Thoughts

  • There's more to explore, including streaming workflows and ontology integration.
  • Foundry aims to streamline enterprise data systems with future updates.

Note: This was a brief overview of Foundry's next-gen Pipeline Builder, with more in-depth features to come.