🔗

Next-Gen Data Integration with Pipeline Builder

May 18, 2025

Lecture Notes: Next-Generation Data Integration with Pipeline Builder

Introduction

Importance of data integration across sectors for future stability.
Transition to a non-fragmented data landscape.

Pipeline Builder Overview

Described as the largest advancement since Foundry's original data integration suite.
Key Features:
- Democratizes data integration.
- Maintains robustness and security.

Building a Data Pipeline

Objective: Create a pipeline from scratch.
Data Sources:
- Tablet data of all sizes.
- Streaming data (e.g. sensor, geospatial, IoT devices).
- Unstructured/semi-structured formats (e.g. imagery, XML, PDFs).
Example Used: Supply chain disruption integrating regional suppliers.
- Datasets: Supplier US West and Supplier Europe.
- Tasks: Clean and union datasets, derive new columns.
- Address schema mismatches by transforming data as needed.

Pipeline Workflow

Union Creation:
- Resolve schema mismatches by renaming and dropping columns.
- Immediate error feedback prevents unnecessary computations.
Pipeline Output:
- Supports diverse outputs (datasets, streams, export systems).
- Immediate warnings if targets aren't met, preventing deployment.
- Example output: Expectation of 35 columns.

Data Integration Functions

Create additional columns using built-in functions:
- Encrypt supplier tax ID.
- Concatenate address fields.
- Calculate transaction spans.
- Clean whitespace.
- Create Boolean columns based on conditions.
- Use window functions for group calculations.

User Collaboration and Version Control

Pipeline Builder supports collaboration among users with different expertise levels.
Technical users benefit from reduced complexity and stronger typing.
Business users have a no-code tool for integration tasks.
Pipeline managers get structured, easily managed pipelines with clear change logs.
Version control allows branching, sandboxing, and merging with conflict checks.

Advanced Example and Deployment

Example: Healthcare systems pipeline templates for rapid deployment.
Example: Aviation industry real-time data integration.
Empowerment of citizen data engineers to build quality pipelines.
Sandbox creation for safe testing and error resolution.
- Example: Add State Data sandbox for integrating new datasets.
- Address casting and schema matching issues.

Conclusion

Pipeline Builder combines no-code fluidity with robust Foundry fundamentals.
Designed for scalable multi-user collaboration and data integration.
Modular architecture supports diverse enterprise systems (data lakes, warehouses, etc.).

Final Thoughts

There's more to explore, including streaming workflows and ontology integration.
Foundry aims to streamline enterprise data systems with future updates.

Note: This was a brief overview of Foundry's next-gen Pipeline Builder, with more in-depth features to come.

Full transcript