Lecture Notes: Next-Generation Data Integration with Pipeline Builder
Introduction
- Importance of data integration across sectors for future stability.
- Transition to a non-fragmented data landscape.
Pipeline Builder Overview
- Described as the largest advancement since Foundry's original data integration suite.
- Key Features:
- Democratizes data integration.
- Maintains robustness and security.
Building a Data Pipeline
- Objective: Create a pipeline from scratch.
- Data Sources:
- Tablet data of all sizes.
- Streaming data (e.g. sensor, geospatial, IoT devices).
- Unstructured/semi-structured formats (e.g. imagery, XML, PDFs).
- Example Used: Supply chain disruption integrating regional suppliers.
- Datasets: Supplier US West and Supplier Europe.
- Tasks: Clean and union datasets, derive new columns.
- Address schema mismatches by transforming data as needed.
Pipeline Workflow
- Union Creation:
- Resolve schema mismatches by renaming and dropping columns.
- Immediate error feedback prevents unnecessary computations.
- Pipeline Output:
- Supports diverse outputs (datasets, streams, export systems).
- Immediate warnings if targets aren't met, preventing deployment.
- Example output: Expectation of 35 columns.
Data Integration Functions
- Create additional columns using built-in functions:
- Encrypt supplier tax ID.
- Concatenate address fields.
- Calculate transaction spans.
- Clean whitespace.
- Create Boolean columns based on conditions.
- Use window functions for group calculations.
User Collaboration and Version Control
- Pipeline Builder supports collaboration among users with different expertise levels.
- Technical users benefit from reduced complexity and stronger typing.
- Business users have a no-code tool for integration tasks.
- Pipeline managers get structured, easily managed pipelines with clear change logs.
- Version control allows branching, sandboxing, and merging with conflict checks.
Advanced Example and Deployment
- Example: Healthcare systems pipeline templates for rapid deployment.
- Example: Aviation industry real-time data integration.
- Empowerment of citizen data engineers to build quality pipelines.
- Sandbox creation for safe testing and error resolution.
- Example: Add State Data sandbox for integrating new datasets.
- Address casting and schema matching issues.
Conclusion
- Pipeline Builder combines no-code fluidity with robust Foundry fundamentals.
- Designed for scalable multi-user collaboration and data integration.
- Modular architecture supports diverse enterprise systems (data lakes, warehouses, etc.).
Final Thoughts
- There's more to explore, including streaming workflows and ontology integration.
- Foundry aims to streamline enterprise data systems with future updates.
Note: This was a brief overview of Foundry's next-gen Pipeline Builder, with more in-depth features to come.