Next-Gen Data Integration with Pipeline Builder

Today, it's clear that being able to leverage the power of data is instrumental to our collective future across sectors. There's too much at stake in an ever-changing world to settle for fragmented or siloed data landscapes. Over the past year, we've been working on a next-generation approach to data integration. It's the biggest single step forward we've taken since the release of Foundry's original data integration suite. Pipeline Builder is our new data integration capability that delivers on democratizing data integration, while maintaining robustness and security. In short, allowing users to build better pipelines. Let's start from the beginning and see how it's put into action. We're going to work through building a simple pipeline from scratch. We can add any data on a Foundry instance to the pipeline. This data can include tablet data of all sizes, streaming data such as sensor recordings, geospatial data and telemetry from IoT devices, or unstructured and semi-structured formats such as imagery data, XML files and PDFs. Here we'll be working with a supply chain disruption. We want to bring all our regional suppliers into one place to build a worldwide view which we can rely on. We'll add a couple of datasets, Supplier US West and Supplier Europe. These datasets contain rows of suppliers with columns describing various properties of each supplier. We want to clean and union these two datasets together and derive additional columns. We'll also upload some XML files which can be quickly passed into a structure table and joined into our supplier datasets. So we'll start by creating a union between Supplier US West and Supplier Europe to bring them into a single dataset. Upon creating the union, we're immediately given an error message that the schemas don't line up. We can fix this by adding a couple of transforms upstream. One transform renaming Supplier to Supplier ID so that ID columns match up, and another dropping some columns that show up on one side of the union and that we shouldn't need. Notice that the error message is immediately resolved. This immediate feedback saves us time while iterating and means we don't need to run any data computation to find issues. Next, we can create a pipeline output, which will let us deploy the final outputs of our pipeline. Pipeline Builder's underlying architecture is designed to support all kinds of outputs, data sets, ontological objects, streams, time series, even export to external systems. The aim is to take you end-to-end from the raw data source through to the final destination of your data. Pipeline outputs are the target of the pipeline. If we no longer meet this target, the user is immediately warned. and deployment is prevented so that the user can determine whether it's an intentional change to the output contract or an accidental logic break or problem with incoming data that needs resolving. Here we'll create a dataset pipeline output. This output will expect this exact set of 35 columns. Again, if the pipeline does not produce these 35 columns, we'll immediately flag the break and prevent deployment to stop breaks downstream. Pipeline Builder will produce these outputs while abstracting all of the implementation of logic execution and pipeline deployment behind the scenes. We use our union as a starting point for the pipeline output and add a few columns. An address string column, an active boolean column to mark whether the supplier is currently active, a transaction span integer column for the number of days between the oldest and most recent transactions, and the total number of suppliers in the same city as the total suppliers column, which is an integer. As we continue to iterate, we'll be guided to fill out this target. A core concept of Pipeline Builder is the separation of data computation and schema computation. This separation allows us to immediately evaluate the success of our integration. We don't need to wait to run the pipeline across potentially enormous or complex input data. We can check our progress towards filling out pipeline outputs. If bad data shows up tomorrow, like a missing column, we immediately know and prevent deployment entirely. Throughout building this pipeline, we'll be getting immediate feedback. Now we have our supplier data in a single dataset. Let's derive additional columns using the powerful set of built-in data integration functions. As we derive new columns, notice how the pipeline output immediately identifies these are now filled in. First, we can encrypt the supplier tax ID column using Foundry's built-in capability for encryption and selective revelation. Next, we can produce an address column by concatenating street, city and zip code together. We can calculate the time from oldest to most recent transaction with the timestamp difference function. And there's some white space in the branch column, we can use the clean function to remove this. We can also produce a Boolean column for whether the supplier is active by mapping values from our string column to Booleans. A histogram of the column provides the list of values we need to map. True and Yes to True, otherwise fallback to False. And finally, we can calculate the number of active suppliers per city with a window function. Up to this point, we've been a single user working on this pipeline. But Pipeline Builder allows for collaboration between users with different expertise. Technical users now have stronger typing, less time running builds, wrangling code, managing libraries and no longer have to manage the complexity of a code reaper. Business users now have a tool where they can write data integration which will manage performance and protect them from making mistakes. They don't have to go through the laborious process of writing code. For pipeline managers, pipelines are highly structured, their usage and outputs are well defined, they don't have to worry about persistent data sets and they can see exactly what's changed between deployments. And all these users can work together. This collaboration is made possible with Pipeline Builder's sophisticated version control. Users can branch pipeline logic to work on a sandbox. Once they're happy with changes, they can review their changes and propose merging into main or another sandbox. Checks are run on the changes to flag any breaks to the pipeline outputs and check for merge conflicts. If there are merge conflicts due to other edits being merged to main, the user would be prompted to rebase their sandbox and handle these conflicts. Reviewers can then inspect the changes and validate them before merging them into the pipeline. Suppose this pipeline is up and running. When making further changes, we'll want to work on a sandbox and propose merging changes into main. rather than editing and pushing directly to production. So we'll create a sandbox named Add State Data. First, we can quickly fix our remaining errors by casting TransactionSpan and TotalSuppliers from long to integer. As we add these casts, the schema computes through the entire graph from pipeline inputs through to the pipeline output, and we're immediately given feedback that the integration is now successful. Finally, we can pull in our third dataset with a join. This will let us replace addresses that are completely missing with a country code. After joining, we'll coalesce the country codes into fill wherever addresses are null. Now we can create a proposal to merge our changes into main. We're shown a comparison between add state data and main where we can see additions, modifications and deletions and we can click into individual transforms to see exact logic changes that were made. This looks good so I'll create the proposal. When this is merged in we'll see our changes on main. Finally, we can click deploy to deliver all pipeline outputs in the pipeline. Pipeline Builder will deploy all these outputs while abstracting all of the implementation of logic execution and output deployment behind the scenes. Today, we've shown a simple example, but to list a couple of exciting examples... provides a massive step up. With complex healthcare systems where disparate data sources are integrated to construct pipelines which can be templated. And then these templates can be redeployed to specific facilities where last mile work needs to happen in hours, not weeks or months. And in the aviation industry, where there's a need to continuously integrate real-time streaming data sources to make routing decisions and inform other critical operations. And across all industries, where the burden of integrating data has been on only data engineers or technical users. But now citizen pipeline builders can construct production-quality data pipelines themselves. Thanks, Cameron. This was just a sneak peek at the user experience for Foundry's next-generation pipeline builder. We wanted to synthesize the fluidity of no-code, instant feedback, and declarative configuration with the ironclad fundamentals of Foundry, such as backend extensibility, Git-style change management, and active automated metadata for data security, lineage, and health. There's so much more to dive into, including a more detailed look at how streaming workflows, multi-user collaboration, and the integration with the ontology and other target substrates work seamlessly at scale. Moreover, with Foundry's modular architecture, it's designed to power the full range of enterprise systems. Data lakes, warehouses, ERPs, operational data stores, edge synchronizations, and much more. Thanks for watching, and stay tuned for what's next.

It's the biggest single step forward we've taken since the release of Foundry's original data integration suite. Pipeline Builder is our new data integration capability that delivers on democratizing data integration, while maintaining robustness and security. In short, allowing users to build better pipelines.

Let's start from the beginning and see how it's put into action. We're going to work through building a simple pipeline from scratch. We can add any data on a Foundry instance to the pipeline. This data can include tablet data of all sizes, streaming data such as sensor recordings, geospatial data and telemetry from IoT devices, or unstructured and semi-structured formats such as imagery data, XML files and PDFs.

Here we'll be working with a supply chain disruption. We want to bring all our regional suppliers into one place to build a worldwide view which we can rely on. We'll add a couple of datasets, Supplier US West and Supplier Europe.

These datasets contain rows of suppliers with columns describing various properties of each supplier. We want to clean and union these two datasets together and derive additional columns. We'll also upload some XML files which can be quickly passed into a structure table and joined into our supplier datasets. So we'll start by creating a union between Supplier US West and Supplier Europe to bring them into a single dataset.

Upon creating the union, we're immediately given an error message that the schemas don't line up. We can fix this by adding a couple of transforms upstream. One transform renaming Supplier to Supplier ID so that ID columns match up, and another dropping some columns that show up on one side of the union and that we shouldn't need. Notice that the error message is immediately resolved. This immediate feedback saves us time while iterating and means we don't need to run any data computation to find issues.

Next, we can create a pipeline output, which will let us deploy the final outputs of our pipeline. Pipeline Builder's underlying architecture is designed to support all kinds of outputs, data sets, ontological objects, streams, time series, even export to external systems. The aim is to take you end-to-end from the raw data source through to the final destination of your data. Pipeline outputs are the target of the pipeline. If we no longer meet this target, the user is immediately warned.

and deployment is prevented so that the user can determine whether it's an intentional change to the output contract or an accidental logic break or problem with incoming data that needs resolving. Here we'll create a dataset pipeline output. This output will expect this exact set of 35 columns. Again, if the pipeline does not produce these 35 columns, we'll immediately flag the break and prevent deployment to stop breaks downstream.

Pipeline Builder will produce these outputs while abstracting all of the implementation of logic execution and pipeline deployment behind the scenes. We use our union as a starting point for the pipeline output and add a few columns. An address string column, an active boolean column to mark whether the supplier is currently active, a transaction span integer column for the number of days between the oldest and most recent transactions, and the total number of suppliers in the same city as the total suppliers column, which is an integer.

As we continue to iterate, we'll be guided to fill out this target. A core concept of Pipeline Builder is the separation of data computation and schema computation. This separation allows us to immediately evaluate the success of our integration. We don't need to wait to run the pipeline across potentially enormous or complex input data.

We can check our progress towards filling out pipeline outputs. If bad data shows up tomorrow, like a missing column, we immediately know and prevent deployment entirely. Throughout building this pipeline, we'll be getting immediate feedback. Now we have our supplier data in a single dataset. Let's derive additional columns using the powerful set of built-in data integration functions.

As we derive new columns, notice how the pipeline output immediately identifies these are now filled in. First, we can encrypt the supplier tax ID column using Foundry's built-in capability for encryption and selective revelation. Next, we can produce an address column by concatenating street, city and zip code together.

We can calculate the time from oldest to most recent transaction with the timestamp difference function. And there's some white space in the branch column, we can use the clean function to remove this. We can also produce a Boolean column for whether the supplier is active by mapping values from our string column to Booleans. A histogram of the column provides the list of values we need to map.

True and Yes to True, otherwise fallback to False. And finally, we can calculate the number of active suppliers per city with a window function. Up to this point, we've been a single user working on this pipeline. But Pipeline Builder allows for collaboration between users with different expertise. Technical users now have stronger typing, less time running builds, wrangling code, managing libraries and no longer have to manage the complexity of a code reaper.

Business users now have a tool where they can write data integration which will manage performance and protect them from making mistakes. They don't have to go through the laborious process of writing code. For pipeline managers, pipelines are highly structured, their usage and outputs are well defined, they don't have to worry about persistent data sets and they can see exactly what's changed between deployments.

And all these users can work together. This collaboration is made possible with Pipeline Builder's sophisticated version control. Users can branch pipeline logic to work on a sandbox. Once they're happy with changes, they can review their changes and propose merging into main or another sandbox.

Checks are run on the changes to flag any breaks to the pipeline outputs and check for merge conflicts. If there are merge conflicts due to other edits being merged to main, the user would be prompted to rebase their sandbox and handle these conflicts. Reviewers can then inspect the changes and validate them before merging them into the pipeline.

Suppose this pipeline is up and running. When making further changes, we'll want to work on a sandbox and propose merging changes into main. rather than editing and pushing directly to production.

So we'll create a sandbox named Add State Data. First, we can quickly fix our remaining errors by casting TransactionSpan and TotalSuppliers from long to integer. As we add these casts, the schema computes through the entire graph from pipeline inputs through to the pipeline output, and we're immediately given feedback that the integration is now successful.

Finally, we can pull in our third dataset with a join. This will let us replace addresses that are completely missing with a country code. After joining, we'll coalesce the country codes into fill wherever addresses are null. Now we can create a proposal to merge our changes into main. We're shown a comparison between add state data and main where we can see additions, modifications and deletions and we can click into individual transforms to see exact logic changes that were made.

This looks good so I'll create the proposal. When this is merged in we'll see our changes on main. Finally, we can click deploy to deliver all pipeline outputs in the pipeline.

Pipeline Builder will deploy all these outputs while abstracting all of the implementation of logic execution and output deployment behind the scenes. Today, we've shown a simple example, but to list a couple of exciting examples... provides a massive step up. With complex healthcare systems where disparate data sources are integrated to construct pipelines which can be templated.

And then these templates can be redeployed to specific facilities where last mile work needs to happen in hours, not weeks or months. And in the aviation industry, where there's a need to continuously integrate real-time streaming data sources to make routing decisions and inform other critical operations. And across all industries, where the burden of integrating data has been on only data engineers or technical users. But now citizen pipeline builders can construct production-quality data pipelines themselves.

Thanks, Cameron. This was just a sneak peek at the user experience for Foundry's next-generation pipeline builder. We wanted to synthesize the fluidity of no-code, instant feedback, and declarative configuration with the ironclad fundamentals of Foundry, such as backend extensibility, Git-style change management, and active automated metadata for data security, lineage, and health.

There's so much more to dive into, including a more detailed look at how streaming workflows, multi-user collaboration, and the integration with the ontology and other target substrates work seamlessly at scale. Moreover, with Foundry's modular architecture, it's designed to power the full range of enterprise systems. Data lakes, warehouses, ERPs, operational data stores, edge synchronizations, and much more.

Thanks for watching, and stay tuned for what's next.

Transcript for:Next-Gen Data Integration with Pipeline Builder

Transcript for:
Next-Gen Data Integration with Pipeline Builder