Orchestration Techniques in Microsoft Fabric

hey everyone welcome back to this dp700 exam preparation course and today we're going to move on and we're going to be talking about orchestration in Microsoft fabric now there's a few different item types that we can use for orchestration in fabric the main one is going to be data pipelines hopefully you're familiar with data pipelines we're going to be going through data pipelines and how they can be used for orchestration we're going to be looking at some orchestration patterns that you might commonly want to implement in a data pipeline or in a notebook and then towards the end of the session we're going to look at using fabric notebooks for orchestration right so we can use a library notebook utils library to perform orchestration of other notebooks so we're going to look at what that looks like and we're going to finish the session today by looking at various triggers that are available to us in Microsoft fabric so let's dive in okay so let's just start off with some of the basics of data pipelines in a data pipeline we have a number of different activities available to us and this is one of the massive benefits of data pipelines we can use it to orchestrate meaning to trigger the execution of lots of other things in Microsoft fabric so this first block here we have a number of different experiences or fabric items in Microsoft fabric like the notebook the data warehouse The kql Experience spark jobs and all of these things can be triggered by a data pipeline we also have semantic model refreshing down here so if you've got a Mantic model and you want to trigger the refresh programmatically based on some trigger that we're going to look at in the last section of this video you can also do that so another really useful tool that we have in data pipelines is this invoke pipeline activity which essentially allows you to run another pipeline from the pipeline that you're calling it from and we'll look at how that can be useful for various different Frameworks and architectures that we could build with our pipelines but it's not just fabric items that we can trigger from a data pipeline we can also use a wide variety of stuff essentially allows us to orchestrate various other services that we might have in our organization okay these are just a selection of those so we can trigger Azure datab bit notebook runs we can trigger Azure functions so if you have a lot of azure functions in your business that do some sort of ETL process you can trigger them from a fabric data Pipeline and also a lot of other stuff in Azure as well we also have things like web hooks Okay so if you have any sort of services that you want to trigger via a web hook then that is also possible the final kind of family that I'll just mention up front here is around notifying now we're going to do a whole section around error handling and monitoring but I just want to call out that we have the Office 365 activity and the teams activity so this is another benefit really because we can control over when you fire off notifications to people in your team or anyone in your business really and so we get this control over when things go wrong or when things go right we can notify the right people in your business so one activity that I didn't mention there was the copy data activity and I wanted to focus a little bit more time on the copy data activity because it's arguably one of the most important activities right especially for data ingestion because one of the primary mechanisms at least in a data pipeline for getting data from various external Services into Fabric and it's actually very multifunctional we can use it in a variety of different ways if you go into a data Pipeline and you create a copy data activity the first thing you'll be asked is to create a connection to a source and there's lots and lots of different sources that you can select this is just a selection of them here so you see that there's a wide variety of primarily Azure services but there's also some interesting ones that I'd like to pinpoint so the SQL Server database connection is really useful when you have on premise data because you can connect via the on- premise data Gateway now the copy data activity is one of two main ways you can also do this from a data flow Gen 2 connect to on- premise data via this on premise data Gateway so that's really important to know there the other one is this rest connection right so if you have rest apis that you want to connect to and you want to per perform some sort of get request or post request to that API to extract some data it even allows basic pagination okay so if that API returns some Pages some paginated results I put basic pagination because it can get a little bit messy if your API has somewhat unique pagination functionality but it is available there for you to do pagination and things like that now if your rest API requests get a little bit more complex then I definitely recommend to move into the notebook world for your rest API data extraction but it is available in a copy data activity finally I'll just pinpoint the HTTP request so that's if you want to extract things like website data or anything that's openly accessible on the internet so there's many different methods and connections that we can create on the source side but this is a copy data activity so we're copying data from the source and then loading it into a destination and again there's various destinations that we can choose with a fabric data pipeline copy data activity if you're talking about file data well obviously we can load it into a Lakehouse files area tabular data can be written into essentially any fabric data store that we have in there plus and not many people are aware of this but you can also write to external data stores as well like an Azure SQL database so you could trigger these ETL processes from within fabric but that doesn't necessarily mean that your destination has to be within fabric for these so let's just look at a very simple Pipeline and then we'll build up the complexity and we'll talk about some of the key features you might find that the copy data activity is a pipeline in itself right this is all you need to do you just want to get some data from one location external to fabric and load it somewhere in fabric okay so you'll create a copy data activity and you set up a connection and you have this relative URL section here now what you see here is I've currently hardcoded this value which is not ideal but we'll talk about how we can improve that in the future slides you're also going to say oh this is a HTTP request this connection so it's going to return a file format and we're going to specify what that format is so RC knows what to do with it and knows how to pass the the response so that's what we're looking at here at a very very basic level now if we zoom in on this copy data activity a little bit more most pipelines we create were they're not really going to be single activities right that kind of defeats the object the real benefit from pipelines comes from the fact that we can build these directed ayylic graphs right a dag which essentially means we can connect various activities together with dependencies between them right so we're going to say I want to start with this activity and then if that's successful I want to do this second activity and so we can build up this logic in our pipelines now to implement that you need to be aware of these four condition Flags here so when you're connecting activities in a data pipeline you'll need to select one of these four here and then connect it to the next activity in your pipeline and obviously whichever of these four conditions you select it will impact how those activities are related okay so the first one is on skip so if you declare an on skip condition the next activity will run only if the previous activity is skipped okay so if that doesn't run this copy data activity for whatever reason perhaps you have some sort of other logic where you have another activity down here and it actually goes through this activity but not this activity or if you're in a for Loop and this one doesn't run because it exits the for Loop for example before you has time to execute this activity whatever that reason is if there activity doesn't execute only then will the next activity run okay the next one is on success so this is probably the most common one and this basically means that the next activity will run only if this previous activity is run successfully okay we've also got on fail so this means that the next activity will run only if previous activity fails so you might want to use this for things like notifications right finally is on completion this means that the next activity will run when the previous activity completes it doesn't matter if it is successful or it fails right either way that next activity will run so we can connect up our activities a little bit like this so we might start by getting some data from an external location then we might load it into a fabric data store then we might want to do something like a semantic model refresh okay if you're using direct late mode but you've turned off the auto reframing then this might be a good use case right and then if the semantic model refresh fails then you can notify you know someone in your team that this has failed so this is how we can build dependencies between different activities another feature of data pipelines to be aware of is the fact that you can have active and unactive activities so if you rightclick on any activity you'll see this popup box and you'll see that here we have deactivate here because this is currently an active activity and essentially what that means is that when you run that pipeline it's not going to execute that activity it's just going to go straight to this one will be the first one and this is what this is going to look like right you'll see this kind of grade out activity box here and also when you run your activity it's just going to say inactive so it's not going to execute this activity it will just go straight to our silver transform data now this could be useful for a wide variety of things sometimes when you're debugging it's useful to make some of the activities inac active if you're sure that they're working well and you just want to focus on maybe the second half of a pipeline you can do that and there's various other ways that we can use this active and unactive but just know that if you see this kind of grade out box with this icon here then that is going to mean that that activity is deactivated it's not going to run okay so that's some of the basics of data pipelines and how we can construct data pipelines in fabric let's talk about some patterns that you might commonly see in orchestration so in the basic pipeline example I was hardcoding the relative URL but the problem with this is that we can only do one data set at a time right you can only load one CSV file for example into fabric at a time now normally you want to build your pipelines in a way so that there's no hard coding of things like string values or file paths or things like that because it means that we can scale them to much larger data set sizes a number of data sets without changing the pipeline itself so one of the main ways that we can do this is build what we call metadata driven Pipelines this is going to allow us to do things like oh I want to ingest 25 different tables from this Azure SQL database and then I want to load them into 25 Separate Tables in the data warehouse right because if you have some sort of database in an external system the chances are that you don't just want one table you probably want multiple tables in that database to come through and you don't want to be writing 25 separate copy data activities for this you want to be building your pipeline in such a way that it can essentially Loop through all of those tables just with one copy data activity so we'll look at how we do that in a second another example is I want to query 10 different rest API end points to get all of my data out of this SAS product API and save all of the Json responses in a lakeh house fars area now a very simple way of viewing what this would look like is essentially you need to store all of the connection details so what your table name is what your schema name is in that database in let's take the first example you need to store those somewhere in your fabric environment now that can be a data warehouse table can now be a SQL database table so now we have fabric SQL so we can store our metadata table in there but it can also be flat files right some people use Json files to store their metadata as well and essentially what you're going to do as the first step is to create a lookup activity now this could also be a script activity if you've got your metadata in a table format right because the script and lookup can be used for similar tasks in this regard but essentially we need to read our metadata wherever it lies and bring it in to our data Pipeline and I think of this as like instructions it's like an instruction manual so it's going to be 10 or to continue our first example 25 lines in this metadata table and these 25 lines are going to say okay you need to extract The Source table 1 2 3 4 5 6 7 8 all the way up to 25 into these destination tables right again and the destination table is going to be this one here called this so once we've got our lookup activity here we can pass all of that information those instructions to a for each Loop because we now have 25 instructions that we want to act on so we want to do for each of them then you're going to see your copy data activity in here so for each of my lines in my metadata table I want you to go away and copy whatever is in the source system and write write it into my destination now obviously we've only got one copy dat activity here so we need to change how we write this right so we're going to need this Dynamic content right so what this is essentially saying is now I'm not hardcoding this relative URL I think before it was addresses. CSV instead I'm reading it so do item because we're in a for each Loop now it's the current item that you're iterating through right so in this example there'd be 25 items in total and we're getting the dot source property from the current item okay so this is essentially what this means and it all allows us to iterate through all those different metadata items and we're just passing in dynamically whatever the current value is for Source now obviously the benefit of this is that once our pipeline is set up once you shouldn't need to edit it because if you add a few tables into your Azure SQL database and this becomes 27 well you don't have to change your pipeline at all right all that needs to change is this metadata table so you've probably got some table and you simply need to add a few new rows at the bottom which is our instruction manual as I like to think of it we're just adding on a couple more instructions but the pipeline itself is not changing which is really useful so in the last slide we saw an example that looked like this but what if we had to do lots and lots of things for each of these different items in our for each Loop what if we had to ingest and maybe do some transformation then we wanted to load it somewhere else into a silver layer for example perform some sort of script in maybe to build our gold layer and then refresh some semantic model right there's a lot to be done there in this for each Loop and so what you might want to think about is doing something a bit like this now it's very very similar but all we're doing is substituting this logic here and we're going to bring that into its own pipeline right so we'll create a second pipeline now this is normally called parent and child architecture so in this example this would be the the parent and this we've wrapped all of that logic and we're building it into a sep pipeline now this has a lot of benefits but it also requires us to construct our pipelines in a slightly different way so let's look at some of the key features that can enable this kind of architecture so in our parent pipeline the most important thing is this invoke pipeline activity right we saw that on one of the first slides and we said how useful It is Well we need this now currently I recommend use the Legacy invoke pipeline activity especially if you want to pass pipeline return values up from the child back into the parent because that's not currently sported in the preview version of this activity and this invoke pipeline activity normally what you'll want to do is configure some parameters in that child Pipeline and then within the parent pipeline you're going to configure these parameters to pass things into that child pipeline so in this example I've got things like Source directory source file name destination table name expected colum names so these are things that I need in my child pipeline for it to function right for it to be dynamic and this is coming from the metadata table in this example right so we're reading the metadata and the metadata is going into my for each Loop and then for each individual one because we're doing do item again or at item we're basically dynamically passing in these four properties that we're going to use in our child pipeline so our child pipeline is set up like this so we're going to have most importantly some pipeline parameters in our child pipeline otherwise this wouldn't work but you need to declare these parameters in your child pipeline first then we can use whatever we pass in here with at pipeline. parameters and then the parameter name okay you can also get this from the right hand menu in a data pipeline you don't necessarily need to remember it off by heart because you know there's a parameters Tab and you can simply specify it will show you which parameters are available to you that you can use so now you would use these in your various like for reach loops and your if conditions so when you need to spe CL ify your source for example I think in here there is a copy data activity so now you're not hardcoding it but you're again you're still doing it dynamically but this time the dynamic parameters are coming from the parent so it's not just data pipelines although data pipelines are the primary method for orchestration of fabric items we can also orchestrate notebooks within other notebooks so let's look at notebook orchestration a little bit more detail So within a notebook we can call the the execution of other notebooks in Microsoft Fabric and we're going to do that using the notebook utils package so essentially this is a package that's managed and maintained by Microsoft and it gives us a number of useful functions that we can call you know there things for managing files there's things for managing your credentials getting credentials from Azure key volt for example but the one that we're going to be focusing on is The Notebook module because that allows us to run other notebooks from within a notebook right so that's the main distinction between this type of orchestration and data pipeline orchestration here we're only talking about running other notebooks so if we want to use this style of orchestration within a notebook we've got a few different options available to us right option one if we want our notebooks to be executed in parallel meaning the order doesn't matter we're just going to run them all at the same time then we can use this notebook. run multiple method and we can pass it in a list a python list and these are going to be names of notebooks within your workspace right again the ordering doesn't matter because they're all going to be executed at the same time so the way that we construct this obviously we need to import the relevant module and Library which is notebook I've just given it an alias of this module NB so nb. run multiple then we're passing in this list of notebook names so that's like the simplest method but what if we wanted to control the order of execution of these notebooks well as well as a list that we saw we can also pass into this run multiple function what's called a dag right so directed A C graph this comes up all the time in orchestration you know if you're used to Apache air flow for example runs on things like this and what you'll notice is that the dag is quite similar actually to a data pipeline if you've ever looked at the edit Json code view it's very very similar it's just a Json representation of your activities right so you might be thinking how does this structure encode a graph well there's a few key features that we should dig into here so obviously we're declaring a variable called dag as a python dictionary and one of the main attributes of this dictionary is the activities list right so this is going to be a list of objects or dictionaries in Python and each of these represents a notebook that you want to run right so this first block repres presents one notebook in your dag we give it a name and a path to that uh specific notebook and there's a few other kind of optional settings that we can also specify here this one is Time Out per cell in seconds for example here then we add another notebook into this list so my second one is notebook name two which has the same path and the name is the same as the path because they're in the same workspace here now here we've got a few extra properties added we've got a retry property of one so if if it fails it's going to retry it again we've also got a retry interval okay so it's going to wait 10 seconds until it retries and then most importantly is this dependencies so this dependencies is how we as the name suggests build dependencies between different activities so it allows us to say okay I want you to start with this one and then when that is completed do this one so let's look at what this means so our second notebook has a dependency of notebook one right so what that means it's going to start the pipeline it's going to run notebook one and because Notebook 2 is dependent on Notebook one it's only going to run after notebook one has finished like this right so that is the dependency that we're basically encoding in this dag now these can get more and more complex you could have multiple things here and then wait for all of these to complete before you do two but essentially it allows us to build up this logic within this structure now when you're calling run multiple with a dag you can also use this display dag via graph Biz equals true and it shows you you know a very similar diagram to one that I just draw now also alongside the Run multiple it's a good idea to use another method that we have called validate dag so once you've constructed your dag just to make sure that you have actually built it in a correct way it's always recommended to pass it through this validate dag method right before you do this run multiple okay so that's most of the things I want to talk about when it comes to notebook orchestration in fabric there's just one point I wanted to add on here is that what's the benefit of this well one of the key benefits of this was that you can execute lots of different notebooks spark notebooks from a single notebook and it's going to use the same spark session right so you can use this to run multiple notebooks within the samees spark session the problem was that with data pipelines previously at least if you wanted to do this sort of notebook orchestration from a data pipeline it's going to spin up a brand new SP session is going to use a lot of resources for every single notebook right now they have introduced a new feature called session tags in data pipelines that essentially allows you to declare what's called a session tag that means that multiple notebooks executed from a data pipeline can all run on the same session so that's less of a benefit now but that's one of the benefits of using notebook orchestration or at least it was in the past okay let's round off this session and talk a little bit about triggers in Microsoft fabric so the main one is the schedule trigger right so if you've got a data pipeline a notebook data flow Gen 2 all of these can be triggered via a schedule right so I want my data pipeline to run every day at 9:00 a.m. for example right you need to declare a start time and an end time and we have this option of declaring the time zone okay I think by default it's going to be UTC but you can change that to whatever makes sense for your organization now you can also trigger individual notebook book schedules and data flow Gen 2 schedules but I'd recommend doing it from a data pipeline because obviously you can trigger both of these from a data Pipeline and with the data pipeline you get benefits like error handling you know notification and things like that so in most use cases I'd recommend creating data pipelines as your primary orchestration tool you don't really want lots of separate notebooks and data flows triggering on schedules because then it becomes quite difficult to manage right you're not sure what gets executed where it's better to have everything executed through a data pipeline in my personal opinion so as well as the schedule trigger recently they've introduced event-based triggers for data pipelines and currently this will trigger based on an event in Azure blob storage so if there's a new file uploaded it will trigger the execution of a data pipeline okay so it looks a bit like this you're going to have to select which fabric item you want to run which is your data Pipeline and you're going to have to select the event so it's going to ask you for the particular Azure blob storage account that you want to monitor for this event and you're going to have to make that connection and then it's going to basically subscribe to that particular connection and when there's any changes depending on which event you're listening for it's going to run your fabric item now this is particularly useful for people migrating from synapse for example and they need this functionality they just want to trigger fabric items rather than ADF so the other triggers to be aware of in the real time hub and this is kind of a developing space in fabric because this is still preview as you can possibly see here but the real time Hub is showing quite a lot of potential in terms of real time events so more than just Azure blob storage events in the real time Hub we have some other event triggers for example job events which are events produced by status changes on fabric monitoring activities so a new job created or a new job succeeded for example we also have one L events so this is going to trigger on actions in files and folders in one lake so that's very useful and workspace item events right so when you create items in a workspace it will trigger this event and we can use this again to trigger either alerts so here we're talking about data activator or activator alerts I think it's now called or event streams or data pipelines as well now I just wanted to round off this triggering session with a little bit on semantic models because semantic models have some new functionality that you can trigger within Microsoft fabric so obviously the first one is auto refresh within a sematic model you'll have this option for keeping your direct L data up to date now if you turn that on then any changes that you make to your underlying one Lake files it's automatically going to be visible and updated in your direct Lake semantic model right assuming that the user is quering that particular data on the front end the second one is in the semantic model settings again we can actually turn this setting off and we can choose to configure a refresh schedule or if you're using import mode semantic models for example you this is like the traditional way of refreshing a semantic model scheduled refresh but in fabric we have a few different options we've seen this one already the semantic model refresh activity in a data pipeline so this allows us to integrate semantic model refreshes into our engineering pipelines right so you might want to you load some different tables into a to warehouse and only then when all of your tables have been loaded successfully refresh your semantic model one of the negatives of the automatic direct Lake refresh option is that it's just going to constantly update automatically all of your tables so if you're Midway through an ETL job you might have some tables that are fresh data and some tables that have not completed their ETL job yet so then you risk getting data out of sync to the user on the front end in option three you're gaining a bit more control of that so you're saying okay I only want to refresh my sematic model once I know that these 10 tables have all been successfully updated and finally semantic link so it's not just from the data pipeline that we can programmatically execute the refresh of a semantic model now we also have refresh data set which is a method in Senpai so semantic link so we can now do it via notebooks as well thank you very much for watching I hope you enjoyed that that's the last of section one of the study guide in the next video we're going to start looking at ingesting data and data stores and things like that so hope to see you in the next video

Transcript for:Orchestration Techniques in Microsoft Fabric

Transcript for:
Orchestration Techniques in Microsoft Fabric