Azure Data Factory Overview

a your data Engineers are in demand but you need to master AZ your data Factory for that so in this 4 hours long as your data Factory course you will become a pro developer from scratch because in this course you will gain strong Hands-On with complex real-time scenarios using activities such as data validation parameterized data sets for each activity if condition hierarchial pipelines pulling data directly from the apis transforming data using data flows and you will also learn about how to use schedule triggers tumbling window triggers and storage event triggers and at the end you will build an endtoend automated triggered parent pipeline having multiple child pipelines sounds amazing right why I have covered so much in this course because the competition is very very high right now and I want you to become competitive enough to crack the interviews so are you excited to master this tool and I know your anwers so let's do it welcome welcome welcome welcome to a Y data Factory full course and obviously for free so does it mean like it is not worth it no obviously if you have watched my previous videos you know the quality of content that I provide to you so free doesn't mean that it is not worth it it is very much worth it so if you are new to my channel you can just hit the Subscribe button because I just upload long format videos and I just try to provide the real knowledge related to data engineering for free and obviously you can just read the comments in the previous videos and you will get to know how useful they are and to all my previous viewers you need to just hit that subscribe button why you are not hitting that button just just tell me just tell me I'm just putting like lots of efforts I just try to provide you the knowledge why you are not hitting that subscribe button why so if you haven't hit that button just hit it right now and hit that Bell icon as well so as you know that I just upload video like on every Sunday so just to be notified about the timings and all so just hit that Bell icon as well and it's my order just hit that subscribe button right now I'm just kidding so it's like good for you and I was just viewing my like Channel analytics and I was like why people are not subscribing to my channel like people are loving my content people are like watching my videos people are liking my videos commenting on my videos but they are not subscribing it so it's high time to subscribe the channel so it's just a request from your friend from your buddy who just trying to who just trying you to just succeed in the in the world of data engineering so without wasting time just hit that subscribe button right now so if you have done it I know I trust you you have already done that so now let's get started with this full course what's that like why do we need to learn aure data Factory first of all aure data Engineers are in demand right you you all know it so if you want to become an Y data engineer or you want to become a data engineer who will be using as your resources as well so you cannot imagine any single solution without data Factory trust me data Factory is the backbone of all the Azure data engineering Solutions now you will think that we have synapse analytics right now in future we will be having Microsoft fabric do we still need to learn a data Factory the answer is yes why let me tell you so the thing is uh if you just use a data Factory definitely a data Factory is there if you use synapse analytics aure data Factory is there we call it as pipelines slash integration services but we use the same aure data Factory with the same UI everything is exactly same it is just like they have embedded a your data Factory within synaps anals now let's talk about Microsoft fabric so your data Factory is there yes it is like exactly same trust me and even the name is same you will just find data Factory in Microsoft fabric so if you want to like become someone who will be just building pipelines who will be just migrating data from one place to another you will be orchestrating Data Solutions you need this tool you need to learn this tool trust me it is very very very important so now you have the answer like why you need to learn a jorda factory and in this single video you do not need to go here and there you do not need to watch any kind of like multi videos no it there's like just one single video One Stop solution to your all the questions so today you will be learning each and everything in a your factory and trust me from scratch if you do not have any prior experience no worries at all because we'll be learning everything from scratch but what about the prequest yeah we do have let me tell you okay like they're really really easy to cover so first of all laptop or PC with stable internet connection obviously because if you want to learn something you need to have one laptop or PC or like maybe iPad if you can just access aure account within that so it's fine so second thing is aure account if you do not have any a your account so do not need to worry I will tell you how you can create your free as your account and I will tell you each and everything that you need and you do not need to pay for anything do not worry third thing obviously is excitement to learn your data Factory if you are not excited if you are not happy if you are not enthusiastic right now you need to skip this video yeah because you need to be excited enough to learn a jaw Factory it is like one of the most in demand technology right now and trust me when I was learning this technology back in the days I was so so so happy and I was feeling so happy when I was just learning new and new and new things in a factory so it's your time to be in the same shoes and learn as much as you can so let's get started with this basic question what actually is a your data Factory like what is that we have just heard this name maybe on LinkedIn maybe on any other form or maybe in the job descriptions but what exactly is your data Factory please please please tell us okay so basically it's a cloud et/ tool if you have not started taking notes just do it right now I want you to take notes because it will be very useful for you if you want to just revise those Concepts if you want to just revisit these areas again so if it will be very handy and obviously if you're learning a new technology it is very difficult to just remember all these stuff so please please please start taking notes from this moment please so as I just said that it is an ETL SL tool so basically when we work with big data we do not want to transform the data first so we just load that data that's why it's called ETL that is extract transform load load and the second one is El extract load and then transform simple yeah so what it does and how it does so basically let's take an example because I I just love just using examples so that you can just remember those examples and you can just understand the concept so basically let's say you have a source any Source like you can imagine any Source let's say this is a SQL DB maybe your my SQL post gr SQL Ms SQL any SQL D or any SQL DV like any any Source any Source anything this is your maybe your SQL DV maybe these are your CSV files okay maybe this is your API maybe any website like maybe HTTP connections anything like you can imagine any Source any any any Source what it will do it will use this resources right so how I'm just explaining wait wait wait so let's create the whole scenario let's say this is your Source right let's say this is your source and this is your destination destination again can be anything maybe it's cloud or maybe it's something else external uh application so if you want to migrate your data if you want to migrate your data from source to the destination you can use this powerful tool which is your data Factory so basically how it does that so it has something called connectors it has something called connectors so it has connectors for all these sources all these sources like it has connectors for SQL DB it has connectors for like apis htb connections everything it has connections for your destination as well now you'll be thinking like what can be our destination again anything maybe you want to store this data in let's say a your data Lake right maybe let's say in S3 that is AWS maybe let's say um a your SQL DB or maybe let's say external application called Data ver again like anything so it can easily does that now you will say Okay extract and load is done where is T where is T where is this T how it can transform the data it can it can it has something called Data flows oh what's that so basically data flows are the transformation functionality that your data Factory has so basically these are the spark clusters these are like spark that is running behind these data flows so you can transform your data without even using one line of code it will do it for you using GUI that is graphical user interface and in the in uh like behind the scenes it will be running spark but actually you are not using spark because you are not writing any code it's so cool it's so powerful don't worry I will tell you this thing as well like how you can just do it so it is like end to endend course and obviously you do not need to go here and there you just need to sit and just watch one single video One Stop solution and everything is done everything is done so are you excited I know if you were not excited before now you are excited to learn this solution like what is it like how we can just do all these things and again if you do not have any prior experience do not need to worry we will be covering everything from scratch if you have some experience if you are a little bit of aware of this thing still you will be learning a lot because we will be sharing realtime scenarios that are asked in the interviews as well because obviously if you are just trying to crack the roles you need to be experienced enough you need to be skillful enough to actually answer those questions and in the realtime interviews you will be asked some scenario based questions and yes for that as well I have covered that area as well and we'll be just covering each and everything and you need to do one thing that is just click on that subscribe button just do it right now why I'm saying this again and again see I'm a new Creator right I I'm very raw right now so if you want me to grow if you want me to just grow and help more and more people if you want to help my dedication my hard work that I'm putting on these videos like long format videos you can just hit that subscribe button and along with that you can just like the videos comment on the videos and you can just share the videos with like other data Enthusiast to help others and definitely it will be a big help for me and I really do like I I really feel happy when I I just read comments so just share your feedback on this video and on all the videos that you have watched so far it will be a big help for me and obviously I know my data famam I know my data famam is really really loyal and they are so so so skillful and my aim is just to build a strong strong data engineering family and I know I'm doing that and I need your help so it's time to you to just reciprocate and just hit the Subscribe button share the videos with others like comment and just do everything that you can to support me and now let's create our free aure account to just get started with this course because obviously that is required to just learn a right so let's create free aure account like this so just go to your Google tab and just write your free account let me just zoom it for you oops Yeah so just write aure free account and that's it and then just hit enter and then just click on the first link that you see here's the important step just click on this try Azure for free do not click on pay as you go pay as you go is like the paid option that you can get so just click on try Azure for free so then it will just ask you to put your Microsoft account so if you do not have any Microsoft account you can simply click on this and create one and obviously you can use Microsoft account because it will be very handy if you have like one Microsoft account and nowaday it is very handy when you need to just log in through different different Microsoft servers there the services that you can use for free so I have already created one so you can just click on this and just put your credentials and trust me if you have not created just created right now obviously it is a prequest for this course and it will be very handy for your future use cases as well so once you put your credentials you will just land on this page where you will see a kind of form that you need to fill it's not just a form like just the basic details like your name phone number email ID and all here you can see the important stuff that you get us200 credit to use in the first 30 days and I think these are like enough to complete this course so do not need to worry and you get some free services that you can use every time that is like 55 plus free services and popular Services just for the 12 months so it is very very very convenient you should just take the advantage of this opportunity and learn the your services as much as you can one important thing that you can just note it when you click on the sign up button it will just ask you to put your card details and all do not afraid because a ask you to put your details so that it can confirm that you are the person who will be using this as your account and you will ask like what will happen after 30 days so it will ask you to just upgrade your account if you do not even reply or if you do not even just opt and to just convert it to into your like pay as you go account still your services will be let's say paused and it will be deleted all these service resource will be deleted they will not be in use but you will not get charged do not need to worry just put the details and just start learning this powerful tool so once you have created your Azure account now it's time to actually see the Azure portal and how you can go to aure portal it is very easy let me show you so you simply need to just go to the Google tab and just write portal. azure.com portal. a.com simple just click on it so it will simply take you to the Azure portal account whoa okay let me just click on this favorite button so do not get afraid because those were like my previous services so forget about that this is your aure portal this is your first Azure portal if you are the new one so this is your Azure account my Micosoft as your account you are officially landed on Microsoft as your page I love the UI of Microsoft and yeah it is what it is I I love it so you'll be wondering like what are these oops you'll be wondering like what are these things like what what is this thing what is this thing what is this logo what is that logo do not need to worry I'll just tell you each and everything in detail first of all just feel the vibe just just feel just feel the UI of Microsoft it's it's so it's so good so basically let's talk about some basic stuff in a y if you do not have any proct experience so first of all what we do what we do do the first thing first thing okay so these are the services these are the resources so let's talk about an important thing which is your resource Group let me just type it for you it's called Resource Group what's that so it's basically a folder as you can also see here I can see the icon if you can also see in your homepage it's fine if you cannot it's still fine because these are like some of the recent services that I have used that's it so this is the resource Group what's that basically whatever resource we create in Azure bro hold on hold on now what is a resource just tell me that first no you need to first know what is a resource Group so if you create any resource anything in Azure you need a resource Group you need a folder it is similar to the folder that you have in your PC if you want to store some files obviously you will store those files in a folder similarly if we need to store those resources we will just store those resources under Resource Group now what is a resource basically resource is a service that you create that you use in Azure simple if you are using a your data Factory that is a resource if you are using a your storage account that is a resource if you are using a your snaps natics that is a resource so all these resources need to be stored under a folder and that folder is called as resource grow understood I know I know my my data Community is very smart I know you understood this very good very good so now what are those two resources that we need any guesses obviously one guess is obvious very obvious a your data Factory right right right right a your data Factory what is the secondary resource so bro second resource is your data link and I will just cover one important question as well in this area because we do not have any aure data l in aure what no bro you are kidding ah no like there's a tip for you to follow if you want to create a j so basically we we have something called aure storage account in Azure as you can see here these are the storage accounts so we have something called as your storage account and within that storage account we get something called as blob storage let me just write it for you we get something called as blob storage by default but if we do not want to use blob storage and if we want to use your data Lake there's a configuration that we need to do and once we do that we get data Lake and obviously you would know the difference between blob storage and data Lake if you do not know there's a very small difference so the thing is when you use blob storage you can only use containers but when you use data leg you can use hierarchal name spaces you can use folders within folders and you can store data in your folders small difference do not need to worry in Layon language you can just consider that as data L do not need to worry so these are the two resources that we need to create so for that we have just learned what we need correct we need Resource Group so first of all let's create a resource Group first uh I will just go to this search button or you can just click on this as well so I will just go here and I will type Resource Group and I should see a popup yeah this will save our time so do not need to worry these are like my resource groups that I have so far in my account so do not need to worry simply click on plus create button simply click on create button when you click on create button it will just take you to this page now you need to just name your resource Group so now let's name our Resource Group so I want to name my Resource Group as RG and then let's say ADF course let's say this is my Resource Group you can just name anything anything trust me then you need to just pick the region you can just pick the nearest region so it's up to you and you can just pick the region us as well do not need to worry so next tags we do not need tags tags are very handy if you want to just categorize our resource usages and bills and all so it's out of scope right now so just skip it click on next and then click on this create button simple so it will just validate it and validation is passed woohoo our Resource Group is created our Resource Group is created so one way to do it is like like one way to just see your resource Group just click on this home button and then you can just click on resource groups or you can just search here as well see and then you will see all the resource groups in your case it will be just one so I can just find mine it is here so I will just open it okay it's fine it's empty right now obviously it is empty because we have not created any resource so it's time to create the resource so first of all we will create our storage account and I will just tell you something special in that as well what is that let me just give you a hint I will just talk about data redundancy in brief and that is very important that can be asked in your interview so I do not want to skip this area so simply click on this create button and now it will just take us to the marketplace what is that so basically this is the marketplace for all the resources services and even third party applications are still there so I will simply search storage account oops storage account and just make sure that you're using Microsoft 1 so click on this and click on create button simple so this is the configuration page that we need to like carefully fill if you want to create a your data link instead of blob storage so the first thing that you can see is Resource Group right and our Resource Group is already filled because we are creating a resource using our Resource Group if you would not have created your resource Group you could have created by clicking on this create new button but we have already created it now we need to give the storage account name just a tip you cannot pick my storage account name why because we need to make sure that all the storage accounts are unique throughout the your network so let's say I want to give storage account name as storage uh data Factory I don't think so it will be available yeah because it is very common so I will say uh I will just put my name anch storage Factory storage data Factory anch it should be available because we do not have many ons available on this planet okay just kiding so you can just put your name after this like storage data Factory and just you can just put any name any name your name your friend's name your girlfriend name boyfriend name anything anything anything you you are free to do that I I wanton tell your mom don't worry so the second thing is primary service primary Service is like you can just pick a like Gen 2 it's fine so here is the performance thing I will just keep it standard uh and if you want to just keep the premium you can but it's not required because we do not need like low latency and all like or fast reads no nothing so here's the thing that I was just talking about redundancy options so we have like four redundancy options lrs zrs z z z zebra zrs then GRS G for gun GRS and then gz RS okay let me just show you just let me just click on this drop down so as you can see 1 2 3 34 so what are these first of all the cheapest one is lrs why because it is a locally redundant storage where your data will be duplicated within the same data center now what is this the second uh cheapest option is zrs Zone redundant in which your one data will be residing in one data center of Zone and the second data will be residing in the data center of the second Zone simple then comes the Z GRS which is this one GRS where it is okay it is here GRS this one oops it is GRS so basically it is equivalent to two lrs options within the different regions and then we have gzrs it is the one lrs option and one zrs option in two different regions what I'm saying do not need to worry so basically these are like more secure options than lrs and zrs but for this particular course we will just pick lrs this is the cheapest one and if interview ask you that I just want to keep my cost low just say bro just pick lrs don't don't don't don't say like this don't say like this don't say like this that bro just pick lrs no just say okay if you want to keep your cost low you can pick lrs but if you want to keep your data in the most secure way you can just put gzrs just talk like this right do not talk like bro just pick lrs he he won't pick you he or she won't pick you then okay I will just pick lrs so let's let's click on the next button which is the most important thing and this will be asked in your interviews trust me so if you want to create your data link instead of blob storage you need to click on this button this box when you click on this box what happens it will create the folders as you can see as the name suggest hierarchial name spaces and I just talked about it that in data leg we get hierarchial namespace or hierarchal folders when you just click on this box your data L is created your data L is create it simple and obviously if you are showing that you are an experienced data engineer you need to know these things this is like very basic okay so now just click on next then click on next then click on next then click on next then click on next then click on create so in the childhood I used to just install like softwares or games on my PC like this just click on next next next and install simple so here we will just click on create don't like jokes apart we do not need to take care of networking and data production and encryption right now because this is just like basic project that we are creating we are not securing our data right now we are not using sensitive data so that's why I just skipped it so just click on create button and our storage account should be ready within just few minutes or not minutes man it is like in few seconds let's see it should be ready in few seconds so basically our data lake is ready because obviously if we want to do something with the Gea Factory we need some data and I will be covering realtime scenarios where where I will be just creating scenarios right so we will see how we can do that and I will be just trying to build or let's say create more and more scenarios so it will be lots of fun just trust me just be with me with this video and you will learn a lot a lot so our resource deployed as you can see if you want to check your resource you can simply click on go to resource or you can just go to Home tab click on resource groups then just find your resource Group and you should see your resources created here wow wow it looks so good now the second thing is we need to create our a data Factory let's do that just click on this create button okay just click on this create button and just type data Factory just click on this data Factory yeah just click on this Microsoft one as you can see we have like other other options as well like these are like third party tools that we have we can use them but we we love asual data Factory your data Factory is my first love trust me yeah after college but yeah it's my first love like I just learned this tool like it was like my very first ETL or El tool or Cloud ETL tool because this is the first tool that I learned in the world of cloud data engineering and I loved it I loved it I loved it I I just fell in love with the Gea so first thing that we need to do is like we need to name our GE Factory so I will just name it as ADF and then unch yeah simple or let's say ADF course and then unch ADF course and simple and then you can just click on next and then next then next then next then next then click on create button so now our a your data Factory is also created see it was so simple it was so so so simple or am I making things simple to you same thing bro same thing just enjoy your learning so you can just simply click on go to Resource let's let's click on this because in the previous resource we just uh followed the traditional approach just click on this and it will just take you to this page and obviously you're very excited to hit that button and I am double excited to hit that button but I also want to show you how you can just navigate to the res source using traditional approach just go to Home tab click to Resource groups and then just find your resource Group and then you can see both the resources are ready for you they are waiting for you to to for you to use them so are you excited are you excited are you excited yep but before launching the youra factory I want to cover the fundamentals of it which are required which are required to use a your factory you cannot use a factory without that what are these things let me just tell you so what is this topic let link service and data set this is the you can say fundamental of fundamentals like this is the thing that you should be aware of and this is thing this is the thing that is not just limited to a jaw data Factory this is the thing that you that is like more inclined towards the data engineering fundamentals like this is the basic thing that you should be aware of and let me just share what is it and you will just relate with like maybe some stuff that you are already aware of and if you are not aware of you will be aware of like very very soon so basically let's cover this first thing which is link service what is a linked service basically I will just explain this thing with an example so let's say let's let's take the same example this is your Source this is your destination okay this is your Source okay let me just type here as well this is our source and this is your destination simple okay so basically this is your aure factory this is your Azure data Factory in the middle ADF okay okay perfect now now if you want to copy or migrate the data from your source to the destination we said that a your data Factory has connectors but what is a linked service and how it is linked to a connector so let's say your source is this is just an example so let's say you just need to copy the data from a SQL database right so you need to build a connection you need to build a connection with this ADF with this SQL DB you need to build something called as connection see this connection is called as link service this was so simple so basically let me repeat if you want to create the connection if you want to just pull the data from any Source or let's say you need to just write this data to destination which is outside of ADF obviously you cannot store the data in ADF it is just a a bridge between a source and destination so if you want to connect to a source to a destination no matter if it is an your resource or not you need have to have a connection you need to have a connection so that you can just connect to that application right make sense this is called as linked service in a your data Factory this was so simple we just use a different name called linked service but actually it is just a connection okay this was very easy now what is this data set this is obviously very very easy so let's say this is your SQL database right this is your whole SQL database but within this database you have multiple tables table one table two table three or let's say you have views view one view two right let me just type it for you as well this is T1 right this is T2 this is T3 this is vue1 this is vue2 simple now you have established the connection with your SQL database but which table you need to pick make sense felt connected so which table you need to pick I want to pick Table Three okay I want to pick table three so I this will be my data set table three will be my data set to identify the data that which data needs to be migrated simple and we use data set in destination as well how let's say I want to convert this data into a data L this is my jaw data l i will create this folder this folder is my data set this file is my data set this folder is my data set so in like just summarize it link service is a connection to any application and data set is the actual data that we need to pick that we need to fetch that we need to store simple so this was the fundamental thing that you should be aware of now you know this now we going to like conquer a your factory like this just trust me because this was the thing that was required now you know this now you know the basics so what we will be doing with the J Factory let me just let me just let me just tell you okay so these are our resources right yep these are our resources so first thing first thing first thing we will simply go to our storage account why I will tell you just open it and here you can see the containers these containers are your Delta lake or sorry data Lake not Delta Lake no no no no no no no just ignore that word we'll tell you about that as well don't worry so this is data Lake I was just reading about Delta l so it just came to my mind don't don't worry don't worry so here we have contain containers here we have containers so we will create containers why because we just create containers in data yeah so let's create our first container and I want to call it as my source source container simple oh I just need to pick lower case Yep this is my source container perfect good job you have created your first container good job now I want to create my second container this is my destination okay this is my destination so let's open this Source container which is obviously empty right now so what we will be doing we will upload one file here and I will show you how you can migrate this data from this container to destination container yes this will be the your like this will be your first thing that you will learn to actually migrate the data this is very fundamental thing and once you know the basics of it then we'll be covering some complex Concepts realtime scenarios as well and now you will be just wondering like what's the use case of it so let's say this source file is your Source in which your manager your stakeholders put the files Mak sense let's say your team lead your manager your senior things like senior developer puts files and this folder and your destination folder is being being used by let's say data scientist or let's say bi developers so you need to migrate this thing from source to destination this is very simple solution but you grow eventually so first we will build this simple pipeline where you will just move the data from one data Lake to different data Lake simple from one container to different container so this is your first task that you need to do as a data engineer are you excited are you excited let me show you how you can do that and for those who already know these step this is is a good thing to revise and if you already know it's good but just wait because we'll be covering some complex scenarios as well so it's a good opportunity for you to just revise these Concepts and just do it on your own like these simple things right okay so in this source file let's upload a data it can be any data and just click on it then you will see option called as upload but will we be directly uploading the data in this no why because we have created a data L we can create folders which are also called as directory see so we can create folders within the container so obviously we will leverage this so I will just create the folder first so I will say directory and this will be my CSV files let's say these are like my CSV files let's create it click on Save see now we can see the folder perfect perfect perfect perfect okay so now we will be just uploading the random data and do not need to worry I will just provide this data to my GitHub repository you can just download that data from there and you can just upload the data the same data that I am uploading so I will just simply click on it I will just open the folder and I will simply click on this upload button so let me just click on this upload button and then I will just select the file so I have uploaded this file called fact sales 1.csv this is just a random data that I have picked from my computer so I will just load the data I already mentioned in many s so do not need to worry at all so now our scenario is created now our scenario is created now it's time to just jump on the aure data Factory studio and let's finally see how aure data Factory looks like because we haven't explored it yet so simply go to your resource Group by simply clicking on it and just click on this eSource and simply click on this launch Studio simple so it just take you to the short fact Studio wow wow wow wow wow wow simply just look at the UI on the web page so if this is your first aure data Factory Studio or workspace you want to love it because I love the UI I love the UI and obviously the functionality as well it's not just about the UI it's about the use case as well so first of all let me just give you an overview of it like how things are like working here or what features for like what thing like all your wh family questions I will just cover all the questions so first of all let me just click on this expand button so that you can just see the naming conventions as well so first of all this is your Home tab so obviously you have like multi multiple things in just orchestrate and then transform and then configure ssis you do not need to worry about these things now because we'll be just covering all these areas in the coming sessions do not need to worry so what is is this author tab so basically author tab is your goto tab why because 90% of the time you will be performing your task because this is the tab where you actually create the pipelines where you actually work with the activities like different different activities within a y Factory so this is your goto Tab and you will just learn a lot regarding this tab because we will be just building realtime scenarios so this will be full of one do not need to worry about that for now next we have monitor tab so once you create the pipelines once you actually create the activities then you need to monitor them so let's say you have just scheduled your pipelines you have created triggers wait wait wait will we be covering triggers as well in this master class yes we will be covering triggers as well like how you can schedule your pipeline so you will building end to endend pipeline which will be running on its own so if you want to just monitor it you can just go to this tab and you can actually do it then we have something called manage tab do you remember we just talked about Link services and data said so actually you can create link service within this tab because it manages all those resources it manages GitHub devops git configuration everything so these are like three major tabs that we use so first of all I will just click on manage tab why because before starting anything we just need to establish a connection as we just talked about it right we just need need to establish a connection with the source with the destination and in our scenario it is Data lake so it is very simple you can just simply click on this ring service and click on this new button and as you can see like there are so many connectors available in your data Factory you can just connect to all these applications seamlessly without any hustle so it's very very very easy they're like countless countless like connectors available so do not need to worry in our scenario it is Data l so I will simply search aure data link and I will pick aure data link Gen 2 click on continue and then I just need to name this link service SL connection so what name I I just want to pick so I can just pick like uh linked service data L DL so DL is for data L simple then I just need to pick my subscription which is my aure subscription then I will just pick my storage account in my scenario it is it is storage data Factory on in your scenario it can be something different then before clicking on this create button you just need to click on test connection so that we can just actually test this connection if it is working fine or not and I'm sure it will work fine yeah see it is successful now we can just click on this create button simple so now as you can see our link service or connection is created with what with this data Factory with this data Factory and between this storage account so now this data Factory is connected with the storage account using that link service see this was all about link service now it's time to cover all the activities all the realtime scenarios all the interview questions that will be asked by you if you are just trying to crack the roles or if you are already an a engineer you are just advancing your skills yes it is the time to actually do that so what is the first activity that we'll be learning here so let me just give you a hint we will not just covering the activities no I will just design the scenarios in such a way that you need to connect those activities because obviously in the real world you do not perform activities independently no we connect our activities according to the requ requirments and then we build one complete pipeline so I decided that I will just create scenarios where you can just combine the activities and then by this way you will obviously learn about the activities plus you will learn about the scenarios so yeah I know like I I need to just put more efforts to just explain all these stuff but I'm ready to do that because I know you my data fam is really really interested to learn all these things so I'm just ready to put my efforts and yes let's cover all these areas seamlessly and you will learn everything everything from scratch and if you are already familiar with data Factory it will be a good Knowledge Test for you because we'll be covering like real time examples real time scenarios so are you excited to cover first activity and what's that let me just tell you it's none other than copy what is this copy activity so basically you cannot imagine your aure data Factory pipelines without this activity why because in 99.99% of the solutions we are migrating data from one place to another in 99.99 so this is the activity that is used to migrate your data yes as the name suggest copy data it copies the data from one place to another simple let me just give you the architecture like how it works so basically let's say this is your Source right and this is your sync or let's say destination so in your data Factory we just use sync instead of destination but if you see sync if you hear me saying sync it means destination so it needs two things what are those two things first of all obviously connection this is a source as you already discussed that it needs connection so it is your connection okay with this that is the first thing what is the second thing that it needs this is your first thing second thing is obviously data set okay once it has all the things like these two things then it can travel your data from your source to your destination simple simple so are you excited to actually implement the solution and we already know the scenario but now we'll be using copy activity and after performing this scenario which is just migrating data from Source container to destination container I will add some more twist and I will just leveling up the game by introducing more and more like complex Solutions and let's say scenarios so get ready with that and let's be focused and just be more focused towards this area because this is the fundamental once your fundamental is clear then you'll be just able to perform the perform all the further activities or let's say further scenarios so just be focused with this part yeah let's do that so I will simply go to my Azure portal and then I will simply go to this author tab finally it's time to create the pipeline click on this author tab then this is your author tab let me just collapse it to have some more space this is your author tab so as you can see we have three sources called pipelines CDC that is Cap change data capture data sets data flows power query so obviously we are aware of pipelines we know like pipelines is a set of activities that you perform and orchestrate what is this change data capture so change data capture is a kind of incremental loading pipeline where changes are automatically captured in the destination it is under preview right now and it is in preview I think since last year so no no comments no comments so data sets data sets as we all know like data sets are the real data within the link service then data flows data flows we know that it is used to transform the data using spark without using spark because we will not be writing code your data Factory will be writing code for us then we have something called as power query what is that if you are familiar with familiar with powerbi this is the same thing that you call there as power query editor yes you can transform your data using power query editor just imagine the power just imagine the ease of using it yeah it is very easy and it is very helpful we'll discuss it don't worry so first of all let's create a pipeline simply click on it and you will see nothing because obviously we do not have anything just click on these three dots click on new pipeline okay then this canvas will be poed up from left and right so first of all this is simple and towards your right you just need to name your pipeline and I will name it as let's say pipeline what name should I give pipeline let's say manager yeah it makes sense and if you need just need to collapse it you can just simply click on this properties tab once again to have some more space right because you as a data engineer need some more space then we just need to find the activity and an just tell us about this area as well because this was also just popped up so this is nothing nothing but like set of activities and all the activities are grouped are categorized let's say if you want to perform any activities related to move and transform then this area you can explore if you want to perform some let's say related to snaps let's say spark notebooks and job configuration then click on this you will see like spark notebooks and Spark job definitions then let's say if you want to perform something L to data breakes you can just click on it you will see notebooks jars and all see if you just click on let's say iteration and conditionals it will just show you the for Loop if conditions everything so basically a your data Factory has combine these set of activities into categories which are very handy but do you know the simplest way because when you are new obviously you need to just um let's say go through all these areas but once you know that this activity is called as this let's say I already know that if I want to move my stuff it's copy activity I will simply go here bro let me just collapse it I will I will simply go here and I will simply say copy simple I do not need to just find my activity within these folders no no no no so simply copy activity just drag it to the canvas this is your copy activity okay but what is this what is this obviously as we just discussed that copy activity needs two things what's that link service data set for Source link service data set for Destination and I also told you that in a your data Factory we used sync instead of destination so it is nothing but just destination so we can also name this activity so I will name it as what name should I get I will name it as let's say copy CSV files copy CSV simple right do not give space because it will throw an error so just copy CSV simple simple short and simple okay then let's see like what things do we have here so this is a description that you can give within your effect reactivity then active state if you click on deactivate it will be deactivated but why do you want to deactivate it bro come on man serious serious so this is the time out then it will be like it will be timed out after this like after 12 hours then retry if it fails you want it to retry by default it is zero if you give one so if it like fails then it will just retry for one time it will if you say two it will try two times then hey we do not want to retry get lost then retry interval obviously if you want to retry your pipeline you need some gap between second run right so you can just decide like by default it is 30 seconds simple then simply select this area Source now copy data activity is asking you to provide Source it is saying hey provide me the source let just provide the source so simple we do not have any data set we do not have any data set we have not created any data set did we create no we have just created the link service so now we will create the data set simply click on this new button and then it is asking what type of like data is stored in which uh application it is Data L data L Okay click on it continue it will say what is the format of the data in our scenario it is CSV file so just click on it and click on continue then it is saying what name you want to give to your data set I will say CSV Source because it is my source data set right then it is saying do you have link service bro yes we do have we have already created this link service if you do not have you can simply click on this new button not not a big thing so I will just pick this link service then important step as I just told you that link service is a connection with an application but data set is the actual data within that application so now as your data Factory is asking you hey bro you have provided me the connection with this storage account okay I have successfully connected with it but with which folder with which file do I need to connect so I will say hey just click on this browse button and you will see all the containers that you have we have like source container click on it and then we have one folder inside it I guess let's see it's [Music] loading yeah so now we have one folder CSV file select it then we have one file as well select this as well Okay click on okay so now we know this is our data set Source data set yes now do you want to keep first row as header obviously because in CS we need to specify that this Row first row is header so we'll simply select it then import no we do not want to import schema simple then we will simply say okay now our source is done if you want to preview data you can simply click on this goggles and you should see the preview of your data so this is your data so you guys that we are using okay so now we need to click on sync which is destination it is saying hey do you have the data set for sync we will say no but we will create it do not select the source data set because that location is different sync location is different because it is it it will be stored in the destination container so we need to create data set for that as well click on plus new again data set is of like in stored in data L Gen 2 do not pick this one as your blob storage because we because we are using data link click on continue in what format do you want to store obviously CSV because it is in CSV so we will store in the CSV but you can just pick AO bucket orc click on continue then this will be my destination so I will say sync CSV yeah that was Source CSV that is this is sync CSV link service would be same because same application is being used yes data leg is same in both the scenarios just remind this now we need to pick the location destination and within this destination we have nothing so we'll say okay but I want to create the directory this is also an interview question if you do not have any directory but still you want to create it you can just mention anything here it will create the directory for you if I just write let's say CSV files it will create folder automatically I will show you in the storage account as well right now we do not have anything in destination we just have in Source but it will create one folder in destination as well me just drink some water I'm thirsty so now it is asking us to provide the file name as well so if you want to provide the the file name you can just provide the file name if you do not provide any file name it will just randomly pick any file name so I will provide the file name let's say or I I don't want to provide the file name because it will automatically pick the file name for me so here the none then click on okay this is done this is done so I will just simply click on the white canvas like the blank area now if I just run on debug if I run on debug it will migrate my data from source to Destiny and it will create the folder as well let me just click on debug let me just click on debug here you can see the output of it so it is running right now so that's why activity status is queued so you can refresh it from here so it is now running so once it is successful you will see green Mark if it fails you will see red mark I love both the marks because with green you feel happy with red you'll learn that was that was a deep line man if it is green you feel happy yeah that's that that we all know but if you see red that means if you fail you learn that was a very deep line it succeeded because I'm here to teach you so you're learning from me and you should feel happy from your pipelines perfect perfect combination so it is successful it is successful do you want to see the result yes I I want to see the result if you want if you do not want to see the result so I will simply go to my storage account by simply clicking on this so now I will open my storage account right let me just open my storage account click on containers and now I will see my destination because my destination was empty my destination was empty destination folder okay so destination I will click on destination so now wow I can see that new folder is created I will simply click on it wow this file is here trust me I didn't manually copy it my a your data Factory copied it for me this is your first data pipeline that you have created in a your data Factory trust me like I I was so happy when I created my first pipeline I don't know about you but I was so so so happy so this is your first Pipeline and you have migrated the data from source and from source to destination so what does it mean like did it just remove the data from Source no because it is a copy activity so if I just go to my source let's say I go to my source I will still see the file see it is here we are just copying the data from source to destination happy you have learned your like first activity like first copy activity but this was just the fundamental now we're going to level up our game now we're going to learn like real time scenarios and we will be just leveling up with the activities that I'll be adding like one after another but I want to just cover one more uh you can say uh use case using copy activity so in this scenario we just let me just show the containers we just move the data from source to the destination right like it was between data L to data L now what if you want to pull the data from API or let's say from any HTTP connection to data Lake let's say I will open my GitHub account right now okay and I will show you one file there so instead of manually uploading the data in this data Lake I directly want to pull the data from API can you do that yes how because we have a connector with API we have HTTP connection as well in a your data Factory excited to learn that let me show you exactly what we're going to do right now so now this time we have a special requirement to perform copy activity what we'll be doing inside this so let's say we have a CSV file in my GitHub repository I want to directly copy the data from the GitHub repository to my destination folder in the data Lake can you do that yes it is very easy and let me let me explain you like how we can just do it because once I explain this it will be very easy trust me so we we have like something called HTTP connection that we'll be using to pull data directly from this and obviously for that we need to create a link service we need to First create the connection with the GitHub then we will create the data set with the data within the GitHub and we already have link set link service not link set I just mixed like link service and data set link set no we already have link service for our GitHub like for our data Lake and then we will be just creating a data set for our data link simple sounds interesting really bro because you are just switching data directly from the GitHub which is outside of the aure network so it is very very very easy let me just show you how you can do it so let me first show you this thing which is my repository so I'm not promoting my repository I'm just showing you because I have a file here your factory when you go inside this you will see raw data and then you have a file here yes so this time we will be using data directly from here fact sales 2. CS directly from from here yes so let me just go here and let me just tell you how you can do it okay let me just cross all the other tabs then so we will be creating a new activity and for that we can just perform uh same activity within the same Pipeline and the you can say new activity in the new pipeline but I would just prefer to perform this activity in a new pipeline so I will just create a new Pipeline and I will name it as pipeline get because we are just downloading data directly from the get and we were just moving the data to a data link so once I name it I will simply repeat the steps for oh oh oh so so so it's a quick revision for you all because you already know how to create link service so let's create our own link service and you will create your own link service without coping from this video and if you face any errors then you definitely you can refer so this is my copy activity I will say copy get simple and then in my sour I do not have any data set obviously I will click on create new okay this time this time where is our data our data is in the HTTP connection so I will simply write HTTP click on it and click on continue basically we have two options HTTP and uh a rest API but it is the recommended approach by Microsoft to use HTTP if we have a file in our uh GitHub or in our website but if we just have like API call then we just use rest API don't need wor so both are same but this is a recommended approach click on continue and then in what format we have data in uh that HTTP connection CSV click on continue then what name you want to give I will name it as Source get simple do we have link service this time no no this time we do not have link service so we will click on this and we click on create new so here we need to just name the link service okay so I will say linked get simple then you need to provide the base URL from where we can get the base URL from where it is available in GitHub and how you can find that URL let me just show you so you simply need to just go to my GitHub repository and from there we will just get the URL so how you can just get the URL simply click on my repository and then obviously pick the repository then simply click on Raw data and this is our file simply click on it you will see this page where we have data embedded like this simply click on this RAW button this one this one click on this and this will give you the web page we need this URL and if you want to get the URL simply click on this tab two times and you will see the URL but this is the complete URL but for link service we just need the URL to establish the connection with GitHub in this scenario it is this one it is only till doc because this is the relative URL this is not the base URL base URL is just still.com that's it so when we'll be using rest of the URL we'll tell you bro wait wait wait we'll tell you because this is the URL we use it for data set because once we have established the connection with GitHub then we will say hey pick this file but for now we will just use this thing this https still.com so I will simply copy this part of URL and we'll paste it in base URL simple and I can just remove this backs slash yeah so what is the authentication type Anonymous and everything looks good to me so I will just test the connection simply click on this yeah it is successful simply click on create so our link service is created now now I told you now it is asking for relative URL now it is asking for relative URL it is saying Hey I want relative URL I want it is saying so now you can just copy the rest of the thing like this after the backs slash now I will just put it here in relative URL and then simply click on okay this is our source and you can just preview the data click on preview data and this is the data that we should expect and and sync is same sync data set is already not created smart yes only link service is created so we'll click on plus new and then we will just pick the data Lake then CSV and this will be my sync so I will say sync get CSV simple and Link service we already have link service data L simple and it is asking me to like from where like to which location we need to to copy the data so I will just browse it I will select destination and I will select CSV files but that's it I will not pick the file because new file will be coming here using file name and this time I will I I can just say the file name this time like file okay okay let's just leave it we will just use the name that will be coming from the GitHub because it is the recommended approach because we should not manually type anything no it will be just picking everything so connection schema no simply click on okay done everything is done now when I just click on this debug mode when I just click on this debug or let's say when I just run this it should copy this which is also known as fact sales 2 from this GitHub from this GitHub yes from this GitHub it should just get the data from here so let me just click on debug and let me just just wait to come like to have to see the results let's see I should see green let's let's hope for green let's let's see what color we see so it is like one of the real time scenarios that you actually perform in the real world you do not uh actually download the data from GitHub directly whenever you see the data in get up you actually import that data using data Factory okay I see it succeeded but I want want to validate it I will simply go to my data Lake to destination and I should see now two files let me just click on it yes I have two files but we have a folder yes because it just use the directory from my gup so when I just open this I should see the files like it I have used like multiple folders I know see now we have files so we have like this hierarchy because we do not we did not use the file name if we use the file name name then it would not have given us this thing let me show you that as well let me show you that as well if I just delete this folder okay simply click on delete and if I just go to my J data Factory if I just select my data activity if now I want to just make the changes in my data set that's why I did not do it uh during that time so I will simply click on this and click on open now we can just edit our data set so if you do not want to just get all the column names or sorry not the column name like all the folders name you can just specify the file name then it will use that file name but earlier what it was doing it was just using all these folder names and obviously when you put slash it treats it as a folder simple Common Sense common sense so now we can just simply say file 2. CSV simple now if you just want to perform the data like perform the activity you can just go there and you can just run the PIP line then you should see file 2. CSV and you can just use you can just use any name any name means any name literally you can just use as any name just refresh it yeah it succeeded let's see the result let me just refresh it yeah see this time I got file 2. CSV if I would have named it as let's say git file or G CSV file I should see the same name and now you can just see that this file is G copied so you have successfully learned how you can just pull data directly from the API to your data Lake it was very very very helpful trust me if you have grabbed all the concept it's good if not just rewatch this part because now we going to level up yeah bro because with each activity with new activity with every new activity we'll be just leveling up our game so now how we can just level up our game like what we are just doing right now like what what thing we are doing right now so okay let me tell you the scenario ah as you can see here that we have factore sales 1.csv that is our file number one then we have file 2. CSV and this is just random file this is not fact file just imagine because this is fact sales right this is not fact sale this is just a random file that we are pulling into the destination folder so now the requirement is and this is the real time scenario and I don't care if like obviously like this question will be asked in your interviews but I don't care about that because this is the scenario that you'll be just dealing with your day-to-day activities even if you are not ask this question in your interviews but I am so sure that you will be definitely asked but still I am just putting day-to-day activities above than the interviews because this kind of activities will'll be performing in your day-to-day activities in your day-to-day tasks so what is this thing like why I'm just putting so much focus on this part because let's say you have just moved all the files into your destination folder correct correct now just listen carefully just listen carefully this file this file is used by the reporting team is used by the reporting team okay just write it for you this file is used by your reporting team okay this file is used by the data scientist team okay data science team sorry data scientist team data oops data science team this file is used by reporting team this file is used by the data science team but for now but for now we have requirement number one we have requirement that we want to create a new container called reporting called reporting which will be used by bi developers powerbi developers data analyst they want to do reporting on top of the data because obviously uh bi developers use facts and dimensions basically data warehouse to build the reports right so they want to just use fact file they do not want to use file.csv file but in your container in your folder one folder you have multiple files in your one folder you have multiple files but reporting team just wants to see this data it's it's not about it's not about they want to see they just need to see this data so according to the data governance data security we cannot show this file to them file 2. CSV that means we need to build a solution in which one folder having two files but only one type of file will be going to the new container let me just draw it for you only fact file here it is like fact sales one in other run it can be fact sales two in third it can be fact sales three it is related to fact and it is like suggested by the name as well like it needs to go to the reporting team right so we will create a container where only fact data will go not the file. CSV because this data cannot be seen by data analyst no because of data quality data governance right and this is the real time scenario that you will be asked in the interviews trust me so now now to achieve this we have an we have a new activity in Y data Factory and it's none other than get metadata get metadata activity yes so what this activity will do what this what this activity will do so we have two files within one folder it will just give us the metadata of the whole folder metadata is data about data that means data about folder so this is metadata of the folder so it will give us all the file names one two it will give us all the file names okay so now we will apply some logic that we will first see the metadata right we will see all the files and how we can see this using this get metadata activity then we will apply another activity that I will just tell you after showing this activity so we will apply some logic some condition in which we will say hey just take this data only just take fact sales data only and the second activity I will talk about in just few minutes but now let's see how you can efficiently use get metadata activity let me show you how you can use it it's very easy but it's very very very powerful so simply go to aure so this is our aure portal and just just a quick tip obviously you'll be watching this video in parts so whenever you want to just save your work just click on this publish all button I can just also click on this because it will just save your all the progress so far so it's will it will just save you from a surprise of seeing no work if you have just developed everything just click on publish all so now as we have discussed we will use an activity called get metadata so let me create a new pipeline here so I will name it as only selected files yeah and just a hint we will be using two different activities but first I need to just tell you uh the first activity obviously because you you you want to learn like one by one right in pieces and steps so this activity is called get metadata so I can simply search get metadata and I can see get metadata is here wow wow wow wow so I will just remove one I will simply say get metadata okay so now it's very simple just click on settings button and then it will ask to provide the data set yes we need to provide the data set so we will just create a new data set set for that dat for for just like destination folder in which we will not give the file name why because we want to pull all the files if we just like provide the data set in which we have provided the file name as well it will just give us the metadata of that particular file but what we want we want metadata of folder so we will create a new data set click on plus new obviously data is in data L Ro trust me these scenarios are really really really handy in your day-to-day activities plus these will be asked in your interviews trust me you cannot expect like your interviews without these kinds of scenarios without these kinds of activities and with with with more and more activities we are leveling up our game I will also introduce to triggers like a tumbling window triggers schedule triggers storage based triggers don't worry just be with me because you'll be learning a lot just just be with me so data is in CSC format obviously click on continue okay and I will just name it as let's say metadata metadata DS okay link service obviously it is the same now we just need to pick destination container and CSV files that's it do not select any particular file no we want all the files so just select the folder click okay perfect and import schema no click on Okay click on okay now I will just run this pipeline okay oh okay okay okay okay okay my bad so thing is we have selected get metadata so now what type of metadata we want what type of metadata so for that simply click on field list plus new then click on argument you will see all the options available let me just expand it you will see all the options child items child items means items that are available Within this folder and that's what we want simply select it then we have something called exist if it if any particular file exists or not so this is used for that particular purpose then we have item name item type last modified date if you want to just find the last modified date of a folder you can just you can just use it if you want to find the last modified date of a file still you can use it then for now we will just simply select child items because we want to see all the files within this folder now I can just on it simply click on debug then it should just give us the array of files let's see let's see let's see okay it is succeeded it is succeeded it is succeeded now now I will just see the output of it and how you can see it simply put your mouse here and simply click on this exit button which is also called as output click on it and you will see this array let me expand it what is this this this is the thing that we want yes so if you just closely observe we have child items and we have two child items one is this one let me just zoom it see one is this one and second one is this one one file is fact file CSV second file is file 2. CSV it has given us the metadata of the of that folder it has tell it has told does that these two files are there and we can use those files that's what I want that's what I want because let's suppose you have hundreds of files you have hundreds of files how you can just get the metadata of all the files using this get metadata activity because it will give you the all the items all the items now once we have this get meta get metadata get metadata it's a tongue twister so once we have get metadata activity in place now we will use a new activity and what is that activity it's none other it's very popular in programming languages if you are from programming background it's called if condition yes it's very popular right if condition so what I will take I will take both the items right and I will say if name of any item in this case we have just two items let's say if we have hundreds of items I will only listen to me carefully I will only copy that data for that file starting with fact let me repeat so I will only and only copy the data which is starting with fact which is starting with fact not with any other name if any file is starting with fact I want to see only that file and if any file is starting with any other name I do not want to use that F that's my condition are you excited yes let me just show you so for that we can simply create a new container here here so I will create a container called reporting and I will click on create button simple so in this reporting container we will only transfer the files starting with fact starting with fact and do you want to see that activity how it looks like let me show you so this is your activity if condition if if if if if I told you that we going to level up we going to level up with the course of time so now it's time to level up so what exactly we need to do let me just tell you so first of all obviously we will just get the metadata we will get the list of files right not just the list of files array of files why because we have their different different entities like file name file type we will use file name don't worry then we want to use an activity which will apply a condition what was the condition just tell me just tell me okay why I'm so charged up because I just had a cup of coffee so I'm all boost up yeah I was like a little bit tired because I was just continuously speaking but now now now now now I'm all like boosted up I'm charged up just kidding man I'm just like this so the thing is the condition that we want to apply is we want our file name similar to or let's say like if you are from SQL background or let's say starts with starts with fact fact fact fact oops so this is our condition if this condition gets fulfilled then just transfer this data to the reporting folder simple yeah it is simple but here is a surprise for you what's the surprise let me show you bro so there's a common sense what's the common sense I know like common sense is not so common nowadays but still it's a common sense and my data communi is has like a great IQ so I know like they have lots of common sense so the thing is we have the array let's say file one right file two oh man my handwriting then file three my mom used to squall me a lot about my handwriting my school teacher used to squall me a lot I can't help like I have a beautiful handwriting I don't know why they don't like it do you like it so file one file two file three these are like different different entities within the list right if we want to apply the if condition we want to apply that condition to individual file so we want to run a loop a for Loop over these files right and here is the new activity for you here is the surprise for you so now we will be using for each activity on top of this if condition why so I know you're already smart but still I want to tell you why so this for each activity will run a loop on this files on these files like file one file two I think my mom and my teachers like are right about my handwriting now I can also say that okay so let's say we are running a four loop on this particular list so first of all it will just bring this only this file file one it will send this file inside this activity which is if condition right it will check if the file name of this file is satisfying the condition if yes then we will apply a copy activity if this file is not satisfying the condition it will not write the it will not perform the copy activity simple yeah it is simple not so simple but after a few minute is after a few minutes it it will be very simple un calm down calm down calm down I know you're excited but calm down okay so after a few minutes you will be familiar you will know each and everything that we are talking about right now regarding the solution because we want to develop an end to end solution not just the basic common you can say very normal stuff no we want to just build complex Solutions so that you you can build like complex skills as well you can just conquer the interviews like this okay so we have understood the architecture not fully because obviously we have not performed it yet we have not implemented it yet so jokes apart now jokes not joks jokes apart now it's time to just implement the solution so I will just tell you each and everything so let's be focused now so this is our aure portal and now it's time to just drag the for each activity okay for each yes just drag it here simple so I will just name it as for each CSV simple so there here's one important thing so we have like four types of nodes four types of connections what's that this is on skip this is on success this is on failure this is on completion so it depends like how you want to succeed how you want to just perform the further activity based on your previous one if you want to say if this particular get meta data activity fails and then you want to run your 4H CSV file or for CSV activity then you can just connect these activity based on this node so similarly for this scenario we are just connecting it on success once it succeeds then we can just perform our for each activity now if you are from python background no no no we are not from python background okay okay calm down calm down calm down calm down if you are not even from any programming background still for each is a kind of loop that we run and this Loop is used to just iterate through every every single entity so moral of the story is we need a loop we need a loop to iterate through the array so where is the array because we have the for Loop where is the array where is the array tell me so bro the array is here if you just see the output of this file that's why I ran this output for you guys so that you can use it because we'll be using that array right now don't worry so the array is here so we will be using output we will be using bro just be focused now you will be using output of this activity and output of this activity is in the form of array yes you can do this if you go here and check the output you will see the array is called as child items arrays oh bro calm down bro yeah so oh let me just expand it it is very sensitive very sensitive so the array is child item as you can see let me just zoom it for you see this is an array child items is our array so we will be using this array because the first thing that it it will iterate is this one the second thing is this one simple simple so we will be using this sty items as our array and what is the file name what is the file name just tell me if you are from programming background bro we just told you we are not from programming background okay okay my bad I'm sorry I'm sorry if you are if you are not if you are not from programming background still I will just tell you how you can just do that so this is a kind of array in Array we have something called indexing so we do not need to worry about the indexing because for Loop will run all the indexes on its own but what we need to care we need to care that this child items array we just need only one key what is the key name that is name name this is name name so every time every time we need to perform or let's say any we want to perform if condition or any kind of thing we will say child items which is our array dot name do name traditionally traditionally if we do not have for Loop let's say we do not have for Loop how this will be executed this will be executed as like this child items because I want to Prov provide you the core knowledge I'm not just teaching you a tool I am teaching you a skill and I want you to master that so this is child items child items is an array and the first element of the array starts with index zero so this will be my index zero dot then once it reaches this index once it reaches this index it has two things name and type what we want now we want oh bro what what you have done so sorry so sorry so sorry my bad so yeah so what we need we need child items obviously of index zero now what is the thing we just need name we just need name of that particular index because our element is in the form of dictionary in the form of key value P pairs we just need one key which is dot name this is the thing this is the thing this is the thing now the second thing like the second iteration will be here one 0 1 the next one we do not have in this scenario but if you have like multiple files 2 3 4 do we need to manually put this number the answer is no why because we are running a loop it is automatically iterating through these values so the only thing that we need to put is this is this let me just write it for you just write it child items okay child items do name simple simple we will write just this thing inside our if condition so this was the scenario this was the exact core skill core knowledge right let's perform this this was important so now in for each we have already know that we will go to settings and we will provide sequential do we want to provide like do we want to perform the stuff or the activities within for in sequence I personally like to perform all those in sequence so I will just say sequential yes and what are the items we just discussed that we will be using output of this activity and for that you just need to click on ADD dynamic content and then you just need to scroll down and you will see the output of get metadata activity see output now this is the whole output this is the whole output what you are saying bro let me show you if I just go to Output of this activity this is the whole output do you want to use all the things no then you just need to use child items right so we will just write output dot child items simple simple just click on okay now this for Loop will have all the child items now we will go inside this for Loop how we can just do that simply go to activities and click on this pencil button so this is the empty canvas but actually if you will observe we are inside this 4 CSV right now I will drag my if condition if condition if condition and I will say I will name it as if file matches if file matches okay now what is my condition what is my condition I will again use add Dynamic content because we are using the output of for each we just discussed that we will be using every single element of that array and what is that element dot name dot name my data Community is smart my data Community is very very smart so we'll simply click on ADD Dynamic content click on it then we will use output of for CSV how we can use there's a function called item if you do not oh not itm it's item yeah so there's a function called item if you do not remember this function just forget it just forget it simply click on this for each CSV it will show item but as we just discussed in each item we have two two things name and type what do we want name we will simply write name and this item is equivalent to let me just tell you again bro because I I want you to learn this item is equivalent to this thing child items of zeroth index dot name this is equivalent to this right that that's my bro that's my bro that's my bro so now we will simply click on okay button this is done this is done is this done not really because we want to use this expression in the if condition so I will just again click on this box now I want I know that my name has like I I have like two names one is fact sale something second is file 2. CSV something but I I know my condition I want my condition is is if my file name this is my file name if this thing starts with fact then return true otherwise return false because if condition has two things true or false simple right how we can do that we have a function called starts with so we will simply write if you do not know again because if you are new obviously you would not remember these functions you can simply go here like uh and three dots functions then you will have all the functions you can just search that function here let's I will just say starts with oops starts with see even if you do not remember this function you can simply go to the string functions because obviously it is related to string we are just comparing the string then you will see all the string function available concat ends with guid index of last index of replace split starts with sof string two lower two upper so basically these are string functions we want to use starts with so basically it has let let me first remove this function like item we will just use it inside it so first of all we want to focus on this one starts with so starts with has like two things like what we need to use in starts with in and in what uh variable let's say our item name item do name starts with fact so first of all we want to confirm like what is the case of that fact let's say it is is it upper case or lower case something like that so you just need to go to destination and confirm the okay it is in the normal case like first alphabet is capital and rest of the Stu is like in the lower case so we will use this exact word we will use this exact word so starts with let me just show you again how you can just do this simply click on this starts with function simple now it says function starts with does not accept zero argument so bro we are providing you the arguments just hold on so how you can do it starts with so basically so here I will first of all write the output of that for each activity so we can simply go to for each iterator click on this and now it has return return item simple item now we want to add dot name as we just discussed that why we are using this dot name okay add comma Now what is our starts with string we just checked it is fact simple just click okay now what it will do if this condition is satisfied if this condition is true then we want to perform copy activity if this condition is false we do not want to do anything because we just want to copy the data to the reporting container if this condition is successful understood till here yes or no if your answer is yeah still it is fine why because you are learning complex Solutions it is normal if you are not grasping 100% of it in one goal trust me because when I was uh learning these things back in the days I was also referring these Solutions again and again I was just trying to build these Solutions on my own again and again and yes it is very normal if you are watching this part again and again it's fine you can rewatch this part and you can just enhance your knowledge it doesn't mean that you are not smart enough to learn the things in one go no because you are learning skills deeply you are not just covering the areas that you will be forgetting after a few days no you are learning the skills deeply so it's fine right if you learned everything in one go it's fine if you are not still it's fine because I was among those who just referred these kind of solution who buil these kind of solution like moment I then I got to know okay okay this is the the way this is the thing that we need to cover and I am just explaining these Concepts in the easiest manner that I got to know after so after like you can say so many years yeah like I was just learning these things I was just failing and then I just rebuilding so do not need to worry just just just rewatch this part and then you can just perform the further activities simple I am here I'll be guiding you each and everything from Scrat so do not need to worry it's fine it's fine you are my data family an lamba's data family is the strongest data family on this planet just trust me just mark my words just mark my words so once we have this activity then we will just perform true and false activity as we have just discussed that if this condition is true we want to copy our data now we will go inside this if condition see we're building such a complex solution but that makes you skillful okay so just click on this pencil button now we are inside this if condition see first of all we uh just selected this one only selected fil that is our main pipeline then we entered this for each then we entered this if condition and within that that if condition we are inside this true activity I will also tell you how to trigger this pipeline automatically just just wait and watch now is the like fun part of the course has started because now like we'll be covering like all the real time scenarios and complex Solutions yeah that's fun so now we want to perform copy activity I will simply write copy copy data and I will call it as copy f data that makes sense fact data okay now what is our source now what what is let's say copy fact data simple now what is our source source is our destination folder but here is a catch what what catch now bro what what is the catch now we will be building parameterized data sets parameterized data set what does it mean I will show I will first create it and I will just tell you how we can just interpret those parameter parameterized data sets because these Solutions are hard to find I know I know but now you have found me so don't worry so first of all I will create a new one and obviously the source is data L click on continue obviously data file format is CSV I will name it as param parameterize param data set parameterized Source let's say parameterized Source CSV link service is same right now I will I will do something Advanced right now just just wait and watch your your your boy is here boy yeah I'm not a like yeah I'm young yeah so boy yeah it's fine okay so just click on browse button so just select your destination folder which is your Source not Source but yeah name of the f folder is destination but it is acting as a source select this select your folder that's it do not select any file no why let me tell you so so so so let's say you have let's say you have here third file as well which is called as fact sales let's say two or let's say three. CSV doesn't matter okay first of all it will just check this file it will just perform copy activity based on this file why because it is matching the condition H yeah okay it will not perform copy activity on this because it is not matching the condition okay then in the third iteration it will again match the condition if we select this file here how it can pick this file how so what we going to do what we going to not do right now just tell me just tell me just tell me just tell me I will show you what we going to do so basically we will create something called parameter parameter yeah parameter and let's say let's create a parameter called x uh X yeah you uh let's create a parameter y I don't like X yeah let's create a parameter called y okay okay so parameter is y so I will copy my data first with this value then with this value so by this way I can change the value of my file which is a parameter with every iteration with every iteration got it let me show you as well like what do I mean so here I will obviously pick this thing uh Ro folder is this one let me cancel it so this is your browse button loading loading loading we do not want loading just load quickly oh man come on yeah destination then CSV files that's it that's it that's it click on okay now this thing is empty now every time every time we have a condition satisfied which is like if condition is satisfied we want to perform copy activity and we want this file name dynamically entered here instead of hard coding it instead of hard coding it got it yeah so how we can do that simply click on this Advanced Tab and click on open this data set so what it will do it will just create the data set and it will just take us to the edit mode of that data set firstly close this one because I need more space so this is my file path so here first of all I will create parameter I will call it as file name p file name P stands for parameter P file name and there's no default value so then I will just go to connection then if I just go back now this thing is empty so how we can just fill it I will just click on this and I will call it as ADD Dynamic content because we want to add a parameter here click on it click on P file name okay now this value is a parameter this value is a variable like obviously like functionalities are same so do not get confused you can either use variable or parameter so this is a kind of why I'm saying variable because we are we have already studied like mathematics so we call there is that call it as like variable XYZ I have personally learned mathematics in deep so I know no so this is your parameter which we can provide the values to it dynamically so now this is done now you need to find your copy activity where is that copy activity where is that copy activity so the copy activity is here perfect so now we will see something different what's that it is asking us hey bro you have provided the parameter yeah we provided it it is asking us just provide the value as well yeah it is asking us just provide the value as well do not leave that parameter blank I'm not fool I will not fill it like I I know your teacher used to say this like just do your homework I'm not the one who will do it so similarly it is saying if you have created the parameter you have that empty box just provide the value now how we will just provide the value I think you now know the answer the file name is coming from the loop it's coming from the loop using that item do name got it yeah I think so okay so simply click on it and now just fill this box with Dynamic content I know the level is so up right now right but that's how you grow that's how you become an outlier in the crowd because everyone knows how to build Simple Solutions everyone knows how you can just be an outlier in that Crowd by building these solutions by building like these kinds of scenarios by providing solutions to these kinds of scenarios right right and on the top of it you are my data family you going to conquer your interviews like everything not just the interviews if you are a data engineer they should feel that you belong to this data family which is very very competent which is very very skillful so just be competitive enough okay okay so now we will just provide the file name so we already know that file name is item which is the for each activity output do name yeah I know you remember so once we have provided this item. name what it will do it will specifically copy only that file name for every iteration let's say in the first iteration it receives fact sales it is approved it will just give that file name to Let's click okay first it will just give that file name to this parameter which is our file name it will copy that in the second ation it will see file 2.c CSV there's no starts with like fact so it will just uh ignore it then maybe like in future we have like one more file if it is let's say fact three something there's a fact yes it will just provide that file name to this parameter so this way we can perform copy activity on like any number of files we do not need to worry then in the sync now in the sync we need to create a new data set for our sync which is a new data set because we are pointing our output to reporting container which is a new one simply click on it and data is like data Lake CSV perfect and I will say it as reporting sync simple link service the same obviously data L now we can just select the folder which is reporting and there's no folder I guess but we can create let's say CSV files yeah CS files now again what is the file name it will be changing dynamically we just saw it right same stuff just perform on your own now just perform on your own that that's your homework just perform on your own so click on Advanced open this data set oops do not import the schema click on open this data set so click on properties again just to have more space then the same way we will provide the file name we can just call it as P file name again yeah you can just use same parameter name there's no harm because that parameter file name is like in different data set this parameter file name is is in different data set so do not need to worry so this is done go to connection again and in the file name add Dynamic content use that parameter click okay now same thing just find that activity here it is click on sync now same step we just need to define the file name at theate item dot name simple just click okay bro you have developed the solution you have developed the solution once you see that green mark on your pipeline run you going to tap your back because you have definitely definitely done a great job so simply click on the first pipeline that is only selected files click on it now we will see the solution see see you have built a beautiful solution so this is your get met data let me just give you an overview what we have done so far this is get met data activity it will just give us all the child items then we will just iterate through it through each click on it click on this expand button click on this pencil button once we have the each entity we will apply if condition on top of it what is the condition condition is this one if that filim starts with fact then only perform this copy Activity Click on this pencil button this copy data activity this copy data activity will dynamically will dynamically copy the files wow that really an end to end solution end to end solution bro end to end solution we learn while having fun but we learn the real skills we learn the real skills that's the difference because I don't like like boring sessions and boring lectures no like let me just let me just hit run and then I will just share a story of mine are you excited to run this this is a really complex solution let me just click on debug button fingers crossed fingers crossed so while this pipeline is running let me just share my not a story like just a thing normal thing so I used I used to be a back bencher but my whole group was so studious so studious and we were the type of group who used to have so much fun in the class and at the same time used to score like highest numbers just be like that just be like that do not feel like if you are just seeing like boring lectures that should be fine no bro I don't like boring lectures no just have some fun just enjoy your learning just make your learning enjoyable you should feel happy see another reason to for the happiness I can see all the greens and I was expecting one red at because this was very complex solution very hierarchal solution activity inside activity inside activity and this is successful but I will only feel happy if I see the results because results are the only things that Universe sees okay so the result is we should only see fact sales file in the reporting container not the pre not the other one right let's see let's see click on your containers then click on your reporting container click on CSU files fingers crossed I should see only one file bro please please please yes Bam Bam Bam Bam bro first of all good job good job good job I'm just virtually tapping your bag good job good job good job good job this was really complex solution trust me this was really complex solution this was a hierarchal solution multiple activities were involved we used uh parameters like this was mixture of solution and trust me this is like one of the most asked interview question as like how you can just perform these things trust me trust me I'm so happy I'm so happy for you that you are able to build this solution I'm so so so happy love you fam love you so let's see our activity wow man this looks so so so good this looks so good so what extra we will be doing now what extra we will be doing so what extra we will be doing I will just tell you I just promised you that we will also talk about triggers yes but there is one more thing that I need to tell you what's that what's that what's that what's that okay we haven't talked about data flows right and I think this is the best time to talk about data flows what are data flows so there's an activity called Data flows I will simply search it first drag it here and I will connect this activity here see let me just collapse it to have more space yeah so we are developing the solution like this get met data activity for each then copy then if everything then once we have data in the reporting container once we have data in the reporting container we want to transform that data we want to apply some Transformations any Transformations any any Transformations like maybe we want to perform Group by maybe want to perform any kind of like transformation you can imagine any transformational data maybe you want to filter some records maybe we want to just select a subset of columns so we will use spark for that spar spark where is spark this data flow is using spark behind the scenes but you do not need to learn spark for that you can just use spark using drag and drop using draggy draggy clicky clicky things using data flow and yes we will learn data flow the next thing yes yes yes and once we have covered this data flow then we will create the trigger as well to perform this pipeline activity or let's say to perform this whole Solution on its own because we will not manually running everything debug everything like again and again no no no no and I will also show you how you can schedule your triggers like I will just pick a random time and I will just show you how you can schedule that trigger and it will it will automatically running your pipeline it will be fun it will be fun yeah I know and there are like some scenarios left that I want to discuss that we'll be discussing after that trigger yeah so let's quickly see how we can just transform the data so first of all click on this data flow okay then click on this settings button no okay let's first rename it uh I will I will call it as data transformation oh man I'm so hungry right now ah data transformation okay perfect so just go to settings and it it is saying just attach your data flow but we don't have so so what just click on plus new this is your data flow this is your data flow canvas so what is the name of the data flow we want to give it as transform CSV okay we will close it now the first thing the first thing if you want to transform anything what is that obviously the first thing is Source we need to attach the source we need to attach the source very good so I will I will simply click on this this source button click on it and we will click on ADD source so it will just add a step which is Source One so first thing we will do we will just rename it we will say Source CSV simple then it is asking us to provide data set obviously we need data to provide to perform transformation so what is this data set so I will just create new data set for my reporting for my reporting page Okay click on this plus new and then data L click on continue what is the format of CSU file yeah this is the one very good very good very good then I will call it as data flow Source simple yes sir simple simple simple okay then link service link service same link service is always the same okay then we need to pick the data we can just click on it reporting and just pick the CSU files and you can either click on this file if you do not click on this file it will pick all the files within this folder just click on okay as you can see this is empty still it will just pick the data click on none we do not need to import the schema click on okay yep simple simple simple simple then we need to go to Source options click on it it is saying do you have wild card paths what is this wild card path so basically if you want to just perform recursive reading you can just use this wild card path where you can just select the folders and it will automatically pick the files within that but we are not using that okay so what is a projection we will import it how we can import it so as I just mentioned that we use Park clusters to run these activities so there's a button called data flow debug button see we need to turn this on and it will take I think some time and there you can say time to live like debug time to live so time to live is a feature that keeps our data breaks clust not data bricks Pro yeah I was just doing some stuff in datab so this is a kind of time to live is a feature which keep keep our cluster running so we want our cluster to run at least for 1 hour yeah 1 hour is enough right and we we will just click on okay so now it will just turn on our cluster it will take some time it it can take I think one to two minutes and till then I can just go and eat something I'm really really hungry you can also eat something and I will just come back in few minutes so our cluster is ready as you can see the green sign after data FL debug once our cluster is ready now we can perform some extra functions on top of the things in data flow why because once our cluster is ready we can actually see the data we can actually preview the data so first of all we will import projection what is this import projection import projection is basically importing the schema right do not mind my carrot by the way you should also eat carrot carrot is good for health so is just importing the schema right now our schema is imported now as you can see it has automatically decided the schema like data type is short Tim stamp spring short string long like all the data types for all the columns are decided you can obviously change it you can just click on the drag down if you want to manually change it because obviously it is decided by the spark so if you're familiar with P spark it is equivalent to infer schema mode in P spark no no no we are not familiar just just keep a mouth shut okay sorry sorry s sorry sorry sorry sorry so basically infos schema I I will just tell you infer schema is basically a function that we have in pbar where we can just ask Park hey bro yeah bro bro yeah bro hey bro can you please decide our schema or let's say schema for our data so it automatically decides it and obviously if you want to explicitly decide it we can just do it like this so not a big deal H projection is done okay now if you want to preview the data you can just click on it just click on refresh button now it should just display the data but if your data flow debug is not ready if it is not turned on you cannot see the data perfect perfect perfect perfect H now this is our data right we have transaction ID transaction date product customer so many things right so many things okay now I want to perform some transformations we will create one data flow we will try to cover some Transformations I know like these transformation are basic transformation but still we need to cover it First Transformation what's that basic one the most used one what's that select transformation I will simply click on this plus button the small plus button click on it then I will see all these Transformations all these Transformations see join conditional split look up you Union schemas schema modifier like DED column select aggregate there are so so so many Transformations you can perform almost everything you want you can even perform your window functions bro what what is your what is your so you can even perform like window functions you can even perform window function just just feel the vibe like there are like so many stuff you can do in data flows you can do almost everything that's why if you're using data flows you can complete completely ignore using ppar that doesn't mean that you should because obviously when you use ppar you have some more customizations you can obviously optimize your code and there are so many stuff but I'm just talking about the transformation perspective that you have all the functions available right and obviously for all these Transformations you do not even need to write one line of code wow so first of all I will use select transformation simply click on select then obviously I will call it as select columns then what does this transform like transformation will do basically as the name suggest it will just select all the columns that we want simply go here let's say I do not want to keep loyalty card I will simply click on it I do not want to keep quantity I do not want to keep credit card H okay I do not want to keep cost let's say so I want to delete these columns so I will simply click on delete these are deleted now we just have these columns that's it if you want to see the data click on data preview click on refresh now we will see just like five to six columns that we have selected that's it not all the columns so this was our select transformation that we can just perform on this see 1 2 3 4 5 six yeah we have just six columns see that's it that's it now now I will perform one uh a complex transformation not much complex you would be familiar with Group by so I want to perform group bu on top of this let's say I just want all okay okay let's let's perform Group by in the end let's perform some more Transformations here just the basic ones so if you want to filter your data if you want to filter your data let's say I just want customer ID uh or I do not want customer ID equals to 12 I do not want it so I can just perform filter transformation so I can just select on plus button then I can just go to raow modifier click on filter then I will name it as filter rows because obviously they are like countless transformation you can just play with it but I'm just showing you all the important ones that you will be using in day-to-day activities because these are the ones which will be used very often so filter condition what is the filter condition we need to give the filter condition so as we just know just just click on this uh activity we want customer ID not equals to 12 let's say customer ID not equals to 12 so we will just write the condition here so we will say either you can just write it here or you can click this open Expression Builder because once you click on it you will see lots of options and you can just select the columns from there as well simply click on it I personally preferred this open Expression Builder so this is your Expression Builder where you can just build the expression so this is your schema right now I want customer ID not not equals to 12 simple I will simply say save and finish simple now I want to say preview my data so now I should see I should not see basically so I should see data without 12 and I should not see 12 see now I don't have 12 so we have filtered our records based on that condition based on that condition simple now what's the next thing that we need to perform so if you just click on this plus button you have join you have conditional split let's split our data what is this conditional split I will show you I want to split my data based on this payment we have I think two to three types of payment Visa master and American Express yeah I want to split my data based on these three different different types so I will just simply click on this I will click on conditional split so now now I will give the condition now I will give the condition so my first condition is let me just click here now basically it will just divide our data into multiple streams so this will be our one data frame this will be our second data frame this will be our third data frame where yeah it will be here so just imagine this is our third data frame so for that we just need to define the name obviously and we will just name it as let's say Visa right what is the condition open Expression Builder simply and just select payment equals to equals to Visa simple click on Save and finish simple that's it that's it if you just write here let's say other than Visa so what it will do it will just create the data frame it will just create the stream without any Visa payment equals to Visa Visa means Visa card so by default it will just put all those records here let me just show you the preview obviously we will create three streams don't worry I'm just showing you the preview and I'm just showing you the core things that are being happened behind the sces so I will just see data with Visa right and in the second stream in this stream I will see the other two types let me just refresh it like we cannot preview this thing because this is giving preview from here so if you want to just select this specific stream you need to select here as well output stream because by default it just gives the output of the above stream just select the other stream yeah then you should see the other two types see MasterCard American Express simple now we want different streams an you just mentioned that we will be creating different streams so what I know we will create it let's click on this plus button or let's click on this stream main stream click on it go to your conditional split then click on this plus button here now we will write here MasterCard and condition is payment equals to equals to MasterCard simple and the third one is MX simple and for MX we do not need to specify anything because the other rows then visa and Master Guard will be going to MX so now let's preview the data now let's prev the data now let's refresh yeah now let's refresh why we have created three streams because I will tell you don't don't worry I will tell you because first of all let's see the data see this is the Visa this is Mastercard so we will just see MasterCard records here yes perfect perfect perfect now in this the third stream we we will just see the MX perfect perfect perfect obviously null ones because we have specified any records other than visa and MasterCard so all the other records will be here so there are some NS don't worry don't don't don't worry don't worry don't worry okay that gives rise to a a new transformation that we can do what's that let me just tell you so as you can see we have nulls how we can replace nulls how let me just tell you that's a good one that's a good one okay so we know that we have nulls in MX or it's not just about nulls because we have any value other than V and Master Card right so I will create one transformation after this stream not after these two streams just after after this stream yes we can do it click on this plus button then select derived column derived column basically an activity which is used to transform a column or add a new column just click on this and I will just name it as derived column let's let's keep it as it is let's keep it okay now it is saying select a column now you have two options one option you can pick a new name if you pick a new name it will create a new column if you pick the already existing column it will just transform it what we want to do we want to transform our existing column so I will just click on this drop down and I will select payment perfect perfect what is the expression I will click on Expression Builder I will click on Expression Builder perfect now I will pick the expression slash function so I know that I just want to remove the nulls or replace the nulls with some value right so let's see do we have Coles function here yes we do have so in the Calles function we will write payment if it has nulls then replace it with n/ a perfect click on Save and finish let's now see the data now let's preview the data now we should not see nulls we should see NA perfect see see see see see see see now we we have na instead of nuls instead of NS now we have na very good very good very good now I want to perform a group by okay now I just want to perform a group bu based on let's see let's say on this right let's say on this so how you can perform a group buy so if you want to perform a group buy you can just find something called agre click on this okay so let me just first rename it aggregate so on which thing you want to aggregate I want to perform Group by on customer ID H nice I will pick customer ID from here then what Aggregates you need to perform what Aggregates so let me see what I will be performing so this is like column add or select a column so I want to pick hm I will just perform the max of product ID I just want to perform Max of product ID based on the customer ID like what is customer maximum product ID so this is the aggregation that I want to perform so I will simply pick this product ID and what is the expression I will pick it from here so I want to perform Max let's see do we have Max here yeah we do have so I will simply say Max of product ID save and finish now let's preview the data what do we have let's preview the data see you are performing Transformations on the data without even writing single line of code everything is just like so easily available to you you just need to click on boxes move your mouse and everything is done everything is done for you using Spark so as you can see these are all the customer IDs and these are the maximum product ID based on every customer ID simple Yes simple but powerful right simple but handy you if you do not want to just use codes you can perform this why it is important just tell me why like what is the best use case of these data flows it is very useful when you are work with the ones who do not know much coding when you are working with the ones when you want to keep things uh visible to everyone even to tech people even to non- tech people because non- tech people can also understand what's going on let me just just tell me one thing just just tell me one thing if I just show this if I just show this data flow or let's say these Transformations can can non tech people understand this like what's going on in this code obviously yes they will understand this is a source then you selected some columns then you filtered some rows then you splitted this data then you aggregated this data then you derived one column it is so so so understandable right you do not need to convey those codings to non-tech people non-tech stakeholders that's why Industries are adapting these Technologies where you do not need to exactly write the code I'm not saying that it is a replacement for code no there are requirements when you should not use code you should use these Solutions so you should be well vered with all the types of solutions if anyone comes to you hey bro can you do this you should say definitely bro you you should behave like this that you are just ready to perform any kind of solution because you know everything you should be competitive enough right so that's why that's why I'm not saying that this is a replacement for code no no never but there are some situations there are some requirements when you will be actively working with non-ex stakeholders when they will be part of the development so you need to just keep everything transparent so in those types of situation you should be aware of how you can just build these Solutions right right so this was all about data flow but one thing is missing what's that sync we have transformed this data now we want to write this data now we want to write this data how we can just write this data you just need to Simply select this plus sign okay and then just select sync just select sync when you select sync obviously you need to first of all rename it perfect and then you will just pick the data set let's create a new data set for our data flow obviously the location will be data flow Gen 2 CSV then I will name it as data flow sync simple link service will be the same okay and then just pick the reporting container perfect within the reporting container we will just create a folder called as data flows data flows output data flow output simple simple perfect okay just click okay and obviously schema importing no perfect so we have applied here sync do you want to see the data let's refresh it and you should always perform one activity before sing Let me just share that as well that is Alter rows which allows you to actually uh perform the operation insert operation update operation upsert Operation so you should always use use that so I will simply use it uh you can use this data flow without alter rows as well but if you are dealing with databases or let's say data warehouses then you should just use that it is called alter rows so what it will do let me just show you let me just rename it first alter rows so it it it asks us like alter row condition we will say insert if expression 1al to equal 1 what's that what's that what's that so basically we want to say insert this data only if 1al to equals to 1 that means we always want to say the condition is true that means it will like all the time will be inserting the data if I say condition here it is like let's say I open the Expression Builder and if I let's say um product ID equals to equals to one let's say then oh type mismatch I think it is string let's click on Save and finish yeah in this scenario it will only insert the data for product equals equals to one but we want to just insert data for all the records so I will simply say 1al to equals to 1 you can put anything bro it's not about 1al to equals to one it's about 0 equals to0 that means like any true condition which is always true which is always true you can always like I can just put like this as well unch equals to equals to Pro no unch equals equals to un unch is just unch so it is always true right an is equals to equals to anch so this way you can just Define the condition and you can just provide the sync so if you want to validate this you can simply click on validate button and it is done so it is not important that every time we should just use sync yes we should use sync with at least one stream but we can ignore using streams with these two streams as well that's why I didn't want to use this so that you can learn that if you want to just filter this data and do you you do not want to actually write this data you can do this if you want to write this data as well no worries you can just use sync operation like sync activity and you can just attach it and you can just use your data set it will write this particular data set to the location but you can ignore this as well this is very confusing when someone asks you because obviously it it is very confusing right like why we are just uh creating streams then but it is possible it is doable you can just ignore writing this data to a sync yes perfect perfect perfect perfect so we have created this beautiful data flow see we have created this beautiful data flow so what this data flow will be doing it will be triggered after this pipeline after our main pipeline which is which is which is which one yeah this one which will be triggered after this yeah so it will automatically pick the data from reporting and it will write that data to the sync and which is the sync it is a data flow data flow folder within the reporting container right are you excited to do that yes I am excited but you should publish all your work before doing that because it is very important to save your work save your progress by the way it is good if you just if you do not publish your data you have to do it again why I'm doing that because obviously you will learn a lot if you just do it again and again right I know I know I know bro I know so this is done I do not want to run this activity right now why I told you that we'll be using a trigger to run this activity so for that we will attach a trigger so I will call it as ADD trigger click on it I will say new so now it will ask me to pick a trigger I do not have any trigger so it will say okay let's create new I will say okay let's create new so this is my trigger name so I will name it as selected files trigger selected files trigger simple now what type of schedule trigger like not schedule trigger what type of trigger we want to use we want to use scheduled trigger because we want to schedule this pipeline we want to schedule this pipeline right so we will pick shidu trigger basically we can also use tumbling window trigger what's the difference between two shule and tumbling because these are very similar similar basically schedule trigger and tumbling window trigger are both schedule based triggers because it triggers at a particular interval but with tumbling window trigger you can actually run your pipeline using past intervals as well whereas in terms of schedule trigger you cannot run your pipelines in the past interval like you cannot pick past date but with tumbling window trigger you can just pick date from the past that's the only difference that's the only difference simple so for now we are not running this pipeline for the past we are just running for the future so I will just pick schedule trigger and what is the start date what what is the start date that you want uh uh uh uh choose the start time okay let's pick the start time start time I want currently it is 3:42 p.m. here but I think it it takes uh Atlantic time zone okay then it's fine okay so 342 means 15 something yeah and minutes it will be let's say 45 let's see okay so hey bro click on okay I didn't click on okay so I want to run my pipeline at 345 Okay click on okay perfect and this is recurrence okay first it will just run our pipeline at 345 next run will be running at what time it is like for after 15 minutes and you want to specify end date yes definitely you want to specify end date I want my trigger to end at 15 uh 50 I guess yeah it's okay yeah okay and recurrence is like after 15 minutes but our trigger will be completed before that okay so this is start trigger on completion I can click on okay and for that I just need to publish it make sure to publish the changes after clicking okay okay okay okay bro I will follow your instructions bro so if you want to trigger your trigger so you need to just click on publish all so click on publish yes perfect so now how we can just check our trigger because we want to see if this works or not right I will simply go to monitor tab sorry manage Tab and then you can see their triggers here click on this and then it is status is started so it should automatically trigger my pipeline after 1 minute because it is 344 I'm going to my monitor tab I do not have anything right now because nothing is uh running right right now but after a few seconds I should see my pipeline here running I should see that pipeline here running and everything will be running again it will just get the metadata it will again write my fact like CSV file again and it will run that data flow and we will see the output there in the new folder let's see and fingers are crossed fingers are crossed let's eat the rest of the carrot so the thing is when it will automatically trigger the pipeline that's what we want in the real life because we want our activities our data pipelines will be running automatically automatically without any human inter human inter human intervention yeah I'm just reading that so don't don't mind don't mind me human intervention let me click on refresh button because it is 345 so it should trigger right yeah yeah so now you can see status is in progress that means our pipeline is running automatically I didn't run it I didn't run it it is running automatically by that trigger now we can also click on it it will show was all the activities that it has performed so far so it has performed get metadata for each if copy if now it is performing data transformation automatically and you can see the duration that it has taken like 16 second 20 second 15 second 14 second 2 seconds 2 seconds not 2 seconds it is running it takes some time because it is running Transformations and obviously Transformations take time spark clusters are just performing transformation on top of it so it takes some time not much time you should feel happy that you have developed an end to-end solution using trigger now we will build one more Pipeline and that will be full of fun because we will cover another trigger and that is a very special kind of trigger and very very powerful very powerful trust me because I I will show you that that's that should be suspense I will show you and we will learn all the like other activities that are important and that are not covered yet I will cover all those activities as well and yes we will be covering other triggers as well and that trigger name is storage based trigger storage events trigger storage blob trigger you can call it anything it is very powerful it is very very powerful I personally love that River personally love that River trust me so and at the end we will create like one big pipeline in which we will be add these small small pipelines and we will run everything everything together okay so it is still running data transformation are still running it has been running since past 1 minute 41 seconds it's 2 minutes now yeah it takes some time it takes some time don't worry because that is like performing transformation on top of your data so it takes some time don't worry and I know you are learning a lot and I'm really happy and now we will be just quickly covering the other activities and other Pipeline and that is really really interesting really interesting really interesting so let me just finish my carrot and let me just see like when we have the output of this one and once we have all the greens I will just start the next pipeline the new pipeline in which we will be covering all the other activities and storage blob trigger storage based triggers or you can call it as anything the common thing is it is based on the storage path yeah it is very special I will I will show you let me just finish this car so as you can see all are green finally finally finally we have run the data flow you using trigger not just the data flow we have triggered the whole pipeline whole pipeline do you want to see the output let me just take you to the uh reporting container and oh we have a new folder that we created data flow output I just want to click on it and I should see this file now bro bro bro what is this what is this so basically when we do not pick any file name and when we are using spark to write our data it uses this naming convention part 00 something and this success file shows that data is not corrupted data is copied or let's say written is successfully so if you this is in CSV format don't worry and let let me just click on it and let me just click on the edit tab I should just see the two columns yes we have just two columns that means we have successfully we have successfully used let me just go to the author tab wow man we have just created this Pipeline and to and Pipeline and we have created such a complex solution because we have like three to four level of hierarchy in this particular activity I'm really happy for you that you have learned all of these things now what we going to discuss let me tell you now let's talk about this activity set variable activity it is very very very useful when we build complex Solutions and if we want to use value of something and we want to store that value we have a specific activity for it so you can just just imagine how useful it is it is very easy to perform we do not need to configure anything but its use cases are really really really handy and really helpful so one useful uh use case can be like let's say you have an activity okay let's say you have an activity uh okay okay let's let's cover one realtime scenario okay okay okay okay I just uh remember that remember I I I was just thinking about that particular scenario so the thing is let's say you are performing get metadata activity right where you see the number of files like not number of files like all the child items and everything right if I want to store that information in a variable how can I store it I can just create a variable and I can store that information in in that particular variable yes I can also perform d damic content on top of the set variable activity let's say I want to store that child items and I want to specifically pull something let's say a particular file I can do that as well like set variable is a kind of variable like XYZ and you can store anything anything means anything like output of the activities or let's say pipeline run IDs everything it is very handy and why it is so useful because we can just connect this activity with the further activities let me just show you this powerful activity and it is really really really helpful trust me and we going to cover that particular scenario as well what's that scenario I will just tell you in detail in Azure data Factory and it is related to uh get metadata yep so we are in our Azure portal account so for that I will just simply create a new pipeline so I will call it as where pipeline simple and I will simply first of all copy that get metadata activity that we have already created so it will save our time simply click on it contrl C and go to where pipeline click on this canvas then control V perfect so this is the cop get meta get metadata get metadata activity okay so obviously we know that what it returns it returns an array because we have selected child items right so now first of all I will create a variable in this pipeline simply click on this blank canvas go to variables plus new I will say where files where files now what will be the data type for it it will be an array because we are storing array in it we are storing array so now we will just pick set variable activity from here just drag it here and just pick on success and then click on this set variable uh I can just give the name like store files let's say I want to store that information in this particular uh variable right click on settings now it will say which variable you need to pick click on the d drop down v file so it will store this information in this variable there are two options pipeline variable pipeline return value so if we select pipeline variable it will return the value in this particular variable in this particular output but if we select pipeline return value then we can use the output of this pipeline in the other pipeline of this value of this set variable activity it is so so powerful so for now we will just select pipeline variable and then what do we need to select what is the value that we need to provide click on it add Dynamic content as I just told you that you can use Dynamic content click on it and then click on get metadata output activity dot child items perfect now I will run it I will click on debug so you will see that it will store that value and I will show you the value as well it succeeded let me just click on this output button because this is get metadata activity right we can also show it as well no worries so this is the output and I wanted to store this information right now I will show you the output simply click on it expand it see I stored only that information which is where files which was like child items and I can see both the values here see now I can connect I can connect anything anything after this and I can use this value see how powerful it is it can hold in Can it can store the values for you whenever you will be running the pipelines it is so so so powerful so powerful got it got the concept now I will show you something special that is called storage events trigger what's that what's that what's that this is the trigger that we were talking about so are you excited to learn this trigger so basically like let me explain like what is it so basically do you remember that we were performing one copy data activity on source and we called it called it as like manager pipeline if you remember and the scenario was manager was dropping one file right and we were copying this file and we were dropping this file to the destination folder and being a data engineer we already said that we will be just automating everything so if we want to schedule this pipeline let's say I schedule this pipeline after 15 minutes it will be copying the data after 15es minutes continuously continuously let's say manager just focus on this part because this is really a real time scenario plus interviewer will definitely ask you about this trigger and use case as well and obviously if you answer both the questions the third question will be how you can implement it so just be focused so the scenario is let's say your manager just dropped this file in the monor okay your pipeline just ran after 15 minutes and you just drop this data here okay that's sorted but then that manager didn't add any file after that didn't add any file any file means any file but he adds or she adds F next day okay then again like after 15 minutes you your pipeline ran and you just copied the data here right then after 2 minutes she or he uploaded another file just after 2 minutes or let's say after 3 minutes then your pipeline ran and you dropped the data here so the dependency here is on the manager because you do not know when he or she will add the data in your Source you do not know but his or her requirement is that as soon as he or she adds a file in the source your pipeline should do some magic and should drop the data in the container in the destination container you'll be like what what are you saying bro you should tell me when you'll be just dropping and you will be just adding data in the source and then I can just schedule it or you will just tell me the time like like like let's say at 9:00 a.m. you will just add the file so that so that I can just add a schedule trigger to my pipeline so my pipeline will be running at 9:00 a.m. daily you should just tell me this your manager said no I won't tell you because I can even drop the file at 2 a.m. in the night yeah and you being a data engineer needs to just copy the act like copy the data into the Container immediately immediately one option is like just be awaken like for 24 hours and just see his or her message and just copy the data second thing is you should learn about this trigger storage events trigger okay so will it solve the problem yes you do not need to be awakened don't worry so the thing is this trigger will do some magic this trigger will automatic Ally trigger the pipeline once he or she will add the data here automatically and there's one more thing it will automatically stop the pipeline as well just after it is completed then even your manager adds the data let's say after 5 minutes it will again automatically trigger again that's the power of this storage events trigger yes that's that's why it is one of the favorite triggers because you do not need to depend on anyone your pipelines are independent you can say to your manager okay even if you just add the data at 3:00 a.m. my pipelines will immediately copy the data to The Container yes this is the answer this this is the answer right and are you excited to learn the solution this is very very very powerful let me just show you how you can build this so let's go to our assure data Factory so how we can do this first of all I think you understood the uh the thing that we are doing so this is our source where our manager is dropping this file right and the pipeline for this is this one pipeline manager so we will attach storage trigger to this particular pipeline which will trigger the pipeline automatically when we will be having data when we will be having data automatically trust me trust me I will just show you each and every step that you can perform just just just hold on so first of all I will write add trigger okay just write add new and we do not have any trigger so plus new okay I will call it as manager trigger yeah because obviously this pip is dependent on your manager right so I will call it as manager trigger this like the type of this trigger will be storage events see storage events click on this now now you need to pick the storage account so I will just first of all pick my subscription then storage account will be storage data Factory unch then we need to provide the path yes here as you can see blob path begins with or ends with you can just provide any one thing if you just over over on this you also need to provide the container name in this format in our scenario it is Source simple then blob path begins with if you just hover over this I button it will say blow blob path let me just zoom It Blow path must begin with the path that means after container what do you have the folder in our case it is CSU files then within that we have a file file 1.csv and do you know the best thing about it do you know the best thing if your manager okay okay okay let's let's add some complexity let's let's add some complexity if if if if now let's do it if your manager says I will only upload this particular file but he or she can upload some files like some other files by mistake but your pipelines should not get triggered if she or he adds wrong files what you are becoming so choosy right now but it is what it is if she uploads any other files let's say she uploads contacts FES let's say and she then after 2 minutes she realized oh I just uploaded wrong files so you should not copy that data so your pipelines should be performant enough should be resilient enough to pick the right file as well so here is the magic here is the magic what you can do you can Define the complete path and you can say Okay manager I will just provide you the complete path and you can just see we need to say CSV files just hover over this if you just get confused just write this thing we just zoom it 2018 in our case it is CSU files then a file name file name you can just simply copy from here factore sales1 and let's remove this single code I don't think so we need to add we will see if we need to add it or not CSU files then fact sales one. CS finish then it is asking us how you should trigger your pipeline we will trigger our pipeline once the blob is created that means once the file is uploaded okay click on blob created and then continue then continue click on continue and continue here as well and then obviously we will just click on publish but before publishing it publishing it I will do something else as well so as you know that we have a file here and we have said once we have a file it will trigger our pipeline if this file will be 247 here our pipeline will be running one after another like it will be just running again and again and again so we need to make sure once the data is copied we can delete this file because this file is successfully copied once this is successfully like copied only then we will just delete it so here is your new activity that you are going to learn which is delete activity delete activity so here we will say delete file simple and what is the source source is source is I think it is Source uh uhuh what was the name of yeah CSV Source okay so this is the source and we can even preview the data to confirm okay this is the right Source yeah this is the right Source okay then logging settings no we do not need to enable the login then on success we will connect this node now you know the importance of these nodes we will not say on completion we will say once this data is successfully copied only then delete the files otherwise do not delete the file right this is the pipeline this is the automated pipeline we are not doing anything manually here not doing anything manually okay let's publish all and let me also delete this file first so that you can see the difference let me just delete this file we will again reupload this to test the trigger now we have nothing here okay let me just click on publish because publish button is important if you want to use the trigger so now it is publishing the uh changes now our pipeline is ready this pipeline fail to get subscript status subscription so as you can see we can see some errors like fail to activate the trigger and fail to get the subscription status on the storage do you know the reason the thing is if we want to use the special kind of trigger we need to actually register we need to actually register our subscription for that particular storage trigger so obviously that's a new uh ADF account that's why we need to do that and obviously the subscription does not have that particular registration so how we can do it it is very simple so simply go to your home tab then just click on the subscriptions and then you can just pick your subscription in your scenario and just close this I think you can just go to Resource providers yeah I think it is here and see some services are not registered some are registered so you just need to search regarding even something so it should be Microsoft events based trigger something like that it should be something somewhere here uh Microsoft Event so it will be here see so as you can see Microsoft Event grid so you need to register on this service and my sub doesn't have this registered yet so I can simply click on these three dots and I can just click on register so you can do it like this so now it is registering that it will take I think just few seconds and once it is done then we will be able to add that trigger and let's see how long it takes because earlier like they used to add the service by default so now they have just asked us to just manually enable it yeah because this is a kind of special feature so maybe we are just special people to use that trigger just kidding so just uh enable it it will take some time so once once it is turned on then we can just start using our trigger and we we are just good to test our pipeline because we have developed our pipeline so we we just want to test it so our event trigger is registered successfully now we can just go back to our ADF and then we can just refresh it just make sure that you have published your resources before publish before clicking on publish all otherwise you will lose all your progress so let's see do we have that trigger or do we need to create one so let's see new and this manager trigger let's get rid of it let's detach it and let's create a new one add trigger new and edit and then obviously new trigger so we will name it as man trigger let's say and this time obviously storage events trigger and then pick your subscription storage account name in my scenario to storage data Factory blob starts with basically we need to just give the container name first so it is source and blow path starts with CSV files we are just repeating those steps don't need to worry CSU files and then we have files file we do not have right now but we will have 1. CSV simple simple and event type is obviously blob created click on continue click on continue click on okay now we should not see any error when we click on publish all let's see let's let's let's see let's see do we see any error or not so now it is publishing and once it says okay it is published then we can just reupload that file and we should see our pip is triggered automatically without any manual intervention so let's see it is being published just hold on for a few more seconds and I think it will just publish it it will publish it successfully this time see publishing is completed the only thing was that registration was not registered so now it is registered so no errors now are you excited to see the status so So currently currently we can only see this trigger obviously these are pipeline runs that we ran just few minutes back when we were just testing our like schedule trigger now just go on trigger runs so currently let me just refresh it now you can see that we do not have anything running we do not have anything running for that particular trigger that is man trigger but now it it will trigger our pipeline automatically let me just show you so now I will just go to my storage account right where is my storage account it is inside this okay now what I will do I will just upload that particular file here in this folder and once I upload this data I should see my pipeline is trigger I should see that trigger let's see let's see let me just upload it just click on upload button you can also do the same steps let just click on upload button so I have selected this file so once I will click on this upload button it should trigger our pipeline let me just click on this upload button now our data is uploaded let's see the status let me just refresh it let me just refresh yes it is triggered it is triggered man trigger is automatically triggered woohoo let me just see the status of it just me click on it so if you just want to see the ram just click on this one button now it will just take you to the pipeline run okay it has successfully triggered our pipeline it has successfully triggered our Pipeline and you can see that it is automatically triggered and do you know the best thing it has automatically deleted that file as well so that it cannot trigger that pipeline right away see if I just see it should be deleted let me just refresh it perfect perfect perfect so now I I will test it for one more time and this time I will add some other file because we covered this scenario that if our manager just add some random data it should not trigger a pipeline right let me just add some other data which is like not known as factore sales and something like that okay let's do that so I have picked one random file which is called as 2010 128. CSV so I will now click on the upload button now I should not see the trigger I should not see let me just click on upload button okay file is uploaded let me check let me just check the status of it let's go to trigger runs let me just refresh it refresh refresh refresh see our pipeline is not triggered why because it does not have anything it does not have anything because it needs to be exactly that data file that's our condition that we put in our uh trigger so our pipeline now can pull the data immediately the moment our manager submits the data even at 2 a.m. 3:00 a.m. 4:00 a.m. we do not need to run our pipeline manually and obviously we do not need to waste any kind of computation because it will not run anything it will not run anything right this was so so so powerful like I I love this trigger I love this trigger so now it's time to sum up our all the pipelines and we will just discuss one more important activity which is called as execute pipeline so we can embed pipelines within pipelines that means we can just come like create a one big Pipeline and we can embed child pipelines within that this is so cool this is so so so cool so now we will be creating one big pipeline which will be embedding child pipelines within it let me show you how you can do it it is execute pipeline activity so let's say this is our main pipeline like this blue box is our main Pipeline and within this pipeline we have execute pipeline one then we have execute pipeline two that means means first this pipeline will be running after it is successful this pipeline will be running so we can attach the trigger with the parent pipeline this pipeline is called as parent pipeline these are the child pipelines so or let's say children because these are two pipelines so we can just attach our trigger to this parent Pipeline with that was like storage based trigger we can just attach this trigger to the parent pipeline instead of this one so what will happen what will happen do you want to know the full flow what will happen let me just tell you so let's say this is our parent pipeline this is our parent pipeline okay because we going to just build end to end solution right now because we'll be just combining all the pipelines all the activities one by one and let me just tell you what going to happen Okay so first of all this execute pipeline will get triggered which is based on your manager's data so let's say the manager adds data right even at 2 a.m. even at 2 a.m. once manager adds data at 2 a.m. it will copy the data right to the destination then after it is added to the destination this activity will pull the data from the GitHub that is our fact sales to if you remember that data is sitting there then both the files okay both the files will go to the get metadata activity then they will go to for each activity then they will go to if condition then they will go to copy activity then our data flow will be running wow that's fun man and to perform this activity that that like the the activity that I'm going to perform right now I will delete all the data we will see how our pipeline can pull all the data automatically automatically I will just remove all the data and I will just create a trigger and I will say hey this is my pipeline this is my production pipeline production pipeline so once we have data in the data Lake everything will be triggered automatically I'm just going to do some kind of orchestration which will create our production pipeline let's do that so first of all I will just detach all the triggers attached to any pipeline because we will ATT the main trigger with our parent pipeline let me just show you so I will just go here and I will call it as detach it is detached okay now I will go to only selected files and I will say detach now no activity or no pipeline is attached to any trigger so if you remember this activity this this this this activity will be running after our this activity pipeline manager so let me just give you an overview again what we are doing right now first of all let's go to uh data L okay this is Source okay so this is our source container right let's say we have nothing here we have nothing here let me just remove everything because we will be doing everything from scratch this is our production pipeline okay let's remove everything from here as well as this is our production pipeline okay we have nothing here we just have source and this source is not managed by us this is managed by our manager or any stakeholder let's take any stakeholder any stakeholder like let's say a business analyst or anyone the person will drop the file here simple it will drw he or she will drop the file here that file should exactly match with factore sales 1.csv that is the strict restrictions that we have from the manager got it once we bro BCS because I'm just explaining the workflow once we have that file once we have that file it will go to this destination into the CSV files folder perfect one file is there then it will bring the data from the GitHub if you remember we just pull the data from GitHub as well it will go there so it will will copy this data as well it will copy this data as well it will drop this data where in the destination CSV files folder perfect perfect once that data arrives in destination container it will trigger this pipeline only selected files this validation activity will check if we do have any files if yes then it will go to get met data activity it will bring all the child items then it will go to for each CSV if it has any array obviously it will give the child items it will go inside this and it will perform that if condition what is that condition if that file starts with fact then perform the copy activity perform copy activity in all the files which which are starting with fact perfect okay once it is done we will perform data transformation data transformation using data flow perfect understood the end to and flow I hope so now how we can just build the end2end flow I will just create one new Pipeline and I will call it as parent pipeline uh let's say parent pipeline or pad pipeline let's call it as pad pipeline perfect then I will use an activity called execute yes and I will say execute manager pipeline manager pipeline perfect go to settings select that pipeline which is this one pipeline manager perfect then I will add another execute pipeline I will call it as on success obviously I will call it as execute on selected files pipeline yeah because I'm just keeping the name same from here on selected to this one so it will just remove any confusion right what is the pipeline only selected files perfect now I will attach the trigger add trigger new edit use trigger and which trigger I need to attach man trigger why because we need to trigger the whole the p and pipeline based on that a file add file being added to that particular container perfect everything is fine perfect click on continue click on continue for one more time click on okay and now click on publish all wow what is this publishing error validation of model uh trigger selected files trigger cannot be activated and contain no pipeline which one I think it is related to the trigger that we have already selected F Trigger or yeah we need to delete it because obviously this trigger is not being used by any other pipeline so obviously I could have just topped it as well but I do not need it now I can just try for one more time like for publishing all click on publish all and let's see I think that should be the scenario I don't know like is there any other issue now let's see it is just trying to publish it and once we have published it I will also test this parent pipeline test this production workflow end to end workflow without any manual intervention without any manual step we will just drop the file and everything should be triggered automatically and once it is done you going to tap your back you going to feel so relaxed and at the same time feel happy because you have learned so so so much in a Y data Factory including like triggers and everything everything let's see if it is done yeah it is still like in progress see so let's wait for like few more seconds and once it is done we can just test it let's see let's see let's see see so publishing is completed you can see the stateus publishing is successfully completed now I'm going to test this whole pipeline now I'm going to test this whole parent pipeline or you can say production Pipeline and you already know the test case we will just drop the file in the source in the CSV files with that factore sales like one. CSV once we upload that we should see this whole whole production pipeline should be triggered and we can just go to manual like monitor Tab and we should see that trigger here and let's see the status okay let me just upload the file click on upload and you can just upload the file so now I am just about to click on this upload button because I have selected my file so finally let's click upload perfect Let's cross the fingers and let's see the result let me just refresh it perfect perfect but wait we need to see the green signs in the child pipelines as well so let me just click on this one button because I can see my man trigger is triggered let me just click on one sign click on this button now I should see green in both the child pipeline both the child pipeline because the whole workflow going to just be restarted because there's no data I removed all the data in front of you so now I should see all the data back again and let's see how it goes I'm really excited okay first pipeline is successful I should see green on the second pipeline as well finally finally finally we are just about to complete the whole course and I hope that you learned a lot I hope that you covered some realtime scenarios using triggers using like so many real world use cases and again like I'm I'm really happy to share this knowledge with you all and and if you feel this course is helpful to you please please please help me to grow I'm really really new I'm really raw and my whole intent is to just give you knowledge lots of knowledge related to data engineering and as you all aware of that I just upload data on a weekly basis like long form videos I am really putting lots of lots of efforts just for you so that you can learn easily I try to engage with you a and I feel so happy when I just talk to you I literally feel that I'm just talking to you in person trust me I'm just putting lots of lots of efforts so just help me to grow and just hit the Subscribe button just like the videos because of you obviously you know like the more you will engage with the videos the more I can grow and more you can learn just hit the like button comment like do comment on the videos share these videos on like different different Channel different different platforms please help me to grow just a small request from my data famam and I know my datam is like very very very helpful to me and obviously they will help me to grow and I can like already witness that in the comments like you guys give me like so much so much so much of love so let me click on refresh button okay it is still running obviously we can expect some time in this pipeline because we are running a data flow as well in this pipeline so it takes some time if if your cluster is turned off so it takes some time to turn it on so you can expect some delay but but but but we should see green sign at the end of this pipeline because we have worked so hard to build this end to end solution this end to end production Pipeline and and and one more thing if you want me to build let's say an end to and Project based on ADF that is aure data Factory only in which I can include more real time scenarios more difficult situations exactly these are like nothing we have so so so much complex Solutions in the world of data engineering using ADF if you want me to cover those Solutions uh in the upcoming videos through projects just do let me know so as you can see that our both the pipelines are green green Green I'm so so so happy to share all the information with you all and yeah like I I'm so so so happy that you have built this solution and trust me you are learning a lot and you should feel happy if you have build the solution into and if you have not understood all the concepts in one go even if you have understood all the concepts I would recommend you to just rewatch some difficult Parts like that activity that we created like using for each then if then uh get metadata then copy activity you can just try to rebuild those complex pipelines and you can just rewat watch that part at least like two to three times trust me it will definitely help you a lot trust me because I also watch these kinds of Concepts and I also just practice these Concepts like many times so you should also do that I'm just helping you out so you should do that okay so now I want to validate everything I should see my data in every possible container okay let's go to the containers okay first of all we will check destination folder it should have two files Fact one and and files do files 2. CSV let's see oh we just have one file but why why we just have one file we should have two files what's what's wrong with this what's what's wrong with this oh silly mistake silly mistake silly mistake if you are able to find the mistake not a big mistake just a silly mistake so then it's good you should you should find the mistake so the thing is we have not included this this pipeline pipeline git which is just copying the data from the GitHub who will include this who will include this we so what we can do there's not like nothing big thing to worry so we can just open this pipeline like execute manager click on this it will just open our uh pipeline manager so what we can do we can just bring this pipeline here so we can simply add execute pipeline yes see okay okay so this this this is good because through this activity you can just even learn that you can connect activities with the pipeline as well it's not about connecting Pipeline with pipeline no you can connect activity with pipeline as well so what we want we want once our data is copied into CSV once it is deleted we want to trigger this pipeline as well which one git one so we will just select it and we will call it as execute git git pipeline that's good get pipeline settings select that g pipeline pipeline get and let me call it as on success perfect perfect perfect because this was pending now it is also done is there any other pipeline that we are missing no no no okay it's fine okay that's a good test and these things you should be aware of and you should be aware of how to fix these things because troubleshooting or debugging is also an art and see how quickly we just found the bug so now what we can do I will again delete this data no it is already deleted because of this delete echy wow so I will just click on publish all okay let's click on publish all perfect it is just publishing the change it is published okay now we will again yes yes yes it is important it is important now we will again add that file and we will again trigger the data and we will finally see the whole production pipeline yes we have to cover each and everything it's not just about completing the course no we are developing skills so let's upload the file again in the source so let's go to Source it is empty obviously like it has like that random file but let me just upload that file again so I have selected the file and this time it should work fine it was running fine as well we saw like Greens in all the things we just forgot to add one uh pipeline that was one like copy activity from GitHub no worries we will just upload it again okay now we should see our trigger let me just go to trigger runs is it triggered again we should see one trigger yes yes yes yes bro it is triggered let me just click on this one button and let me just click on this production pipeline so now it should show Green again in both the activities in both the pipelines and then we will see the flow for one more time and we should see all the files at the correct location yes let's see we can see that one pipeline has run successfully perfect now we are just waiting for our second Pipeline and and once it is done we will just validate our data bro I'm so so so tired I'm so tired but I'm really happy and I know that you loved this video I know I know I know I know and just tell me this thing in the comments that you love without this video and just share it with your friends with other data Enthusiast just try to help them as well just help them to grow and always be curious to know about the trends the updates and you already know that this channel is dedicated towards that type of learning where I just try my best to cover the real time knowledge or Real World Knowledge that we actually use in the industry it's also about just covering the fundamentals it's also about covering the you can say basic knowledge and at the same time covering the real time scenarios as well because it's not like you just know the basics and you can just crack the interviews or you can just be competitive no you should be competitive enough even if you're sitting for the interviews obviously you learn when you join an organization but still you should try your best you should try your best to learn to be updated every time let me share one scenario with you so I think yesterday I was learning not learning I was just doing I think one activity related to Delta tables so I was reading a book related to Delta table I will share it with the with you like in the upcoming videos about that book and like the things that I've learned so I was just reading some complex Topics in the world of Delta Lake and I didn't even finish that book and I got to know that like so many updates in the world of Delta Lake and that book was launched just last year can you imagine how fast like this industry is growing and I I I was just debugging that thing and I got to know something fishy I was like how it can happen because it is written in this book like this and this book is the latest one like it just launched last year then I just did some research then I got to know that they have introduced an update just last month in October 2024 I was like what so and let me just give you a hint it is related to deletion vectors in Delta tables do not need to worry your your your boy is here your your pro is here to tell you each and everything so in future I'll be just creating more and more videos including Delta l in detail and I have already created one I think project and one databas master as well yes and I have covered Delta Lake in detail in that particular video if you have not watched that video just go and watch that video it is very very very helpful and I have covered each and everything in the world of Delta Lake in detail I will just attach the link in the ey button you can just click on it and you can just learn about that because Delta L is the thing that we use in the real world and you can easily learn Delta Lake watching my video I'm not just promoting my video because I know how I have already explained those critical Parts in that video I know and I know how easily you can absorb that knowledge just trust me you can just watch that video you will just learn a lot and in the upcoming videos I will just try to include the updates as well like the updates in the world of Delta lake or any in any tool and Technology you know you know me like you you know me now you know me like this this person is like crazy about like data engineering and giving so much of knowledge so I just need your help your support that's it so let's refresh it okay it ran successfully okay perfect perfect perfect it is succeeded now we should see everything everything everything okay let's go back here okay I'm just going to the containers so first of all in Source we have that data right that we that our manager uploaded in the destination we should see two files now please as your Factory show us two files perfect perfect perfect perfect now these two files are here now we will be going to reporting folder and in the reporting folder we should see only one file in CSV files let me click click on that boom boom boom boom boom boom okay only one file not both the files now we will go to data flow output in this we should only see one file related to I think yeah aggregation file remember yeah just click on it perfect perfect perfect perfect why we are seeing two files because one file is the previous file as you can also see the time it's like 517 and the latest file is this one 531 and we have used insert option if you remember in the data flow because we do not want to remove this data because this data is being used by the reporting team or let's say developer team or bi team so that's why we cannot delete this data so that's why we just inserted the new data Common Sense Common Sense common sense everything everything everything you have learned everything so I'm so happy that just few hours back we didn't know anything in the world of data engineering using your data Factory now you have built like crazy solution just look at your pipelines man just look at your pipelines see just just look at your pipelines just look at your pipeline so you have used like activities like validation activity for each activity and data transformation validation get metadata for each activity data flow inside this you have used if condition then within this you have used copy data activity in the production pipeline you have used pipeline after pipeline you have used kit activity you have used the whole flow which is automatically triggered using storage events triggered like you have learned so much so much so much I'm so happy to share this information with you and that was all about today's video just share it with the maximum maximum maximum number of people and here is a surprise for you I have created a TAA engineering Master Class video if you are a total beginner you do not know anything in the world of data engineering you can just watch that Master Class if you want to learn pisk that is one of the most in demand technology right now you can learn ppar from scratch I have created 6 hours long dedicated full course on ppar and that is for free you can just check the video coming on the screen and I will see you there

Transcript for:Azure Data Factory Overview

Transcript for:
Azure Data Factory Overview