Azure Data Warehouse Architecture

this is the only Azure datab bricks end to end project you need in 2025 to crack the interviews and master Azure datab bricks technology really but why should we trust you because this project will tell you how to incrementally load the data using spark streaming and ingest pipelines then it will demonstrate all the pispark functions including Python oop concepts such as classes and using unity catalog functions to populate the silver layer and in the code layer you will build star schema and will build slowly changing dimensions including type one and type two both and not only this you will also learn about Delta live tables which is the most in- demand technology right now in the world of data bricks and wait there is one more thing you will build end to end ETL pipelines and workflows in data bricks i think now you can trust me but why I covered so much in this particular project because data bricks is one of the most in demand technologies right now and I want you to master it and I said master it because it's very easy to get started with data bricks and it needs serious attention and focus and strong hands-on to master data bricks and I have taken the responsibility to make you a prodeveloper in the world of data bricks so if you want to support my dedication you can just hit that subscribe button and just like and comment on this video and just take out your notebook prepare your coffee and let's get started with this end to end Azure datab bricks project so what's up what's up what's up my lovely data fam what's up and what's up strangers so just hit that subscribe button then I will also welcome you as like welcome my data fam what's up okay so just hit that subscribe button because because because because you going to love this video trust me why I am saying this in the beginning of this video because I'm recording this introduction part at the end because I wanted to just build the entire project with the flow and I wanted to keep it raw and let me just give you a little spoiler this video is amazing bro amazing like the things that we have covered let me just tell you hold on hold on i know you are also excited that's why you just clicked on this video and second thing I know you are really really dedicated in order to achieve your dream job dream role and obviously you want to be placed in your dream company that's why you just clicked on this video so by the way you just need to thank God because obviously they just made you click on this video okay so let me just give you the whole architecture what we're going to do because I cannot wait because I have just completed this video i have just completed the recording and bro this video is insane okay let me just tell you what we're going to do so we are mastering data bricks in this video by doing or by you can say creating or building an end toend project when I say end to end project I mean it plus we want to create a realtime project real world project and let me just tell you one thing bro there's a difference between a hobby project and real world or realtime project this is the real realtime project trust me why because of so many reasons you will get to know along with this video because when you just build hobby project you do not perform administrative task you do not perform security parameters you do not allow anything you can say cross data communication between the application you do not do all those things but in this project you going to do everything so let me just tell you about the architecture what we going to do so first of all we going to obviously master this particular technology which is data bricks okay in this technology we going to build a medelian architecture which means our data will be distributed in bronze silver and gold layer okay and our data will be coming from GitHub okay and Azure yes and let me just give you another spoiler we going to perform incremental loading which will include item potency which means exactly once criteria that means once you process your data you do not need to perform your data like process your data again we going to perform spark structured streaming in order to just perform incremental loading so for your bronze layer you going to learn incremental loading spark structured streaming IDM potency and the fourth thing is automated injection pipelines which is a new feature in data bricks so everything you will learn in this particular layer which is bronze layer and the best part we going to work with columner file formats which is also called as big data file format instead of just CSVs so we going to work directly with pocket file formats just for your information so the moment you have dumped your data in your data lake okay okay makes sense how you going to do this we going to use something called as unity catalog which is the most in demand architect not architecture you can say a kind of data governance solution in the modern world and trust me unity catalog is the backbone of data bricks right now and backbone of all the datab bricks interviews as well just mark my words okay just mark my words so we going to leverage unity catalog and plus as I've just mentioned mentioned you going to perform all the administrative task as well because nowadays companies are looking for those developers who know a lot of stuff like stuff not staff lot of stuff who can just perform administ administrative task who can just perform obviously development task obviously in the organization they have dedicated people but they want talented people who will be knowledgeable enough to perform those tasks in the absence of those people that happens Okay so you need to have that knowledge as well then let's talk about our silver layer this time this silver layer is really really really helpful and really important why first reason we going to learn so many in demand pispark functions okay reason number two we going to learn how we can leverage oop concepts in python and we going to integrate those concepts with our pispar code wow amazing third reason we going to learn something called as functions in Unity catalog for reusability of code this is again the new feature um in data bricks so you're going to master that as well so there are three reasons that make this silver layer really really really special and trust me this silver layer is really really really special and we going to save our data in the delta format in this silver layer okay makes sense then once we have our data in the silver layer then the main part I would say all the parts are main but this is the most important part of the video which is the gold layer why this is the most important one because you can already see what is written on the screen we going to build star schema that means we going to create dimension tables we going to create fact table we going to build slowly changing dimension type one we going to build slowly changing dimension type two and we also going to learn delta life tables as well so slowly changing dimension type one you will building by code by writing each each each line of code and slowly changing dimension type two will be created by delta life tables and you will actually have the end to end understanding of gold layer and this is again the most important and the most most most in demand area currently in the world of data bricks which is delta live tables so you will master delta live tables as well oh really an amazing video bro then obviously once we have our data um sorry star schema ready then we'll be just using warehouse yes we have warehousing capabilities available in data bricks as well and obviously there are some outsider tools such as synapse snowflake and so many other tools as well and we going to share those endpoints with the powerbi and obviously we do not need to build the powerbi dashboard from scratch no but yes being a data engineer you should know how you can just establish the connection with the PowerBI dashboard so that your data analyst can efficiently build the reports on top of those dashboards sources huh so this is very high level overview just to this is not even 1% of the video just trust me because we have covered so many interview questions in this because I was just creating video and I just covered hey this is your interview question hey this can be interview question so there are so many tips there are so many tricks there are so many debugging strategies as well like this video is full of full of full of knowledge and trust me once you build this project and you completely understand what I am doing in this project this project will actually make you an outlier in the interviews and this project will actually help you land a job trust me you just need to explain this project with confidence in the interviews and that's it because we have followed all the real world strategies and approaches and you can easily say that you are wellvered with all these areas and trust me you can answer all the questions because we have covered so many things in this video so this is just just just one person you can say trailer of our video okay so now you will be saying an Lama we are fully charged up we also want to build this project so what are the prerequest so basically prerequest here are the prerequest so first of all you should have a laptop or PC with stable internet connection okay stable or like almost stable okay so then second thing is databix account if you do not have any database account don't worry we'll be creating database account for free plus we going to attach a cloud storage as well with that because obviously in the real world we attach any cloud solution with datab bricks okay so we going to just create both the accounts for free don't need to worry the third thing is very very very important and this is non-negotiable which is excitement to learn data bricks end to end solution because datab bricks is very easy to start but it is not very easy to master you need patience you need excitement and I'm not lying like uh we do not need excitement we can just simply learn it just try try learn try to learn it and just see the results without excitement you cannot learn forget about data bricks you cannot learn any technology trust me trust me so you should have the excitement that you are learning something new you are learning a technology which is very much in demand you should feel excited because once you master this technology you will feel like okay now I know a lot of things in this particular area which is in demand which no one like not everyone knows about so you need excitement for it right you need excitement for it so now I think that's enough introduction now let me just give you a quick overview of the data that we going to use all the repositories that we going to use and now let's see what do we have because I know you are all set and just hit that subscribe button right now and share this video with others and do like this video and obviously do comment on this video this will help me a lot to grow and if you want to support my dedication I'm literally literally literally putting a lot of efforts and time to create these videos so that you do not need to pay for those like thousands of bucks to anyone just support my dedication just share this channel comment on the videos because it helps me to grow just like the like the videos just do everything that you can and just support support me that's all I can say and let me just show you that data right now so basically this is your repository from where you can just download your data so this time we are not going to work with CSV files why because you are becoming a data engineer in today's era or you want to just excel as a data engineer in today's era so obviously CSV files are obviously important but those are just text files so you should know like how to work with the columner based file formats which is like pocket file format so you should know like what are those big data file formats and how we can just work with pocket file formats so what I did I just took the CSV files and I just converted them into pocket file formats plus I just divided that data in the incremental way so that you can just feel like how we can just incrementally load the data so we are just building incremental flows incremental processing incremental notebooks everything and at the same time you will actually validate the data as well because we have incremental data files in the form of pocket okay and just to give you an overall glimpse like what do we have in this particular files so basically we are going to work with a retail data set in which we have something called as orders in which we something called as have something called as customers we have something called as products why did I just take this data set so basically you are going to master data bricks today and when you will you when you will be just aligned with the data set you will learn any technology with a faster rate and trust me and obviously now you are becoming a data engineer or let's say you are already a data engineer but you are mastering data bricks you would have already worked with these similar data sets maybe in your school days maybe in your college days maybe in your current job like orders customers products you can actually feel when we just join those tables you will actually feel what is there in the data set let's say I will just pick any data set from IoT data set you will say hey bro what is this IoT data set we do not know actually anything so that is why when you are learning it is very important to pick an easy data set because you are not mastering any kind of data sources no you are mastering the technologies you are mastering the processing or end to end flow so you should just pick those data sets with with which you can just feel aligned okay do not pick complex data sets do not just pick those data sets which will make you feel happy okay this is my order okay this is the product okay this is the category okay this is the customer you will feel aligned okay so your learning graph will definitely go up with this approach so that is why I just picked these data sets and as I just mentioned that you'll be just working directly with the pocket file formats because in the real world you will rarely work with the CSV files you will work with CSV files but packet file formats are something that you should know how they work under the hood and how you can just actually process the pocket files and trust me once you are a data engineer you going to hate CSV files and you're going to learn pocket files why because those text files do not hold any kind of schema but these pocket files actually store the schema at the footer of the file see this can be your interview question so we going to just cover these little little interview questions as well along with the video so do not need to worry and obviously these information are really really helpful when you are just sitting in the interviews person can just ask you a simple question that will that will not be directly aligned with the you can say role but yeah these information or you can say these pieces of information are something that you should know like you must know and obviously these are not written anywhere in any kind of road map but yeah you should know all these things so that is why I'll be just mentioning some of the little little question that you should know along with this video and you you already know about me pro so this is the data set that we are going to work with today and you can simply download all the files from here simply go here and obviously you will not see any kind of preview version because this is a pocket file format simply click on this download and your file will be downloaded just click on it and you will see hey it is downloaded i have already downloaded it so it's fine and that's it for the data and once you have all the data ready in your PC or in your local computer or any laptop then now it's time to create an Azure free account you will say hey why we are just going to create Azure account because we are going to work with datab bricks right yeah exactly but in order to work with data bricks you need to have a cloud account associated to it so that you can actually see how things are going to work because when you are going to master data bricks you need to just attach that instance with any kind of cloud and Azure is the most in demand cloud right now and let me just tell you one thing when you just attach any cloud with any kind of you can say processing uh solution and we are obviously learning data bricks so in that particular scenario you need to work with so many administrative tasks such as like um allowing data bricks to read the data creating external data locations creating unity getalog creating blah blah blah so many things bro and these are the things that create the difference between a normal hobby project and a realtime project so I know it's very easy to say hey this is a real time project there's a definition of realtime project in which you actually build everything from scratch that you build in the real world that is called real world/realtime project okay so if anyone like let's say um tells you hey this is the real time project we are just building real time project there's a definition what is the meaning of realtime project so you need to configure all those things like security parameters your um permission configurations your unity catalog your cataloges your schema your tables your median architecture everything needs to be done by you that is called realtime project it's not just like opening data bricks um writing few lines of code it is like real time project no bro real time project is something else okay and this is the real time project you will get to know why because we going to cover almost everything in data bricks and you got to master data bricks today with this project and we will be using Azure data lake as a storage solution and just as a kind of data lake solution in order to just preserve our data make sense okay very good now let's create our first if it is your first Azure account if not obviously you can just use your existing Azure account and if you do not have any Azure account let's create the first Azure account and now let's get started with our project bro let's do it so now in order to create your free Azure account I hope that you would already have one free trial but this is just for those who do not have free Azure account or who are just coming from a different cloud environment or different cloud background so do not need to worry bro I am here so how you can just create simply go to your um Google tab or any tab simply search Azure free account just search you need to type n as well okay so simply click on the first link and then simply click on this try Azure for free that's it and here you will be landing on the page where it will ask you to just provide your email id and some credentials so basically if you have Microsoft account okay then simply put it here if you do not have simply click on create one and you just need to register your normal Gmail account under Microsoft and you can even use at thegmail.com or at theoutlook.com so if you want to just create a new Outlook account you can even create that one as well i don't know i don't know i don't know just get your account and that's it okay simple okay very good so now just keep your email id with you because obviously you need to put it here then when you'll be clicking on next it will take you to a page where it will ask you to complete a kind of web form and it will just ask you your personal details such as name um phone number email id and blah blah blah and at the end you will simply click on sign up and the moment you will click on the sign up it will just ask you to just provide some card details and some billing instructions do not worry they will not send you a bill because this is a free account but yes they need to just verify that you will be the one who will be using the services because the thing is Azure gives you 200 USD credits in order to use their services so they need to just verify okay this is the right person to whom we are just giving the credits and you are the authorized person to use these credits and these credits will be there for 1 month and after 1 month it will be gone so your question will be like hey what will happen after 30 days so nothing will happen they will simply nullify your services and and you do not need to worry about anything they will not send you any bell so this was just a prerequest that you need to create a free account okay so once you create a free account let me just take you to the Azure portal account and simply search portal.asure.com this is the link that we use hit enter then I will simply select my email ID and I can easily log to my portal account so now you need to put the same email ID that you have created just now okay got it simple so let me just take you to the Azure account and let me just give you a quick overview of it let me just show you so this is my Azure portal account and this is the homepage that you will also see in your particular account and don't worry in your case it might be a little different because of these Azure services that you will not be seeing here maybe you will see some different services so you do not need to worry these are some of the recent services that you see if you have used in the past and obviously if this is your new account that doesn't make any sense if you will see these services right common sense common sense common sense so now I know this UI is so so so amazing i personally love this UI and now let me just give you a quick overview of this aure do not worry we are not discussing Azure in detail we are only focusing on data bricks data databicks but yeah you should have some understanding basically uh this is the area like maybe you can call it as like recents page because I know this is a homepage but this is like showing all the recent tabs that we open and let me just take you to this menu bar basically these are some of the resources which are really popular and we use on a daily basis that's why they have just put it here like these resources and the one thing that you should know even if you are learning only and only data bricks this one thing you should know what's that basically basically basically basically all resources and resource groups what what are those these two things basically resource group is the backbone of your resource management why because the thing is anything you create or let's say anything you use as a service from Azure it's called a resource if you're using Azure data bricks it is a resource if you're using data lake storage account it is a resource if you're using a jaw data factory it is a resource so every resource every resource needs to be added in a folder and that folder is called as a resource group simple just keep it simple do not make it complicated life is already very complicated so do not make it complicated by saying resource group is like hard construction in which we just store our resources shut up so just this is just a folder same folder that you create in your machine to just save the files similarly this is a folder in which you just put your all the resources that's it that's it that's it so in order to create any resource because obviously in our case you'll be creating uh data bricks obviously you'll be creating data lake so all those resources need to be added in a folder so let's create that resource group and how you can just create that simply click on this search bar and simply search resource group and as you can see resource groups and I have so many resource groups because I'm lazy enough to not not to delete these when I just create these so just click on plus new and let me just show you how did I clicked on this see plus create click on this and then click on resource group name so obviously you need to provide a name i will simply say RG and then datab bricks E2 ET project or let's say ET end to end makes sense catchy name right so simply click on region um region let's pick UK south why why who lives there so click on review plus create and then click on create okay that is done now I can simply search my resource group by simply saying RG and then I can simply say data bricks oh not master class where's my resource group data bricks yeah here it is simply click on this and this is my resource group obviously it is empty because we have not created anything so now now first of all we'll be creating our data lake because data lake is obviously important and without data lake we cannot even do anything let's say if we just want to store our data makes sense because let's say you are processing your data but at the end of the day you need to store your data somewhere where you will store that data data lake data lake simple simple simple sorted so in Azure we do not have data links an Lama are you kidding me bro seriously we have data links but we need to create data links indirectly indirectly yeah so by default we create blob storage in Azure but we create data links by selecting or let's say by configuring blob storage in such a way that it will be converted into data lake oamba you just mentioned that life is already complicated why you're making it complicated so so sorry about that so it is very simple okay simply click on plus create and then I will simply go to marketplace basically this is the place from where we create our uh resources all the resources will be created from marketplace okay here you will find all the first party Azure services or resources second party third party fourth party all the party okay i will simply search storage account because we do not have something called as data lake we have storage account click on this and then you will see so many storage solutions available within the marketplace but we need to pick Microsoft one click on this and click on create and this time I will simply name it see this time this resource group folder is already populated for you why because we created the resource by entering into the resource group first and then we clicked on create common sense storage account name let's give a good storage account name I will simply say datab bricks ET makes sense um datab bricks ET yeah yeah makes sense to me it makes sense and another uh quick note you cannot pick this name now because I I have taken this name and your storage account name should be unique throughout the Azure network uh so you have to pick another name okay so now region is this one i'm fine with this and then redundancy simply pick LRS locally redundant storage where your replica of the data will be uh replicated within the same data center and it is the cheapest option and if you are using free account then obviously it doesn't matter click on next so here's that thing that I was talking about basically when you just check this box you are telling Azure that hey I do not want just blob storage I want data lake basically data lake is built on top of blob storage and by default blob storage cannot be hierarchically um they cannot hierarchically store the data no so that is why we cannot create hierarchical folders within the blob storage account but in data lake we get this option so that is why if you're working with big data if you're just working like performing big data analytics you need to just store hierarchical folders so that you can analyze the data so that is why we call it as a data lake and yes data lake is the hottest topic yes because everything is built on data lake nowadays okay click on this box and click on review plus create that's it and your data lake is ready click on create and it will just deploy your data link and I think it should not take much time i would say just a few seconds let me just refresh it so what's the second resource that we are going to create what is the second resource the second resource is data bricks and before creating data bricks we will simply set up our data lake why because bro we going to follow Medelain architecture so we need to just create three layers bronze silver and gold and obviously we going to create a container in which we will be just storing our data as a source make sense good so now this is done so now in order to go to your resource you can simply click on go to resource and it will directly take you to the resource okay so this is your data lake and if you are just seeing it for the first time do not feel overwhelmed because we get basically four storage solutions within the storage account the first one is containers which is our data lake fileshares cues and tables are the other three services that we get along with the data lake that we do not use much but yeah you should know about these as well like fileshare is just like the shareepoint solution in which you can just share the files with your entire team and they can just use it q's is basically the solution similar to Kafka where you can just store the messages in a queue and tables is a solution that is similar for similar to Cosmos DB in which you can just store your data in semiructured format in basically the JSON form okay in the key value pairs that's it but 99% of the time whenever you're creating storage account you will simply use containers which is the data lake so Let me just click on it so this is the area and within this we create something called as containers so simply click on plus container and I will simply create container called source what is this source so in this source container we will put our all the um you can say source data packet files make sense very good so I'll simply open this and I will simply h I will simply create directories yes click on plus directory and simply create three directories orders basically four okay click on orders very good and then click on this orders first because obviously we need to upload some data so now I will simply click on upload and I will simply upload the data from my system uh uh uh so orders first that's it do not upload both the files like orders first and orders second no because we will be incrementally loading the data so we will just upload it when we have our notebook set up just upload orders first that's it okay very good now just click on this containers and this is the source click on this and let's create another directory it's called customers very good and do the same thing upload only one file which is called customers first okay perfect now create another directory called products and then click on upload and just upload products first perfect because you basically have four files okay three are the major ones one is just like small mapping file that's it and click on source and create your fourth directory which is called as region or regions click on upload and then regions perfect click on upload that's it oops I uploaded it here so sorry let me delete it let me just go inside this and then click on upload regions perfect so this is uploaded okay very good very good very good so now our source is set up source is set up now we do not need to do anything else we will simply create empty containers and those empty containers will be called as bronze silver and gold okay simply click on plus container because we are following Medelin architecture so each container should be isolated from the other so that's why different container for different layer okay so perfect everything is set up very very very very very very very good now simply go to home and simply search resource group and simply search your resource group click on this and you will see your storage account is here very good now let's create our data bricks because that is the thing that we going to use in this entire project because I know we going to learn so much in this databicks course databicks project basically it is designed in such a way that you will master each and everything orchestration pipelines pispark silver layer gold layer slowly changing dimension and blah blah so many things you already know I have just mentioned all these things and trust me you want to master these things you're going to master these things let's create for databix workspace and let me just search bro what what do you want man private offer management simply type Azure datab bricks and just pick this one don't worry we will just create this one as well why I will just let you know first create a datab bricks okay click on create and you will feel the difference between hobby project and real time or real world project you will feel the difference then workspace name workspace name let's say let me pick um datab bricks make sense okay yeah makes sense region is UK south and pricing here this is important so basically I know you are using free azure account if you are not using free as your account obviously why don't why you want to just spend some money if you want to just create this database workspace for just for the learning purpose right so instead of choosing premium I will simply pick trial premium which is exactly same as premium but it is just available for 14 days that's all and I think that's enough bro that's enough okay now it is saying enter manage resource group basically this is optional thing and if you're just using unity catalog this is not even required this is just required to handle or to manage your clusters to manage your virtual machines that's it and in the world of Unity catalog rest of the things will be managed by our own location okay so it is optional so if you do not provide any manage resource group name it will automatically create any group and we are fine with that because we are not going to monitor that group for this project so it is fine okay it is fine then click on next and then click on review plus create and then simply click on create that's it so it will take some time because it will just deploy all the things and create your datab bricks workspace it will take some time but we have some work to do meanwhile and let me just tell you why I was saying that we going to create the other resource as well let me just show you let me just go to the resource group this one yep so basically this is our storage account and just imagine that we have data bricks as well okay make sense make sense very good so basically let's say this is your datab bricks okay this is your storage account data lake now this datab bricks wants to read some data from your data lake make sense yes how this will read the data from this data link how how both are different resources both are different resource how so the answer is can it cannot read the data from data link so an lamba why did we create this so basically we know that it cannot read data but we will allow this to read the data okay so we need to provide the permission how now the thing is if we would be using Azure own resource let's say Azure data factory let's say Azure synapse analytics then we can directly provide the access to the resource to use the data lake but datab bricks I know is a firstparty service provider in Azure but it is not owned by Azure datab bricks is a separate company so now we have something called as connector so this connector is specifically made to to allow the access to this data lake so that this connector can be installed or let's say embed inside this data bricks and once it is embedded that means we can easily access the data lake with the help of this connector that's it so let's create that connector and let's allow that connector to use this data so simply click on create and simply search datab bricks and then you will see access connector for Azure datab bricks simply click on this and click on create and then we need to simply provide the name and I can simply call it as datab bricks et connector and I have so many connectors I didn't delete those connectors I will simply delete those connectors don't worry but yeah for now it is fine click on review plus connect create and then click on create that's it Okay so it is done as you can see like deployment is in progress but yeah data bricks is done oh this is also done it was quick because it is just a small resource not a big deal that's it go to home from here or you can simply click on this resource group as well so you have all the options so now let me just refresh it and I should see that connector as well yeah perfect i have three resources and it looks good one is data link second is data bricks and third is the connector between these two simple logic simple simple simple simple simple logic very good so now without wasting any time let's actually go inside our data bricks workace let's go inside this so just click on launch workspace and it will simply ask you hey login simply pick your email account of your Azure okay okay okay okay h so this is the Azure datab bricks workspace and I know it looks amazing this new UI is amazing amazing amazing amazing now I know that in this project we going to master each and everything available for us for us means for data engineers so obviously we going to cover all the things in detail independently but just to give you a quick overview let me just give you that so this is the main thing towards left this is the pane so workspace workspace is basically your repository for all the things your notebooks your EDL pipelines your workflows everything just like a folder okay that's it recents tab I think this doesn't need any kind of explanation okay then catalog this is really important catalog is the area where we register all the things our cataloges schemas tables functions functions Yes functions volumes volumes yeah volumes don't worry you will learn everything so yeah all these things are registered under catalog and catalog is the area where we can actually see all those things okay done workflows workflows is the ETL capabilities that we have within the datab bricks can we do that yes that's why in this particular project I didn't pick Azure data factory because I wanted you to learn ETL capabilities within data bricks as well and it is really really really important okay then compute compute is the thing that you use to process anything or or I would say everything compute is basically the cluster that you create your spark cluster your executor node driver nodes everything it is here and you can easily create compute by using um drag and drop feature or by just using a simple UI that's it you do not need to use create a spark submit again and again no then marketplace yeah databix also has a marketplace from where you can just use the resources that's it okay then this area is newly added and this is majorly focused on SQL warehouses like once you create your warehouse or SL lakehouse then you can actually connect that lakehouse through external points and you can simply share that link with other data analyst or data scientist and whenever they will be quering the data those queries will be optimized through this area which is called SQL warehouse and it is really really really amazing and in the SQL warehouse they have launched their own on own cluster in which they have optimized those cluster to run SQL workloads only so when you'll be just querying the data using that cluster using that compute you will be getting amazing results or I would say much faster results as compared to job cluster or I would say allpurpose cluster so those clusters are specifically built for running your SQL queries make sense very good and by default we get something called as serverless SQL serverless SQL compute which is the same compute that we going to use in this particular project as well because it is already created for you it is not much expensive it is lightweight and enough for our transformations yes we'll be creating some special computes as well in this video because we going to create delta live tables and so many things so we know that but just to get started we can easily use some these kinds of clusters makes sense makes sense makes sense makes sense so this is data engineering where we have job runs data injection pipelines and machine learning is totally you know that all the your um machine learning models machine learning building everything is in this area and obviously we are not covering this machine learning thing okay so simply click on the SQL warehouses and within the SQL warehouse you will see serverless starter warehouse and this is the you can say compute that is already there for you okay and if you just click on and if if it is not here in your case you can simply click on create SQL warehouse this is something that it will create a SQL compute or let's say compute for SQL workloads okay then you have allpurpose compute this is a compute that we create traditionally okay then we have job compute job compute is something that is used for production scenarios or if you are just building delta life tables okay then vector search so this is newly added and this is a kind of vector DB and this is used by data scientists to perform vector searches and it is kind of a data in which um it gets translated into multiple dimensions so you do not need to worry it is not your area so do not worry then pools you can even create the cluster pools in which you can define hey this is my pool pool is basically the set of uh machines that are available for you and you can simply set hey this is the number one machine like minimum number of machine that I need all the time so it will be always up and running for you and these are the maximum number of machines that's it okay then policies these are the cluster policies the very famous one is personal compute okay and these are the pre-built policies so um let's say personal compute is the one in which you do not get much things to configure and if you just go to job compute then you will be seeing some fields already filled for you and you can even create your own polic policy as well so do not need to worry and as you can see that shared compute as well so these are like uh the these four are for like all purpose and this is just for the job cluster so this is the most popular one in which you just need to click on button and all the things will be filled for you and then just click on create that's it that's it that's it and this is the apps okay so now enough enough enough overview is provided now let's get started with our workspace click on this okay and then click on create and then click on folder and then just provide a good folder name and I will simply say data bricks et project okay click on create within this folder we will create our all the notebooks make sense very good now now now no now no now no now no now no now no now no now no now no now no now now let's start ingesting our data let's do that we know that we going to inject just our data and we'll be doing it incrementally yes that's true but I told you that we have one mapping file that is fixed we are not going to pull it incrementally so how we can just ingest that data and I'm going to show you an amazing feature introduced by data bricks and I love that feature basically in this particular feature you do not need to write any code to pull any data what yes you do not need to write any code it is totally no code what and let me just tell you one more thing bro hold on with the help of this you do not even need to create any table the table will also be created for you this is an amazing feature i love this okay I know you are really excited to know about this feature and similarly these kinds of fe a lot of features will be discussed today don't worry just just just hold your emotions and before that let me just tell you the most important thing because even before getting started with anything we need to set up something called as unity catalog unity catalog you need to set up okay and how you can just do that first thing you need to simply go to datab bricks admin console and how you can just go there click on this dropdown and you should see something called as manage account okay sometimes you will not be able to see this why because this particular manage account option will be visible only to the administrators and you are not an admin if you are not seeing this what I am I am the only owner of this account why I I cannot see so basically the thing is just click on this this is your email id that you are using right h but do you know what is the actual email ID that Azure uses it is like it ends with like something called as # ext let me just show you click on your Azure go to home then then search Microsoft enter ID extra ID enter ID bro click on this go to users and you will see that this email id has the principle name this so this is the original email id that it uses okay so when you will just click on this you will not be able to see this manage account if if if you are not able to see this okay so what's the solution for that solution is very simple simply go to Google and simply search accounts.net this one accounts do.net so simply hit enter and it will simply ask you obviously it will not ask me because I am already logged in it will simply ask you hey simply put your email id here to me carefully listen to me carefully okay now when it will be asking you to provide your email id you do not need to provide this email id normalgmail.com no simply provide this long email id which is your user principle name then you can simply put the password and it will I think ask you to just hey provide reset password and all do that then you will be able to land here in this console and let me show you something click on this you will see my long email ID see so I'm not logged in with my normal Gmail account no I am logged in with my long email ID okay Anchala hold on if you are not logged in with your normal email ID how you can see this manage account i will tell you baby hold on so first of all just do this and just land on the console page first of all okay simply land here so once you land here then what I did I just went to user management because this is the top level of the hierarchy of our uh user management i went there and then I just added my email account and I just added the user click on add user new user email just put your email ID new user full name just give any name obviously it is your name okay click on add user then I assigned it as account admin roles assigned as account admin i think there are only three roles i can just check let's say hello bro at the rategmail.com and then click on add user and then it will simply ask me roles see there are three roles account admin marketplace admin billing admin so I just created myself as account admin you can obviously create marketplace admin as well and I would say that's not mandatory if you would say hey if I just want to add any user and if I do not want to make him the admin obviously it's up to you it's not like you have to have to just make the person as admin no so you can simply say last name and roles obviously see I have not given any role i can simply say save that's it and this user is saved here see so now this is just a user but with no privileges but obviously if you just want make yourself as as an admin simply pick account admin let me just delete this user because otherwise I will next week I will be thinking hey who is he or she click on these three dots delete user confirm delete number who is she okay sorted now let's talk about the main thing go to workspace or catalog just go to catalog okay now we need to create something called as unity meta store so whenever we just want to enable unity catalog for our datab bricks workspace we are like we have to first enable the unity meta store unity meta store comes at the top let me show you the hierarchy bro let me show you the hierarchy so Unity catalog oops let's type it unity catalog and just search this one what is Unity catalog so this is the hierarchy see at the top we have to create our Unity meta store then we assign our datab bricks workspace to that Unity meta store when we do that that means we are enabling unity catalog what is the advantage of it basically when we allow unity catalog that particular catalog will be shared with other database workspace as well if we allow if we allow okay don't worry about this thing in like much for now you will understand all the things what is unity catalog what I am talking about just be with me just be with me just be with me like just be with this video just be with okay so now an Lamba who created these three meta stores me so who will delete these me but I will not delete it now Okay so just forget about or just um ignore these three one so Anal Lamba who created this one as your database created this for you by default because whenever you create your database workspace you will get one unity catalog for free or basically unity meta store for free for free how how costly it is obviously it is free but yeah you do not need to create this and I got like free things i will simply say hey click on this and click on these three dots click on delete that's it why because I want to create my own bro i will simply copy the name of this i'll simply set delete perfect let's create our meta store because obviously if you're sitting in the interviews the person will say "Hey you are creating the this what you will see and and how you can just delete the meta store how you can just create your own meta store." You should be prepared for all those questions right and with confidence with confidence confidence is very very important and just be overconfident in your interviews and just mark my words bro whoever is taking your interview I know the person is very knowledgeable but that person also doesn't know everything in this world right it's fine then if you do not know that thing but just be confident just be confident just be confident with your knowledge that you know and just learn all the ins and outs of the topics and just be conf just say like I'm the master of this skill bro just ask me anything and obviously I can be like answering 99% of the time but obviously in 1% of the cases I would think and I would simply say I cannot remember this thing but I have just worked with this thing or similar to this thing that's it bro just be overconident and do not hear to anyone just just just be overconfident that's it just click on create metas store and click on name so now we need to create our metas store and obviously we need to provide the name to this metas store so I'll simply say um Let's say datab bricks ET metas store okay very good region I will simply pick UK south so another interview question person will say hey there is a meta store in UK south okay and you need to create a meta store how you can create it you will say um we can simply click on create meta store and we can simply do the the the this he will say hey bro just cut the call and you are rejected why because the thing is you can only create one meta store in one region so if a person is saying hey there's a meta store already in that location simply say either delete that or choose another location because you cannot create two meta store in one location so this can be a tricky question you should be prepared and I think now you are prepared okay and by the way just hit the subscribe button right now right now I was just looking at the analytics of my channel i was like why people are not subscribing to my channel why why why is there any reason i know there are some haters but I don't want to give a reply to them but I know you love me a lot so simply click on subscribe button i know you would have forgotten to hit that at that time so simply hit it right now and just share this channel with others as well so that they can be learning a lot of things okay so simply now now no now no now no now no now no now no now no now no now no now no now now I know that you have done that so now let's talk about this other option which is called ADLS genad what is that and if it is optional llamba should we do that yes what is this basically unity catalog needs some location in which it can store your manage data just tell you one thing whenever you create a manage table you know that data will be managed by the datab bricks okay where it will store the data database is not a storage solution so where it will store the data it will store that data in this location which you'll be providing here right now okay so simply provide that particular location and then you can easily work with that particular location so now let's provide the location so container name is we need to create because we do not have any container so simply go to your Azure go to home and then just go to your datab bricks workspace go to your data lake and then click on containers and then simply create a container and I'll simply say meta store because this is the container that is only and only available for metas store and I will simply add a tag D and D do not touch D and T not D and D so this is done meta store okay so now if you remember one thing we created the datab bricks connector but we didn't allow that to access this data okay let's do it right now so simply click on access control IM okay and just remember that you need to pick the containers like you need to be on this page where you can see all the containers so that it will access all the containers okay simply go to access control and click on plus add role assignment and here we will be adding a role called storage blob contributor perfect click on this click on next then click on manage identity click select members then simply pick uh access connector for Azure data bricks and then search your connector in your case it will be easy in my case I know I'm lazy so simply I selected it and simply click on select now click on review plus sign click on review plus sign for one more time and that's it it will simply assign it and it is added so what we have done what we have done we have simply provided this particular character permission we have assigned a role it is called storage blob contributor that means it can contribute to this data lake this one and obviously in all the containers very good now we will simply click on this go to containers and now we need to simply first of all get the location of this one and what's the location of this one let me just tell you there's a protocol that we need to follow really yeah simply go here and put your location so location will be metas store okay metas store is the container name okay then add the rate then storage account name storage account name is datab bricks then we need to write dt dfs dot windows no core dot dot dfs.core.windows.net yeah.net perfect so this is done invalid ADS to path why why why why let me just add slash perfect because we need to add all the containers so simply add slash as well then it is asking for access connector ID so let's go and provide it because we have that go to resource group go to your uh connector and simply copy this perfect so simply attach it here then click on create so now that is it really yeah that's it it will simply create yeah perfect now it is saying hey it is done can you just assign it to workspace yes click on our only one workspace because we cannot assign this to other workspace because other workspaces are already busy so I will simply pick this one and and and and and click on assign that's it it is saying hey you want to enable unity catalog yes bro click on enable uh congratulations then perfect that's it now one more thing which is really really important you will land on this page when you will click on this one and if even if you cannot land on this page simply click on catalog and pick your meta store and then you will see metas store admin here if you click on edit you will see your long email account is the admin and which we do not want why because we will be accessing our data bricks with the normal account okay so simply create that account as admin so simply pick this one and save perfect what is data sharing don't worry about that for now don't worry about for about that for now now everything is done everything is done on the Unity Metas store site simply go back to your databicks workspace and that's it go to catalog okay so now what we need to do now what we will do we will simply create the catalog we will simply create the catalog catalog is nothing but just the um parent or you can say root for all the databases or for all the schemas so in one catalog we can create multiple multiple schemas and within one schema we can create multiple multiple tables that is the hierarchy okay so I'll simply click on plus button and I'll simply say create a catalog and then I will simply provide the catalog name catalog name will be let's say datab bricks cat ka ka okay then story location we cannot we we we do not need to provide the story location because ideally we should provide the location to the meta store and we have already done that and you know I follow all the industry standards thank you thank you thank you so much so simply click on create our catalog is created and configure catalog yep we need to configure Okay so now it is saying okay this is important so now privileges so it is saying that choose which users or group can access this catalog so we are saying all account users are granted browse by default yes what is browse it is nothing browse will simply allow other users to just see the data not even like see the data see the definitions like this is the table name this is the schema name these are the columns that's it they cannot even read the data they can just browse okay these are the things listed and this is the minimum level of you can say um permission that we can assign for a catalog and we will simply say grant okay and then simply say grant and if we just want to grant other permissions we can simply select use catalog use schema but we do not want to do that simply click on close and simply say next because it is just browse for now see all account users who are the account users whoever have access to this particular database workspace is account user simple they have browse permission for now that's it you'll be saying an Lamba I am the only one having access to this database workspace will I also be having just browse permission no because you are a meta store admin so you are an admin you do not need any kind of browse permissions at all you have all the permissions okay so there's a hierarchy of permissions as well you are admin of the top level you are the principal okay so do not need to worry about that so simply click on next and then click on save that's it that's it okay perfect so now our catalog is done now click on this particular thing catalog for one more time and you will see this is the catalog built for you with default schema that's it obviously you'll be creating lots of schemas but before that let me just click on catalog for one more time and then click on external data and this is the most important thing if you are building a real world project why because in order to create a real world project you will be storing and reading the data from external data lakes and in order to create or you can say do that you need to create something called as external data locations this is the most important thing that you should know make sense it's very easy how you can just do that you will simply say an Lamba we know that we used connector to allow the permission exactly you need to just do that thing only but the only difference is let me just show you the difference if you just go to your resource group and you will see the storage account go to containers so the thing is each container is isolated see so we need to simply create the external location for each container we want to use each container obviously we do not need to do anything for metas store because this permission is already given at the metas store level okay we need to create these like external data location for these containers okay and what will be that thing it is called access connector right simple so simply go to your catalog and before clicking external data click on yeah click on this external data and then click on credentials because you need to paste your resource ID of your access connector okay simply click on create credential and I will simply name it as an creds okay now paste paste that access connector ID simply go to resource group go to access connector copy paste it here that's it that's it click on create that's it basically this is just a gift wrapper that we applied on the access connector this is the same thing but we call it as a credential in database that's it because this credential can be anything but in our case we are using access connector right so just to use a common name and just to match you can say industry standards we call it as credential that's it just a gift wrapper on your gift maybe ludo or chess i still remember like I used to just go to the stationary shops and just purchasing those gifts and giving to our friends those golden days are gone are gone where like we used to just go to the birthday parties and we were like hey you bought Ludo right and the guy was like blushing and no no no no it is it is something else it is something else no no no so those those days were like golden days now now I'm not in touch with anyone of those friends okay do not talk about that thing bro so basically okay so now simply click on catalog okay because our credential is ready now click on this external data okay and now just create your external location okay click on this external location and I will simply name it now you as an efficient data engineer and trust me these things actually make a difference in the interviews why let me just tell you if an interviewer asks you you need to create an external location obviously you can just name it anything let's say my location okay you are sitting in an interview and you just provided a name called my location to your external location and there is a guy okay or a girl who just provided this location as bronze external location just tell me one thing who will have the better chances to be selected obviously this one because this is matching some standards this means that you are concerned about the reusability of your code right of your development so that is why always if you're sitting in the interviews just behave like you need to flex that you are the best and you need to follow those things right instead of just flexing so now everything is fine this is my external location name and what is the external location it is called ABFSS this is basically a protocol that we follow it is called abfs Azure blob file storage secure I don't know okay then container name bronze at the rate storage name then dfs dot core dot windows.net net and that's it why because we want to provide access to the whole container okay obviously you can fine grain it by just going inside the folder but that is not a good practice okay then storage credential obviously I will simply pick an credits perfect click on create and I will simply copy this location because I'll be reusing it okay click on create and it is done it is done i can even test my connection click on test connection and perfect done now let's create more okay click on external locations and create three more it's this time it's called silver extl is silver instead of bronze so see these are some of the administrative task that you need to know in today's era to crack the interviews because it's not just about writing the code in the notebook no it's about managing the whole ecosystem bro now companies are not relying on so many resources they want one person who can do all the things obviously there will be different people for different tasks but they want someone to be like fully like wellversed with all the areas so that whenever they will be in need they can just call you whenever they want something they can just call you so this is the personality they are looking for right now okay it's not like you are a data engineer i just want to write my code and blah blah blah i just want to write pispar code and that's it those days are gone okay just be an outlier and just learn all the things end to end why why I'm creating these videos just for you bro just for you just learn these things these are not just do not feel like these things are not important we will skip this part we will simply directly go to pispark part okay then just see yourself in the interviews and if you are not skipping these parts you are if you are just actually serious about these parts just see yourself in the interviews i'm not kidding here so just try to gain the knowledge if I'm just telling so much of extra things I know it is important and I know there are some haters who just I don't know who just spit bad words in the comments and I do not need to worry about that because I know you are here to support me you I I know that you are here to love me so I do not actually care about all those few haters and just look at all the big personalities do not that's not the case like they do not have haters everyone has some like few haters and problem is with them they are not satisfied with their life so they just spit bad words and just they just go to uh feeds of different people and they just write bad things why because they are not satisfied with their life so do not feel like we do not need to learn these things okay other person is saying that it is not important no just focus on your things focus on your skills that's it do not need to just follow those things that that which are like bad words written by some few folks no just ignore those bad elements of the society just focus on yourself and that's it that's it i also like do not reply to those comments i just say like God bless you bro because you need bro so just ignore and just focus on these areas now pick this and click on create angela you should become a philosopher really I will try i will try i will try simply test connection and click on done and now click on external data let's create another one and this is gold external location like obviously I can try to become a philosopher like why not everyone is a philosopher if the person is changing someone's life that is a philosopher bro simple so storage credential click on create click on test connection perfect now our last external data okay and this is for source if you remember we have source external data as well perfect url source by the way I first want to go on the TEDex stage i don't know why I want to go there but I want to go there i want to go there just pray for me just just just pray for me and I will just give you a high I will obviously like invite if if it is possible because I want I I don't I don't know like if it is possible I will definitely invite you all of that city that just come on the event I'm so sure promise test connection and then done then done done our all the unity catalog external data everything is set up now let's perform our first data injection and now you will see what I was talking about now you will see simply go to catalog okay this is your catalog and within this let's create a schema and how you can just create schema um select this and then click on this drop-down and here you will see create schema button okay now schema name what what should be the schema name obviously we are working with the medallion architecture so schema name should be like bronze silver or gold make sense i will simply say bronze simple external location no because we have provided external location at the meta store level very good click on create so this is our bronze schema in which we will be populating the data we will be pulling the raw data make sense okay now an lamba you were just talking about that we have an amazing feature like with no code and we can just pull the data yes what's that now let's talk about that thing so basically click on this sidebar okay and then just click on data injection okay click on this and you will actually land on the data injection capabilities available within databicks datab bricks has some native connectors with Salesforce and obviously for now it is the only one okay then it has some closely integrated connectors as well files okay then volumes and then data lake azure data lake okay now if I just pick this data lake what it will do it will simply ingest the data from the data lake for me and push that data in the managed location and will create the table for me and I can directly use that data wow everything within just few clicks okay then what is this area basically we all know that Fiverr is an amazing data migration tool data migration SAS that we have software service which takes care of all the data movements and now databicks has partnered with Fiverr where you can just connect with so many sources and you can actually push your data to the data bricks so simple so simply you can click on see all and these are the connectors and there are so many connectors I'm really really impressed with this Fiverr integration and fiverran is an amazing tool I was also exploring this tool and I will like make this video of this pipet in future maybe so that you can also learn about this technology and this is amazing technology and now data bricks has partnered with the fiverran so now I think it's my responsibility to obviously tell you a lot about five I will just make a separate video on pipran don't worry so for now we'll be using data lake click on this and now it will simply say external location bronze we can simply click on from this drop down and we will simply say no do not pick bronze because our data is listed in source so now it will list all the folders automatically i will simply pick regions because regions is that particular data which is you can say small mapping file and nothing incremental in that so it is like static data click on this and then click on preview table okay and then it will create the table for you now it's your opportunity to pick the right schema click on this and pick bronze schema and perfect and table name you can even change it but I will simply say regions or I will simply say yeah I will I will say it is fine okay it is fine so now it is just loading the data and this is the data and it will look like this it's good yes why it is rescued data what is this basically if schema changes in future so all those or you can say failed columns will enter in this one and you will learn about this one as well in just few seconds don't worry because we'll be just writing our code with spark streaming to incrementally load the data so then click on create table and see in just two clicks in just two clicks your table will be ready and your data will be migrated to the manage location wow wow it is created bro let me just refresh this and you will see it here see wow man this is amazing click on this table and click on details and you will see it is a manage table and this is the location of this table and it is saved in the meta store see why because we already know that metas store is the location where it puts all the managed data very very very well done and now if you just want to go there and see you can actually see it simply go there and go to resource group go to this data lake go to containers and go to this meta store and then click on this GUID and then tables and this is a table see this has created delta table for you for free like obviously for free like in just two clicks can you imagine it has migrated the data bro it has migrated the data for you in just two clicks and created a table for you wow man so so this is done perfect so this was like code everything free thing that we have for data injection okay that's cool that's really really cool and this is an amazing feature that we can use let's say if you just want to uh pull like data for one time or let's say just pulling static data for moment it is very good it is very good okay now let's go to workspace because now we'll be creating our notebook and now we'll be performing streaming on the data to load data incrementally yes and do you know what we going to create dynamic notebooks what do you mean by dynamic notebook so basically the thing is I will just tell you I do not want to give you the spoilers so I will just tell you simply click on this folder and let's click on create and then notebook and let's create our first notebook so basically this is the first notebook that we have and we know that our free trial will end in 14 days so just get lost okay so first of all just provide a cute name to your notebook i will simply say bronze layer make sense make sense bronze layer bronze layer yeah bronze layer because we just have one notebook yeah because I can create multiple notebooks but that is too old-fashioned i will create a dynamic notebook which will handle all the data loads incrementally for all the tables let's do that let's do that let's do that and before doing that let's connect our databicks notebook with our cluster and simply click on this connect button and you will see we already have one serverless cluster ready for us simply click on this serverless and you will see it is already up and running so you do not need to actually create your compute but the thing is whenever you are performing streaming you need to just create your all-purpose cluster and for rest of the transformation you can use serverless so I will create allpurpose cluster right now in front of you click on compute and click on create compute and then simply pick policy personal compute because it is really easy to configure and performance I would simply pick um 15.4 4 LTS it is for long-term support and DS3 V2 four cores that is fine and yeah perfect terminate after 4 320 minutes no 20 minutes are enough then advanced options okay that is fine then click on create compute that's it make sense so it will simply create the cluster it will take some time it will take maybe at least at least 3 to 4 minutes or maybe 5 minutes to create this cluster because it needs to create the virtual machine disk and everything and then it will just be turned on so we need to wait here okay and once it is turned on then I'll just show you how you can just incrementally load the data using something called as autoloader which is built on top of spark structured streaming and I will just tell you all the things that you need to know and don't worry I will just mention all the things such as um dynamic notebooks how you can just create dynamic notebooks and I will just show you how you can just perform loops in the workflows in the ETL workflows to run that dynamic notebook don't worry so we have amazing stuff right from the beginning right from the bronze layer okay make sense let me just show you so as you can see our cluster is up and running it shows green so now simply click on recents because we just opened our notebook so it should be in the recents so now perfect click on this connect button and simply click on this anlas cluster obviously in your case it will be something different but yeah so now first of all simply remove this what we are going to try now first of all obviously you would be aware about the notebooks okay so just to give you a brief this is just a notebook in which we have cells okay so whatever code we write in this particular notebook it will be executed okay that's it that's enough obviously you should have some knowledge with the notebooks so now the thing is I will simply create a markdown cell i will simply click on markdown and I'll simply say data reading and I will simply um first of all create H1 heading oh no H1 will be too much h3 and then bold i will simply say data reading perfect very good now we going to use something called as auto loader what is autoloader let's talk about this so basically let's say this is your let's say this is your data lake okay this is your data link make sense this is your datab bricks okay this is your datab bricks and this is your data lake again but obviously this is a destination so I'll simply say destination okay and this can be anything now what we going to do we have a file here let's say this one okay we going to read this file okay and we going to move this file here make sense okay now let's say on next day or any next iteration there is a new file now instead of reading both the files we just going to read this file only because this is a new file this is the incremental data so only this data will be read and this data will be written here instead of both the files and this concept is called as exactly once this concept is called as exactly once let me just write it for you it is really important exactly once so this data which is processed already will not be processed again and this is your item potency which is exactly once data will be processed only once that's it that's it now this was the very highlevel overview of autoloader and obviously we use something called as spark structured streaming to perform this let's go a little bit deeper because obviously if you are just explaining autoloader in interview obviously these are the follow-up questions that you can expect very very very common so what are those so basically before covering some questions let's understand the concept like how it is working under the hood so basically now you will feel like how it is only processing the new file and why it is not processing the previous file so we have something called as checkpoint location okay we have something called as checkpoint location in which we have something called as Rox DB folder and in that Rox DB folder we just save or you can say we just write the current state of the data of the destination so let's say your one file is processed so that information will be there in the checkpoint location and next time it will only only only read the file which is not processed so that information will be there in the Rox DB folder within the checkpoint location h okay so this was all about your IDM potency one more thing let's say our data schema is changed tomorrow let's see so how it will handle that first of all let me just tell you we have schema evolution capabilities and how we use something called as schema location okay in which we write the schema we do not write basically uh databicks will automatically infer the schema and it will write that schema to this location so it will match the schema before writing the data to the destination now ideally we know that this folder is different this folder is different but we like it is a like you can say good practice we save this folder inside this checkpoint location so that it because obviously in the checkpoint location we have so many folders ros DB and so many other folders and within that within that we just save our schema location as well so that we do not need to monitor or manage two different folders we can only monitor or manage one folder which is checkpoint so this is the end to end game and it is built on top of spark structured streaming make sense make sense very good so now I know the concept is clear now let's actually perform the things and now let's actually see like how things are being performed under the hood so we know that we have the data for this let me just go to the containers let me just go to the source and uh um orders let's first pull the orders data okay so I know that I have this file make sense okay let's read this data and I will simply say data reading i will simply say df equals spark let me just increase the size no no it's fine spark so it is giving me some suggestions but do not look at that spark dot read dot format now just tell me what is the format of the data you will say pocket now when we use autoloader we need to simply say cloud files so cloud files is the format of the data then we specify dot option cloud files dot format and then we need to say park it make sense very good so this is the thing that we need to first of all define then we need to mention something called as schema location okay schema location so what will be the schema location i will simply say dot option cloud files see it is giving me hints i will simply say tab and then I will simply change the location because location is abs now it's up to me which location I want to pick so I will simply pick um the destination folder because it is already uh it is like good to manage all the things on the destination side so I will simply say silver at the rate datab bricks etfs.core windows.net makes sense and within this I will create a folder called checkpoint i'll simply say checkpoint um orders because obviously we have so many folders right like so many data so I'll simply say checkpoint orders that's it okay then I can simply say dot option and it's called cloud files dot schema location oh sorry not schema location it's called um schema hints so if If you just want to provide some hints to your schema you can actually use schema hints let's say you just want to provide schema hint to your order ID you can just do that so basically it is just for that purpose and not a big deal if you do not provide but yeah I'm just telling you that you can just provide the schema hints like this okay so then I will simply say dot load and then the location obviously abs orders yeah this is the location now I can simply run this and I will simply say attach and run okay oh so it was terminated why because I think it was inactive for 20 minutes i think now it will take like next one to two more minutes to turn it on okay no worries no worries no worries so basically till the time it is turned on I also want to just show you the data i will simply say df equals spark dot read dot format pocket why because I just want to show you the data first like how does it look like so I will simply run this command and it is just taking time so I can simply click on this serverless for now and it is being turned on and as you know as I just told you that for normal transformations you can just use serverless compute as well okay simply pick this one and simply run this and I'll simply say df do display just to show you the data how does it look like because you should be aware of the columns and all the things so this is the standard command we uh just you can say write to read any data any data not with the streaming behavior just with the normal behavior that's it that's it like just the you can say reading batch data and is saying path does not exist are you serious um okay bronze oh obviously it is not bronze it is source let me just run Let me just run this so let's see let's see let's see so now it is running so now okay perfect so this is my data that I can see i have order ID I have customer ID I have product ID I have order date I have quantity and total amount obviously this is our fact table and we'll be just building the star schema don't worry at the end which is gold layer so this is our fact table because in the fact table we just have ids and uh numeric columns that's it that's it that's it that's it so this is the data that we are going to read okay and just to show you because obviously in streaming we we cannot see the data it will directly push the data to the destination so now let's see if it is turned on yeah now I can just attach my normal cluster which is allpurpose cluster and confirm perfect so now let's run this code and I think I just need to write stream here instead of just read because we are performing streaming and obviously source now let's run this and obviously it will not do anything why because error what's wrong with you bro [Music] h okay so it is saying that location is not found let me just copy the location from here let me just paste it here maybe we missed something or maybe a typo let's see hm no okay basically these kinds of errors come um when we like did not define the external location um precisely but that's not the case because we can obviously read this data we have just read this data and I can just rerun this uh let's see if it is running [Music] let's see if it is working fine yeah it is working fine so data is there and location is there the error is something else let me just read the error for one more time cloud files cloud files do format schema location okay first of all we should not create schema location in the silver we should create the schema location in bronze so it is not aligned with the error yeah I still Oh I see the error it is not load bro it is not load it is I think yeah it is load it is load it is load i thought we are just writing that let me just see i got the error i got the error i got the error just a typo i have written windows instead of Windows just a tip for you if you see this like because I know these kinds of errors only come when you have something wrong either with your um external locations maybe you have just provided wrong location in that or there is some typo in your URL like both things are not matching that is the only issue that arises if you have like like that is the only resolution uh if these kinds of issues ar uh these kinds of issues arise so I can simply correct this and I can simply run this And it should work because I know my external locations are fine yeah perfect see perfect bro perfect yeah perfect so these are basically you should be able to debug your errors okay and obviously if you see errors feel happy do not feel sad feel happy because there's something new to learn maybe you have made some mistakes okay let's correct those mistakes and let's not obviously feel sad if we see the errors in future because I have already seen like these kinds of so I was happy I was like okay I know the resolution I just need to investigate that's it so just build this habit do not be like a crybaby what is the solution of this error bro you are a data engineer if you are obviously you will be landed as a data engineer in the company and if you will see some errors there what you will do will you go just go somewhere somewhere and will just type hey what is the solution you will find it spend 2 days 3 days 4 days who is judging you right now you're learning what's your issue bro you have all the AI applications chat GP and everything what are you doing man just try to debug on your own okay good so now we have just initiated our spark query okay so see I have also went to like charge db and I also just provided my code I I I just asked it Hey what is the wrong what what is wrong with this code because everything is fine then it just pointed out this thing like vid knows because obviously when you're just building and writing so much of code you cannot look at each and each and every alphabet but you can just leverage these tools like this okay just just say that hey is is there anything wrong with my code if it says no then it's your duty being a detenter to troubleshoot the issue if there was nothing wrong in my this code then I knew I just need to go to my external location and just need to double check my URLs that's it that is the you can say steps that you need to follow whenever you see the errors solve it step by step because one the first step is like my code is it fine is it syntactically fine is there any typo i don't want to spend my time into just check each and each like each alphabet i will simply submit my code to AI hey simply just give me any typo if there it is it said hey you have just written windows instead of windows i was like okay my problem is solved i do not need to go to um external location and check the URLs solve all the errors like this bro okay simple now data reading is done now we need to write this data okay and how we can just write the data i can simply say uh df dot write stream because obviously this is a streaming then dot format I will create delta format why because I love delta format just kidding we will just create pocket file format because in the bronze zone there's no need to create delta formats i will simply say pocket okay then I will simply say dot um output mode okay and output mode I will simply say append because obviously we should just pick append and it has automatically taken checkpoint location and checkpoint location is this one yes I want to create this checkpoint location perfect yes and as I just mentioned in the explanation as well this location should be matching with the uh schema location this one so that you just need to manage only and only one folder that's it okay very good now we need to simply say dot option and path what is the path let me just copy it and I will simply create orders folder and in the bronze container make sense makes sense now let's talk about trigger we need to trigger this right because this is a streaming i will simply say once equals true y because there are so many options i can even write let's say uh processing processing time equals to 10 seconds let's say so what it will do it will keep on running after 10 seconds automatically in like in every 10 seconds it will be running so if it is the requirement do it but in my particular case I don't want to run my cluster for every 10 seconds because obviously it is costly what I will do i will simply run once equals to what it will do if I say once equals true if I say once equals true it will simply read all the files which were not processed before and it will simply perform that particular uh you can say data writing part and it will simply stop the streaming query it will not keep on running it will not keep on running it will simply load the data it will simply write the data and that's it this is the best thing let me now just start this query let me just do this and I know in my see streaming initialization so now streaming is being initialized you can simply click on this drop-down and this one as well to see the graph so this is basically the streaming graph and obviously you will not see the continuous graph because this is once equals true so it will first go up and that's it footage of now loading the data and it has done the job why because stream is inactive that means it returned the data it was so quick and it stopped now I will just take you to the data lake containers bronze this is my checkpoint orders told you very good go to orders this is my data this is my data very good now if you just go to bronze go to checkpoint what is this checkpoint location open this you will see all the things this is schemas this is your schema location okay go to commits and then just go to offsets and then just go to sources perfect in zero you will see Rox DB this is the folder which maintains your current state and future state and responsible for exactly one's process okay now your data is successfully loaded to bronze only one data i know only one data okay now it is stopped so you do not need to worry at all i can simply click on this close button now in order to confirm how many records do we have let me just apply a quick data reading here and I can simply say bronze okay and I will simply say orders obviously uh orders yeah orders okay let's run this and let's see what do we get because we know that we have 9990 rows make sense now what I will do i will simply go here in my bronze container okay and in the bronze container or sorry in the in the in the in the source container the source container now let's say next day you are receiving orders second.pocket which means like the number of orders for the next day now I'll simply say upload upload orders second perfect now this data is uploaded order second.pocket make sense very good now your notebook is scheduled to run this or let's say process this data okay now let's go to our data bricks now let's rerun this query and write the data for one more time and let's see what happens obviously stream is initialized makes sense and it is done because it was really really quick and you cannot even see and once it is done it is stopped why because of this reason once equals to true okay makes sense because obviously if this is not required to run your cluster 24/7 why would you be running it in just every 10 seconds okay now let me show you something else just run this command and you will see the the the magic let's scroll down and how many rows that you can see 10,000 how many 10,000 why because earlier we had 9990 rows and the second day we got goat we got only 10 rows so how many rows or let's say how much of data it has read only 10 only 10 not both the files not both the files and if you just want to confirm it I can just show you there's a better way of it you can simply say df dot count and you will see the actual number okay how much how much it is 10,000 because sometime it by default shows 10,000 you'll say "Hey Lamba playing smart huh?" No bro 10,000 that's it instead of 9990 + 10 again and it will be equals to 19,990 no only 10,000 even even this query is topped till it will only process the new files why because Rox DB Rox DB folder is actually capturing your all the information and even if this query is stopped that doesn't mean that this query is like reprocessing all the data no so this is your interview question bro even if the query is stopped new data will be processed only the new data reason rocks DB make sense makes sense makes sense makes sense very good now you understood the concept you have seen the picture all the things now let's make this notebook dynamic what do you mean by dynamic dynamic dynamic let me just tell you what do I mean i know that I have total three particular folders yes as source orders yes and then uh your customers and then products make sense yes just imagine in the real world you have hundreds of folders will you be creating new notebook for all the 100 tables no will you be writing the code for all the 100 tables no what you will do you will create a dynamic notebook okay and you will be parameterizing the notebook okay and you will using loops for that and using workflows make sense make sense if not don't worry it will make sense now so now let's do one thing first of all let me just make this beautiful notebook what make this beautiful notebook make this notebook beautiful what oh man okay let me just write um data writing because obviously I'll be uploading this notebook you should feel like oh this is provided by Anish Lamba okay data writing perfect so now I what I will do I will simply create some variables make sense variables or you can say parameters basically because I will make see there's a trend if you closely observe if I just want to read data I have only this thing orders which will be changing for products it will be a products folder for customers it will be a customers folder that's it and checkpoint location will be the same for all the like different but yeah it will be exactly same as your name of the file right perfect same thing with data writing as well I want to write my data for the orders so what I will do I will simply create a variable called let's say data or you can you can say file name simple file name or folder name simple or source name anything okay so let's do that and how we can just do that basically you need to first of all remove this cell and let's create a text beautiful text and let's say parameterization or dynamic dynamic notebook or dynamic capabilities make sense perfect dynamic capabilities now we will simply say db utils dot and for that I think we need to import db utils i will simply say import db utils no it is already there okay because it is something similar to sh utils in python so in python we just import sh utils h simply say db util dot widgets.ext it will simply create the widgets okay hm now I will simply say file name okay and by default I will simply say orders by default let's run this you will see one box here wow similarly I will create two more or basically one is enough yeah one is enough because let's say I want to print the value of this i will simply say dbut dot widgets.ext okay oh sorry not text get I simply say get and file name so what it will return very simple question orders now let's change the value of this let's say customers so what it will return customers wow then let's say products products so it will be changing in the real time that's what I want so what I what I will do i will simply first of all create this okay and there's another way of creating this as well you can simply like click on plus and you can create more if you want and you can even delete it if you do not want okay and I think there is another button here from where we can just create see add parameter and you can do the same thing here as well but I like to just create it using widgets and it is better way to do that it's up to you like bro it's up to you bro come on man come on so what I will do i will simply remove this value because I do not want to provide any default value let's run this and you will see that whatever value I will be just providing here it will just take that value perfect and we will just dynamically pass this value okay so now what we need to do I will simply store this value and I will simply say file name make sense or let's say P file name just for the readability and now I can store this value in P file name and I can actually return this using this see now what I will do I will make my code dynamic because I'll be using this value in the runtime I will simply go here first of all here I will simply use fring so this is a common string that we use in Python if you just want to combine your string with variables Some people do like use format function i hate it i personally use F strings i love it i will simply say File name or sorry P file name so see this way we can actually combine text and variable okay same thing with here as well because we want to just read orders or whatever the file name will be okay perfect make sense very good now data writing everything is same let's change it here so this is just a design question a kind like this is just a kind of design that you should be aware of if anyone is asking you hey why did you pick this because obviously when you are just explaining the projects in the interviews you should be very well versed like why did you pick this and why didn't you pick the other thing what is the advantage of this okay you are just showcasing your project hello sir hello ma'am this is my project i did this bro why did you pick this why just say that I do not want to write my code for let's say thousands of tables so that's why I have created a parameter that I can change in the runtime and I don't want to write my code again and again just give a reply like this hey sir this what is this bro what is this what is this i told you be overconfident bro perfect perfect perfect now it is fine now if I want to run this what I will do i will simply run this using run all command let's do this and you should know what it will do it will run all the cells what it will do here the answer is nothing because we do not have any new file make sense we do not have any new file we can just confirm it go here go to containers bronze and there should be only two files oh wait wait wait wait wait wait wait what is this oh man wait do you know what it has done do you know we didn't add f here that's why it took that as string okay so just a small typo let's remove these folders again these do not do not feel sad if you just see these things feel happy feel happy these are directly aligned to your learnings okay now let's do run all for one more time okay tick tick tick tick tick and what it will do i have already told you nothing because we just have zero new files so we should just see two two two files that's it orders and just two files refresh it still two files that's it still two files so this is the thing and what is a spark metadata basically whenever you just perform autoload order it just holds a metadata of the individual file as well so do not need to worry the main thing is your um rox db and schema location okay okay okay okay okay okay okay sir okay now it's time to actually create the pipeline so that we can run all the things because you you will say anchama come here come here come here come here come here so oh just call me like this i will give you a reply okay yeah there are a few people who can just call me like this but not everyone like obviously my mom can call me i would love to hear that okay wait now you will say ana now if I just want to process hundreds of tables will I be writing this one by one and clicking on run all for one by one no you are my data fam okay let me tell you the best way to do it what you will do simply go to workspace create a new notebook okay and call it as parameters okay just define an array h okay i will simply say data sets okay array and then define dictionaries i will simply say uh file name okay is orders perfect then file name customers okay then file name products very good so this is my array if I just run this it is attaching it to the cluster okay baby attach it so this is my array i want to use this array as the input for don't worry don't worry don't worry you will learn everything i'm just giving you a higher level overview so we will use this array because I just mentioned this so many time so many time that we going to use for loop if you are coming from Azure data factory background or Azure background you would know for each activity right this is exactly same and if you're not coming from Azure background or your data factory just ignore what I said okay do not worry so basically I want to use this as my input as my input how I can just do that we have something called as DB utils dot I think jobs yes dot I think task values yep and I think dot set yep and then whatever uh name I'll be providing I'll simply say um data sets make sense yeah data sets or data set or let's say output data set it will make more sense output data sets and then I will simply provide the data sets variable and obviously encode it so what we are trying to do this particular utility this one whenever you'll be running this particular notebook it will running this notebook in the workflow that I'm going to tell you right now whenever we'll be running this notebook in the workflow with the help of this utility which is dbuttils dot jobs.task task values dot set we can return this data sets with this particular name so that we can use this as the input input just try to imagine bro I know you're really good at imagination just try to imagine just try to use your superpower simply run this by the way there's no need to run this but you can run this in order to confirm everything is fine syntactically now let's go to workflows Now let's create our first job create job and then I will simply say so this is our job okay now I will simply name it as bronze incremental make sense very good now this is the first activity that we that we need to perform and this activity is our parameters notebook simply click on this one type is notebook okay and click on this and wait it is hanged or or or what what tasks h okay task name workspace oh man it is hang robot oh man I think I need need to just refresh let me just do a refresh quick quick refresh okay perfect yeah it was hanged so task name is uh let's say parameters because this is just a parameters file so notebook obviously so in this we can just perform any activity such as Python script Python wheel SQL pipeline what is pipeline you can actually run the whole pipeline as an activity as well don't worry this we will discuss at the end of this project where we'll be just creating end to end flow at the end okay for now we are good with the notebook so where is the notebook m perfect you can simply search notebook and done now you need to pick the path and path is your database ETL project folder and parameters file confirm now compute is um cluster like all all job cluster or allp purpose cluster because I think it is better to run it then do we have any kind of parameters here no in this we do not have but the in the other notebook we do have that we need to feed it through this notebook make sense and just remember we need to keep the exact parameter name exact exact and I I know that the parameter name is file name in our data bricks notebook okay for now it is fine simply click on create task and perfect this is the task created for you this is a task created for you now I'll simply say add task okay another notebook so this is already linked to it and I'll simply say bronze um incremental or let's say autoloader make sense very good then I will simply say path is this one confirm and compute is obviously this one and now parameters yes we have parameters simply say parameter name file name what is the value now is the thing because we cannot provide you will say hey it is so simple we need to just pick the output of this and we need to provide it here no because that is an array we cannot provide an array to a parameter no we can just provide only one single value so what we need to do here we will simply say hold on we will provide this value hold on how simply click on create task and then click on this fit to viewpoint view port so that you can see everything now click on this and simply click on this loop over this task this is for each activity click on this now this is a loop now it is saying hey what is the value of the loop now you will pick that array and you will feed that loop click on this okay and then input is dynamic so this is the dynamic content click on this and you can simply write your code and it is already giving you suggestion task.parameters do value and my value is what name did I give i think I give output data sets data sets or data set let me just check duplicate and go to recents because you need to pick exact name okay and it is data sets yeah perfect so now this is the value so every time it will first go to this value okay then it will go to this this value okay then it will go to this value perfect perfect now let's say it's done now just click on this inner activity okay now expand it and now you will pick this value how dynamically click on this and then you will simply say input what is this input input is the this dictionary input is the item within the array and what is the item item is this whole dictionary what is the second item this dictionary what is the third item this dictionary so every time it will just pass the whole dictionary and we know how to fetch the value of the dictionary using dot key that's it dot key so what is the key obviously what is the key name file name bro okay so this way it will work and you can simply say insert dynamic reference and you can see a lot of things here input dot input dot blah blah blah simply click on this and then file name same thing perfect this was all about your task so this is the pipeline that is built for bronze incremental bronze incremental only on only only make sense or we will just rename it don't worry we have the opportunity to rename it as well so this is the pipeline and now let's click on save task okay and now just click on run now let's see if we do have any errors because we didn't test this click on view run and let's see if it runs fine and if there are any errors it is fine because we didn't test this we simply saved it so this way it will first run this parameters file and then that parameter that array will go to this particular loop and it will perform that autoload order one by one one by one and don't worry we can just actually validate this as well because we have just copied only orders data not the other data right make sense very good very good very good very good i hope you are understanding this come here done lamba what is this can you see this three scheduled how many values did we have three very good now let me just take you to the data lake okay go to bronze wow all the three folders along with the checkpoint location wow wow looking like Wow wow man wow see see so now just a quick test aishla we trust you we love you and just for the sake of testing let's test this as well how so we know that in the products and in the customers we just have one one file let's upload one more file and just to give you a quick overview in the products we have 490 uh records in this one and in the second file we have 10 more records so in total we should have 500 in the uh customers we have I think uh n 1990 records in the first file and 10 records in the second file so in total 2,000 uh customers okay let's test it let's upload the products first product second okay okay okay okay upload and now refresh refresh bro we just uploaded where the Where's the file what where's the file let's upload for one more time product second yeah it is uploaded now okay so now just do the same stuff with customers as well mhm open upload perfect now let's rerun this for one more time and how we can just rerun it simply click on the workflows this is the bronze incremental click on this and simply say run now for one more time and it should only run the new files and for the other table it should not do anything because there's no new file for orders so it is double testing let's see let's see let's see let's see as you can see one scheduled okay and in total we have three so second is also scheduled now and then third will be scheduled very soon third is also scheduled so see this way you build pipelines dynamically perfect done let's validate the results go to recents go to bronze layer and let's read the data one by one okay uh I will simply say df equals spark dot read dot format bucket dot load and then uh perfect perfect yeah let's do this first of all let's read the orders data and we should see 10,000 rows what's the error P file name oh fring so there should be 10,000 rows perfect 10,000 rows let's see um customers customers i think in total we should have 2,000 not more than 2,000 okay let's see 2,000 total yeah simply scroll it down you can say refresh now see simply scroll down and it will be refreshed 2,000 very good now let's see the products product should be 500 I guess because there were like total 500 products simply run this and perfect 500 rows perfect man perfect very good very good very good our data injection bronze layer is done now it's time to actually create our silver layer with all the transformations and with all those crazy things that we're going to learn in this particular section and let's see what do we have for our silver layer let's see let's talk about our silver layer okay let's create our silver layer so first of all I will simply go to Let me just check if my mic is working yes it is working because sometimes it doesn't work and then I ha have to re-record everything so first of all um let's go to workspace okay okay okay okay and then I will simply go to this workspace and okay and go to this folder and then we can simply create our new notebook which is called silver layer so basically there will be three notebooks because not three actually four yeah because we have like four files so we'll be creating different um notebook for different you can say table because every table has their own transformations so there will be like four notebooks okay makes sense i'll simply say silver and then um orders let's do this first and let's attach this to our cluster make sense okay perfect so now first of all first of all first of all let's put the heading let's say data reading very good now I will simply read the data i'll simply say df equals spark dot read dot format and format is bucket make sense then I will simply say dot load simply because pocket doesn't need to define any schema because schema is already there at the footer of the file i will simply say bfs bronze at the rate data bricks.tfs.core dot windows.net perfect let's see what do we have to transform in this obviously you'll be using like pispark functions and amazing functions unable to input schema for pocket what oh obviously bro what is a folder so sorry folder is orders perfect now let's display the data perfect display okay perfect so this is our data and obviously usually we do not have much stuff to do um whenever we just perform you can say transformation on top of fact table because fact table is full of ids which we cannot transform and some you can say numerical column but when we use autoloader you should know that we get extra column and this extra column is called rescue data and I already told you in the beginning that this column is just for the new columns or let's say new schema if it evolves so obviously we do not need to carry it forward in the enriched layer so actually you do not need to worry because this column will not will not will not disturb your schema because that's why it starts from like underscore and it does not disturb anything it it will not disturb any you can say your schema and you can actually see the schema as well and how you can just see the schema in Pispark you can simply say DF dot print schema and you will see the schema of your table and you will simply see like rescue data and there's no big deal in this one perfect perfect perfect so actually if you just try to rename it and you can simply say DF dot and how you can just rename your column in Pispark we have a function called with column rename we will simply say df dot with column rename and you can simply type the column name here and then whatever new name you want to give it let's say rescue data without any kind of underscore okay and you will see that you will see that it will be renamed rescued data so now let's say if you just want to transform it if you have some values you can actually play with that with with it and you can it's totally up to you but in our case we actually do not need this column so you can even drop this column as well if you want so how you can just drop the column in pispark you can simply use something called as df equals tf.trop and then you can simply say rescue data and just for your kind information we didn't save this into the df let's save it first oh wait wait wait wait wait we just saved it with dot display okay so we would need to reread this and not a big deal so now just perform this okay now just say TF do now you will see that this column is gone and we just keep this column just till the bronze layer but after that it totally depends if we want that data which is coming into the rescued data column we can keep it otherwise we can simply drop that column simple simple now as you can see that our data is looking perfect right now let's say order ID customer ID product ID and everything like order date and everything right yes quantity obviously it is in the integer form total amount it is in the float amount for like you can say float float data type so now let me just show you some date functions that you can perform let's say you want to convert your order date into any other format or let's say into time stamp how you can just do that basically date transformations in pispark how you can just do that you can simply say df equals df dot with column so now with column is a function that we use to either modify a column or create a new one it's totally up to us make sense okay so now df dotwith column and now we what we will do we'll simply write the column name now here's the thing if you just mention the column name here then it will check if this column exist in our table if yes then it will simply modify it otherwise it will simply create a new one sorted life very good simply say order date okay and then now we need to just define the transformation that we need to apply on this particular column and I want to apply let's say two time stamp two time stamp and I will simply use column object and then order date perfect and now let's display the data as well perfect so now you will see that our date column is now converted into a time stamp column and now obviously it will be 00 0 but still now you know how you can just convert that particular uh column into a time stamp okay now you will say can you just show us an example of creating a new column because we are really interested in that as well yes bro yes why not and here bro means you as well miss Mrs whatever you are okay so now now now no now no now no now no now no now no now no now no now no now no now now let's create a new column how we can just create a new column df equals df dot with column and I will simply say month because I just want to get the month or let's say year let's say year oh or let's say yeah let's say year okay I will simply say year now pispark will say is year column in that particular data frame the answer is no so it will create a new column year and then I will simply say year function because we have a function called year and I will simply say column of order date perfect perfect now let's run this and you will see a new column created for us and it's called year 2023 2425 and blah blah blah so this is the particular year column that we have created a new column make sense very good now obviously if we are just working with numbers we would be really really really excited and really interested to apply some window functions what are window functions basically um window functions are those functions which perform operations on rows instead of the entire column so it is like rowle operations that we can perform okay and in order to do that let's say I want to perform a ranking okay based on the total amount but I want to partition my data based on the year as well that means I want all the total amount top to down for 2023 then the same thing for 24 then the same thing for 25 if we do have got it understood the scenario very good now I will just show you doing it simple way then I will just show you something called as classes as well like how you can just apply the same thing using class like Python class because obviously in the silver layer you going to learn so so so much things related to pispar and related to python oop concepts everything I will try your best to just make you understand all those complex things okay with ease so in order to create a new window function or let's say a column and let's create a new column first through simple way uh yeah let's do that i will simply say df equals df dot with column or let's create a new data frame so that we will not mess it like mess up with the previous one we'll simply say df new or let's say df1 okay df do with column i'll simply say flag okay now we need to import another function it's called window from pispark.sql dot window import window okay perfect so now I will simply say flag and then I will simply write the transformation i will apply a dense rank function which is a window function because I want to find a ranking make sense okay 10th rank and then dot over then I need to write window dot partition by and on which column we need to apply partition by obviously on on year then I will simply say dot order by okay and then we will simply say order by on total amount this one then simply say dsec Because obviously we are just sorting data descending in descending descending order perfect perfect now we can simply say df1 dot display makes sense makes sense makes sense uh now let's see perfect now you can see 1 1 1 2 22 2 33 now let's say I also want to create another column and it's called rank flag what is the difference basically there are two ranking functions dense rank and rank function 10th rank maintains the original flow of the counting even if we have duplicates so for example here we have one one one so many one so the next count will start from two but in rank function it will start the next count from the 1 2 3 4 5 6 it will start the next count from six so it will not treat duplicates as like um you can say it should still start from two no so let's let me just show you so you so that you can also understand it so I'll simply say rank function everything else will be same uh yes and this time you can see that we have rank function rank flag make sense makes sense makes sense and where is our flag column bro where is our flag column h okay df1 rank flag df1 df1 equals to this oh makes sense because obviously we just applied it on DF not on DF1 so it just overritten it so if we would have done it like this then it would be it let let me just do it for one more time not a big deal so we can simply rerun our DF like I can simply say run all and that's it sorted life bro sorted life perfect now I can see both the functions parallelly see it started from six i told you bro perfect now let's say I want to perform another uh window function which is row number which will not care about anything and it will simply say 1 2 3 4 5 6 that's it and then it will simply reset the counting for different partition which is 224 so these are the most common window function most popularly popular window functions so how we can just perform window function like the row number window function same way you can simply say uh row flag and then I will simply say row number that's it and then you will see the row number perfect see it doesn't bother anything so now we have hardcoded did all the functions let's say going down and after performing so many other transformations you again want to use window functions then again you'll be writing the code so what's the best thing of like how we can just reuse our functions with efficiency how basically we can create something called as class so let's create that let me just say heading and let me just use h3 and then classes oop this is an oop concept okay so how you can just do that basically we will simply say class and we'll create class called window because it will hold all the window functions all the window functions okay then class window or let's say windows and then we can simply say df init then obviously self okay then within this we will just pass another parameter which is df and then we will simply say self dot uh df equals to dfs so this is basically the parameter or you can say instance parameter so every time you'll be creating the instance of this window function and whatever df we we will be providing it to that class so it will be using that particular data frame that's it this will be our initialized data frame make sense makes sense makes sense so basically let's say you are using DF1 okay let's say uh after performing so many transformations you are using DF2 now in DF2 you want to perform window function then instead of rewriting the code you will simply create an instance of this class and you can easily call this particular function that's it see how powerful it is how powerful it is make sense very good wait hold on i'm just writing my code so this is my DF now let's create DF um dense rank let's say this is a dense rank function okay and obviously self and then within this we will perform that particular transformation which is this by the way this is a spoiler man this just writes all the code so obviously it is a good thing if you're just developing it but obviously if you're learning then obviously you should just do it on your own not a big deal we know that what we need to do so we will simply use the df which is data frame in this particular column uh in this particular function and it we will return the df by applying that transformation so I will simply say df dense rank so this is for um user readability and then I will simply perform this self.df because we have this df okay or you can simply copy this code from here as well uh which is this one perfect so we will be saving this particular transformation here in this DF now you will say is this code correct no why because we are using DF df is not native to this particular class okay so we can only use native variables or that variable should be coming from the outside function or let's say it should be coming from here let's say like this like this like df it should be coming from here either way either way um I would say this is a better way it should come from the function instead of creating the instance yeah I personally feel so okay so we can simply remove this DF from here and we actually do not need to create any kind of um constructor i do not remember its name like init function perfect so this is my DF and I'll simply return DF um denth rank make sense then I will simply create a rank function similarly and I will simply perform this and return rank and row number and then I will simply say return so this way I have created my class in which I have three functions which is this one and this one rank function and this one so now let's say I have DF okay I have DF first of all let me just create a new DF let's say code and I will create a DF new equals DF obviously okay let's do this because I do not want to make a mess of my DF original DF okay so let's say in obviously in DF new we would have everything that we have in DF so let me just display the DF df new DF new let me just do that so in this we have all the things that we have in DF so let's say I want to create um dense rank column let's say this is my new DF and I want to perform window function on this particular data frame instead of writing the whole function again what I can do it is very simple i will simply create an object of my class which is called windows okay so this is my object now in order to call the function I will simply say first of all let me create the object windows is not defined are you sure bro didn't we run this okay let me just run this i thought we we ran this now simply run this perfect now I will simply call obj dot dens rank and I will simply pass df new perfect and you will see it will run fine see it has ran successfully now in order to see the results I can simply save this in a data frame obviously I will simply say DF result okay make sense df results or you can even save it in DF new as well bro not a big deal so it is just for your understanding so you will see that in DF result I will have that new column ready for me which is flag 1 one one two two and so on so on right how was it so now this way if you have multiple df if you're just writing like let's say thousands of lines of code you can simply reuse your class simple you can reuse your class you can simply call that particular object and you can simply pass your data frame and that's it that's it makes sense make sense very good so obviously we are not going to save these columns but I wanted to show you how you can just transform your data based on the requirements make sense make sense very good so now what we need to do we will simply write our DF which is just with the year column because yes we want year column but not this black column so we will simply say data writing okay I'll simply say text and then H3 and then data writing perfect i will simply say DF.right dot format okay and this time I will save my data in delta format because in the silver layer we should save our data in delta format yes you should perfect mode is overrite that's fine actually I can use append as well not a big deal okay then let me just give the location ABFSS silver at the rate um what was the location name datab bricks etfs.core dot windows.net and then I will simply say orders make sense perfect let me just write this and our first table is written in the enriched layer which is also called as silver layer perfect perfect perfect perfect so this was our extra touch of classes that is called oop concepts because you should know these things okay now obviously classes are something different like you should know classes and all those oop concepts if you're just becoming a data engineer and you can also learn this along with this project as well so it's not a big deal so now let's create a new notebook okay i will simply go to workspace and I'll simply say create and I will create a new notebook it's called silver customers silver customers make sense let's connect this notebook to this cluster okay so now again same thing we will first of all import the libraries from pispark.sql SQL dot uh functions import ax then from pispark dotsql dot types import ax perfect now let's perform the data reading first of data reading okay perfect now df equals spark dot read dot format pocket and then simply load silver at the rate hm silver bronze bronze perfect let's see what do we have inside this obviously we have 2,000 customers inside this and again we do have one more column so first of all without link let's first of all drop this column because we do not need this need this in our enrich layer okay so I will simply say DF equals DF dot drop rescue data and let's say DF dot display perfect it is dropped so now what we need to do so now first transformation that I'm going to perform and it is really really important you should know about this and it is regarding let me just tell you the requirement first of all we have email ids right do not send emails to these emails bro so Anita 65 I know you like you will be sending email no these are like pseudo emails so so now now no now no now what we need to do basically in the email column we have so many email elame email ids I want to create a new column in which I want to store all the domains so that I would know what are all the domains that my customers are using so that I can effectively provide some kind of you can say vouchers or those kinds of uh promo codes according to the domain names make sense because there would be some domain names of some organizations there would be some public domain names like Gmail email Yahoo and so on so I want to know like what are all the domain names and how we can just do this in order to perform this activity what we need to do we simply need to split our email column okay virtually virtually not physically then we need to pick only the value after add the rate wo so we have to perform multiple things for this transformation that's why I picked this transformation so that you can learn pi spark so what's the approach first of all I will split the function like split the column then then I will perform array indexing to just grab the last value of the list because whenever we just perform split operation it creates a list of all the values based on a delimer or based on a separator make sense make sense and what is the separator in our case add the rate very good let's do this then you will understand better so now let's say no no no no now let's say df equals df dotwith column obviously because we are creating a new column okay now let's say I want to create a column called domains i will simply say split column email okay and what should be my separator it's at the rate done then after performing this I want to grab the 1 array 1 element of the array one means like next zero is the first one one is the next one i can also use minus one because it is the last but we just have two um items so it is better to use one and that's it i can simply say df do oops what's wrong uh wait dot column oh I see there's a typo uh I just need to enclose it and then and then column okay this is fine then here uh wait column is this then this then this oh it's fine okay column I hate parenthesis okay this parenthesis is closing here then split column is here then width column is here oh that's it we are good we're good let's do this perfect so now you can see I have successfully fetched all the domains fun like first is called rion.org or kelly.com and so many other domains right see so many other domains as well so many domains so many so many so many domains obviously there will be like so many uh repeated domains as well but yeah I can just see domains as well make sense yes now I want to do something special what's that okay so now what I want to do I know there are some popular domains one is obviously gmail.com com second one is yahoo.com okay and the third one is uh let's say king.com let's say let's say it is the third one which is very popular okay and we can actually find this okay okay okay okay let's perform one transformation first then we will just jump on to the special one okay now let's say uh I want to know like how many users or how many customers do I have for gmail.com or how many customers do I have for yahoo.com and so on so how I can just do that i will simply perform aggregation functions so in order to perform aggregation function I'll simply say uh df dot group by and I want to perform group by on domains column i will simply say ag that means what kind of aggregation I want to perform i want to perform count okay count of let's say customer ID because customer ID column is the unique ID here customer ID that's it and then if I just want to provide a good name to it I can simply use the dot alias and I'll simply say dot total um customers make sense perfect now let's say dot display perfect perfect perfect so wow we have got all the things and now let's actually sort it as well because there are total 811 domains so I will simply say dot sort on total customers and ascending equals to false perfect so this is a transformation in which we have performed so many transformations see CC see so many but that's how you learn that's how you just combine all the transformations and that's why this this is the like you can say one of the major reasons of building projects where you just apply all the knowledge that you gained so far let's say you already are aware of all the functions in pispark but while you build the projects you apply all those functions and you actually combine so many things together right very good so now I got my top three domains Gmail Hotmail and Yahoo.com very good now I want to perform something special what's that why are you smiling anala so the thing is I want to apply a filter on top of these three domains okay that means let's say I want to perform a filter on DFGmail and I want to just filter all the customers of the Gmail domain i will simply say df dot filter then I will define the condition i will simply say column of um domains okay and then equals to equals to gmail gmail.com sorry gmail.com yeah perfect and I can simply say display so this way I will only see the gmail.com make sense now let's say now let's say I want to do the same stuff i want to do the let me just do this now let's say I want to do the same stuff for uh for Yahoo okay and here yahoo.com okay and then Hotmail hotmail hotmail perfect make sense yes now let me just run this and you will see what I'm talking and it has just run three things okay one by one makes sense because obviously first this code will be running and then this code will be running and then this code will be running like one by one one by one one by one and just to prove this I can simply import time module okay I can simply say import time okay so I can simply say time dot sleep 5 seconds okay then time dot sleep 5 seconds then time dot sleep 5 seconds let me just rerun this then you will see that first it will run and wait for 5 seconds okay so it is now waiting for 5 seconds perfect and now perfect so this way you can actually perform some kind of time modules if you just want let's say um any kind of delay if you just want between your code or any any anything so for now in this particular thing all these code like DF hotmail DF Yahoo DFG Gmail like actually processed together and it returned all the result together okay make sense makes sense makes sense but if you just want to perform any kind of task and if you want like previous task should wait you can actually use time module okay and you can actually perform those tasks in parallel it's up to you make sense makes sense makes sense makes sense so now this this was like filtering of the data frame based on the domain makes sense okay now let me first of all remove this um cell why because it is taking so much of space i will keep it here but I will not run it because we know the output okay just to have some space so let's say now our DF is looking like this okay now we have already seen like how we can just apply split group by and everything now let's say that I want to perform a kind of new column and I want to create a full name instead of last name how we can just do that basically we can do that using a concat function so basically we have some text functions as well and let me just create that i will simply say df equals df dot width column and I will simply say full name okay and within this full name I will simply say concat see I think it is hearing my voice i I am so so so sure oh man oh this was in my mind because obviously when we have created a full name column we do not need first name and last name column so we will simply drop it but it is so smart but I will simply say TF here so let's create and let's see what is the result because we do not want like two different columns we just want one column and that's it this is full name perfect so this is our data that is looking like this and we are pretty much satisfied with this particular transformation or you can say set of transformation that we have applied on this particular table make sense okay now let's write this data tf dot write dot mode dot format okay yeah perfect now let's run this and I want to just save it in delta format and I want to do append okay makes sense perfect now let's do this perfect perfect perfect so now it has written the data i will confirm as well i will simply go here in the containers and silver and yeah we do have customers table as well and we do have two partitions and in this and the silver sorry in orders we do have two partitions there yeah that's it good and we can also confirm it just to read or we can just perform the reading at the end of this particular you can say silver layer um part so that we can confirm okay all the data is looking fine okay Perfect now let's create another notebook for our third table which is silver products perfect let's attach the notebook with the cluster okay perfect first of all importing the libraries that's it now let's pi sap by spark group pi spark okay now let's write data reading perfect now what we will do Perfect let's see this data what's the typo bro what's the typo what's the typo data bricks ed windows windows windows products hm oh date data bricks perfect now let's see how this data look like and this particular obviously we first need to remove this particular column so let's do that tf dot drop data perfect okay now let's run this command one more time and we should not see this data perfect so now obviously we have these thing these things product ID product name category brand and price very very very good now I'm going to tell you something very very very special which is really important for you as well it is called functions what are functions bro so basically we know that we create something called as userdefined functions in pispark but those functions will stay there till your program ends But we can actually store our functions in our catalog so that we can reuse it the same way we do it in SQL let me just show you if you just click on this catalog okay and then click on this ribbon to have more space database catalog within this we have bronze schema within this you can create something called as function really yes and the good thing is the syntax is almost almost similar to the TSQL that we perform in MSSQL server and you can reuse that function so let's say you are creating a function and you want to use that function in another notebook or let's say you want to use that function in all the other notebooks then obviously you cannot rely on UDF right because UDF will not be there in all these sessions but this function can be reused and it will be stored here and I'll just show you how you can just create function to reuse it make sense obviously we'll be just creating some basic functions so that you will understand the concept first and you can actually scale that particular solution to such a great height and it will be really really really helpful for you okay so let me basically we have two functions uh scalar and table function which return tables so let me just first show you the scalar function how you can just create scalar function let me just show you so now let's have a look on the functions okay so as we know that we have this brown schema let's create the functions that will be retained throughout the session and that can be reused even after the session gets killed bro because the thing is you want to create some function if you have used SQL you know that we create functions so that we can reuse it right this these functions can be long these functions can be really really complex so obviously if you just want to write the code you will simply write the code one time and you will just retain it okay so that's why we have functions and the Unity catalog we can actually create functions let's create that okay let me just create a heading H3 bold and functions make sense makes sense okay so now basically we can use SQL and Python both so first of all I will just show you SQL and then Python okay and I will also show you how you can just call those function using SQL and Python both make sense make sense okay very good so in order to do that uh we know that in our particular table this is the one and we know that we do not have any kind of table so let's create a temporary view first of all so that we can just test our values so how you can create a temporary view using DF it is very simple you can simply um create something called as DF dot create or replace temp view simple uh this is a products view perfect let's create that yeah df is not defined that makes sense why why why why why because because because my cluster was dominated so I will simply say run all perfect okay perfect so our view is created now this view is like just a temporary view that we can just use it because let's say our function is created now obviously if you just want to use that function using SQL you should have some SQL objects right obviously and that's not even a big deal because we have regions but in regions we do not have anything that we can just apply okay so what we will do we will simply create a function and I will simply say create or replace function and we need to create a function in datab bricks kata dot bronze schema then dot function name so function name I will simply say let's see what do we have here um okay we have something called as price so I will simply say um price after tax okay makes sense let's let's cut tax or um yeah or let's say discount discount is like 10% or like price after discount so discount is 10% so we will simply provide the 90% of the amount to that particular column make sense okay very good so I'll simply say function name as discount discount funk just to make it more readable then we simply need to pass a parameter because obviously we will be just importing the parameter so I'll simply say p and then price make sense and what is the um data type of this it is double okay then you need to say returns what it will return it will retain uh it will return something with data type called double so this is our scalar function because we'll be returning only one value it can be string it can be double it can be float it can be integer but it will be just one value that is why it is a scalar function make sense now we need to say which language we are using we are using SQL make sense very good now it's time to actually define what it will return it will return uh P price into 0.90 make sense that means just the 90% of it let's create this and let's see if it is done yes now I want to perform this function using SQL on top of this view so I will simply say select um product ID now if you just want to perform any kind of function in SQL what you do you simply write the function name so in our case it is called discount okay funk and then from uh products make sense let's run this and you will see uh discount function search path uh uh uh wait is there a typo is there a typo cannot resolve on search path okay uh okay oh we need to pass the catalog as well so catalog is datab bricks kata dot um schema name so schema name was bronze because only then it can just identify hey this is the column like this is the function that we are just performing and perfect so as you can see 1681 and what was the only original price actually I can just display that as well just for you perfect so this was the original column and this is the discounted price that is just the 90% that means 10 So my camera was turned off so if you didn't see my face so now you can see so I was saying that this is the price okay and this is the discounted price that means price after applying 10% discount so this way I can use this function again and again in all the all the all the objects stored within this schema actually this is a catalog so click on this then bronze then see now I have two two options tables and functions early it was only tables okay click on functions so this is the function that I can reuse every time so this is the way to use it in SQL let's say you will say an Lamba why are we creating temporary views we want to just use it in the um data frame uh directly okay you can use it you can simply say df equals df dotwidth column okay let's say you want to create a new column and you will simply say discounted price okay then you need to use something called as expr because whenever you want to use SQL functions or functions in a SQL way you just use expr and then within the expr you just write all the function in the SQL way so I'll simply say um data bricks uh ka dot bronze dot uh function discount function and then I want to apply it on the top of price function same way that you use it in SQL same way then we use expr okay and that's it I can simply say df dot display Perfect so as you can see that we have an error wow wow wow what is it saying man oh just a typo i forgot to add S and that's it that's it that's it that's it so perfect i can see this is applied on my DF as well directly just remember you need to use exprum in the SQL way that's it this is the you can say flexibility that that we have within the data frames very good now lamba can we just build the same function using python as well yes you can just do that so basically or let's create a new function because this is a very simple function let's create something let's say I just want to make upper lower something like that using Python okay so I will simply say and let's do this transformation on top of category or let's say brand i want to make a brand column as upper obviously I have upper function in pispark i'm just telling you how you can just create the functions using Python and using SQL because in the real world you'll be creating complex functions and you would need to save those functions okay so I'm just making your fundamentals really really really strong and this is a new thing so you should be aware of this thing make sense this is a new thing yes this not before in data bricks okay so how we can just create a function using Python so the 50% of the code is same okay like the statement or definition code is same create or replays and function then data bricks kata dot bronze dot let's say a function um uh upper upper funk makes sense then obviously I will take p parameter brand and this is of string type and I will return string okay and this time I will simply say language Python and this time it is different we need to use as then dollar dollar where is dollar where's dollar here it is dollar dollar and then whatever I I'm writing in this particular area it will simply return that so let's say I want to it is hearing my voice i'm telling you bro i'm telling you why didn't it pick lower why i just said that I want to make this function as upper it picked upper i'm telling you this is an AI error oh man so as we know that we simply need to write return and even if you just write let's say any anything um anything let's say you want to just write for loop you can just write for loop like for i in your p brand make sense right and then you can simply say um um if I do is upper uh yeah you can say that or you can simply say um return I do return simple simple simple so what it will do it will simply return that one by one by one by one that's it you can do anything anything means like you can just write even I would say your udf function in Python as well you can simply say df or you can just write anything in this particular area make sense so for now what I'll simply say return p brand upper which is the common thing that we do in python or let's say if I just want to reverse a string I can simply say that return P brand and P brand and then list and then I can simply pass are you getting my point my emotions like what I'm trying to do right now you can write anything and you can just save it make sense make sense make sense you can literally do any any anything so let's Okay just perform this for now and now if you want to just do the same thing you can actually use the same function here let's say select a strings from um whatever your function name is uh first of all let's do product ID then brand perfect let's see the output of this and you will see that that particular function will be applied on top of that particular column make sense okay and it has applied the function successfully as you can see all the values are turned into the upper case again I'm repeating you should only use this function if you just want to create a complex functions that cannot be achieved using Pispark and you want to save it you can just do it and obviously you can govern it as well using unity get a log make sense now let's actually write our DF df dot write dot format format will be delta you already know that then dot mode will be append then dot save or dot option path will be oops perfect and then dot save make sense let's run this and our third notebook is also done with some new learnings that what we can do makes sense makes sense makes sense very good so now let's quickly create our fourth notebook as well by the way fourth notebook is like very very very basic notebook let me just show you let me just show you go to reasons or go to workspace go here click on new notebook and I will simply say silver regions but still you will learn something new in this as well what's that let me just show you what you will learn in this basically we need to read the data and the data is in the form of table as we know regions table so what we need to do and this is a data table so how you can just read the data table let me just show you you will learn this thing you will simply say df equals spark dot read dot table that's it you do not need to worry about any location or anything because this is a table you will simply say um datab bricks ka dot bronze dot regions and that's it that's it you can simply say df do and this data is just about regions like four four regions and obviously we need to remove this particular column let's remove that df dot drop rescue data and that's it because everything else is fine and we are satisfied with the with the with the with the with the with the transformations by the way let me just tell you this is a special requirement of the project where I want to just show you this particular file will not be the part of um star schema but still it will be there why because this is a mapping file that you want to travel it through medallion architecture and you want to serve this file to the stakeholders so this is not the part of star schema but still this file should be there that is why it is just a static file and it is just a you can say mapping file that we are creating and we will simply create a kind of object in the gold that's it make sense very good so let's write this data tf.t write dot format delta and then dot mode override yeah it can be override because this is just a small file and static file it can be overr over overritten every time okay and then obviously do not save it as a table just save in the location so there is this is another way to write your code in external location instead of using option path you can directly do this as well and this should go to silver perfect so now it's time to query all the data that we have just to make sure okay what's the error what's the error bro hm silver datab bricks oh datab bricks ET by the way it suggested me okay okay perfect let's now quickly read all the data let's say DF1 equals or let's say DF equals spark dot read dot format delta and then dot load let's read this one first tf dod display because obviously before creating the go layer we should be sure that okay our data is looking fine okay this is fine now next is orders let's see that okay this is fine uh regions oh regions is done customers uh 2,000 customers yes and then at the end products okay perfect perfect perfect perfect so all the data is fine now it's time to actually get started with our gold air and let me just tell you gold air is very very very very very very special because we going to create star schema we going to create dimension tables we going to create fact tables we going to create slowly changing dimension type one we going to create slowly changing dimension type two so there is a lot of a lot of a lot of learnings okay let's see what do we have in the gold layer so bro now now now now now is the most important most important most important part of the video i'm not lying obviously all the parts are important but this is the most important part because in this particular part we are going to create our gold layer dimensional data model star schema slowly changing dimensions and a lot lot lot of things so basically I will first show you because obviously we have one fact table in two dimensions and one dimension is customers second dimension is products so we will be creating our customers dimension as slowly changing dimension type one and uh products table as slowly changing dimension type two so that you will have understanding of both the most popular dimensions because type three is like rarely used type one and type two will be used just be with me and let me just give you a disclaimer um this is a complex area okay so you may need to rewatch this part again and again it's fine it's fine because obviously whenever we are just learning a new things new thing it's not like you will be just grabbing all the things in first go it's fine it's fine okay so just be happy and just be with me and you will be happy okay so now without wasting any time let's create our notebook go to workspace click on create notebook perfect i will simply say gold and then I will simply say gold customers make sense very good because there will be like so many things which will be new to you and trust me it's really really really important okay and now we would first need to create our schema as well obviously because we'll be storing our column storing our tables uh so let's create our um let's create our schema so you can simply drop click on this dropown click on this three dots uh or you can simply go to this menu and catalog and then you can go to this catalog basically you can simply write create schema schema name that's not a big deal but I just want to show you through the UI it's better to see and create click on this catalog and then click on create schema and let's create our gold schema okay create that's it perfect perfect perfect let's go back to our notebook what do you want bro oops recent gold customer yeah perfect let's collapse it perfect so this is our cluster okay so don't worry i will just attach this notebook notebook as well so just just just be with me first of all I want to create a flag in the beginning of the notebook yes why because in this particular notebook I'm going to create real world solution for sedd type 1 dim and in the real world we perform something called as initial load/ fullload plus incremental load within one notebook so you need to prepare your notebook in such a way that can handle the dimension table creation for the first run and for the incremental run as well so just to give you a overview so obviously let's say you have a dimension table obviously and if this table is being created for the first time that means this table is not there then obviously you will simply dump this table you will simply write this table but once this table is created you cannot append the data no you need to apply something called as upsert which is update plus insert that means you will only insert the new records and update the existing records and obviously there are rule there is a rule that we we always create a kind of dimension surrogate key so you need to take care of that as well everything in one notebook again that's why it is a real world project it's a real world project right obviously so just be with me so first of all I will simply read the data because our source is silver data make sense yes very good so now in order to work with silver data like we could have created the tables as well on top of it that's not a big deal because we can simply use spark.sql and that's it ideally we should have created tables on top of those silver you can say data but we just have the uh you can say delta uh format data that's it but ideally we should have created those table but it is fine but even if you want to create those tables right now you can create it let me just go back to the recents notebook and let me just open silver customers or let's create yeah silver regions silver customers that's it like it's up to you it's it's a good practice to just create that so just create table on top of it and how we can just create a table we will simply say create table okay if not exist and this table will be uh datab bricks kata dot silver and we do not have silver uh schema so let's create silver schema quickly and I will simply use a code cell and I'll simply say create schema data bricks kata dots silver Done d yeah perfect now it is there yes I can check uh let me refresh uh yeah here it is silver perfect so I will simply say customer ID and then email and what do we have by the way we do not even need to define the schema because in the delta log we have everything there we can simply say uh using delta and then location and location will be abs um silver and then add the datab bricks customers that's it this is the location as we know no parent location was defined oh man it it gives me wrong suggestions it's datab bricks ed it autofills for me and then it gives me error datab bricks okay perfect so this is the table created for me now I can simply query this data like this select a from databick scattera do.sc customer perfect so it will just give me this particular table oh it has created with customers silver no bro i will create it with customers okay let's create with customer silver not a big deal not a big deal not a big deal not a big deal and perfect as you can see we can just see the data as it is and even if we just rerun this code it should run fine because we have said create table if not exist very good okay so this is fine this is fine this is fine let me just apply this let me just add the cable otherwise you will miss my face because it just uh turns my screen into I think black screen and then okay so this is done and let's do the same thing with the others as well like with the you can say yeah let's perform this particular thing with our other tables which are these ones Where it is where it is where it is where it is where it is okay just go to recents and products this time yeah silver products makes sense makes sense makes sense makes sense and this time it will be product silver product silver and this is products okay very good so this is also created now let's create for the orders table as well okay now let's say orders uh there it is here it is here it is yeah here it is okay let's do it here customers orders customers orders order silver and then orders perfect now what I will do i will simply create another one that is the last one which is regions and regions uh yeah regions okay region silver regions perfect our tables are done now let's quickly move back to the gold layer okay gold customers perfect so now what I need to do I simply need to first read my data that is my source which is silver customers make sense okay very good and obviously I just mentioned that we need to first create the flag so let's create a flag first of all I will click on edit and then I will simply say add parameter and then I will simply name it as um let's say I want to name it as click on settings and I will name it as uh init flag or not init flag um let's say load flag is it initial load or incremental load basically load flag okay and perfect initially or let's say init load flag init load flag make sense and uh initially it is one because we do not have any table in our gold layer so it is our initial load and obviously when we'll be performing the incremental load we can simply change the value from here make sense and in the production as well what we do in the real world we first load the data we first run our pipelines with this parameter then once it is deployed we just change the value of this to zero and zero will be there for rest of the runs for rest of the runs that's it make sense this is the way to work in production as well so now let's do one thing now let's read our data okay so I'll simply say data reading data reading now is the thing so now just so our process has started for initial and incremental why so uh first of all obviously I want to read my data as my source so I will simply say df equals uh spark dot read dot table make sense or let's say spark.sql SQL and then I will simply say select a from um silver dot uh customer silver right I will simply say this right this is my data okay data reading from source oh silver cannot be found are you sure are you sure oh we didn't define the catalog Okay perfect okay perfect so now I will simply say data reading from source very good so this is my data and you can simply see this as well not a big deal you will simply say DF do and you will see the data perfect perfect it is fine good now what I need to do as my second step obviously I will first of all remove the duplicates based on the primary key and what is the primary key primary key is my customer ID make sense obviously there should not be duplicates but still we just need to be double sure so we simply remove some duplicates so I'll simply say removing duplicates okay or simply say removing duplicates uh removing duplicates and now simply say df equals df dot drop duplicates we have this function and then we can simply say subset equals to because we simply need to remove the duplicates from the column called customer ID that's it okay so I'll simply say customer ID okay now simply say display df.tl limit because I just want to see 10 records okay perfect makes sense makes sense okay and let's remove this perfect so now we have removed the duplicates okay now what we need to do we need to assign a surrogate key column to our data frame make sense yes to some extent yes what is a surrogate key so let me just show you the data so surrogate key is just the pseudo key that we create in our dimension so that we can easily apply joints for now what is the primary key customer ID so what we do whatever is our primary key we simply create something called as dimension surrogate key okay dim key it is called dim key dimension key so we simply create dimension key on top of that column so that we can efficiently apply joins instead of applying on this column we can simply apply joins on that particular column which is called dimension key dimension key can be created with the help of so many functions we can easily use row number because it will be a unique identifier of the record that's it 1 2 3 4 5 6 and we have already a function called monotonically increasing ID in Spark we can also use that so I will be using that function it is really good so I will simply import it from pispark dossql dot functions import ax from pispark dossql.types import perfect i'll simply say surrogate key surrogate key all the values you will get to know what is this because we are just providing surrogate key on all the values okay okay makes sense i will simply say uh df equals df dot with column and I'll simply say dim uh customer key okay and I'll simply apply monotonically increasing IDC this one and I will simply add + one why because it starts from zero so I can simply say + one make sense uh make sense yeah so we cannot apply + one like this we need to simply say lit because it does not take constant like this we have to wrap it in the lit okay because this is a constant that's why so let's run this and let me just show you the output of this like how it will look like df dot with column oh sorry df dot display uh df dol limit perfect so this is my dim key see 1 2 3 4 5 6 7 8 9 10 perfect makes sense now is the real working started now just tell me one thing if this is a initial load that means we do not have any kind of table we can simply write this table in the gold layer makes sense but but but let's say in the next run we have new data let's say we have 50 more records make sense if you will be creating surrogate key on that particular column from one does it make any sense obviously not why you will say anal Lamba because we already have 2,000 customers okay so the next surrogate key should start from 2001 exactly exactly that's what I wanted to hear exactly so now what is our task our task is whatever data frame that we have whatever data frame we have we need to divide that data frame into two parts one is new records second is old records why because in the old records we do not need to create any surrogate key because those records are already there and in the new records we need to create surrogate key but but we need to start from the max value of the surrogate key stored in this particular table which will be in the gold layer make sense your data frame will be divided into two parts one is old records keep it aside because we I like we do not need to add any surrogate key because those records are already there second thing is for the new records we need to create a surrogate key plus plus we need to say that surrogate key should start from the max value of the previously loaded table if we have 2,000 records the surrogate key should start from 2001 if we have 10,000 records the surrogate key should start from 10,0001 okay and I know now you have got the concept now let's see how we can implement this okay let's see so now what we will do we will simply go back to our surrogate key step because we do not need to create surrogate key for all the values first you will filter out what are the new records what are the old records make sense very good and just to give you a kind of understanding in our end table that we create in our gold layer the end dimension table we create two more columns one is create date one is update date obviously because we need to keep a track when this record was created when this record was updated make sense yes it makes sense so why this information was required because you will be just dealing with this right now and let me just show you how so the thing is let's first of all decrease some brightness it is it's 7:53 p.m here so obviously I started in the morning so basically before applying surrogate key step we need to first of all filter out previous and old records make sense okay let's do that first of all okay we'll simply say filter or let's say yeah or let's say dividing dividing new versus old records make sense makes sense dividing new versus old records what did I click here uh okay new versus old records so now how we can just do that we will simply so what what what is the solution let's say I am creating this data frame this DF is the new data frame this one okay now if I want to see what records are there and what records are not there how we can just do that you will simply say an you can simply create a data frame on top of the gold table and just apply a join and if we see any nulls that means those are new records yes but what if we do not have any table because this is the initial load so how we will just do that so in this particular scenario we know that all the records are new but still we want to programmatically manage it and what will be the scenario in this case we will create a pseudo table what is a pseudo table let me just show you p sudo table is we will simply grab the main main columns that's it and what is a main column obviously the joining key plus your DSS create date and DSS update date make sense or basically just the surrogate key because what we are trying to do we are trying to figure out what are the records that are already there make sense okay so I will simply create an if condition i will simply say if init flag init load flag you will now understand everything init load flag equals to equals to 1 or let's say equals to equals to 0 that means this this is very simple equals to 0 very simple because we know that we have table created in the gold layer that's why it is an incremental load okay so what we will do we will simply say df old equals to uh spark dossql okay and then select ax from datab bricks dot gold dot let's say the table name is dim customers let's say I know this is not existing right now but let's say so we will simply say select a from that that's it because we want to simply grab it okay or um let's say we just want main columns what are the main main columns because we are just just just want to we just trying to get the columns such as you can say uh first of all surrogate key obviously because we need to just see like okay what was surrogate key or anything just to see anything so we'll simply say dim customer key i know this is critical this is complex but try to understand wait you will understand everything just just hold on dim customers key okay then what you will grab you will grab the joining key what is the joining key customer ID okay by the way this column is not there because this is this table is not created for now right but we are assuming that this table is already created because we are creating or you can say managing the notebook programmatically for both these scenarios so you need to imagine right so customer ID then we'll simply say uh create date and update date create data and update it from this particular table and when we just write our code in multiple lines we simply use three single quotes that's it so this is my DFO if my table is there if my table is there because obviously then it will simply bring all the columns such as dim customer key dim customer ID create date and update date make sense and else else means if my table is not created then I will create a pseudo table so that I can easily apply a join and every time the join will be returning all the records because there is no table created there just look at the code that I'm going to do do so df old equals spark dosql from this and what I will do where 1 equals to zero what What it will return what it will return it will only return the columns just the schema that's what I want that's what I want okay so I can simply say create date and update date okay let's run this uh init load flag is not defined really uh okay makes sense because we didn't load this particular thing so we will simply say init load flag equals db utils dot widgets dot get and it is in it load flag makes sense perfect perfect now we can just use it here so now it is done uh what's the issue database ka gold dim customers cannot be found uh okay makes sense makes sense because obviously it is not here so now it what it is saying it is saying hey this table is not here so obviously if we are just trying to get anything or let's say any schema so obviously in that particular scenario we cannot say hey you need to just bring this data from let's say this particular table because this table is not there right this table is not there you can simply say simply run this and you can simply provide from like any from but it should be a decent table decent table like there should be a table so you can simply say any table like bronze silver or anything silver dot or databex ka dot silver dot customers customer silver just run this uh dim customer cannot be resolved oh okay simply write zero here why because obviously these columns are not there and these are like the pseudo columns so you will actually understand what we're trying to do when you will actually see the DF old otherwise you will say hey what we doing what you doing just be with me just be with me so simply say zero as dim customers key because we know that this is a pseudo column we are not going to use this DM this thing but we are simply creating it same thing here zero same thing here zero and you can even create zero as customer ID as well not a big deal because this is this will only return zero every time and not even zero we just need the name that's it trust me we just need the name so I'll simply run this and it is fine now let me just show you df old dot display and you will see just the schema that's what I want see I just wanted this like I just wanted to create something so that I can get the joining key create date update date and obviously dim key that's it i do not want anything else i do not want any value so that's why I just put zero okay so now is the thing so now we have TF old now just listen to me carefully now in our initial load in our initial load we have this particular table i know it is empty but this is a table so now what we will do we will simply apply the left join with this particular table on this particular column and and because obviously now it is empty but for the incremental runs for the you can say subsequent runs there will be some data you can imagine some data so what will be the deciding factor between existing records and new records this column make sense let's say customer ID uh customer ID 2011 okay has some uh obviously it will not have any kind of well let's say even have some kind of value okay let's say anything let's say zero or anything else because obviously this is the empty table but in the real world obviously not in the real world like in the subsequent runs in this one we will be having values right dim customer key dim customer ID create date update date obviously if We have some values here in any column in any column if we see any column like any value this column this column this column but I personally prefer this column because this column is easy to match so if we have any kind of value here that means this customer ID is there in the table oh okay so if any customer ID is there in the table that means we have customer key dim customer key associated to it so we do not need to create a new one make sense we will only create new ones for new customer ids which are not in this table make sense now let's do it so our next step is applying the join applying the join so I'll simply write applying join with the old records okay applying the join with the old records and I will simply expand the size of this so that you would know that this is the subsequent steps so we are dividing new versus old records and these are the substeps that we are doing applying join and all okay so now I will simply apply join with my DF df and I will call it as let's say DF yeah only DF makes sense or let's say DF join DF join equals DF dot join and what is the second DF let's say what is the second data frame with which we are applying join it's called DF old make sense then what is the column this is the column customer id and it is a left join okay let's see what do we have so I will simply We say df join dot display and obviously this is an initial load so all the values will be null all the values will be null and you can see all the values will be null see dim customer key customer ID create date update date all the column values are null so that means all these records are new that's the truth because this is an initial load all the values are new that's what we want that's what we want make sense okay very good now we need to take care of some things what as you can see that we have same column name that we should not have because obviously this will create a kind of confusion between the columns when we will be just filtering out the values so we'll simply rename this particular like these particular um columns and how we can just rename it we can simply go back to our df old and we can simply say df old equals df old dot um with column renamed okay with column renamed or yeah with column renamed or you can obviously rename at this step as well so I will just do it here instead of doing it here so I will simply say rename and let's do it here yeah perfect renaming columns of old of df old make sense okay i'll simply say df old equals df old dot with column renamed and dim customer key will be changed as old dim customer key I will simply add old as a prefix just for the identifier that's it okay and then I will simply add this same thing with others as well so with column renamed customer ID and create data updated that's it very good very good now let me just run this and I have just added space that's why we have an error and I have added space so let's remove some space perfect perfect now let's reapply the join just to eliminate any confusion so now you can see that we have new join oops we have an error oh yeah now obviously this is a new column so we need to update our join so this is this will be applied with join called as old old customer id make sense very good now you will see that confusion is gone that we know that old dim customer key is the old customer key not the new one or it has a different name so it's not a big deal now to just um differentiate between the new and the old make sense very good now bro it's not easy to create dimensions okay so it's not an easy thing so just feel happy that you're learning and you know how to do this okay and I told you you need to watch this part LT at least three times at least three times trust me because that's how you learn that's how you grow so now it is fine now let's separate the new uh records and old records so as I just mentioned all the old records will be having some value here and all the new records will be having null so let's do that let's quickly do that so now we will be simply saying that separating separating new versus old values old records maybe yeah old records okay very good so now let's do it we will simply say DF new equals DF join dot filter as you can see it has autofilled and very good so what we have done we have simply filtered all the records which have null in their old dim customer key that means dim customer key is not there that means those are new records so that's it that's what we want that's what we want right very good so this is my df new and df old will be opposite of it I will simply say and I will simply rename it or let's DF old is fine okay df old is DF join is not null that's it that's it that's it that's it this is also fine so I have separated DF new and DF old make sense df old means that needs to be added make sense like this DF old is already added in our gold layer so that means we do not need to touch anything okay we do not need to touch anything and what we'll be doing here right now so we simply need to update only one column what's that so basically let's say we are not adding these records that's fine that's fine but we are updating these records right what are we updating so let's say in the new records in this one for which we already have dimension keys okay dim circuit keys any value changed any value changed let's say customer name is changed and there's already a dim key for it so we need to update those records right so that's why we'll be updating these okay plus one more column which is called update date we know that the create date column will not be changed but update date column will be changed and what will be the update date current date or current time make sense so what we will do we will simply go to here and I will simply type preparing df old okay we need to prepare it yes how because we need to make it beautiful according to our destination table and the destination table we do not have the those old underscore those columns we need to remove those plus we need to create create date and update date columns as well make sense so now what we will do we'll simply say df old equals df old dot drop obviously we need to drop all the old columns because we do not need that but we will not drop this update it no or we can drop it not a big deal not a big deal really yeah not a big deal because update date column should be you can say updated so we can actually delete it okay okay so we can simply say delete it let's delete all the things okay so I'll simply add a comment here just for you and we'll simply say dropping all the dropping all the columns which are not required simple and I have dropped this column but I will also say uh adding update date column why because I want that column when we'll be just writing everything make sense let me just make you understand so basically we need to write our data in the same schema we cannot change the schema h okay and for create date we do not need to change anything okay so that's why I didn't run this command for now so that you can understand so even before doing this let's remove these two things from here why you will understand it so you know that in our end schema end schema like last stage of schema has two columns what are those two columns those two columns are update date and create date make sense if those records are there listen to me look into my eyes okay if those records are there do we need to change the create date of those values no because those were created long way back so what we will do we will simply rename the column so I will simply say renaming renaming create date column make sense old create date old create date column to create date make sense very good i will simply say df old very good now this is done now just tell me one thing will we update or will we change will we modify the update date will we change the update date column the answer is yes why because this data is currently processed even if we do not have any new values even if we do not have any modified values but these records are processed right now so we will use the current value for our that column which is like update data make sense make sense okay so now we can drop the update date column why because we will simply say we will create a new column update date update date okay because we'll be using something called as recreating update date column very good i know these are not simple but yeah now you are understanding some of the things let's say you are understanding just 20% of the stuff next time when you will be just re-watching this this part you'll be understanding 60% of the part third time you will understand 90% of the part 10% should be there that's the area of growth okay that you will learn when you will be actually building it from scratch okay update date column with the current time stamp see it has written the code for me and we have a function called current time stamp it is similar to UTC now function in other fun softwares as well in Azure and all so it is called current time stamp so this is the value that we need to give that's it that's it okay so this is the value that is for current time that means these values are processed right now okay that's it that's it now we can have a look in DFold and if I say df old.is obviously it is empty for now but it will be full very very very very soon and the schema is perfect because we cannot play with schema schema is this create date update date and actually this create date is an integer so we need to convert it into date time because obviously we cannot mess with our schema make sense we cannot and from where it has just got this particular you can say number 1 2 3 and all uh it actually got it from when we just did uh you can say uh uh uh from here when we were just separating it like from the above so what we can do we can simply uh change it to the time stamp so how we can just do that we can simply do it here df old equals DF old dot with column and then create date and then we can simply say create date equals two time stamp two time stamp column of this and then let's see what do we have here oh it's still 1 two oh no no no no no okay okay and one braces was missing these braces are man so now it is fine create date and update now it is fine so obviously when we'll be just performing initial load only then it will be a problem because when we have like um all these stuff like like when we have like table okay then obviously we can simply read that thing instead of reading it like d this df old this one and don't worry I will just run this notebook multiple times so that you can actually understand everything don't worry I'm here so now this particular thing will only be running one time because once the table is there it will not pick this it will automatically pick this make sense very good but obviously you should have a schema fixed so just remember the schema now this is the schema where it is this is the schema and we need to have exact schema in the destination plus now it's time to prepare df new according to this df new now I will simply say preparing df new okay I will simply say preparing DF new so in order to prepare DF new what do we need to do what we need to do what we need to do in the DF new what do we have first of all in DF new we have this we have this in DF display new this one like all the columns okay makes sense and all these old columns that we'll be just dropping okay make sense so we'll be simply dropping these two columns and obviously we'll be dropped these two columns as well because these will be empty these will be null bro so we'll be recreating old create date and old update date now just tell me one thing what should be the create date for this particular data frame and if you have answered right that means you have understood more than 20% in first go so the answer is current time stamp because these are new records that means these are created right now and these are processed right now very good so let me just copy the code from the above for removing and doing all the things i will simply say dropping these columns okay and here yeah we need to remove all the columns so I'll simply say old create date as well make sense yeah now we will be recreating update date column update date and current date columns with the current time stamp okay so let me say df old equals very good simply remove this perfect and obviously this is df new this is df new this is df new this is df new df new df new very good that's a real vault project bro don't worry don't worry trust me you will understand everything in just three runs like in just three iterations that's it and in the fourth iteration you'll be like this is just like piece of cake just a piece of cake okay because currently you are understanding the concept plus you are implementing it as well okay so don't worry i'm with you i'm with you i'm with you okay so let's run this and let's see what do we have here perfect let me just show you what do we get in this diff this df new display and it should exactly match the schema see current time stamp current time stamp very very very good now one more thing in preparation of df what's that in the dim customer key what should be the value it should pick as the starting value we just discussed this it should be the maximum of the previously loaded table now it's time to actually add this dim surrogate key and how we going to do it so obviously we will just provide the dim surrogate key here that is fine but we need to simply add the last loaded value let's say in the last loaded value we had 200 C 2 let's say 200 customers so we need to simply add 200 in all the values because it will start from 20 1 202 2003 204 make sense make sense makes sense very good very good and how we can just do that again we going to perform a condition on the initial load flag because in this particular scenario do we have any value no so what should be the value zero okay and in the other cases what should be the value the maximum of the value last loaded perfect very good now you are understanding the concept now I know now I know and just drop a lovely comment right now i'm really hungry i haven't eaten eaten anything okay just kidding that's my love for you okay so now let's create the surrogate key surrogate key and I will simply say first of all surrogate key from zero or let's say from one surrogate key from one okay i will simply run this and I will simply say DF new okay and then I will simply say DF new and here if you see this dim customer key this should not be there right now now it will be added make sense now it will be added so it will simply say DF new do with column dim customer key mononically increasingly yay so done then we can simply see this data df new dot display perfect now we'll be adding the last maximum value okay so as you can see dim customer key 1 2 3 4 5 6 7 8 9 10 very good now we'll simply say text bold adding max surrogate key make sense very good so how we can just add max surrogate key we will simply say again if init load equals to equals to 1 then obviously max surrogate key will be let's say this is a variable max surrogate key equals to zero make sense if it is not if it is not then what we going to do we going to load that particular thing really yes and how we can just do that we'll simply say df max surrogate okay and I will simply run spark.sql SQL and I will simply say max of dim surrogate key from gold from data bricks kata dot gold dot dim customers make sense just read the statement and just say make sense just tell me just tell me make sense make sense make sense make sense what it will do it will simply load the max of dim surrogate key from the table that is there obviously it is not right now there because we are saying else because if it is not there surrogate key is zero it's fixed but if it is there it will simply grab the max value now here's the thing because it will simply return the value in the form of DF now we will convert the DF which is data frame into a variable how we can just do that we have a function called dot collect which will simply convert all the values into a list we have take function as well so let's do that let me just write it here because we have else here so we have to write it in this particular box due to indentation let me just do that basically it's fine we can actually do it in the next cell as well so I'll simply say converting DF max surrogate to max surrogate key variable okay let me just do that first of all what is this seroage key it is called surrogate key surrogate key yeah perfect so this is surrogate key i was like what what what is written here surrogate key yeah perfect so now obviously we first need to run this okay we can simply run this oops what is the issue uh table or view database okay but it is one right init load flag equals to one why it is going here table or view cannot be found obviously it is not there but why it is going there why hm maybe it is an indentation problem if init flag equals to equals to 1 it is equals to equals to 1 okay uh I think I know the problem what's that because because I encountered with this problem before as well so the thing is this is the widget and by default it is of type string so earlier either we can simply quote it in the single quotes but ideally we should simply code it in um integer so how we can just do that you can simply say integer integer of this and perfect that's the power of debugging if you have encountered with this problem in the past see oh it is gone man now just let's convert it so now we have this data frame it is called max sur let me just first of all display it we know that we just have one column right max sur is not defined obviously because it will not go here okay it will not go here but just remember like just imagine that you have just one column it's called max surrogate key okay and you need to just grab that particular value so in order to do that we have something called as I will simply say max surrogate key equals df maxur dot see it is so smart so this is the code for that and what it does it simply performs a function called dot collect and when we just run dot collect it will simply create a list of all the values then we simply need to grab the first record which is zero index you already know the indexing we have already performed in the silver layer then within that we have this particular key which is called max surrogate key which is just a column so obviously in our case we have applied max so that means we just have one record even if you have multiple records you can simply say 012 but when we using max you will just have one record right common sense common sense so now we can simply run this obviously it will return error for this time because this is the thing that will be running only under this so now you have understood this now I can simply cut and paste it here because it should be running under this area okay and I can simply paste it paste this as well so that you will understand like what we are doing here otherwise you will be saying hey what what you are doing man what you are doing perfect perfect now our DF new and DF old are all set to in to be inserted into the gold layer and how we can insert it we can simply say first of all we simply need to apply a union between both DF new and DF old because both have same schema both have all the things that they should have we simply need to apply union that's it we'll simply say union of df old and df new okay perfect and I will simply say DFAL equals DF new dot union by name what is the difference between union and union by name so basically union by name will apply the union by the column name and it is a good practice obviously we have same schema but still it is a good practice union by name df old that's it okay and then now now now very good thing that I forgot and by this code it just suggested me that thing we have max arrogate key right we have max arrogate key who will add that particular value me so add that value first of all so simply say df new equals df new dot with column dim customer and simply add this value bro lit of max arrogate key plus column of this thing so now that 1 2 3 will be having that 200 200 200 or 2,00 2,000 2,000 in the beginning like 2,000 + 1 2,000 + 2 20 2,000 + 3 so now this time you have the updated data make sense so simply say run this and lit max key obviously and what should be the output just tell me if I just display it what should be the output exactly same why because currently the value of max arrogate key is zero but for the subsequent runs it will be max of the previously loaded value okay very good bro very good and now let's perform uh union not union jun okay so this is my dear final i can simply apply union and there is a error what the error current date among customer ID uh create date update date what is current date bro what who who created current date just tell me one thing who created current date who it is called create date bro it is called create date it is not current date it is called create date so now everything is fine everything is fine let me just run all the things from the top which is called run all and we should see the results here uh now dim customer key cannot resolve okay now what is the thing dim customer key mhm okay and okay uh let me just check the schema of both the tables what is missing df old dot display let's see oh we do not have dim why why why let me just check where did we miss that dim uh customers key let me just check in the df old in the DF old where is DF old df old is here perfect okay perfect oh I see i see i see that what's the issue we also dropped dim key that we do not need to drop we simply need to update the name that's it because this dim key will be will remain the same right obviously so we'll simply say renaming this thing as well so renaming old dim customer key to dim customer key make sense makes sense bro and perfect now let's do run all now let's see bro that's why slowly changing dimensions are the biggest headache for us now what's the thing now what's the thing tim customer key among now also it doesn't have really are you serious are you serious are you serious let's see [Music] oh man oh bro where's Wait wait wait wait we We wrote it right dim okay dim customer key dim customer key okay we have Okay now we have dim old customer [Music] key why it is not showing here why df old dot drop oh because we didn't remove it from here i was like what why i was like why are you serious let me just do run all so yeah I was saying that slowly changing dimensions are not easy to maintain and that's why there's high demand of data right now it is finally done i want to see my DF final how does it look like after after a great great great great great effort now let's see so this is my table that it looks like and obviously we have 2,000 rows and we have dim surrogate key as well dim customer key and update date and create date should be exactly exactly exactly same because all our new records i love you man i love you so this is my dim customer key and it feels so good when we have finally created our Tim finally so now it's time to apply the upsert condition that is the final thing and it is called update plus insert and we already like all the hard work is done now we just need to apply that upsert upsert command is automatically updated or let's say automatically inserted or updated for us and data bricks has made or you can say pispark has made that code really really really simple to apply let me show you how you can just apply upsert and just a quick question what should be the column on which we need to apply upsert what's that column and the answer is surrogate key why because currently we have surrogate key as 1 2 3 4 5 6 7 8 and 9 10 okay if we would be having some surrogates there surrogate keys there it will simply update those keys okay or it will simply update the values of other columns if we do not have surrogate keys in the s in the goal table and we do have surrogate keys here that is the case in our scenario it will simply insert it so surrogate key is the deciding column as well surrogate key is just the replacement for your you can say primary key that's it or surrogate key is a new primary key that's it because it has simple number 1 2 3 4 5 6 that's it now let's actually apply our absurd condition and I'm really really really happy that we have successfully implemented sedd type one and now you know what is the difference between a hobby a normal project between and like between a hobby project and a real world realtime project okay so it is very simple to just pick those kegle data sets and just drag and drop blah blah this is data warehousing bro you have actually built star schema obviously you're building right now but you are building a star schema okay so that is the real world project so feel happy obviously these things take time so now feel happy and let's apply our absurd condition on our gold customers let's do it so good morning to everyone it's 10:30 a.m bro here so next day and let's get started where we left off so basically now we need to just apply the upsert condition right upsert means update plus insert basically upsert is a very popular term that you should be aware of so in the interviews let's say interviewer will say hey how you can just work with absert and what is the slowly changing dimension associated with absort condition so absort is nothing but just slowly changing dimension type one and it is properly known as absort because we just either update the records or insert the new ones right but okay how we can just apply that it is very simple so first of all let me just apply the text okay I'll simply say sedd type one so now just tell you one thing in order to apply slowly changing dimension okay in order to apply absert we should first have a table right we should first have some data on the destination right right so again we're going to use this flag if flag equals to equals to 1 so if it is this then what we will do we will simply write our data Okay and we will simply create a table on top of that data and we'll simply call it as dim customers make sense this is just if it if it is our like initial flag equals to equals to 1 so now I was thinking like you can ask this question an do we have any better way of doing it instead of flag yes we have it's called spark catalog so how we can just get that let me just show you if I say if spark dot catalog dot table exist and if I just say this thing and if I just let's say oops this auto suggestion we should have a tole button bro to turn it off seriously sometimes it is very useful but sometimes is very annoying very annoying so let's say I will simply print hello so what do you think what it will return obviously nothing because it doesn't exist this table is not there so that is why we do not have any kind of output right but if I say else if I say bro it will return bro right perfect so this is a alternative my task was to just show you both the things because if we are using this thing then we do not need to worry to maintain this particular flag okay we do not need to change this flag again and again but both the solutions do exist in the organizations so you need to be aware of this thing so let's say you are just joining an organization you will be saying hey what is this flag and obviously now you know because you are watching my videos so don't worry bro basically let's use this one for this case so that in within one notebook you will having both the things so that whenever you will be revising the stuff or whenever you will be showcasing this project in front of any interviewer or anyone okay so you would have both the things ready right let's do this so if my table is not there first of all create that table else now it's time to actually apply our absert logic in order to apply absert logic we need to first of all bring something called as from delta dot table import delta table what is this this is basically the delta table object that we create on top of our data just to create an object or let's say just to create an instance of our table so that we can simply apply the merge condition okay very simple it's very simple so what we will do we will simply say DLT object okay equals delta table dot forpath okay oops delta table table t table dot forpath we have two things for path or for name we can also use for name but I personally do like using for path because it makes code easily reusable h so now we need to simply need to pass the connection which is spark session then location is this one ab fss okay and this condition is also incomplete because there's no um location given here so we'll provide it don't worry so location is gold at the rate datab bricks et dot dfs.co dot windows.net okay and dim customers so this is a location that will be created after this because obviously in the first run this condition will be running and for the further subsequent runs this condition will be running right very good so now this is a location our object is ready now we can simply apply absort condition or merge condition and how we need to apply so basically this is our target right where we need to just apply the merge condition because this is our main table yes perfect so I will simply say DLT object dot alias and I will call it as TRG trg stands for target then I will apply dot merge with what with df final and I will also apply alias code source why because it promotes code readability and it makes like let's say there's another developer who is not much experienced the person will understand okay this is my target this is my source okay then I need to say what should be my condition which condition like the joining condition because this is my merge okay then I will simply say target dot dim customer key equals source dim customer key very good just stop here because rest of the things I need to explain okay so this is my condition right this is really annoying man this auto suggestion so now we are good now when we need to say okay we have applied merge condition on a column but an how we will perform update and how we'll perform insert so bro this is a good news for you everything is already automated for you so we have a function called dot when matched update all what it will do it will simply say if dim customer key of my source is matching with the dim customer key of my target then we know that we need to update that record so it will simply say update all that means it will simply update all the columns of that particular record make sense very good then we can simply say when not matched then insert all that means if my key dim key is not matching with the target that means that is a new record and we need to insert it and in the real world scenarios as well sometimes we use multiple condition let's say when matched and something something something so in those scenarios we prefer using SQL instead of using Python API but in most of the cases you will be using PISPA or Python API so that's why I did this way that's not a big deal and what are the other methods that we have we have when match update when we do not need to update all so what we do we simply say when matched update and we simply pass a subset subset equals to this this is if you want to specify the columns that we want to update that's it and once it is done we need to write execute which is the most important thing to write at the end perfect make sense let's provide the table location here as well i will simply say option path is this one okay makes sense perfect so are you excited to run this let me just run that what it will do it will simply create a table at this location make sense error what's the error bro is not a delta [Music] table okay so by the way it is running directly this command why why why why why because you need to run this bro first of all oh I see not really because it is running this one it's not a delta table hm okay because obviously if we are not defining any kind of format it will basically take delta that's for sure because that is optional thing but it is not going here if you just see carefully okay but why but why why why let's say let me just comment it out and let's confirm it because I think so let me add print testing let's What command is running right now see it is directly going to else statement i think if condition is not working but why let me add a brace here because sometimes it doesn't work if my brace is not working let's see what it will do right now uh I think still it printed this right uh uh uh uh let me just change this value test let's see oh yeah it is directly going here let's see i'm saying that but in the above cell we tested it right h it was going to else oh makes sense oh man so basically we need to switch the code just use your IQ so obviously I'm just having some coffee i think I need some more caffeine so the thing is just just just use your brain if table is there if it exist then obviously we do not need to create the table first we need to switch the code so what I will do I will simply copy this code so it is working fine we simply need to switch it so what I will do i'll simply cut paste it here and we simply need to paste it here perfect that's it that's it that's it that's it perfect perfect perfect perfect so simply use your brain use your brain because obviously if our table is there then obviously we need to upsert bro and we need to create the table only and only if we do not have the table now it's good now let's run this now it should just create the table at this location and now I will do one more thing and after this I would say your understanding of this whole solution that we have just built for gold customers will at least at least two or 3x why because now you will understand all the code that we have written but why because now we have the data available in the destination earlier we were just imagining now we do not need to imagine now we will actually see it okay let's see so as we know that our initial flag load is zero now because we have already uh loaded the data now let's start running this code one by one and just observe what we are doing and just validating your imagination with the real code okay so we know that initial flag load is zero because we just change the value now what we need to do we need to simply read the source that is same make sense and this will be a quick test for our absort condition as well because all the records are exactly same so that means we do not need to insert any record we simply need to update all the records so that is double testing that we are doing at the same time okay let's do it so first of all we'll removing the duplicates that is fine and we can simply see the data it looks good yeah now now is the thing so basically basically basically basically now this code will be running because initial flag load is zero but why I'm seeing red here don't we have dim customers key do we have dim customer key or customers key i think it is dim customer key uh dim customer's key or customer key let me just check let me just check uh dim customer key oh it is dim customer key not dim customers key okay makes sense makes sense okay let me just remove as just a typo dim customer key makes sense we can just remove it here as well but it it doesn't matter because obviously we are renaming it but just for the readability that's it let's run this and let's see earlier it was empty right now you will see the data now the game will change bro see now we have the data and what's the create date and update date what's the create date and update date what we are seeing here obviously the moment when we just created this so obviously uh now it is showing 27th because I just rerun the whole code because obviously I I I just restarted it today today and I just left my code in the night yesterday so it is showing 27th make sense okay and don't worry we have time stamp so like right now it is 1336 okay very good so now we have this data earlier we didn't have so we were imagining what we will do Now we will simply rename the columns just do it okay and if you observe we need to remove S from here as well just to rerun it perfect so now we will apply join with the old records very good let's do it and let's see the data how does it look like after applying the join so obviously we should see old customer ID create date update date make sense and this is the dim customer key old now you do not need to imagine anything now you are seeing everything in front of your eyes now what you will do we have old dim customer key and all now we will simply filter down the records right new and old separating new versus old records so obviously in our case no record is new all the records are old very good now what we need to do this is really interesting preparing DFold so what we are doing here we are simply dropping these two columns okay we are simply renaming customer key because we need to just sustain it then we are renaming create date with this one because we need to keep the create date as it is because create date was 1336 like date was obviously like today's date but time was 1336 right so we need to keep it as it is but we need to simply update the this time stamp right let's do it let's do it now you will see the data here and now you will see the magic create date will not change only update date will be changed like not date like exact time stamp see earlier it was 1336 now it is 1340 but in the above code in the previous in the source like in the target source like target from where we are pulling the data it is 1336 right very good preparing df new obviously we do not have anything so it will be empty very good and this is my DF new df new is zero and that's what we want okay now let's create the surrogate key this is important step this is important step this is very important okay now let's see what it has done for us now we need to adding max arrogate key this is a great test because we should add only and only because obviously 2,000 plus something and we do not have any records so we should not see anything let's see if our data is fine okay okay okay so far so good let's see DF final okay let's see this how does it look like it has 2,000 rows very good now is the time okay let me just remove this this is just a test for you okay let me just Yeah let's perform our delta load and now what we need to do we need to apply merge right and we should not see any more records more than 2,000 and all the records should be um processed like with the uh updated date right let's do it let's do it okay okay okay almost there almost there h very good now I will simply show you our results select a trick from database scattera dot gold dim customers very good let's see let's see you should see new updated date earlier it was 1336 1336 this time you should see 1336 and 1340 okay very good let's see so rows are 2,000 that means it is working fine and see create date is 1336 and update date is 1342 because obviously it took 2 minutes as well like while running it so now it is 1342 make sense makes sense makes sense so our dim customers is running absolutely absolutely fine i'm really really happy that we have just created this dim customers and I have just shown you all the ways that you can just use like you do not need to even manage this particular parameter but yeah you should know about this feature done very good now what we'll do we'll simply go back to our workspace and I will just do a small fix what's that fix it's not basically But yeah right way to just do the things so I will simply go to my silver tables and I will simply perform overwrite instead of append because we do not want to add new data we every time need to override the data okay let's do the same stuff with other as well why we are overwriting because we do not need to keep the data for silver layer because we are not maintaining we are just keeping a layer called transient layer where we'll be just keeping the incremental that's it make sense so overrite oops very good product and now I'm really excited to just show you how you can just work with delta life tables and basically before working with delta live tables just listen to me carefully because I have just talked about delta live tables in the past as well and people do not hear to me like how they need to just configure the cluster then they just cry like why our clusters are not working and blah blah just listen to me i told you like how to just configure cluster so just watch that video carefully just watch those parts carefully do not skip it so and this is just for like those few people i know my fam you are really really really focused when you watch my videos i know I'm just talking about those new people and you know what I mean okay this is already override perfect so now let's talk about Delta live tables in brief let's talk about let me just take you to documentation it is very easy let me just give you a disclaimer it is very very easy and it is built to make your life easy bro okay just tell me one thing let's take our real example you know that we spent so much of our time to actually create slowly changing dimension type one yes and we were just like so so so careful we made some mistakes but yeah we were supposed to be so so so careful yes one of the properties there are so many properties of delta life tables one of the properties of delta life tables is automating this slowly changing dimension tables as well really yes so delta live tables are like built to make your life easy so do not treat it like hey what is this what is this we do not know how to work with this bro take your time delta live tables are new to us as well but yeah it is very very very easy let me just show you what do we have under delta live tables okay basically DT stands for delta life tables it is basically an ETL framework okay and what is the best thing about that and what what is new in that particular ETL framework which is not available in other ETL frameworks so this is a declarative ETL framework what does it mean so basically in declarative framework you do not need to worry about how we need to achieve that result you just need to worry about what we need to achieve that's it how part will be taken care by data bricks or you can say delta like tables that is the definition simplest definition of DT H okay so let's say you want to create a storage dimension we will simply tell and don't worry bro I will just show you how you can just create delta live tables for storage dimension and we will be creating slowly changing dimension type two and bro slowly changing dimension ion type two requires more focus and more coding in order to create that okay because you you need to maintain the history as well so everything will be automated for you and you're going to love this trust me okay so basically if I talk about the backbone of DT let me just take you to the developer code what do we have here um materialized views blah blah blah let me just see hold on bro hold on dt for database equal okay dlt DLT DLT h okay uh maybe concepts yeah yeah these are the key concepts that I want to talk about so first of all these are these streaming tables streaming metalized views yeah so these are the basically the three pillars of delta life tables obviously these are two let me just talk about third as well don't worry streaming tables materialized views and there's one more that is called streaming views okay so we either create a streaming table or we create metalized views or we can create streaming views as well so we already know what are streaming tables streaming tables are nothing but just like unbounded tables which will be just expecting to receive data which can only be appended so these are the streaming tables and then see it is written here append apply changes so it can have one or more streaming flows append apply changes okay then we have something called as materialized view what is the difference between materialized view and a simple view so basically view just holds the query just the query just the logic and every time we call the view it runs the query in the real time and it gives us the result but matt view stores the result of that query as well oh okay so whatever we are just querying whatever uh query we are writing it is actually running that query and it is storing that data as well in the physical table so that is the difference between materialized view and view and streaming view is just simple view but it will be built on top of a streaming source which can be just appended incrementally and it will give you the performance of exactly once as well so that is why it is a great feature to just give you the exactly once capability or item potency and you already know about that because in the view it will every time run the whole query on the whole table it will not run incrementally but in the streaming views it will run incrementally and just for your reference streaming views should only it's not should must only be built on top of streaming tables so that it can basically it needs a streaming source right it needs a streaming source it should be built on streaming tables not on materialized views no no no only on streaming tables or you can build streaming views directly on top of any delta table as well it is fine because obviously in the delta table we maintain the delta logs and versions and all so it knows like uh which version we need to read so it is fine streaming views can be built on delta tables as well but not materialized views not at all so this is just the highle overview of that and let's actually go to our notebook and let's actually create something using DLT and you can obviously um read this documentation if you want i also haven't read this documentation much but yeah I just did some kind of investigation when I was just learning these things and then I just started building this things and it's fine now let's go to Python language reference and let's go to overview so this is the thing that we need to first import which is called import DT okay and these are the API references so these are basically the major APIs append flow apply changes so basically in order to create slowing dimension we use apply changes API H okay then we have create saying create streaming table expectations expectations are very important we will simply use it what are expectations so basically whenever we create tables in production okay we need to apply some constraints right this is null this is not null this is unique so these kinds of constraints can be applied on the table using expectations make sense very good then data set definition function let's go there and this is the basic definition that it is creating so what it is doing it is simply creating a delta live table using a decorator now what is a decorator as the name suggest decorator just takes your function and does some decoration and returns some result that is the basic definition of decorator and decorator is not aligned with delta life table okay decorated something that you should know in Python as well so these are just the fundamentals so I think that particular time has come where fundamentals are fundamentals were always always important but nowadays fundamental knowledge is the most important thing that you should know most important thing everything is built on top of fundamentals right now decorators and all these things everything okay then we have a function name so whatever name don't worry this is just a theory i will just show it practically as well don't worry so this is a function name so we can pick any function name let's say you will say def XYZ so whatever function name we will be picking it will treat that function name as our table name okay but if we just provide table name here in the DT.t table function it can take that table name and it will ignore the function name so that is up to you whichever you want to pick then in the return query you simply write whatever you want to return that's it this is your delta live table trust me that's it okay then we have something something called as apply changes okay you can definitely go mommy so you can definitely go through this because we will be just creating streaming table and all on these as well don't worry and these are very very simple when you just see whenever you read and this is not like you are not alone in this whenever a person is seeing something for the first time he or she will be like what what's this it's fine when I also saw this code for the first time I was like wa bro wait so it is fine i know you are seeing at the rad.t functions like what is going on i know have some water have some coffee and just chill i'm here bro all these things are very very very easy you do not need to worry at all and I will just show you what you need to do okay what do you need to do i will just tell you everything don't worry at all so basically now let's actually create our four streaming table that we can do and you can simply see create these three things one is view second is materialized view third is table so what is the difference between you can say matt view and table so whenever we use dt.t and we use read that is a materialized view but when we use read stream so basically it is not using stream so it is also a matt view so that is the difference between Matt view and streaming table in terms of definition and in terms of implementation you can say but if you just want to create a streaming table then we just use read stream see DT dot readstream and is saying DT module includes DT read and DT dot readstream function that were introduced and blah blah blah blah blah okay make sense so let's actually start with it and let's create our first streaming table or streaming view and then we will be using that particular table as our source to create this thing what is this thing it is apply changes apply changes is your slowly changing dimension code and it is very very very easy and enough theory is given let let me just show you some practical things let me just take you there so let's go back to our notebook and let's create our new notebook and we will call it as uh gold products and when you will be completing this notebook you will actually say this thing like delta live tables are great and it is automating a lot of automating a lot of stuff so now let me just tell you one thing obviously I can just turn this cluster on which is an llambas cluster okay but in order to create our DT pipelines we cannot use allpurpose cluster why because they require job clusters to be running or let's say to be debug okay but why we are just turning on our allpurpose cluster allp purpose cluster will tell you if there will be any syntactical error oh okay so yes so what we will do we will simply create the pipeline first okay like the code for the pipeline and then we cannot see it running then we will simply run it in the development mode and then we can see that if anything is breaking if anything is required or something like that so we have to first build the notebook and then we will be running it and let then then we will be simply debugging the notebook okay actually you do not even need to run this because you can even use serverless as well because obviously if you cannot see the output what's the use of this but why I turned it on because this was the issue from a lot of you so bro listen to me carefully what is written here it is written four codes okay very good just for your kind information it's not about free trial it's about paid account as well i have a paid account so I know about this so basically by default in each region okay in each region in each location we get something called as 10 cores by default okay this is our kota of course so it is using four cores and in order to create a job cluster we have to have to use at least how many cores eight cores because in the job cluster we have to define a driver node and worker node and each core each each node will take at least at least four cores but in allp purpose cluster we can either define driver node that's it and driver just does all the things for us okay so now just tell me 8 + 4 how many it will become equals to 12 right and if you have 10 cores will your pipeline be running no so that's why I told you guys whenever you want to run your job cluster whenever you want to run your DT pipeline shut down this cluster you have to turn it off you have to confirm it it is turned off plus at the same time you need to make sure that your DT pipeline is consuming only and only eight cores because sometimes not sometimes by default it uses 16 cores and you cannot run your DT pipeline bro you have to make sure these things okay very good and I will not repeat this thing again okay good so now what we can do best thing is if you have a paid account I do not know if it is available in free account but yeah you can simply try simply go to home and simply search quotas and then you can simply say my kotas and then you can simply search that particular VM and I think VM name is standard DS3 V2 let's search it how many do I have so I simply say see standard DS3 2 V2 and let me just click on it how many kas do I have see current usage is 40% that means four out of 10 that means I also have only 10 you can say cores available for this region UK south make sense make sense very good if I just click on it you can actually say new kota request if I click on this pencil I can simply say new limit i want my new limit is 20 i can submit it and it will be approved very very very soon in just few hours so you can try doing this and do not cry if your pipelines are failing but even if you cannot um upgrade this number I will simply say just turn this off and make sure your DLT pipeline is using just eight cores just do some investigation bro just do some investigation that's how you learn and I've already created dedicated community as well for help so you should ask at the right place right you should ask your questions at the right place you are future data engineers or if you are already data engineer you want to excel in data engineering you have to learn how to debug the things bro everything is new for everyone you are not alone everyone is trying their best to just learn new things debugging these things seeing errors seeing errors for let's say at least at least few weeks that's how they are learning and growing do not be in that era where everything will be taught to you through dedicated things no everything is changing you need to troubleshoot you need to debug you need to explore make sense take it positively bro take it positively i know we we all see errors i I also see so many errors and I I I I cannot even say like sometime I just spend like days to solve it but that's how you learn bro use all the AI agents use all those things do your best do your best just say like I will not eat anything if I if I am not able to solve this just do that and see the error will be solved in just few minutes and I can just bet on this you need to build that attitude bro just build that attitude okay good awamba focus here it's not a philosophical class philosophical philosophy class so I was just cheering you up so that you can just try learning new new new things that's it and now let's create our pipeline okay the gold products okay so how we can just do that let me simply create a text box i simply say DT pipeline okay makes sense now I will simply create first thing streaming table okay how we can create a streaming table so basically we will create a streaming table on a source what is our source silver dot products so far so good very good so I will simply say at the rate DT.t okay this is a decorator very good now in this particular decorator I can give the name let's say name equals um I want to give name to my table is dim products make sense then I can simply create my uh function and I can simply say um dim products ideally you should just keep the name same just to remove any kind of confusion but now I just mentioned that if you provide name here then it doesn't matter whatever you write here okay but if you remove this let's say you are removing this then it will take the table name as this one simple very good now in this particular function what we need to do we need to simply create a data frame let me just take you to the documentation so that you can actually see what we are doing step by step so we are creating this thing uh let me just go above we are creating this this thing so it is saying streaming read on a table because we want to stream read on a table basically we want to create a streaming table on top of our data or on top of our table make sense very good i will simply say df equals spark dot readstream okay dot table because I have a table and what's the table name it's datab bricks dotka datab bricks kata dots silver dot products make sense I think this is the table name let me just confirm customer silver oh sorry product silver okay very good let me just reconfirm product silver yes product silver so this is my DF now if I just want to perform any kind of transformation I can I can I can simply say DF equals to with column and blah blah blah i can do that really yes so but but now I'm not doing that because I do not need to perform any kind of transformation i will simply say return df so what it will do it will simply return the data frame to me and it will simply create a streaming table on top of this data frame that's it and see this is declarative framework so you are not writing something called as save as table you are not defining any kind of format you not you're not defining any kind of mode everything is managed by delta live table and obviously we need to import dld so I will simply say import dt and let's import some other libraries as well from pispark.sql dot functions imports okay very good uh yeah very good so now now our streaming table is ready now I want to create my streaming view on top of it let's say so you can do that it's very simple i will simply call it as streaming view so this time I'm creating streaming view on top of this table which is called dim customers by the way it is it should not be called as dim customers because this is not our dimh table it is not our final table we have just pulled or ingested the silver table so I will simply say dim products stage make sense okay very good because this is a kind of staging table for it and I will simply say at the rate DT dot view because we are creating a view and I'll simply say df dim products and this is my view so just a view then what I will do now I can do anything and obviously we will simply create a data frame first of all so I'll simply say spark dot readstream dot table okay now you will say I'm lamba this table is not created yet because we have not ran our pipeline how we can just refer this table so we have a keyword called live okay and then we can simply say live dot dim product stage that's it that's how we refer to the tables in the delta live tables because obviously these tables are not created right now because these will be running in the separate pipeline okay makes sense then I will simply say return on df that's it now it's time to create the dimension table and do we have any [Music] error proxy side there was an exception while executing the Python proxy on the Python side okay read stream oh uh no it's fine read stream yeah it's fine i think we do not have that particular um cluster ready so and we are just trying to run it so it is simply saying hey how we can just run this so obviously just ignore it so now we'll simply say and even if you see any errors just ignore it because obviously you cannot debug it till the time you are just turning it on so just ignore it i'll simply say dim products okay let's create it so now in order to create a dimension we need to use an API called apply changes okay and how we can just use it first of all we need to use something called as DT dot create streaming table so what it will do it will simply create a streaming table for us empty streaming table and we will call it as dim products okay this is our streaming table okay let's run this perfect so this is our streaming table this one create streaming table this one this is our streaming table okay which is empty right now now we will use uh apply changes API and how we can just use it we'll simply say DT dot apply changes and this time we need to bring all those parameters so simply go to apply changes API okay and this is the thing now let me just explain you what are these things written here it's very simple whenever you want to create a delta life table of or let's say any dimension table forget about delta life table any dimension table obviously we need a target okay what should be the target in our case you have just 3 seconds so if your answer is that streaming table that we have just created that is empty table you are right very good what is our source you have another 3 seconds okay so if your answer is that view because streaming table is the source for that particular table your answer is right now what is the keys basically we need to apply a kind of upsert right or let's say anything related to merge we need to have a key column so what should be the key column in this particular case key column should be the primary key and primary key is product ID very good what is a sequence by column so basically in in slowly changing dimension type two we maintain the history based on the date column based on the date column but in our scenario we do not have any kind of date column but in real world you will be having some date columns as well associated to your source you will be having that and it's not a big deal we can simply go to silver table and we can actually add that particular or we can just simply do it bro it will be good for your understanding as well we will simply do it in just few seconds then just forget about these things because these are not necessary for you right now this is important except column list let's say you have a date column in your source right will you want that particular column in your target table in dimension obviously not what you will do with that that date column you do not need that you will be using the date column created by the dimension right not of the not the date column of the source so we can simply remove it using this particular feature it's called accept column list okay then this is the main thing stored as std type if we write here one then it will simply say one like std type one if we say uh two then it will simply create SD type two we are creating SD type two so that's why we'll simply write SD type two that's it make sense very good then you do not need to worry about this track history column list track history except column list it is fine so now let's copy this these parameters and let's fill these one by one so target is target is this one or let's say you can simply write uh dim products it's not a big deal okay and data source is live dot dim product view very good now what is key column key column is our product ID and I hope it is called product ID let me just check yeah product ID very good sequence by column it should be a date column here but we do not have any kind of date column so we can simply pick let's say product ID it's fine because it will simply apply the sequence on that particular column it's fine because you will be feeling confused what we are doing so just to keep things simple in the beginning that's why so we can simply remove these and we can simply remove this as well because we do not have any date column then remove last two as well this is important sedd type as two that's it simply run this and it will create the slowly changing dimension type two for you now again same thing okay same error it is saying hey how we can just do that this this so now it's time to actually create our pipeline so can you just imagine we have created our slowly changing dimension type two in just few blocks of code hardly we have written like 10 or 15 lines of code and in other case we just created a long notebook to just create slowly changing dimension type one not even type two just type one so see how easy it is to work with delta life tables and how quickly you can just create these objects without need to worry at all okay so now what we need to do we need to simply kill our cluster terminate okay and just wait for it to be terminated and it should be dominated like quickly okay it is terminated now let's go to our pipelines let's create a pipeline create pipeline and we have ETL pipeline sample ETL pipelines ingest pipeline injust data from external sources in just few clicks and you already aware about this simply click on this and you know about this thing right we have already performed this so you can even perform these things from here as well this time I'll simply say ETL pipelines so this is my pipeline name i'll simply say gold products make sense okay then this is serless it is fine product addition advanced it is fine then pipeline mode triggered it is fine then path we need to simply pick the notebook and it should be here in the gold products yeah here very good and don't worry we will simply perform a lot of things right now because I want to show you expectations as well because it is very very very important okay expectations yeah so this is the thing Unity catalog we will simply use Unity catalog default catalog is this one default schema is gold then compute now is the thing cluster mode is enhanced autoscaling simply say fix size number of workers just one just one okay and then add configuration it is very important then channel current it's fine worker type is your VM that you are using simply pick this one with four cores i know people would be using this one eight cores you do not need to use it four cores that's it then driver type simply pick this and pick four cores extensively okay simply say create so your job cluster is ready and you can see it is in the development mode and if you just click on start it will be started but we are not going to start it right now because I need to add one more thing which is called expectations so simply let me just take you to the documentation here if you will see the expectations expect here it is what are expectations as I just told you I am just looking for that cute diagram uh expectations let me just search that diagram is very very very good expectations in data light tables oh yeah this is a diagram this is very good this is very good so basically you create your table make sense and in this case our table is our dimension table if we pass some expectations what are those expectations expectations are basically the constraints that you want to apply on our table let's say I want to say product ID should not be null and if it is null throw the error or fail the pipeline or give some warning so basically there are three types of thing that we can perform first is warning second is drop and fail flow so let's say this expectation is pass then obviously it will simply go to continue processing it is fine but if it fails then we have three actions either we can ask DT to give me the warning okay or simply drop those records or simply fail the pipeline see so we have basically three three kinds of actions that we can perform on expectation failure and then let me just scroll down then you will see let's say I want want to apply multiple uh expectations on top of my table you can simply do that and in the real world we simply perform multiple expectations so this is a diagram for this let's say you are performing three expectations on table let's say you are saying product ID is not null price is greater than zero and one more thing three things so all should be passed then it will be going there if any of the expectations fail then again three things warm drop and fail flow that's it and how we can just perform this kind of expectation is very very very easy we simply need to create a dictionary in which we'll be creating our rules like rule one rule two rule three then we have a decorator called DT.exect which is by default uh providing you action as warning then we have DT.exect or warn in which you can just simply write okay just send me the warnings third we have DT dot expect uh fail simply fail the flow and you can simply see the thing here this is my dictionary for all the rules then I simply say DT do expect all yeah we also need to use expect all because if we are using multiple expectations we simply say DT do expect all or drop because all word is for multiple rules let's perform It is very very important and it makes your DT pipeline much more you can say efficient in the production environment okay let's go to gold productions notebook and let's perform [Music] some expectations so I'll simply say uh this is DT.apply changes okay so H okay makes sense so what I will do now I will simply say because this is my table which I need to just take care of or let's say I want to just perform all these things in the beginning of the stage because that makes sense because you do not need to worry about any of those things because you would not you wouldn't want like the data should travel till the end and then it should fail block that data in the beginning so I will simply Okay expectations so I will simply add it here and how you can just do that first you should just prepare the rules and I will simply say expectations okay and expectation is uh my rules this is a dictionary that you can create now simply say rule one oops rule one and simply say product ID is not null simple like SQL rules that you write in your SQL same constraints rule two product name is not null let's say then third is do we have product name let me just check first do we have product name check in the silver bro silver [Music] uh yeah and I will simply say yeah these two things these two rules are fine okay and let me just run this and I can simply attach this cluster as well serverless or I can simply attach this cluster now as well because this is also created so I can simply create this delta live table pipeline click on connect so now it is saying you are now connected to a DT pipeline and blah blah blah it is very good and currently we do not have any kind of you can say cluster so it is saying current cancel current execution do you want to cancel cell no so let's wait so now it is validating it will take some time it will take a lot of time to turn this cluster on because it is a kind of job cluster and it takes at least at least I think 6 to 7 minutes to create this but you should wait because it's always good to debug your notebook first before taking it to the pipeline zone and then run it and then see the errors that's not a good deal simply turn it on from here once you create that cluster then simply validate this notebook from here and if everything is fine then you can simply simply simply run the code so for now what it is doing it is simply validating everything and at the same time it is turning it on because it is not turned on and if I just click on this open in DT UI let me just click on and click on new tab so basically click on this so it will simply take me to the um this area so now currently it is waiting for the resources so it is currently creating the resources for me okay and then you can simply see the same thing here as well so I will wait for like 5 to 6 minutes once it is turned on then I will simply run this thing and till the time I can continue with my code so these are my rules basically these are my rules so I will simply say at the rate DLD dot expect all or I want to drop the records if I if I do have any kind of um corrupted records I will simply say drop then I will simply define what are the rules my rule name dictionary is my rules simple that is the thing that you want to do and that's it and you can even confirm this by going here see first we create the decorator for table then we create the decorator for this DT expectations and that's it after that you can simply write your DF that's it so it's not actually complex i would say it's just new and whenever you see something new whenever you just work with something that is new you feel like not much confident but it is not complex it is just new so just absorb this knowledge and call it as a new thing instead of just saying hey it is a complex thing make sense okay so now let's wait for at least 5 to 6 minutes and once it is done I can simply run this whole pipeline uh in the this area as well and obviously first in the debug mode let's wait so now as you can see that this step waiting for resources is passed and it took I think 8 minutes let me just confirm yeah 8 minutes it takes 8 minutes so now as you can see I didn't see any kind of error here waiting for resources yeah I was expecting error here because initializing we need to just tweak whenever we just run the um this job cluster for the first time but just a tip just a quick tip let me just go to compute let me just show you something so if I just go to compute you can see that I have even deleted the allp purpose compute because sometimes when we just terminate it still you will see some error regarding to kota limit and in order to eliminate that error completely simply delete your cluster C when you just click on compute and all allp purpose compute simply delete the compute that's it and wait for a few minutes at least I would say 10 minutes because it takes some time to refresh that particular logs in the repository so delete the personal compute that's it and make sure all the computes obviously job compute you do not need to delete SQL warehouse if you have anything just delete that and that's it and there should be only one thing which is job compute and this is the one and this is the previous thing that I just saw because I was also seeing that particular thing because I didn't delete the allp purpose compute so I simply deleted it and as you can see now the active codes are eight otherwise Guys these will be eight cores and allp purpose compute will be occupying four cores and it will be creating that mess so simply delete this cluster and simply use this one that's it that's it that's it that's it so now you are good very good now let's see what is the error because now we are good with the cluster configuration so in short delete all other clusters and just make sure only one cluster is running and even before um you can say restarting your job cluster make sure that you are like on hold for at least 10 minutes just wait for 10 minutes and then just give it a restart of that particular cluster because it takes time to kill the cluster and update the logs everything makes sense do not be in rush so now let's see what's the error now let's see because now we are good and DLD event log okay so one error is fail to resolve flow due to upstream failure dim products view okay dim products uh failed to resolve flow due to upstream failure dim products gold dim products okay okay and then failed due to catalog okay there are some failures let's see and now we are good because we have this step which is validate and if you just want to validate the step instead of running you can simply run validate and you can see the errors so one error is here as you can see uh it is saying data brick silver product silver what is the issue here uh okay this is our staging readstream.t okay and these are the rules okay and h it looks good to me let's see what is the error here let me simply run this current execution uh yes so it is simply stopping the flow I guess so warning info all DT graph initializing okay so it was here and it threw some errors and we simply stopped it not a big deal you can simply see this here failed to resolve flow due to upstream failure which is dim products view obviously that failed failed to resolve flow due to upstream failure which is this one databicks kata dot gold dim products which is this table because we are not using or referring this table datab brickscatter.go.prouctroucts products i think this is the issue let's see where we have used this where we have used this datab bricks dot we haven't used anywhere what error you are giving bro we are not using any kind of gold thing let me see validate so now I'm simply validating in my code again and this time it you do not need to wait see because you already have some resources available so we will simply go directly to the error and this is the event log okay okay so now it is giving me error for this thing fail to resolve flow stage okay it is saying failed to analyze flow data bricks kata dot gold dot dim product stage but what is this one bro because we are not using this thing anywhere else anywhere what code you are referring let me just check we are calling silver okay in the streaming it is giving warning because fail to read as it dim gold stage hey wait wait wait wait wait wait dim product stage okay it is saying okay fail to read data set databicks scattera dot gold dot oh I see I see I see I got it I got it I got it so what is the thing here so basically it is trying to create this particular table which is dim product stage in the gold because obviously whenever we just use live so it uses this original schema which is gold and in our case we picked the schema as gold so it is simply saying hey we cannot read this table okay makes sense so actual error is here and something is wrong with this particular thing so here the thing is if you closely observe we didn't use silver dot we simply use silver underscore so this is the root cause let me simply say validate for one more time I was like what bro what what gold table we I haven't written gold keyword in the whole notebook and let's see if we do have anything else um fail to resolve databicks scattera okay datab bricks kata gold stage failed to analyze blue gold stage okay is there anything else wrong datab bricks scattera oh spelling wow validate for one more time now it should be good it was just a typo just minor minor minor thing and I think it should be good let's see let's see because see I told you in the DT notebooks we cannot actually debug our code so obviously we cannot see the root cause so you have to run the whole notebook and you have to see like where are the errors that's the pain for now but yeah they'll be improving it don't worry so they should just allow running DT with the allpurpose job cluster as well like they should just do this this thing obviously they'll be doing it because this is very new feature everything is fine and let me just click on DT graph and it should give me the graph very good very good looking looking like a wow and very good dim product stage dim products view and dim products very good so now in order to run this I can simply click on start and now it will simply run the whole flow for me and earlier we simply validated it if we have any errors or not that's it now it is simply running our pipeline and let's see if it is running fine let's hope let's hope let's hope and I will simply show you the end table as well don't worry and once it is done then we need to create our fact table and fact table creation is a task because you will learn how to apply joins how to just bring the dimension keys and ignore rest of the stuff you will learn all those things don't worry and we'll be creating our fact table notebook after this okay so this is our DT pipeline it started running and dim product stage dim products view and dim products so this is done as you can see it brought 500 records okay now this is a view obviously it will simply bring the new records and all our new records and here you should see 500 for one more time okay and it is following all the constraints as well that is the best part without any error we didn't have any error for constraints so that was a good thing so now it is creating a dimension table for us everything automatic everything automatic and it works exactly fine with the exactly once criteria let me just click on re like start for one more time and this time you should see it should not process any record in the whole pipeline why because there's no data there's no new data and can you imagine like it is handling exactly once it is handling absert it is handling slowly changing dimension it is handling constraints it is handling all your um keeping the history of the records everything within just few lines of code and that's it see this time it brought zero records it written it it wrote zero records perfect that is the concept of exactly once very good very very very very good so now what we need to do we simply need to maybe productionize it or we can simply say start it is fine now we need to simply create a new notebook which is our fact table and what we will do in fact table because our dimension is there in gold layer our second dimension is there in gold layer only fact table is missing let's bring it okay simply go to workspace and click on create and click on notebook and simply say gold orders perfect very good and which cluster we'll be using we'll be using serverless make sense very good because we do not need to create another job cluster because we actually do not need here okay so I will simply say data reading or I will simply first create a fact table fact orders okay now let's say data reading fact table creation is simplest step like is the simplest step you will be creating in the dimension data model it is very simple so how we can just do that you will simply first of all read the source and what is source df equals spark dots SQL and then select a from silver dot and basically data bricks data bricks dot silver dot order silver make sense Very good very very very very good let's see what do we have here okay so just to give you a glimpse before that let's give you an error what's the error bro what's what's the error spq SQL select from data bricks dots silver again ancha again we will add dot and database ET H sorry database scattera yeah you are right database kata so Now this is my orders table that we have in the silver layer now just tell me one thing and you should know about this in the fact table we only put the dimension keys and only the numeric columns that's it we remove all the other stuff now this column is controversial that should we just keep date column in the fact table so let me just tell you in some solutions yes we can keep date column in fact table but ideally according to the rule of thumb according to the definition we should not keep the date column in the fact table but there is no harm there is no harm let me just tell you yes there's no harm so how we can just deal with the date columns we do not want to keep it in fact table simply create a dimension table on top of it that's it that's it so for now we can just keep it as it is no worries but yeah ideally you should not but in some solutions yes developers do keep date columns in the fact table that's not a big deal or disaster don't worry okay so now we need to remove these columns or basically not remove replace this column or this column and this column basically not order ID this is my primary key so this column and this column by the way I can just remove this column as well not a big deal because obviously we will be having the combination of these two so for now we will just keep it because this is just ID not any contextual column not any kind of uh you can say string column nothing so now we will simply replace customer ID and product ID with dim customer key and product customer key that's it that's what you want to do and how we can just do that we will simply apply a join so let's create df fact okay df fact and I'll simply say fact data print perfect perfect perfect so df fact equals uh I will simply say spark.sql order silver yes and basically I have df to I am good i do not need to rewrite it df dot join okay and now with what I want to join i want to join it with uh I will first of all create the dimension tables df dim customer equals spark.sql select aix from databicks kata dot gold dotde customers perfect and basically I do not want a fix i just want dim customer key and obviously the joining key column and joining key column is customer ID okay but I will just rename it as uh dim and then customer ID so that I can easily remove it okay let's do the same thing with dim products as well [Music] dem Pro product uh perfect perfect let's run this let's run this i can see one error what is this dim product key what is this what is this h dim product key okay so what is the name of that what is the name of key what is the name of key i think the name of the key is is is is what is the name of key let me just see let me just see because we didn't see the data I guess right let's see the data and let's say select ax from datab bricks ka dot gold dot dim products let's see what delta live table has created for us let's see and you will see that it has already maintained the history as well for us you will see a lot of things and it has done everything automatically so basically as you can see that start at start at column is keeping your this thing because we didn't have any kind of date column but if you have any date column then obviously you will see a kind of date here this column tells you that okay this value was actually added at this particular stage and then it will end at null because obviously this value is in use so this is the common definition let's say common thing about slowly changing dimension type two so you should have that fundamental knowledge and that what is a slowly changing dimension type two it is nothing but it keeps a track of your change like when it was created it's called start date and when it was let's say updated or when it was not in use so it is called endat make sense makes sense make sense and we just pick product ID so that's why we are seeing this product ID okay and this is our key column that we can bring it here obviously we can just rename it as well as you can say dim product key or anything like that it's up to us but it is fine it's not a big big deal it is fine so I can simply say um product ID as dim product ID as because this is just a replacement of your primary key that's it and then I would need product ID as well yes okay and now it is good and you know the hustle that we perform like when we when we just perform to bring or can say create this kind of table right so everything is done automatically by TLT del T delta tables okay perfect so now let's first join dimcast okay df dimcus okay on customer ID but we have different column name so I will simply say df of customer ID dim customer ID perfect so TF dimro product ID dim product ID yes perfect uh yes okay and we should just perform the inner joint inner joint will work obviously but ideally we should perform left join so let's perform left join and how we can just perform left join we know that how equals to left and then here as well how equals to left and then what we will do we will simply select the particular columns that we we want to select order ID dim customer ID dim product key order date quantity not really I will select all the columns and I'll simply drop just few column TF fact new equals df fact dot drop and what do I want to drop I just want to drop dim customer ID and there's one more column dim product ID that's it that's it that's it so this is my DF and I want to just show you how does it look like okay and I'll simply say oops what's the error String object has no attribute session what do you mean by session bro what do you mean by session uh wait wait wait um oh really yeah i think we just need to remove this that's why I do not trust this auto suggestion every time it just creates a mess sometimes so yeah see but yeah sometime it is really quick sometime it is really really annoying let's see what do we have in DF effect new so we are all good yes we are all good so now what we need to do we will simply write this data in our gold layer but you know with which oh okay so now we have customer ID as well and dim product key and if you want you can actually drop the customer ids as well like this one because we actually do not need that why this can be your interview question next interview question why because it doesn't make any sense to keep customer ID because now we have dim customer key which is the joining key with the whole dimension so why do we need to just keep these ids why obviously no there's no need so I will simply remove this as well uh customer ID okay and then product ID as well let's rerun this and let's see what do we have perfect now our fact table is looking like a wow perfect only dim customer keys and numerical columns and that's it now let's try to apply absert condition or fact table as well why because obviously bro fact table needs to be upserted right if your fact table is small you can simply do override but ideally it is not the case we need to just uh apply the absort condition on fact as well so I will simply say upsort on fact upsort on fact table perfect i will simply say import dl uh delta.ts from delta.tables import data table perfect same thing now you know the steps so let's perform the if condition if spark dot catalog dot um table exist perfect and table exist and what's the table name it's called datab bricks kata.g gold dot fact orders if it is there then we'll simply perform our upsert operation and for now let's write pass else dfact new.right Write dot format delta perfect dot option path is this one perfect orders fact orders okay very good very good very good and dot save as table and dot mode it's fine because by default it will take up and mode it is fine now let's write our logic for upsert when we all know that's so easy delta table dot for name this time let's use for name okay just to show you both the ways so this time we simply need to give the name of the table and then we will simply apply the merge operation dot alias this is my target dot merge dact new alias source and go away man order ID equals to order ID so basically now you will ask me this thing what should be the uh primary key so if we do have natural primary key in our fact table it is good but ideally it's not like it's not you can say rule of thumb that we have to have a primary key in fact tableama what you are saying we should have a fact table like primary key in the fact table right uh not as the native column but ideally fact table's primary key is the combination of all the dimension keys like this this this this like all the dimensions all the dimensions all the dimensions if we have like deodor as well then we will take simply combination of all the three so if you have natural primary key it is good but if it is not it's not a problem because the primary key of the fact table is the combination of all the dimension uh keys so in this case I will treat this as our dimension as well the mod or mod or mod or mod or mod or mod or mod or mod or mod or mod orders then I will simply say and let's keep it capital and trg dot dim customer equals source dot uh dim customer key and perfect when match update all as we all know and then obviously dot when not matched insert all then execute perfect let's run this and let's see what do we get we should see at least I think 10,000 records because in the orders table we have 10,000 records let's query it asterex from gold dot [Music] Fact orders right yeah okay 10,000 rows perfect very very very well done now let's rerun this and let's see if it still returns 10,000 rows with up absort with item potency with everything it's a double test again let's see let's see fingers crossed fingers crossed let's see perfect perfect perfect perfect mission successful project is successfully built okay very very very very very well done we have created our star schema okay and now if you just want to see let me just refresh it now we have our all the things in the gold schema dim customers uh fact orders dim products all these three things are there and what is this database internal basically this is the catalog let me just show you what is this if you just go to catalog if you just go to datab bricks internal just click on it and this is the area where it manages all the things for your delta life table so if you get the question in the interview you can simply say we automatically gets a catalog called databicks internal where it manages everything for our DT tables obviously it is empty earlier we had the access to this particular catalog as well and we were able to see all the data and all the locations but obviously DT is managing it obviously we should not interrupt their things because if we just do some things it can break the whole data warehouse so that is why it's a good step by the data bricks they have s they have like simply made this available to system user earlier like I was also able to see the data so now they have just removed everything very good so now if we just go to workflows we can actually create our end to end pipeline really yeah let's create our pipeline let's create a new pipeline or we can even continue with this one as well bronze incremental okay and we can even embed this particular pipeline in our new pipeline let me show view create job okay and if I just want to give any name let's say bronze bronze injection then I can simply say type is pipeline there should be something called ELT or ETL pipeline let me just search uh pipeline see so this time you can actually pick if we do have any pipeline we have gold product pipeline so we can even pick this one as well or if you just want to put job run job we can even put run job within this and we have bronze incremental so we can even embed the whole pipeline inside another pipeline it's called parent pipeline and rest of the pipelines are called children pipelines just for your understanding but we will create the whole workflow in one okay instead of embedding it I will simply go to workflows and I will simply first of all rename it and I will simply say parent pipeline or you can say end to end pipeline okay perfect let's see the tasks so we know that we have just these two tasks so far so this is our parameters file this is our bronze file okay that we just after this we need to perform the data transformation on the silver layer right let me just do add task notebook and this will be called as let's say silver um orders okay and let's provide the notebook path and notebook path is silver orders let's do that and perfect that's it create task very good now let's do add task notebook so now you will say hey why it is running after silver orders why does that make any sense to run my silver orders first and then let's say run my silver customers does that make any sense obviously no because there's no dependency between these this notebook can run independently without this notebook so we will run these notebooks in parallel instead of running the notebook in the sequence how we can do that click on this and obviously I will simply call it as silver customers and dependency instead of picking silver orders you simply need to pick the dependency as uh bronze autoloader see now it is running in parallel perfect what is the path is silver customers very good same thing I will do it with other pipelines as well save and continue very good add task and I will simply say notebook silver products okay and then path silver products very good perfect so now what we need to do dependent on bronze autoloader very good so now you can see that we have created three tasks in parallel because that makes sense okay now we do not need to orchestrate our regions by regions notebook because that is a static notebook and there's no incremental data so it will be running independently it is fine now once our silver uh layer is done then we need to run our uh gold customers and gold products in parallel so how we can just do that how we can just do that and we should only run the dim customers after silver customers if that makes sense to you because obviously this is our source right and products dim products should run after silver make sense let's do it so I will simply say add task and ideally let me just tell you one thing we should wait for all the silver pipelines to complete first then we should enter in the gold year this is my personal tip for you and I will simply say add task and this time it is a notebook and notebook is gold customers okay and then path is this one gold customers very good and depends on all the silver silver orders and silver products see now this pipeline will be running once all the three pipelines are completed this way we can just build pipelines like this make any sense very good now I will insert another task and this time it is a pipeline i will simply say pipeline and it is a DT pipeline we know so it is called gold products very good and this is a pipeline okay pipeline depends on silver customer silver order silver products makes sense now it looks very messy see it i know but yeah it is what it is because you need to just do all the because you need to run both the dimensions in palent and there is no dependency so in order to optimize your jobs you have to do it but you can anytime click on it and you will see the flow like what is doing here like what are we doing here so it is fine okay do not feel like hey there are so many threads how we can just interpret it simply hover over this and you will see the blue colors highlighted see so it is fine now in order to run this particular you can say activity we need to define a specific cluster what's that so basically we will simply go here and pipeline is this as we know this is the pipeline gold products and depends on this this this it is fine so now if you just click on any any first of all let's save and continue if you click on any other activity let's say dim customers it will ask you what is the compute for now we are saying serverless okay makes sense because serverless is already there and this is the job cluster but we can even ask it to run through job clusters because obviously our DT pipeline will be running on the job cluster so we should pick job cluster for all the tasks because we areuling these tasks right so simply say job cluster job cluster and then job cluster or you can even pick serverless not a big deal because serless compute can also work but I'm just telling you for the better deployment and you can say best practices job cluster job cluster job cluster obviously it will take some time to run this pipeline but that's how we should run this and this should work on job cluster perfect and the last one is this one parameters Very good now we need to just add our fact okay so in order to add fact we have to wait till both the pipelines are completed because or let's say both the activities are completed because we are referring we are referring both the dimensions in our fact table common sense very good let's say notebook and fact table fact orders and path is this one i know this is the best part of the project where we just orchestrate all the things i know I know I know and this will also be running this depends on gold products and gold customers very good so finally congratulations you have created this beautiful flow this beautiful flow see all the things are done for you first of all it will read the parameters file incrementally load the data silver silver silver gold customers gold products this is like slow chaining dimension type one type two and this is our fact table very well done this is end to end flow that we can simply click on run now and obviously it will take a lot of time you can simply run it on your own machine and see the output and if there are some errors you can simply debug it it's not a big deal because this is just a flow this is just orchestration an orchestration that we are performing that's it nothing else and it's time to sum up everything but before summing up let me just tell you one thing now let's say your data warehouse is done so if you simply go to SQL warehouse okay and obviously we do not have any kind of SQL warehouse you can simply create it if you create create a SQL warehouse you will see cluster size extra large you can simply pick 2x small and simply create it then you will be able to see all your tables and warehouses and you can actually query that particular table let me just create it for you uh let's say 2x small uh click on create uh let's say cluster SQL cluster so it will simply create a cluster for you and manage permission is honor yeah perfect so it will take some time to create it and it is turned on so now what you can do click on this SQL warehouse and then simply go to the maybe only my okay this is good option only my SQL warehouses so this way I can simply see only MySQL warehouses but this is the cluster that you can see it is turned on and the size is 2x small and this is the cluster that is designed specially to run your SQL workloads and it will simply eliminate rest of the stuff and it is built specifically to run your SQL queries on top of your warehouse make sense so now let me just show you one more thing if you just go to SQL editor you will see everything here see and this time it is attached to SQL cluster not the allpurpose cluster and here is our catalog if I just run anything let's say I want to run select a bricks from um gold datab bricks kata gold dot let's say fact orders run now so I will see the same stuff that I see in the SQL workbenches or you can say SQL server management studios see perfect so this time I can actually feel that I'm building SQL scripts and this this is the latest update you can say advancement that they have done in data bricks this way you can run SQL workloads efficiently and let's say you want to share your gold schema with data analyst data um you can say report builders business analyst they can simply use the schema okay and they can simply run the queries here using datab bricks cluster or any kind of SQL server workbench and let me show you another feature of it if you just go here and just go to the queries so this is the area where you can just simply create the query same way that you have done recently but this time you can actually save the queries see you can actually save this query if you just go here and you can simply say oops if you just click on it you can simply say query one that's it so it will be saved here and you can just refer it later if you want if you just go to queries and you can simply say refresh let me just refresh so basically if you just go to SQL editor and we should just need to make sure that it is saved so yeah I knew it so as you can see that query one is unsaved so that is why it is not being shown there you can simply say control S and oops you can simply say control S and then it will simply say where you want to save this query you can simply say save it here then your query will be saved there as well okay perfect now one more thing we have an amazing feature what's that if you just click on this plus button you will see that visualization what's that click on this it will simply create visualizations on top of your queries and you can simply say save and obviously you need to pick some column let's say on X um bar I just need to pick order date and on the Y bar I want to pick let's say quantity and I want to create a line chart just to see the trend okay so it is just showing the quantity why it is straight line because obviously the Xaxis is time stamp and it is recording records on seconds basis okay and you can even select anything else let's say I want to pick year and if I just pick quantity then what I will do i will simply pick quantity and I will this time I'll simply create pie chart and I will apply wait x column is this group by on uh year okay and x-axis will be my quantity I'm not an expert in building charts so you can just play with it okay and is always supported in the legacy this I think I just need to populate a lot of things as Like you can just simply play with it bro i I haven't built like much charts and all so you can simply build a lot of visualization from here and I can simply pick any other uh visualization let's say bar chart yeah so this time you can see like in 2023 we had these quantities the this quantity these quantity these years so you can simply play with it i just wanted to show you that we have visualization feature available as well in data bricks and you can actually create a dashboard as well dashboard ideally we just create dashboard in PowerBI yes but databicks has launched a just like just launched a new feature in which you can create dashboard directly in data bricks as well so how you can just do that simply go to um partner connect where is partner connect maybe in the marketplace uh here yeah partner connect so as you can see that we have PowerBI desktop okay we have Tableau desktop we have obviously DB cloud and fiveran you already know so basically let's say you want to create a dashboard in PowerBI and how you can just do that you will say hey we need to configure all the connections all the data warehouse everything no really so you can simply click on this and you can click on connect okay and what it will do it will simply download the connection file for you and this connection file basically this is a PBX file which is your PowerBI file it will have all the connections already filled for you so you do not need to actually do that hard work that is why it is called partner connect integration so there are like so many things that are there in database and obviously a lot of improvements and updates that are coming every single month so let's try to say that our project is done because obviously like there are so many features obviously you can just play and explore and learn a lot of things so just to sum up let's go back to our workflows and let's see our beautiful pipeline which is end to end pipeline so this is the pipeline that we have created and I'm really really happy that you learned a lot in this project and yes now it's time to actually flex in front of everyone that you have created this project because this project actually is a rare project which is end to end project handling incremental loading slowly changing dimensions star schema your constraints DT all the latest things everything so just create this project complete this project and I know you have completed this project now drop a lovely comment on this project and just share this video with others and just share this project everywhere and just tag me i will try to just um drop a comment okay so this is the achievement for you that you have achieved recently and it this project was really really like a milestone that you have achieved because this project was not easy you have learned so many new things in this project so congratulations and now it's time to say bye-bye and I will see you in the next video and you can click on the video coming on the screen because in this particular video you're going to learn a lot of things in the world of data engineering and and and you will become an outlier so just click on the video coming on the screen i will see you there happy learning

Transcript for:Azure Data Warehouse Architecture

Transcript for:
Azure Data Warehouse Architecture