Notes: Spark, DBT, and Cloud Provider Project

hello and welcome back to the channel in today's video we are going to be going through a project that is based on spark DBT and a cloud provider this is as decided by the community and you can see how the community voted in the last week we had 108 votes of which 43 of them voted that we go for spark DBT and any other Cloud platform as a runner Hub we have a data warehousing project with best practices is following up with 38% snowflakes 16% and Apache Flink streaming project 4% I have decided that we're going to be going with the winner of this poll which is the spark DBT and any Cloud platform project so let's go through the system architecture as decided we will be adopting The Medallion architecture for this project The Medallion architecture allows us to have three layers of transformation the bronze layer the silver as well as the gold in the bronze layer the data that are coming into the system are left as is and uncut while in the silver layer a level of transformation has taken place and some of the data might have been changed transformed or even removed from the system in the gold layer this is where we have all the data that are ready to be consumed by different business intelligence unit different users as well as maybe for Analytics purposes even for consumers like data scientists that want to develop data science and machine learning algorithms now according to this project we will be using Azure data Factory Azure data breaks DBT and a lot of gen two services in the ADLs section so this project is majorly on the Azure cloud and we'll try as much as possible to keep it at that if you're new here don't forget to like And subscribe to the channel and also turn on the notification Bell so you can be the first person to get notified once there's a new video released so let's get into the project and let's get started if you do not have an account I would really recommend you go on Google and search for aure free account if you go for aure free account you have access to a free account which gives you access to free $200 and as your credit and you can use 55 services for free and it's usually for 12 months so if that is something that you want to consider I will really recommend and suggest that you go and start free right now okay I'll be using my brother's account it's a free account he open it a couple of days ago and uh that's the one I'll be using okay so now to get started I'm going to start by creating a resource so the resource Group that we're going to be creating start with the resource groups so you can you can search from here or you can select from the list of asual services that is listed on the UI for you I'll be clicking on the resource Group here and I have a couple of resource groups currently here and I'll just click on create to create a new one so the subscription that is currently having is aure for student so if you are using the free account you have aure free uh subscription here so he's using aure for students so but the resource Group that is going to be creating I'll be using I'll call this Medallion Medallion spark DBT Resource Group okay now it looks like this Resource Group is okay and uh right now we're in the East us so I'll be selecting something very close to me uh something around UK South if that's okay to Europe so you can choose anyone that is closer to you and if you have any resource constraint and you cannot create any of the services I really Rec commend you you change the resource Group uh region or whatever resource that you're going to use to be using you can choose any Resource Group and region for that so I'll be using UK Sal for that and in next I click on the next and the tag section so I don't have any tags it's just to review and create and after the validation is passed if you want to download the template for automation with something like terraform you can download it from here but I don't need that as for this moment because it is out of scope of this video so I'll just click on create okay so right now if you can if you notice we have a medallion spark DBT Resource Group that is available for us now the next thing I want to do is to create a Gen 2 ADLs aure data link service for storing our data so I'm going to comeon here I'll search for um gen two okay and which is the storage account or you search for storage accounts if that's okay for you storage accounts so if you select this uh you have access to the ones that have been previously created so I'll just create a new one I'll call this Medallion uh it's using the resource Group that I previously created so I'll be selecting The Medallion spark DBT Resource Group and the storage account name is going to be called um so let's call it Medallion I'm choosing The Medallion architecture so that's the reason why I'm using The Medallion name so I'll be using The Medallion architecture which simply is the bronze silver and gold layer so the the bronze layer like I said is usually for raw data silver is for fine-tuned data but is can be consumed by some people and some analytics and data scientists uh if you want but the good layer is where the refined data and everything is more uh in a presentable format it could be data Mass it could be multiple uh services that are going to be consuming from there but it's more like a very refined data at that point so yeah so for the storage account I'm going to say medallion storage uh so it can only contain letters and numbers I just say medallion storage account uh maybe Medallion Dash Medallion s say okay as my solid account so the region is going to be UK South and I'll be using the standard uh okay the G redundancies by default then in here I'll be just enabling the hierarchical name space which is complement complimenting the D ADLs Gen 2 end points for enable enabling file and directed semantics uh well that's English but usually you have access to the files and directory some just where you have it in your my documents on your system where you have your folders and the files uh structured out in in hierarchical order that's the way it usually is so I'll leaving the rest uh as it is then I click on network I should enable access to public access from all Network the data protection is going to be as it is I'll just click on next no encryption as it is by default leave it then we just wait for the validation to be done and I'll click on create now the storage account is going to be created and this is where we're going to be having the multiple storage whether it's bronze layer the silver layer or the gold layer that's where we're going to be putting them in our storage account so I just wait a couple of seconds for the storage deployment to be done once it is done we switch over to that okay so the deployment is done I can go to the re resource now going to the resource I have access to all of this in on the left and some of the keys and shortcuts on the on the right here so but on the left you can see the data storage I'll be clicking on the containers in there so I don't have any container right now so I'll be creating one so I'll start with the brones bronze and then the silver and finally the gold okay now these are the three layers that we're going to be having so whatever data that we're grating or that is coming into the system will'll be landing into the bronze layer from there we process it into the silver layer and then we process finally uh into the gold okay now once this is done the next thing is to create a data Factory so I'm going to come in here and search for data Factory okay and you can click on data factories in here now I have one in here but I'll just create a new one so I'll create a new one I'll call it I'll be selecting the med Medallion spark DBT Resource Group and I'll call it Medallion Medallion ADF okay then next uh the gate configuration is by default even though we're not going to be using it we just leave it as we configure it later the networking is going to be a public end point we leave it at that Advance no tax no review and create and that's it boom we just wait a few seconds for the data Factory also to be created as well before we continue okay the deployment is succeed I click on go to Resource and then we can we can view what we have in our data Factory I'm not going to be launching the data studio uh data Factory studio right now so I need one other thing I it's called key Vault so I can search for key volt in here key volt so the key volt is where I'll be storing all my keys and secrets so I don't want to be copying every now and then so I can just easily paste data into my secrets and I can reuse it in my data brakes by the time we get to the data Brak section so we don't have to you know recopy every time you know expose our keys as it is going to be in more like in a production environment okay so I select the key vult now I have a key vult in here so I'll just create a new one uh creating a new one I'm going to be selecting the resource Group medallion dbtg then the key volt name is going to be Medallion Medallion spark uh call it KV okay now the region is going to be UK South all right the pricing here is going to be standard I'll just leave it at that then next access configuration um so I can use the rbac the RO based Access Control control but I can leave it as a vae access policy where you can select a user so if you don't have it in here so I think it is best to select something like this so you can find grain and uh select uh the policy that you want for every particular user that is going to be assigned to so I already created a user but if you want to create a new user you can use this button and uh if you want you can use a template in here just select the proper that you're going to be attaching to a particular user in here and then just the principal is going to be okay let's say I select all of this for instance the principal is going to be the uh applications that are going to be accessing this in part this particular key Vol so you can for instance if you have data bricks you can add it in here data bricks resource provider can be added to this list and if you have any other ones maybe Gen 2 Storage Lake is going to be accessing the key V for some reason reasons you can add it in here and any other service so you have any application as well which is fine just review and create and that's how I created this user and if you click on this user and click on edit you can see the properties and uh permissions that that were given to this particular user I don't need any cryptographic operations as well as privilege key operations and secret key operations I don't need any of them I only need the key Secrets the certificate as well as the rotation policy operation those are the the four things that I attached to that particular user now if you click on the review and create you can see all of them in here but I will not be doing that right now once that is done I'll just review uh I'll just next uh let's see am I missing anything here so we have the resource access virsion machine and know resource manager for template deployment no this encryption no it's fine you just click on next now allow access from all networks I'll be leaving it at that the tax is free review and create and uh once it is validated we can click on create so it takes a few seconds as well to finish the deployment and once that is done we can continue okay now the deployment is done so let's switch over to our data Factory because that's where the major work is so I I'll be selecting The Medallion ADF in here and I'll be selecting the launch studio so this is going to launch the Azure data Factory for me okay so right now we are in the Azure data Factory page now we can start with the one on the left so we have the own which is the landing page of course author is where we're going to be creating our pipeline data sets and the rest in here monitor is where we we can do the monitoring of the pipelines the triggers the chain data capture and the rest here but right now we are going to be focusing on two things the author section and the manage section now we start with the manage section because what we want to do is have access to our Gen 2 Lake Service as well as database which we'll be creating shortly so we don't have a database right now and we don't have any data as well so let's do that before we continue to your data Factory and you know get cracking okay now going into this guy we are going to do one last thing in here so you can search for SQL databases SQL database this is aure SQL now mind you this is using SQL server command so if you have background in postgress or any other languages like uh any other database scripting language like uh maybe MySQL or uh or maybe Maria DB or something like that some of the syntax might have to be fine- tuned but right now we're using SQL Server uh dialect as the language okay now I'll be creating a new datab I click on this and then we can create a new database okay so um creating a new database uh so I'll be selecting the resource Group as The Medallion spark uh DBT Resource Group okay now the database name all right so this is very expensive so what I'm going to do right now is 396 so what I'm going to do basically is to remove the one that I currently have on the system so I'll just delete this uh database okay I'll just delete it uh just delete okay it takes a few seconds to be completed uh and once it is done we can continue now as usual a database resides on a particular server so we can come in here and search for SQL Server you have SQL servers in here so I I currently have one server in here sitting pretty nicely in here so if you want I can delete this and we can create a new one or let me see if there's anything that we can adjust in here so basically uh I just created a Ser uh this server on the Fly I think I'll just maybe delete this let's delete it um yeah let's delete it uh skip feedback we're going to create a new one eventually so just delete this and uh delete okay I don't have anything important on it so I just delete that okay so I close this going to SQL Server refresh so yeah our SQL Server is deleted as well as the database itself so we can go back into our SQL database in here so let's create a new SQL database all right so right now I have access to 32 gig free of data and 33 G free of backup uh storage so I'll be applying this offer so I don't pay so much on that so right now I'll be selecting The Medallion spark DBT the database name is going to be code with you and I'll call the database name uh I'll say medallion DB okay maybe I'll say Dev cuz this is semi- production already really so we don't have a right now so we create a new server so the server is going to be code with you Medallion uh DB all right so the location is going to be in the UK South uh where is it UK Sal Show Yourself uh I can't find it okay UK yeah UK s okay now the authentication is going to be like that but I'll be using the usql authentication so I'll be having the server admin as my in fact I'll be using code with you as a admin login and the password is going to be something that I create and I'm not going to be sharing that with you okay all right so you can use your own password as well and don't share with me okay thank you very much now the password matches and I can click on okay now this creat a server for us on the Fly and I know the username is code with you and the password is something I'm going I'm not going to be telling you okay now yeah behavior when the free offer is reached I I want you to autop pause the DB until the next month and uh it's fine I'll just click on networking now connectivity method now I don't need there's going to be a public endpoint access so I need people to be able to access access this not everybody it's just me alone of course I'll be deleting this after this video so nobody can access this DBA as well but it's okay firewall rules there no need to that no need for that as well as the connection policy it's okay I'll just click on next as my security I don't need the defender for SQ no Ledger Ser identity no and the rest is going to be left as default additional settings now this is very important at the data source level I want I want this to be populated with Adventure Works Data that that's already more like a sample database so you need to select the sample database for adventure Works to be added to your system if you don't select that that means you'll have to maybe download adventure works by yourself and then populate your DB uh with it if you want but I think that's a stressful way of putting it so if you want I think it's best to just use a sample and it gives you the latest Adventure works in the system you have all the data you know sitting there pretty nicely once it is done so I just click on the sample and it gives us adventure adventure Works LT I'll just click on okay and uh yeah the next TX is free and uh the review and create I don't have any cost it's free of charge and this is important if you select just like I had it before if you have your first 32 gig you have it for free but subsequently you can you have to be paying right now I had $395 to be paying I I can't be paying something like that for for just a single video you know and all that so I think it's okay if it is something that is more like production ready then it's okay I just click on create all right so the deployment is in progress we just wait a few more seconds for the deployment to be done and we can get cracking okay the resource deployment is complete and we can go to the resource now so let's double check if the data is automatically populated and we have access to it so if I click on the query editor with the preview it's it recognizes my username so I can just type in my password so my password goes does and uh it says could not log in with this IP address so I just need to allow it and it's allowed uh for Access I'll just click on okay and login again so right now I have access to this and if you look at the tables uh I drop it down you can see all of the tables in here so if I select startu from address for instance and I run this you can see that I have um customer address sitting pretty nicely in here and for the customer address as well customer address I can okay I'll just remove this and I'll remove the last one here I'll run it again and you can see the customer address uh sitting there pretty nicely we have the product as well and uh just to show you that the data has been populated for us we don't need to write any scripts to or some fake uh data or synthetic data to be to it this is Adventure works and it has some sales information the customer information as well as the product information so this works for us but what we want to do basically is I want to automate the process so I can automatically get the data into our bronze layer from there to Silver then gold okay how do we do this so we're going to start with the ADF so the ADF is going to go into this database on our server and fetch the table names in there so each of these tables are going to be looped through and then we are going to dump all of this data into our broomes as our first level now once data gets into our brones layer we are going to have them as uh more like pack uh more like normal files okay then in our in in the second layer which is going to be silver and gold and the second and the third they're going to be more like a Delta uh Lether files there we're going to store them as pet files at that point but initially they're also going to be St as pet which is going to be our initial files in the in the brones layer so let let me let me show you what I'm what I mean so it's a little bit clearer if I show you okay so right now what we're doing basically is we have all of these uh tables in here so if you want to get all of the tables you can get them this guy I'll just copy this and I'll put this as uh like that and that will be information _ schema okay yeah all right so these are the tables that I have so this is the schema name and the table name okay so you can see that we have uh let me see if I can expand this a little bit okay maybe not so I have the table catalog the table schema table name and table type and you can see each of the data that is here but what we want to focus on are the ones that are in this sales LT schema uh we don't need the dbo which is fine this is more like a added to it automatically so we just focus on this and we focus on the tables we don't need the view as well okay uh so if you want you can have the views but for now we just uh exclude those on so we just say where the table schema is equals to okay sales LT and uh table type table type is equal to base table okay if we run this again we have something more find tuned which is uh this so we can use this to get all the table names in there and based on this we can Feit that uh with Dynamic query in our Azure data Factory all right let's switch over to Azure Factory and let's get cracking with that so in our your factory in here we need to create a link service that is going to be connecting to our SQL server and we can based on that create a data set okay so let's create a new one the way I got in here by clicking on the manage uh on the left on the Side Bar here and clicking on the linked service so if you click on plus and you come to the asual sport as your tabs you can type in SQL and uh you have access to SQL database here if you click on continue uh it's going to be a your SQL database whatever name you call it I'll leave it at that and autor resolve runtime integration from a zero subscription I'll select the subscription okay the server name is going to be The Medallion DB the database name is going to be code with you Medallion DB Dev all right then the username is going to be code with you the password is once is going to be what I can you know I'm the only one that haded that okay once that is done now let's test the connection connection fail what's the error it says user code with you client IP is not allowed so I think we need to allow create a fire world that allowed this okay so let's fix that I'm just going to click on configure and we have access in here it says public network access is selected Network so what I'm going to do yeah so I just need to had the just I just scroll down and um allow your services and resource access to this server and I just click on that and I click on Save once that is done uh we can go back into our Medallion here and test it again test connection should have access now okay connection is successful and I click on create all right now we have a link access from our data Factory to AO right now we also need access to where we're going to be writing and dumping data into on our Gen 2 Lake service so which is here we just come in here and select the ADLs Gen 2 continue and I can leave it as as as that and uh in here I'll just select that the subscription the storage account name is going to to be The Medallion storage account and then the file path we going to be storing data into it's going to be I'll just call it brones okay and we can test the connection it's successful all right we create okay good now we can go back into the author section and uh create our first pipeline okay so let's create a new pipeline now in our pipeline I'll just minimize this for now in our pipeline we are going to have what is going to be bringing data up for us for that we need a data set so let's create a new data set now this data set is going to be the one reading the data uh from our Azure database into our data uh data Factory okay so I'll just click on this and uh I'll just say SQL and uh we have Azure SQL continue now the link service is this one and I'll call this uh table tables uh maybe table squarey or something like that something to fet all of the queries but right here we cannot select just a single table because what we want to do is get all of the tables so right now leave everything in here as it is and I'll click on okay now this seems funny but this is how we're going to be doing because we want to dynamically assign the table names into this guy okay okay so right now we leave the table squarey as STS now let's go into our Pipeline and bringing our look up that is going to be fetching all of these data for us so we can either come in here and search for lookup okay but or you can just go into the general section and uh and drag look up into the plane okay so in our look up we are going to have some settings in here so but for now uh we click on the settings stab now I don't need the first row only I need multiple records so if it's just a single record that I need you can select the first row only but I don't need that I need the entire thing so we have a data set which is where the data is going to be coming from but right now we not we don't have anything selected for our table query so what we need to do serice a query that is going to be getting data for us I think I copied that earlier but for some reasons I've overwritten that so what we're going to do is uh we can either go back into our DB or just take our chances and write it in here select St from um I'll just get the database name uh maybe I'll just quickly go back into our SQL database cuz I can recall the okay the exact query database name and all that okay just go in here and uh we can select start from the database name I'll just copy this guy and I'll put it in here information schema. tables okay and uh we just uh I'll reduce this a little bit so we the table schema equals to sales LT and table type equals to base table okay so let's see if this works I think it does and we can just select this I'll just copy that and go into our data Factory I'll paste this in here so right now if you preview the data let's see what we have all right so we have all of these table catalog table schema table name and table type okay so we can use this to get started and load data into our data set so what we're going to be doing is once we have this the next thing is going to be uh where the data is going to be going to so the data is going to be going into um our DB table so okay so I I have it as uh tables query so this look up is fine now each of these records that are being populated into our from our look up uh from the base uh table query in here what we want to do is we want to Loop through each of them and for each of them we want to dump them into our Bron layer so we come in here and we can search for for each okay so for each is going to iterate through each of the items in there and I can select this in there and uh I'll select this for each of the activities now the first thing we want to do is copy the data that is coming in into the system okay so I can search in here and going to the move and transform for copy data now you can add add many logic as well as poss as much as possible in here but right now just keep it simple we have our copy data which is uh this we just need to configure The Source the source is going to be the tables query that is coming in uh let's see our pipeline in here so if you click on the for each here we have the settings now the items that will be coming into the look up before we configure what is inside the forage we need to have a dynamic content in here so we have the uh let's see what we have I'll just minimize this for now uh we have the look up I'll change the name to fch or tables okay so that will be the name and once it get into to the for each we have the settings and the items are going to be fetch all tables uh return value so this is not the pipeline value but the value itself okay and uh yeah we just use that now for each of the items that is now coming in as the table output value we we want to pick each of them and dump them into um uh bronze layer okay so right now we have this coming in and if you preview the data uh you can't because we don't have anything outside to that yet for us to be able to work with this we can use the table query as well but what we need to do is have a separate data set that is going to be feeding data from this table query so I'm going to just create a new data set here I'll call it U still the same SQL and I'll select that and continue and inside here I'll be selecting the SQL table instead of the data uh data table that we have earlier the table list so I just selecting this as my SQL table and uh selecting that there's no need for tables as well I'll click on okay now this SQL table is the one that we're going to be feeding from when we are copying okay so I'll select this as my SQL table but in in here we just need to add a parameter to specify the schema as well as the table name so I'm going to have the schema name and uh the table table name okay now these two guys are the ones that are going to be passing each of the records that are coming in from our table query into our SQL so we can use this and uh dump them into our bronze layer okay so coming back to this yeah I think we're good yeah we good so we just go back into the pipeline select this and instead of this we just have uh what's the name SQL table so we have the schema name as well as table name so we since we are not at coding the schema name and the table name we need to add them as Dynamic content so I'll select this and it says for each of the item we will be selecting the if you can recall in our DB we have the table schema and we have the table name okay so we have the table uncore schema and I'll click on okay and for the table name I'll come in here for each again and I'll select the table underscore name which is what is here I'll select okay now we have access to the table that are coming in from our pipeline once we fetch all of them we pass them into our for each and for each is going to Loop through each of these activities and and send them into our source the source is going to transfer them to the sync so now let's configure the sync yeah everything think everything is fine here we select the sync now the sync is where we are going to be sending data to we don't have where that is going to be done right now so let's create a data set for that so I'll be selecting here and uh I'll be selecting the gen two L Storage and uh that be the pet file that we're going to be sending to that so this will be the pet file output okay and then the link service I'll be selecting the data link storage I'll leave it as St but right now we we need to pass in the directory as well as the the file name so I'll leave it as that for now and then we can parameterize this as well so this is going to be taking the parameter of the uh of the the folder name oh yeah folder name and the I'll call this the the file okay uh maybe the file name okay so these are the two things that this requires as parameters and uh we have access to our data link from here now now in our file path we are going to be passing that as Dynamic content so we have the folder name as our directory which is okay and for the I'll just minimize this for the file name we'll be passing the file name that is coming into the system as uh a parameter there so right now we have we leave it as a compression of Snappy but if you want you can use gzip or the iso IO compression Snappy in there and then this is how we configure the parameters okay now we can now use this pety output as our output section in the sync so I'll be selecting P output and it's asking me for the folder name as well as the file name now for us to configure the file name and the folder name we need to understand what the final output is going to be looking like so based on what we have in each of these records if you look at the say for customer for instance I'll just uh let's say we have select St from customer and I run this okay you can see that we have uh we have some records in here but what we want to do basically is group all of these as a single entity in the in our brones layer so the date that we're bringing the data in is going to be where the data is going to be sitting in and the file name is going to be separate okay so what I mean basically is I'm going to be having a single date as a let's say um format date time I'll be using the date time of current date which is the UTC now and I'll be formatting that as the year month and day if you like you can use day month and year doesn't really matter but you know for consistency sake you can use the year month or day or you can also add something like this if if that's what you prefer okay so I'll select that and that that will be my folder name now for the uh the file name we need to concatenate the schema name as well as the uh table name to create a an output there so I'm going to have the the table schema which is this and then I'm going to be having a join with uh because this is going to be a concat function okay I'm going to be joining this as well with uh the item. table table name okay then finally I'll be joining this with the p because that's the final extension that we're going to be storing so it's going to be schema name do table name. pet yeah so if we have multiple schemas or this the same table name in multiple schemers the schema name is going to to differentiate each of them as a a separate entity okay so I'll select that and once that is done I'll select I'll leave the copy Behavior as as it is and U every other thing is going to be left as it is so in here I'll just come into the uh yeah I'll leave it at that right now we can test this out and see if everything is fine or we need to adjust anything before we continue so let's validate this it says the copy data requires is table is required for copy activity okay let's see why is table required for copy activity so let's see we have the source which is passed in here we have it here but in our SQL table uh let's see what we have we have the schema as well as the parameters sitting here so I think we need to fit in the table here so instead of selecting this uh we need to pass in the parameters as our schema name I'll be selecting the enter manually and then we can pass in the schema name as well as the table name manually okay as a dynamic content brother okay I'll select this and this should be good so we should go back into our copy data and validate again excellent so everything is fine right now and we can test let's see if everything is fine and then we can debug okay let's debug this and see if there's any error before continuing okay so you recognize and you can see we have multiple copy data in here and uh once the data comes in here you can see the output here you'll see the table catalog table schema table name table type and each of the tables are here so we don't have to add code any of the table so I have that replicated multiple times and uh beautifully you can see each of the coupy data uh is completed successfully and you can see the data written is 35 kiloby the number of rows for that and each of them has separate records so this is 847 record so let's double check that in our storage account okay to see if everything is working as expected I'll just duplicate this and um we can go into our base directory and search for the storage account uh okay I'm in the I'm in ADF so this is ADF studio so I'll just pure there okay I'll just go to the the own page now I'll be selecting the storage account and uh yeah we can go into to The Medallion storage account select a container and in our brones let's see if we have data in there okay so no data is in here so something is wrong okay even though our copy data is uh successful but it's not writing that to it's not writing it to a file so let's see why so we have our copy data in here so if I select this and we have the activity so it's fetching all of this output data and if you go into the copy data the sync we have the table schema and the table name okay let's see why this is not working as expected so if I refresh this I'm supposed to have data in this container ah but it's actually created but it's inside a wrong Direct Tre if you look at what we have right now we have this here and you can see all of them sitting pretty nicely here so let's restructure the uh where the data is sitting in and then we can fix that so we created a new storage account for that but we don't I mean he created a new folder for that but that's not what we want we want to put it inside the the brones so let's go back in here and in our pile output so the file system we need to put that as uh as our file uh our container name which is going to be the brones the current one that we have right now that's what is missing I don't know how I missed this but it's interesting okay so now going back in here we just have to R run this I just uh run this I need to delete this one I'll just delete this and then delete it okay confirm it takes a few seconds for the pipeline running to run to be done and we can continue once it is done now all the activities are done and you can see if we go into the bronze layer right now we can see the folder of the date the current date and all of the tables sitting there pretty nicely excellent so now at this point we've been able to dump the data that we have in our SQL Server into our bronze layer now the next thing we want to do is populate this into our data bricks notebook so that involves setting up the notebooks connecting that to our ADF and populating the schema and this uh the database in our um in our notebook with the dynamic content that we have in our ADF so let's quickly do that okay so can come back in here and uh on our UI I just duplicate this and I can search for just above here I can search for Azure data bricks Azure data bricks and I'll click on this guy so I already have a workspace but what I'm going to be doing is creating a new one all right so I'll be selecting the resource Group the medalion spark and the the workspace name is going to be the medallion spark data breaks okay still UK South we be selecting the the standard and then select networking encryption limit dat compliance tags like that okay and I'll just review and create takes a few seconds to be done and I'll create a new one okay the resource deployment is complete and we can click on this to go to the resource now we are in the data brakes environment I just need to launch the workspace so it automatically logs me into the datab bras environment or the datab studio if you like and we have access to this now good now the first thing we want to do is uh create a workspace so right now we don't have a workspace so I can create a new workspace or use this current one that we have but I prefer to create a folder just in here so I can just create a folder and I'll call this my workspace okay let's say medallion workspace Medallion workpace all right and I click on create now if you like you can put this under a particular user or into the share folder I just leave it into the root directory just for Simplicity sake now once this is done any notebook that we are creating is going to be sitting here so I'll just uh right clicking here create notebook and I'll call this uh I'll change the name to let's say base base notebook now now with this base notebook we are going to be writing some scripts and queries in here but before we do that we need to set up something called Secrets okay not so many people know how to do this so but if if you want to have a secret you just need to add an ash key in here and call this uh Scopes and uh create secret okay and this this should uh I think it's uh scope or did I miss it I think it's secret let me see uh Secrets create scope I guess that's see that's secret slash create scope all right now this is where we need to connect our data braks to our ADF okay so we need to get our the DNS name that we're using and the resource ID that we're using for uh our data breaks into this guy so we can you know automatically access the keys and we can use that to access our Medallion information in here so to do that we just need to establish a relationship between our key Vault and our data breaks so we have a key Vault that has been created already which is currently sitting here if you go to the key volt you have our Medallion Spar key volt so let's put data into our key volt and let's see how that works so I can come into the secrets in here and uh I'll create a new one generate a new one so the man it's going to be manual anyways now the secret is going to be the name and the secret value so the secret value that we're going to be putting inside here the key that we use to that will be used to access our um our Bruns our storage account in short okay so basically I go into this guy and uh in our storage account Medallion storage account here I just need to come into the access keys okay and uh I'll click on show and I can copy this now don't worry I need I'll just rotate this key once it is done but basically you come into the access key here and get the access key and that is what you you use to create a secret as a secret value here and you can call the name let's say storage account storage account key okay and the content type is going to be optional you you leave the rest as it is there's no expiration date if you want you can set it but I'll just leave it at that now you have a storage account key in here that you can use just in case you have some challenges with uh the you know maybe listing the records the or whatever if you not able to see the storage account ke that is created all you just have to do is come back to the access configuration in here and uh you know activate the access policy and go to the access policy add your user as a user to this particular access policy which is currently here okay if you do that you should be able to access all of the records in there but as it is we just leave it at that with the storage account key created okay now I have my storage account key my key VT so we need to get this key Vault information into our data braks okay how do we do that we go into properties here as our KEYT I'll copy The Vault URI and I'll go into my data braks and the URI is going to be the data uh the DNS name okay and I'll come back again uh not this one yeah I'll copy the resource ID which is currently here and I'm going to be pasting it here all right then once that is done I just need to put it to give it a particular scope name it doesn't matter what I give it but you should Al also have the the name or whatever it is you give to it uh in mind so I'll call this data bricks data bricks Bri scope if you like and um I'll create I say the premium tier is uh disabled in this uh in this workspace secret scope can only be created with initial manage principle user so all you just need to do that big grammar just changes to all users and you can create okay now the secret Cod has been created and I click on okay now once this is done you can go back into your base notebook to confirm if this has been correctly initialized and Order okay so to do that you just go into the workspace you just go into the workspace in here and just do the GBU Tools fs. Mount you need to mount all of our storage accounts into this for it to work so we have our source it's going to be I'll just copy this so that I don't make mistakes uh just uh copy that and change the name to brooms and change the storage account name to Medallion Medallion essay I'll change this as well to that and then change the key which is what I have in here as my data bricks data bricks scope and uh I just need to get the key name which is the key inside our Medallion key V I mean in the secret key here I just need to get this storage account key okay and we can confirm if this has successfully if we've been able to successfully Mount this we should be able to yeah we haven't created a compute so let's let's create a compute okay to create a compute we can just come in here and go to compute to create a cluster so I'm going to have a single node okay the single node is going to have uh 14.0 as my data breaks one time and I'll be using the standard DS3 version two as my node type and I'll leave the rest as it is uh yeah I'll leave the rest as it is as uh click on create compute it takes a few seconds for the compute to be created and we can continue once that is done while this is going on we can go into our notebook and uh you know talk through what we have in our base notebook okay so what I'm doing basically uh we have two base notebooks right now so what I'm doing basically is to mount whatever we have in our brones uh on our Medallion storage account into this mounting point which is MNT brones okay so once we've been able to connect successfully to our storage account using this uh key that is stored in our storage key in the in the key Vault once that is done successfully we can access that in the MNT brunes and we should be able to see every content of our Medallion storage account from this DBS okay so once this is uh I think is still creating uh we just wait a few more seconds all right so the compute creation is done we can go back into our notebook and double check this so if you run this if there's any error we should be able to see here but if there's none I guess everything is fine all right so you can see true that means it's able to connect successfully to our storage account now if I want to see what we have in our storage account I just need to do FS list and we can go into our mount for Bruns MNT Bruns and we should be able to see that we see you can see we have 2023 1217 and the name of this directory so if you go into this directory let's say 20 I'll just copy this and I paste this here and I run this again you can see all of the records in here address customer address uh product product category and the rest that means our connection to our storage account is successful and is working as expected perfect so right now we need to mount in our silver and gold as well so we just need to copy this a couple of times change this to Silver and change this as well to Silver um yeah and this one to gold and this to Gold I just need to run this again but because I already I've already mounted our our brones we need to comment that out and run this again so we should have access to Silver and um the gold okay perfect so we have our brooms and if you go into the silver you should be able to see what we have in the silver nothing in the silver yet uh silver not silver okay you probably saw that I just remount I'll comment this and remount the silver okay and run this again now it's okay and uh if you go into the gold we don't have anything in there so we have everything fine so so far we can see an empty array that means the connection is successful and we're good to go right now I just need to delete this uh okay I'll delete this I'll leave this in a script section once we done so in case you want to use the same code and you can just copy this and uh use it in your own project as well now once this is done uh we can test something out as well before we continue because this is where we're going to be putting all of this into play so I just need to don't show confirm I just wanted to delete that okay so what I want to do basically is to double check if we can create a database I'll just say spark. SQL and I'll say create database I'll call it testing and if we are able to create this database that means we are ready to run our connections and script so let's double check this in our catalog it says it's successful and you can see we have a testing database so that means our catalog is fine by the time we run our queries we should be able to see all of them so I just need to drop it again and we we're good okay perfect so at this point the connection between our data bricks and the storage account has been established we've also been able to run our queries to create a new database on our catalog on data breaks now what we want to do is once we dump data into our bronze layer we want to create use those files to create tables and you know the database in our storage account well not storage account per say uh on our on our data breaks on our DB so we have a customer paret file in our bronze layer we want to use that to create a customer table on our database account if that makes sense so that's what we want to do basically so for that we need to initially have a script that is going to automatically recognize the database if it has not been created before create it but if it has been created skip skip the creation of the database and as well as the table so if you run it multiple times you can only only create once you know the simple query is you know creative not exist or create uh uh creative not exist for the database and maybe create if not exist as well for the tables if you want so but that's basically the script that is going to be helping us to do that so let's quickly go into that and continue okay so I'll just remove all of this and we're going to be using the base notebook okay but before we do that let's establish a connection between our data breakes and our data frame uh I mean data Factory not data frame okay so in here I'll just create a plus and I'll select notebook here now the notebook needs to be configured we need to be able to configure this I'll minimize this and I'll minimize uh this guy as well okay now in our notebook I just need to come into the Azure data braks here we don't have any linked service for our data break so let's create one so we leave that as your data breaks if you like you can change the name now I'll come back in here for my autor resolve integration I'll leave it as it is but instead of you know from the aure subscription uh you can use this but I think I'll prefer to enter the connections manually so right now I for my workspace URL I can come into my base notebook here and copy everything from net to the to the starting point I don't need the storage account keys for now uh maybe I'll just leave it as it is now going back here where are we Yeah in our data Factory I'll paste this here as my workspace URL now I'll be using access token to connect from our data Factory to data breaks to do that that will take us to the user settings you go into the developer profile and select the access token to generate a new token I'll call this adff in fact I'll call this medallion Medallion ADF data braks okay and uh maybe you just use 30 days well this is going to be deleted once the video is released anyway so it's fine so I'll just copy this token and uh go back into my data Factory and paste this in here so that's the first thing know uh once that is done it automatically uh scan through what we have but this is very important because I made this mistake earlier so so if you select a new job cluster you have to be sure that you have enough compute maybe you have a pay as you go account and you want it to be like each time the job is running it spin UPS a new cluster I don't think that is the best approach but in some cases when you have you know very huge processing to be done you might want to spin up individual processing and I will explain why I'm using this so if you have a single some something like myself if you have a a limited constraint on on the number of computes that you can use it is best that you just select the existing compute that is running on your system and run whatever job on that particular computer but in some cases you know when you are in a production environment you want to spin up a new cluster but mind you once that is done it automatically uh spins down but it costs you know to spin up every huge workload in in in production so I'll be selecting the um existing cluster and then we use that so I'll be selecting the existing interactive cluster and right now this should automatically populate um what we have in there so let's see I'll just remove this and uh I remove the back slash from there and in our access token existing interactive clusters we should be able to see uh so let's see what we have in here I mean uh let's see existing pool H this should be automatically populated I don't know why it is not so what I'm going to do is uh discard the changes in here and uh try to uh okay I'll leave it at that and uh maybe try to set it up again and then I'll just click on this and go into the notebooks and uh select new again I'll be selecting the autor resolve enter manually and I'll paste in our token then the data bricks URL I what I'm going to do is I'll just keep the token here and in our data bricks I'll copy the URL from here and I'll paste it here and I believe this should automatically load yeah so this is the new job cluster but that's not what we want what we want is the existing job cluster that is currently running which is this mujib Gan cluster so this is very tricky I think it's probably something that needs to be improved on so the first time it wasn't loading the cluster that is currently running I just need to close the current uh pop up and then reopen it once that is done it's loaded the the cluster so you might want to double check if you didn't see it at first that's how best to you know close the pop up and then create a new one once you select the mujib G cluster and you test it out you can see connection successful I'm going to be selecting the create so that means once once this is done I just quickly talk through that once this is done and you submit a job any job that is running on our ADF is going to be submitted to this cluster using the the submit API okay so that's what is going to be happening there so but what we want to do basically is uh we want to have a notebook path that is going to be triggered once to submit a job to that so I'll be selecting this and in our root folder I have a medallion work workspace and in my Medallion workspace I have a base notebook this is where we this is The Notebook that will be triggered once we submit a job to this particular notebook cluster Okay so so we need to pass in some parameters today so we know uh the parameters that are going to be accessed uh by this notebook so I'll be adding three in here so the first one is going to be the schema name which is the table schema and uh I had the file name uh I'll call this the file name as well as okay I need to add the table name and I need to have the file name okay file name now for the dynamic content since this is a parameter so I just need to we are still inside it for each Loop don't forget so we can just pass in the table schema here okay and for the table name I passing the table name and then the file name I just need to get the file name and this is going to be automatic Al populated using the format uh query that we had earlier which is the UTC now you can use any other name that you want if you if that's what you prefer are using the year year month and day okay once that is done I just click on okay and we should be able to trigger whatever it is in this uh Bas notebook so let's populate what will be triggered by the time we submit a job to this particular notebook now this is a script that I got online I'll just uh fine tune it for us to be able to work with that okay so I'm going to be using the I'll be getting the file name which is going to be get gotten from the DB utils uh the widgets and uh I'll guess this from the file name okay which is what we have here uh this guy this guy the file name we do the same for the table name as well as the table schema okay we come back in here I'll just copy this and repl replicate it multiple times we have the table schema schema and we have the table name table name and maybe I'll just stick to table schema table name if you like can use the underscore doesn't really matter so I'll just uh table table schema and the table name because it's a variable name if you I just want something that will be a little bit uh different see now we just need to create a database if it doesn't exist so create database if it doesn't exist exist okay what we just do is spark. SQL and this is going to be create database just like we did earlier if if not exist and the the schema of that particular uh database is going to be the same as uh the database name so I'll call this table schema okay table table schema all right now once the database is created we want to create the table if it is not existing on the database let's create it so if uh if the table is not existing on the database uh let's create it okay so to do that we just have spark. SQL and we can have our query in here I'll just uh I'll use this and this is going to be yeah something like this we can put in our query here we can say create table if not exist okay now the table schema that we're going to be passing to this is going to be I'll just uh yeah terminate that and uh plus we have the table the table schema and then do table name so that would be more like a a dot in here and then we had a table name in here and this should give us the more like something like this create table if not exist so let's say we have The Medallion as a schema name Medallion schema. table table name something like this is what this is um going to be creating as well as uh we're going to be using the pet that we dumped in our in our bronze directory uh we have the not pest is pet then the location of the pet is inside the the month section which is the MNT we we we now have access to that don't forget we have access to the bronze layer and we can pass in the uh the P which is going to be p p okay that's our file name but we need to get in the schema name and the uh the schema name the table name as well okay so I'll just uh this is supposed to be three I'll just copy this all right in our brones layer here I'll just pass in the brones I'll pass in the file path which is the file name okay and this is instead of dots we have the file name in fact I'll just leave this one as it is and I'll just add a new one yeah I'll just add a new one here and the new uh the new one here here is going to be the file path file PA which is the file name okay um we can have forward slash so this is going to be Bronze in fact I'll just uh let me just copy this cuz it looks like um I might make a mistake at that point yeah okay so this is uh is mounting onto brones the file name the table schema as well as the table name okay so let's see if we are missing anything I just need to maybe replace this Center section to avoid making errors okay just yeah so what is is doing is the table uh if it doesn't exist the table schema the table name name and is mounting that into Bron the table schema table name as well as uh uh the file name of course as the root direct which is the year month and day that we formatted earlier and dumping that as a pet to use that it's going to use this to create a table okay so that's what he's doing basically and that's all we need to do all we just have to do is uh trigger the pipeline at this point to see if everything is working as expected okay I just save this uh and I can go back into my data Factory and uh BL this again okay if everything is working uh we have the base notebook we have the parameters that are going in there and we can try and run this if there's any error we fix but for now everything seems to be fine so let's see if everything is okay okay so it's loading the data right now uh okay yeah she's dumping them to our bronze layer it has not reached the notebook section okay so it's sending data into our notebook right now if you want to see how this is been submitted all you just have to do is go into the base notebook here in our uh data braks and you see in the data engineering section you see the job runs you will see the submitted data uh submitted jobs and if you see what is going on with any of them it says the function has no attributes SQL yeah it shouldn't be SQL it should be spark. SQL you probably saw that uh why didn't you tell me okay let's fix that okay there's an error in here it should be spark. SQL all right I don't know how I missed that it's just uh maybe I just need to rerun this again yeah we just that this looks like there's an other error in here so let's see what the error is looking like Jo bronze yeah what's the error here again it says name table schema is not defined I'm no longer using the small letter yeah that's the error I just need to change this to I need to change this to table schema and table name something like that table name and this table schema okay any other thing please tell me now before I go on and run this again okay I just run this again to see okay so notebook is getting submitted and uh being we shouldn't have any error this time around yeah you can see all of them are Su successful and uh takes a few seconds to reflect here well all of them are successful regardless so if you go into the if you go into the catalog you will see that we have sales LT in here as our uh database name and you can see address customer address every other thing in here so if you check the customer information you can see the sample data being populated using the pet information that we that was dumped into our bronze layer as well as the customer address if you look at that the address itself look at date product and yeah everything is fine and uh everything is good so what we just need to do now is go proceed into our DBT section right now we are completed the ADF Azure data braks integration we've been able to move data into our from our SQL using ADF and then dump it into our data brakes what we want to do right now is set up our DBT for the transformation as I explained earlier DBT is for the transformation layer is really not concerned with data gets into the system or all data is being uh man the only thing is that is concerned with is the manipulation of the the data so whatever data that you have in your DB can be you know transformed and manipulated and saved into you know other dimension and other firms that makes sense okay so let's switch over to the DBT section and uh continue from there so I'll be creating a new project uh I'll be using py Cham for some reasons py Cham is now my favorite ID for python okay so but you can use any other IDs that you like if you if you prefer Visual Studio code or any other you knowbe whatever ID you like you can just use okay so right now we'll be us we stick to The Medallion I'll call this uh DBT I'll call this spark Azure Deb okay maybe we just leave it at that maybe there's no need for as but who okay I just leave it at that yeah um I'm using python 3.9 anyway so if you want to use the same thing you should use you should be using python 3.9 as your uh python uh environment so if you like you can change it to something more complex maybe or the lat latest one 3.11 uh if you like but I think uh to prevent any issues and errors that may occur from you know updation of dependencies and all that I'll just stick to 3.9 okay but I think with time all of those dependencies will be fully integrated there will be no errors and also so you might want to use the latest python environment so we just wait a few seconds for this to be instantiated and once that is done we can get started okay so the first thing that I need to do is uh let's let's get a new um python file into this okay I'll just touch touch main. py and let's uh see if we can increase the size of uh let's see if we can increase the size of this ID to 20 I hope this is okay 20 should be good all right for clarity okay so I just need to do a pep install Deputy data braks now this is going to help me bring in the dependency of data breaks and DBT all coined together and wired up into my system so that's the first thing that I need to do so once this is done I need to do one other thing before continuing okay I need to ensure that my system can connect successfully to DBT um to data braks and now do I do that I use the package that has been released by the data breaks team so I'll be using p install uh the name of the package is data bricks CLI okay so I'll just install that as well and this is going to allow me to have access to data bricks from my system now once this is done I'm just waiting for the indexing to be done because it slows down once uh it is not is still in progress okay just with a few seconds now the indexing is done the next that I want to do is configure my data braks uh on my local system so I want to ensure that my system can effectively con connect to data breaks uh that I have on the cloud okay to do that I just need to do data breaks data bricks uh configure and I need to pass in the flag of token now this is where the host is going to be so the host of my data bricks is this guy okay this this guy here this is my host my data bricks host without the backwards the forward slash all right now the token was the one that I created earlier that I pasted here I'll just use this token here and I'll paste okay um once I paste it I press enter now we should want to test if I have access to data bricks itself or not so data bricks and I'll use the secret list scope and if I'm able to see this that means I've been able to establish a connection to my data brakes and this is the key even that I created on on data bre now the next thing I want to do is check if I can list the data uh that I have in my catalog okay data breaks and I can do FS list now if I can see this that means automatically my solution uh my system is able to effectively connect to data breaks so if you have any challenge here you need to fix whatever error that you have before you continue because at this this is a very critical point for data DBT to work uh on your system as well as uh to connect to as your data breaks okay now once this is done the next thing is to do DBT in it okay in it so once I I I press enter it's going to ask me the name of my project and all that uh blah blah blah information so what I want to do basically is for the Project's name I'm going to have it as a medallion Medallion DBT spark okay and uh I press enter it said it's not a valid project name letters digit underscore so I just have Medallion DBT spark okay now which database would you like to use I'll be using data bricks as my database so I'll select one now the next thing is going to ask me let's see now it says what's my data bricks host my data bricks host is uh let's see what we have as my data breaks host which is this one isn't it datab brick.com yeah that should be it if there's any error we fix this once we are not able to connect now the HTTP path if you want to uh fix this or to get this information basically you just need to come into your compute and select your compute select your compute here and in the advanced option you see the jdbc odbc here so my server host name this is actually my server host name I think is the same one I have here but without the htps I'll fix that shortly but the HTTP path is this one I'll copy this and pass I'll paste it here now it says I'm going to uh desire access token option I'll be using the use access token now the access token that I copied earlier is here I'll just copy that and paste it here I don't need to use Unity catalog it's okay and the schema that I'll be using the default schema that I'll be using for this uh automation is this guy sales LT okay sales LT all right uh one is fine I don't need to do that it's okay uh so final thing that I need to do is I need to bring in my profiles yo and fix this HTTP that I added here because if I do DBT debug I'm going to have a challenge there's an error that is going to be generated at this point so I need to fix that error it says it is not valid so let's fix that okay to fix that we just need to come back into my uh into my provid Y in here and it's usually here I'll just uh get it here I'll just come in here to my DBT profiles. yo and I'll drag that into my Python and you can see we have the code with you DBT and this is uh DBT B query C demo crash course Medallion yeah this guy now this is saying we don't have a cat log but we do actually so we don't have a catalog here the host information is wrong I'll just remove that and uh what I need to do is move this back up I'll just move this to the top because that's what I want to work with at this point and uh yeah I'll just save this and do DBT debug again uh it says there are errors with this so what I'm going to do is I'm just going to comment these people out they are causing issues for me DB debug I says profile should be known if loading profile is completed all right um so what I'm going to do as well I'm just going to I'll just remove all of this for now so I can establish connection and I can paste them back once this is done uh you can see that uh the connection test is fine but it says what's the issue it says Medallion spark DBT project is no fun but because I'm inside a wrong directory so I just need to move into The Medallion spark directory and I can run DBT debug again and we should be good all things been equal perfect so everything is fine I'm able to connect to the uh data bricks with this guy which is okay I'll just paste these guys back and uh I'll just leave them as as they are okay in fact I'll just comment this guys out okay good now going back here in my medalion DBT Park I have my DBT project toy and right now when I run the DB DBT debug I was able to successfully connect and you can see the catalog has been resolved as I have met store now this at this point is where we need to write some SQL queries for the transformation so I'm going to start with the snapshots to load in all of the current snapshot of the data that we have right now if you look at our sales LT we only have these tables we don't have any snapshots at this moment so we need to keep a snapshot of this current directory before for transformation so we can replay all of them at any point in time we want okay so what I'm going to do is I'm going to go back into my uh Medallion spark here and under this snapshot I'm going to create a new SQL file I call in fact I'll just say touch snapshot uh each of the each of these tables are going to be handled as snapshot separately uh so I'm going to start with the address so I'll say address. SQL I'll have snapshot again I'll have customers. SQL customers. SQL and I'm going to have uh yeah just going to the snapshot directory and I can just have touch uh customer address customer address SQL I have my products SQL I have product category product category. SQL product description product description SQL uh product model product model. SQL we have product model product description SQL we have sales order sales order detail tosql and we have sales order Eder sales order ed. SQL those are the sqls that we are going to be creating and if you look at the snapshot section you can see all of them so what I'm going to do basically I'll just copy paste all of these scripts I have written them and we can just talk through each of them and what they do basically okay because it takes some time to to write all of them okay so I'll just paste in the address I'll paste in the customer information the customer as well as the customer address I'll copy that and paste that and we can talk through all of them we have the product uh product we have the product category I don't have that um I removed this so I'll just use the product model um the product description information there and I'll just focus on this ones yeah so if you like you can you know populate all of them but for now I'll just stick to the ones that have uh that I've been able to do I'll remove this category and description as well as this guy as well okay so right now I have address customer customer address product product model sales order detail sales other Ed so okay let's talk through each of them so let's start with the address so what I'm doing basically I thought I ped the address okay so what I'm doing basically is I'm creating a snapshot of the address snapshot as a variable name that's the name of the snapshot now I'm using the configuration of the file format is going to be Delta like I said our silver and gold are going to be Delta but um but that's not the same for uh our brones brones is a simply pet okay right now we have a Moun Mountain Point as MNT silver and the address so this is the location where the data is going to be dumped into on our uh on our on our what's the name on our storage account it just quickly escaped me okay so now we have the target scheme as the snapshot as this as the target schema invalidate delete unique ID unique key for the address Key address ID the strategies check and the columns is going to be checking all of them so what I'm doing basically is to select this colums from Salt sales LT from address so the address that we have in here this guy this address I'm selecting uh no not this one I'm selecting all of this information I mean those that I selected here I'm selecting all of them and I'm selecting select St from that that's basically this is a CTE expression and also running this and that's all that's what I'm doing and the same thing is what I'm doing for customers as well in our snapshot the customer address as well we have the product the same thing product model sales order as well as say all that details so let me just take water now I'm selecting all of this information okay from the sales order Ed as well sales order details is exactly the same thing it's just the column name that is different as the as well as the name of the snapshot okay once this are done I think I'll just this is a good point for us to commit whatever we have here and uh commit into our repository on GitHub okay so if you look at what we have here we have we have the everything inside our snapshot uh okay so I think I initialized the gate in the wrong directory yeah I'll just remove the dot gug here it should be in the root directory okay get in it and I just need to check again so I have all of this good so I just need to add my my everything just I'll just add everything to the DB at this point okay I don't need this main main py again so I just say GAA and uh commit see added snapshot uh just remove this added snapshot to DBT okay added snapshot to DBT okay now once that is done we can run this and see if everything is fine working at expected so just just do that by running the DBT snapshot and running DBT snapshot is going to run each of the queries that you have and see if everything is running as expected it says snapshot depends on sales sales lt. customer which is not found okay uh okay so we need to fix the models so let's fix the models and we should be good okay now for the sales LT actually I'm missing one other thing here that is important I only pasted all of this code I haven't added the source which is the where we recognize all of the tables and the schema that is on our Azure data bricks that we want to model on DBT yeah maybe it's because I needed water then so let's fix that uh we just need to get in our yaml that is going to be recognizing all of these tables as well well as the the schema okay to do that we just come into we come into our model here uh we can just call this uh we create a folder we can call this a staging directory uh or if you like you can call it brones whatever it is but basically this is the yaml file we have so we can say maybe we can call this brones yo okay yes and so basically in here we are going to be having version we have version two as yam file so we have our sources the name is going to be let me just paste and copy paste that to save time okay so basically what we have is uh the name of this schema is Sal LT and the sclt schema and this adventure Works loaded into brones whatever description you want to put you want to give it so right now we have address customer customer address product product category description model sa details and say so Ed as our sources and once this is done you want to check if everything is fine debuted the BG okay so everything is fine let's check if we can do DBT snapshot again hopefully we should be good yeah you can see it's unable to pass unable to do partial passing because saved manifest not found so it's doing a full pass and running everything from the beginning and you can see each of the snapshots are being run at this point so we have the address snapshots being run first so all of the seven snapshots are going to be running and uh we in the three of seven okay perfect it's completed successfully so if you look at our sales LT right now and you refresh you will see that we have uh we have our snapshots created and these are the seven tables and if you look at each of them they have a snapshot with addition of DBT updated DBT xci updated at Value from and value two right these are very these are important keys for us to know the changes that are happening and what time each of these record is invalidated so for instance if you change or remove or deleted whatever record that is here the DBT value to will be populated and you will see the last record valid date uh before it was changed okay and that's that's one of the good thing about DBT it keeps track of the history of each of the data and this is our data is pulling data from P okay so that means uh at this point our snapshot is fine and if you go into our storage account uh just go into the containers and you can check in the silver section you can see all of them sitting there pretty nicely for our address address snapshot is here and uh if you look at the customer we have the customer snapshot sitting here pretty nicely and uh each of these can be used to replay you know if if anything changes or whether for some reasons which we lost any data in our production environment we which is maybe the gold or the silver section we can replay whatever we have in our silver in our bronze to the silver then to the gold and everything will be sitting there pretty nicely okay so if you have any transformation to be done I think this is where we are going to be doing the uh more like the major transformation okay so I'm going to be deleting this example file I don't need it so I just need to have in our model here I need to have a match I'll just uh add our mat here so this match is going to be more like a streamlined version that will be going into gold so this are going into whatever snapshots we have here are going into the silver section uh and like I said you can do any transformation in here you know change the column name merge whatever it is you want to do but these are going into our silver section for the ones that are going into the gold which is the final stage in our transformation layer we are putting them inside the mat section so we going to have for mat we are going to have customer M customer and uh let me just uh you know make directory we have a model mat and we have a customer we are going to have okay it already exists we have uh products and we have sales okay sales now the three of them are created so we just need to create a file so we just need to do models M sales. we do for customer and we do customer customer. as well as uh the SQL version of each of them I just need to do for product as well products. emo so we do products. SQL custom SQL and uh sales. SQL okay so if you look at each of them you can see for we have the SQL in here so the transformation for the SQL for the that is going into the gold will be inside here but the structure and the test which is the yaml file just like we have in our if you look at our snapshot here okay we don't have the structure here I mean if you look at the okay we we create one don't worry uh with already deleted the examples but inside the yaml file is where we have our structure of the table and maybe the test for each of the columns that are going into the table but inside our SQL that's where we're going to be having the Transformations that are going to be happening okay so let's start with the customer I already have this this written down and we can just quickly talk through it um just space this here but I'll change the name from customer to dim Customer because that's more like a if you like like you can put it like that I'll just change this um yeah do a refractor it's okay and do a refractor for this as well read customer so if you like you can you can write your FX table here or you can write the DM I'm using the DM here but you can use whatever it is you like for some reasons I I just left it at that but if you like you can I'll just uh rename this to leave the sales as it is so we can you know work through each of them now so what we I'm doing here basically in our dim customer uh so I just have this and if you look at what I have in here so this is going to be a this is going to be materialized as a table that means it's being written as a table not a view okay and um the file format is Delta and the location is going to be the mount go customers okay that's where it's going to be written to and this is the address snapshot uh this is this are the context expression the CT that I written so this is getting data from address this is getting data from customer and this is getting customer address this is getting from customer itself okay and at the end of the day this is the transformation that is happening this is just basically a select statement that is happening between customer snapshot customer address and address snapshot I just join all of them and then I join and you know select the uh select the columns but you can you know adjust this and do multiple transformation I'll just paste in the structure and the text that I have in here I'll just uh paste in here simple as that and for the product as well I'll copy paste that let's just paste the SQL and paste the structure as well for product and uh I'll paste the SQL for sales and uh for the sales structure as well I'll just copy paste that in here good so to just talk through this we have the product uh this is also going to be materializing as table the data format and is going to be to product so I'm selecting the product Snapshot from our product snapshot that we had earlier the model snapshot as well is going to be uh selected as well from there and if you notice something here we have the DBT valid to now dbut value two is uh showing us a new record okay so far it's n it has not changed so but if you have multiple Transformations that have been done on a particular column maybe you've updated it you've uh maybe removed it or deleted it or whatever it is or whatever any of the columns have been modified you see the valid two and a new record is going to be added you know without the know so we are selecting the latest one without the valid two information which is the time stamp that is added at what point it was changed okay that's important okay so right now at the end of the day we have our transformed section and we selecting that the same thing we have the structure and this is the structure we have the D product and and uh everything here is as it is uh we have the sales as well which is here uh the same thing happens here we selecting the snapshot the product and I'm joining the other as well to the to the dance and then the transformation layer I just selected all of them and I write them I select them as uh transformed and that's basically what is happening there all we just need to do right now is run this and we should be able to see everything see sitting down in our Gold section and we can write a queries against that if that's okay so let's do that and check so in our terminal we are okay so where are we so uh PWD so we are in the medalion DBT spark I just need to do DBT run okay but before that I need to do a test to see if all the test that I've written in the schemas I mean in this yl files are working as a expected it says configuration Parts exist in your DBT we do not apply to any resources which is this DBT spark example so let's remove that if you look at this DBT uh project in here we have our example here so we just need to change that to I'll just remove everything you can add each of these folders if you want them to materialize as table of view you can add them in case you are wondering how to you change that or you know modify that I have a crash course if you check the link in the description or the card on this video you will see the video there that is been suggested to you so you can watch the video and you'll see how I converted the video I mean you see how I converted a table to view view to tables and how you can you know work individually with the multiple nested folders inside each other so let's not go into that that's beyond the scope of this video but let's leave that as it is we just remove that we do a test again DB test and we should be good okay it says uh it's trying to run yeah because we haven't run it so let's do a DBT run so so you can have all of these registered in the system and we can test all of them okay so it's trying to run all of these test on them but there's no deem sales in the package itself so it's trying to run each of them as a table and once that is done we can now do a test even though the test is being done uh in the background but you know it is important to do that separately so right now we can now run our DBT test and we should be good okay so it says found three model 7 snapshot and nothing to do try to check your config and model A so I think everything should be okay so let's check our UI and if you check our catalog in here if I do a refresh and if go into the sales LT section you will see that we have dim customer which is here and this is the final look of how the customer looks like the same thing with product you see how it looks like and you can see something like the sales end date discontin date and something like that and the last one is the sales which is here the sales is the sample data here and it's crunching it here and it's done so if you look at what it looks like we have it like this and uh that's the thumbnail photo and uh each of the records are here and the final comment so this is how it looks like and uh what am I doing yeah this is how it looks like and these are the columns uh the column information the types and all that and if you go into the gold section at this point I just need to go one step up into the containers and if you go into the gold section you have the customer product and sales and if you see the customer you have the DM customer here if you have any other Transformations they can be done via the DBT product level here as well as uh the sales here okay all of them are there all right now finally the next thing we want to do is just generate the documentation and we can view it looks like how the structure of the system looks like so we just need to do DBT docs uh DBC I write it somewhere here uh just this docs generate uh DBT docks generate okay this is going to generate the documentation of whatever transformation that has been done so far and once this is done we just do DBT s to view it on the UI okay so it's written to this catalog. Json in the Target directory and we can do DBT docs serve okay and it looks like this on the UI I'll just uh drag it into this and you can see how it looks like okay okay so this is how the the documentation looks like so we have the the sources here uh we have the The Medallion DBT spark here and this is the model the mats the customer the DM the dim customer here and this is the code uh if you look at the description here is not documented because I didn't do so much documentation there and these are the columns the dependencies which are these three guys and this is the full code all right the same thing happens to the product you will see the details here and the code here as well the sales you see it here the snapshot and you can see of the data and W is referencing what and now each of them are being referenced I think for the customer if you look at this you see the D customer is referencing that and it has a node of customer okay and if you look at the sales it here which are the source all right you can see each of the data and the sample SQL that are select that are being selected so this is selecting data from our hi met store and our Azure data bricks the same thing with the address here you can see it's the uh the snapshot and this is the query that is that is used there if you look at the database if you look at it on the database level it looks like this I mean sales it and the snapshot which is exactly what we have in our c l in here okay the same thing we have here sales it sales LT and snapshot the same thing just like it this the same way here with the exception of the additional uh something like this uh product model product description table that wasn't added to that so but basically that's how it looks like and I don't have any group models in here but that's how you can see it on the pro project level as well as the database level so if you want to view uh you can see it from there if you want to see the graph the lineage graph for um PR category looks like this but the one that is a littleit more interesting is if you look at the sales let's look at the dim customer yeah the dim customer has this uh graph which is the sales LT into the address snapshot sales customer into the customer snapshot everything going to the the customer so the same way if you want to view the sales for instance we have this and it's like this if you view view screen we have the sales order details into the snapshots then we have the the product and the order in here everything going into the sales all right so that's how the the graphs look like and uh what else again I think there is another one we have the product do we have an interesting graph I think it's just a simple graph really we don't have anything to cost uh to complex but that's how it looks like anyways so that's basically that's basically it if you have any questions or comment feedback just drop them in the comment section so that caps it all that's where we are going to be stopping for our DBT but before we continue if you enjoy this video and if you learn something new please give a like subscribe and turn on the notification Bell so you don't miss out on latest information so if you new here don't forget to subscribe before going and you can check videos that are going to be listed on this UI for more information and more contents like this all right so thank you very much for watching I'll will see you in the next one

hello and welcome back to the channel in today&#39;s 
video we are going to be going through a project   that is based on spark DBT and a cloud provider 
this is as decided by the community and you can   see how the community voted in the last week we 
had 108 votes of which 43 of them voted that we   go for spark DBT and any other Cloud platform as a 
runner Hub we have a data warehousing project with   best practices is following up with 38% snowflakes 
16% and Apache Flink streaming project 4% I have   decided that we&#39;re going to be going with the 
winner of this poll which is the spark DBT and   any Cloud platform project so let&#39;s go through the 
system architecture as decided we will be adopting   The Medallion architecture for this project The 
Medallion architecture allows us to have three   layers of transformation the bronze layer the 
silver as well as the gold in the bronze layer   the data that are coming into the system are left 
as is and uncut while in the silver layer a level   of transformation has taken place and some of the 
data might have been changed transformed or even   removed from the system in the gold layer this is 
where we have all the data that are ready to be   consumed by different business intelligence 
unit different users as well as maybe for   Analytics purposes even for consumers like data 
scientists that want to develop data science and   machine learning algorithms now according to this 
project we will be using Azure data Factory Azure   data breaks DBT and a lot of gen two services 
in the ADLs section so this project is majorly   on the Azure cloud and we&#39;ll try as much as 
possible to keep it at that if you&#39;re new   here don&#39;t forget to like And subscribe to the 
channel and also turn on the notification Bell   so you can be the first person to get notified 
once there&#39;s a new video released so let&#39;s get   into the project and let&#39;s get started if you do 
not have an account I would really recommend you   go on Google and search for aure free account if 
you go for aure free account you have access to   a free account which gives you access to free $200 
and as your credit and you can use 55 services for   free and it&#39;s usually for 12 months so if that is 
something that you want to consider I will really   recommend and suggest that you go and start free 
right now okay I&#39;ll be using my brother&#39;s account   it&#39;s a free account he open it a couple of days 
ago and uh that&#39;s the one I&#39;ll be using okay so   now to get started I&#39;m going to start by creating 
a resource so the resource Group that we&#39;re going   to be creating start with the resource groups 
so you can you can search from here or you can   select from the list of asual services that 
is listed on the UI for you I&#39;ll be clicking   on the resource Group here and I have a couple of 
resource groups currently here and I&#39;ll just click   on create to create a new one so the subscription 
that is currently having is aure for student so if   you are using the free account you have aure 
free uh subscription here so he&#39;s using aure   for students so but the resource Group that is 
going to be creating I&#39;ll be using I&#39;ll call this Medallion Medallion spark DBT Resource Group okay 
now it looks like this Resource Group is okay   and uh right now we&#39;re in the East us so I&#39;ll be 
selecting something very close to me uh something   around UK South if that&#39;s okay to Europe so you 
can choose anyone that is closer to you and if   you have any resource constraint and you cannot 
create any of the services I really Rec commend   you you change the resource Group uh region or 
whatever resource that you&#39;re going to use to be   using you can choose any Resource Group and region 
for that so I&#39;ll be using UK Sal for that and in   next I click on the next and the tag section 
so I don&#39;t have any tags it&#39;s just to review   and create and after the validation is passed if 
you want to download the template for automation   with something like terraform you can download 
it from here but I don&#39;t need that as for this   moment because it is out of scope of this video 
so I&#39;ll just click on create okay so right now if   you can if you notice we have a medallion spark 
DBT Resource Group that is available for us now   the next thing I want to do is to create a Gen 2 
ADLs aure data link service for storing our data   so I&#39;m going to comeon here I&#39;ll search for um 
gen two okay and which is the storage account or   you search for storage accounts if that&#39;s okay for 
you storage accounts so if you select this uh you   have access to the ones that have been previously 
created so I&#39;ll just create a new one I&#39;ll call   this Medallion uh it&#39;s using the resource Group 
that I previously created so I&#39;ll be selecting   The Medallion spark DBT Resource Group and the 
storage account name is going to be called um so   let&#39;s call it Medallion I&#39;m choosing The Medallion 
architecture so that&#39;s the reason why I&#39;m using   The Medallion name so I&#39;ll be using The Medallion 
architecture which simply is the bronze silver and   gold layer so the the bronze layer like I said 
is usually for raw data silver is for fine-tuned   data but is can be consumed by some people and 
some analytics and data scientists uh if you   want but the good layer is where the refined data 
and everything is more uh in a presentable format   it could be data Mass it could be multiple uh 
services that are going to be consuming from there   but it&#39;s more like a very refined data at that 
point so yeah so for the storage account I&#39;m going   to say medallion storage uh so it can only contain 
letters and numbers I just say medallion storage   account uh maybe Medallion Dash Medallion s say 
okay as my solid account so the region is going   to be UK South and I&#39;ll be using the standard uh 
okay the G redundancies by default then in here   I&#39;ll be just enabling the hierarchical name space 
which is complement complimenting the D ADLs Gen 2   end points for enable enabling file and directed 
semantics uh well that&#39;s English but usually you   have access to the files and directory some just 
where you have it in your my documents on your   system where you have your folders and the files 
uh structured out in in hierarchical order that&#39;s   the way it usually is so I&#39;ll leaving the rest uh 
as it is then I click on network I should enable   access to public access from all Network the data 
protection is going to be as it is I&#39;ll just click   on next no encryption as it is by default leave 
it then we just wait for the validation to be done   and I&#39;ll click on create now the storage account 
is going to be created and this is where we&#39;re   going to be having the multiple storage whether 
it&#39;s bronze layer the silver layer or the gold   layer that&#39;s where we&#39;re going to be putting 
them in our storage account so I just wait a   couple of seconds for the storage deployment 
to be done once it is done we switch over to that okay so the deployment is done I can go to 
the re resource now going to the resource I have   access to all of this in on the left and some of 
the keys and shortcuts on the on the right here so   but on the left you can see the data storage I&#39;ll 
be clicking on the containers in there so I don&#39;t   have any container right now so I&#39;ll be creating 
one so I&#39;ll start with the brones bronze and then   the silver and finally the gold okay now these are 
the three layers that we&#39;re going to be having so   whatever data that we&#39;re grating or that is coming 
into the system will&#39;ll be landing into the bronze   layer from there we process it into the silver 
layer and then we process finally uh into the   gold okay now once this is done the next thing is 
to create a data Factory so I&#39;m going to come in   here and search for data Factory okay and you can 
click on data factories in here now I have one in   here but I&#39;ll just create a new one so I&#39;ll create 
a new one I&#39;ll call it I&#39;ll be selecting the med   Medallion spark DBT Resource Group and I&#39;ll call 
it Medallion Medallion ADF okay then next uh the   gate configuration is by default even though we&#39;re 
not going to be using it we just leave it as we   configure it later the networking is going to be 
a public end point we leave it at that Advance   no tax no review and create and that&#39;s it boom we 
just wait a few seconds for the data Factory also   to be created as well before we continue okay the 
deployment is succeed I click on go to Resource   and then we can we can view what we have in our 
data Factory I&#39;m not going to be launching the   data studio uh data Factory studio right now so 
I need one other thing I it&#39;s called key Vault so   I can search for key volt in here key volt so the 
key volt is where I&#39;ll be storing all my keys and   secrets so I don&#39;t want to be copying every now 
and then so I can just easily paste data into my   secrets and I can reuse it in my data brakes by 
the time we get to the data Brak section so we   don&#39;t have to you know recopy every time you 
know expose our keys as it is going to be in   more like in a production environment okay so 
I select the key vult now I have a key vult in   here so I&#39;ll just create a new one uh creating 
a new one I&#39;m going to be selecting the resource   Group medallion dbtg then the key volt name is 
going to be Medallion Medallion spark uh call   it KV okay now the region is going to be UK 
South all right the pricing here is going to   be standard I&#39;ll just leave it at that then next 
access configuration um so I can use the rbac the   RO based Access Control control but I can leave 
it as a vae access policy where you can select a   user so if you don&#39;t have it in here so I think 
it is best to select something like this so you   can find grain and uh select uh the policy that 
you want for every particular user that is going   to be assigned to so I already created a user 
but if you want to create a new user you can   use this button and uh if you want you can use 
a template in here just select the proper that   you&#39;re going to be attaching to a particular user 
in here and then just the principal is going to be   okay let&#39;s say I select all of this for instance 
the principal is going to be the uh applications   that are going to be accessing this in part this 
particular key Vol so you can for instance if   you have data bricks you can add it in here data 
bricks resource provider can be added to this list   and if you have any other ones maybe Gen 2 Storage 
Lake is going to be accessing the key V for some   reason reasons you can add it in here and any 
other service so you have any application as well   which is fine just review and create and that&#39;s 
how I created this user and if you click on this   user and click on edit you can see the properties 
and uh permissions that that were given to this   particular user I don&#39;t need any cryptographic 
operations as well as privilege key operations   and secret key operations I don&#39;t need any of 
them I only need the key Secrets the certificate   as well as the rotation policy operation those 
are the the four things that I attached to that   particular user now if you click on the review 
and create you can see all of them in here but I   will not be doing that right now once that is done 
I&#39;ll just review uh I&#39;ll just next uh let&#39;s see am   I missing anything here so we have the resource 
access virsion machine and know resource manager   for template deployment no this encryption no it&#39;s 
fine you just click on next now allow access from   all networks I&#39;ll be leaving it at that the 
tax is free review and create and uh once it   is validated we can click on create so it takes 
a few seconds as well to finish the deployment   and once that is done we can continue okay now 
the deployment is done so let&#39;s switch over to   our data Factory because that&#39;s where the major 
work is so I I&#39;ll be selecting The Medallion ADF   in here and I&#39;ll be selecting the launch studio so 
this is going to launch the Azure data Factory for me okay so right now we are in the Azure data 
Factory page now we can start with the one on   the left so we have the own which is the landing 
page of course author is where we&#39;re going to be   creating our pipeline data sets and the rest in 
here monitor is where we we can do the monitoring   of the pipelines the triggers the chain data 
capture and the rest here but right now we are   going to be focusing on two things the author 
section and the manage section now we start   with the manage section because what we want 
to do is have access to our Gen 2 Lake Service   as well as database which we&#39;ll be creating 
shortly so we don&#39;t have a database right now   and we don&#39;t have any data as well so let&#39;s do 
that before we continue to your data Factory and   you know get cracking okay now going into this 
guy we are going to do one last thing in here   so you can search for SQL databases SQL database 
this is aure SQL now mind you this is using SQL   server command so if you have background in 
postgress or any other languages like uh any   other database scripting language like uh maybe 
MySQL or uh or maybe Maria DB or something like   that some of the syntax might have to be fine- 
tuned but right now we&#39;re using SQL Server uh   dialect as the language okay now I&#39;ll be creating 
a new datab I click on this and then we can create   a new database okay so um creating a new database 
uh so I&#39;ll be selecting the resource Group as The   Medallion spark uh DBT Resource Group okay 
now the database name all right so this is   very expensive so what I&#39;m going to do right 
now is 396 so what I&#39;m going to do basically   is to remove the one that I currently have on 
the system so I&#39;ll just delete this uh database   okay I&#39;ll just delete it uh just delete okay 
it takes a few seconds to be completed uh and   once it is done we can continue now as usual 
a database resides on a particular server so   we can come in here and search for SQL Server 
you have SQL servers in here so I I currently   have one server in here sitting pretty nicely 
in here so if you want I can delete this and   we can create a new one or let me see if there&#39;s 
anything that we can adjust in here so basically   uh I just created a Ser uh this server on the Fly 
I think I&#39;ll just maybe delete this let&#39;s delete   it um yeah let&#39;s delete it uh skip feedback 
we&#39;re going to create a new one eventually   so just delete this and uh delete okay I don&#39;t 
have anything important on it so I just delete   that okay so I close this going to SQL Server 
refresh so yeah our SQL Server is deleted as   well as the database itself so we can go back 
into our SQL database in here so let&#39;s create   a new SQL database all right so right now I have 
access to 32 gig free of data and 33 G free of   backup uh storage so I&#39;ll be applying this offer 
so I don&#39;t pay so much on that so right now I&#39;ll   be selecting The Medallion spark DBT the database 
name is going to be code with you and I&#39;ll call   the database name uh I&#39;ll say medallion DB okay 
maybe I&#39;ll say Dev cuz this is semi- production   already really so we don&#39;t have a right now so 
we create a new server so the server is going   to be code with you Medallion uh DB all right 
so the location is going to be in the UK South uh where is it UK Sal Show Yourself uh I 
can&#39;t find it okay UK yeah UK s okay now   the authentication is going to be like that but 
I&#39;ll be using the usql authentication so I&#39;ll be   having the server admin as my in fact I&#39;ll be 
using code with you as a admin login and the   password is going to be something that I create 
and I&#39;m not going to be sharing that with you   okay all right so you can use your own password as 
well and don&#39;t share with me okay thank you very   much now the password matches and I can click 
on okay now this creat a server for us on the   Fly and I know the username is code with you and 
the password is something I&#39;m going I&#39;m not going   to be telling you okay now yeah behavior when the 
free offer is reached I I want you to autop pause   the DB until the next month and uh it&#39;s fine I&#39;ll 
just click on networking now connectivity method   now I don&#39;t need there&#39;s going to be a public 
endpoint access so I need people to be able   to access access this not everybody it&#39;s just me 
alone of course I&#39;ll be deleting this after this   video so nobody can access this DBA as well but 
it&#39;s okay firewall rules there no need to that   no need for that as well as the connection policy 
it&#39;s okay I&#39;ll just click on next as my security   I don&#39;t need the defender for SQ no Ledger Ser 
identity no and the rest is going to be left   as default additional settings now this is very 
important at the data source level I want I want   this to be populated with Adventure Works Data 
that that&#39;s already more like a sample database   so you need to select the sample database for 
adventure Works to be added to your system if   you don&#39;t select that that means you&#39;ll have to 
maybe download adventure works by yourself and   then populate your DB uh with it if you want but 
I think that&#39;s a stressful way of putting it so   if you want I think it&#39;s best to just use a sample 
and it gives you the latest Adventure works in the   system you have all the data you know sitting 
there pretty nicely once it is done so I just   click on the sample and it gives us adventure 
adventure Works LT I&#39;ll just click on okay and   uh yeah the next TX is free and uh the review and 
create I don&#39;t have any cost it&#39;s free of charge   and this is important if you select just like I 
had it before if you have your first 32 gig you   have it for free but subsequently you can you have 
to be paying right now I had $395 to be paying I I   can&#39;t be paying something like that for for just 
a single video you know and all that so I think   it&#39;s okay if it is something that is more like 
production ready then it&#39;s okay I just click on create all right so the deployment is in progress   we just wait a few more seconds for the 
deployment to be done and we can get cracking okay the resource deployment is complete 
and we can go to the resource now so let&#39;s double   check if the data is automatically populated and 
we have access to it so if I click on the query   editor with the preview it&#39;s it recognizes my 
username so I can just type in my password so   my password goes does and uh it says could not log 
in with this IP address so I just need to allow it   and it&#39;s allowed uh for Access I&#39;ll just click on 
okay and login again so right now I have access   to this and if you look at the tables uh I drop 
it down you can see all of the tables in here so   if I select startu from address for instance and 
I run this you can see that I have um customer   address sitting pretty nicely in here and for 
the customer address as well customer address   I can okay I&#39;ll just remove this and I&#39;ll remove 
the last one here I&#39;ll run it again and you can   see the customer address uh sitting there pretty 
nicely we have the product as well and uh just   to show you that the data has been populated for 
us we don&#39;t need to write any scripts to or some   fake uh data or synthetic data to be to it this is 
Adventure works and it has some sales information   the customer information as well as the product 
information so this works for us but what we want   to do basically is I want to automate the process 
so I can automatically get the data into our   bronze layer from there to Silver then gold okay 
how do we do this so we&#39;re going to start with the   ADF so the ADF is going to go into this database 
on our server and fetch the table names in there   so each of these tables are going to be looped 
through and then we are going to dump all of   this data into our broomes as our first level now 
once data gets into our brones layer we are going   to have them as uh more like pack uh more like 
normal files okay then in our in in the second   layer which is going to be silver and gold and the 
second and the third they&#39;re going to be more like   a Delta uh Lether files there we&#39;re going to store 
them as pet files at that point but initially   they&#39;re also going to be St as pet which is going 
to be our initial files in the in the brones layer   so let let me let me show you what I&#39;m what I mean 
so it&#39;s a little bit clearer if I show you okay so   right now what we&#39;re doing basically is we have 
all of these uh tables in here so if you want to   get all of the tables you can get them this guy 
I&#39;ll just copy this and I&#39;ll put this as uh like that and that will be information _ schema okay yeah all right so these are the tables that 
I have so this is the schema name and the table   name okay so you can see that we have uh let me 
see if I can expand this a little bit okay maybe   not so I have the table catalog the table schema 
table name and table type and you can see each of   the data that is here but what we want to focus 
on are the ones that are in this sales LT schema   uh we don&#39;t need the dbo which is fine this is 
more like a added to it automatically so we just   focus on this and we focus on the tables we don&#39;t 
need the view as well okay uh so if you want you   can have the views but for now we just uh exclude 
those on so we just say where the table schema is   equals to okay sales LT and uh table type table 
type is equal to base table okay if we run this   again we have something more find tuned which is 
uh this so we can use this to get all the table   names in there and based on this we can Feit 
that uh with Dynamic query in our Azure data   Factory all right let&#39;s switch over to Azure 
Factory and let&#39;s get cracking with that so in   our your factory in here we need to create a link 
service that is going to be connecting to our SQL   server and we can based on that create a data set 
okay so let&#39;s create a new one the way I got in   here by clicking on the manage uh on the left 
on the Side Bar here and clicking on the linked   service so if you click on plus and you come 
to the asual sport as your tabs you can type in SQL and uh you have access to SQL database here 
if you click on continue uh it&#39;s going to be   a your SQL database whatever name you call it 
I&#39;ll leave it at that and autor resolve runtime   integration from a zero subscription I&#39;ll select 
the subscription okay the server name is going to   be The Medallion DB the database name is going 
to be code with you Medallion DB Dev all right   then the username is going to be code with you 
the password is once is going to be what I can   you know I&#39;m the only one that haded that 
okay once that is done now let&#39;s test the connection connection fail what&#39;s the error 
it says user code with you client IP is not   allowed so I think we need to allow create a 
fire world that allowed this okay so let&#39;s fix   that I&#39;m just going to click on configure and 
we have access in here it says public network   access is selected Network so what I&#39;m going 
to do yeah so I just need to had the just I   just scroll down and um allow your services and 
resource access to this server and I just click   on that and I click on Save once that is done uh 
we can go back into our Medallion here and test it   again test connection should have access now okay 
connection is successful and I click on create all   right now we have a link access from our data 
Factory to AO right now we also need access to   where we&#39;re going to be writing and dumping data 
into on our Gen 2 Lake service so which is here   we just come in here and select the ADLs Gen 2 
continue and I can leave it as as as that and uh   in here I&#39;ll just select that the subscription 
the storage account name is going to to be The   Medallion storage account and then the file path 
we going to be storing data into it&#39;s going to be I&#39;ll just call it brones okay and we can 
test the connection it&#39;s successful all right   we create okay good now we can go back into the 
author section and uh create our first pipeline   okay so let&#39;s create a new pipeline now in our 
pipeline I&#39;ll just minimize this for now in our   pipeline we are going to have what is going to 
be bringing data up for us for that we need a   data set so let&#39;s create a new data set now 
this data set is going to be the one reading   the data uh from our Azure database into our 
data uh data Factory okay so I&#39;ll just click   on this and uh I&#39;ll just say SQL and uh we have 
Azure SQL continue now the link service is this   one and I&#39;ll call this uh table tables uh maybe 
table squarey or something like that something to   fet all of the queries but right here we cannot 
select just a single table because what we want   to do is get all of the tables so right now leave 
everything in here as it is and I&#39;ll click on okay   now this seems funny but this is how we&#39;re going 
to be doing because we want to dynamically assign   the table names into this guy okay okay so right 
now we leave the table squarey as STS now let&#39;s   go into our Pipeline and bringing our look up 
that is going to be fetching all of these data   for us so we can either come in here and search 
for lookup okay but or you can just go into the   general section and uh and drag look up into 
the plane okay so in our look up we are going   to have some settings in here so but for now 
uh we click on the settings stab now I don&#39;t   need the first row only I need multiple records so 
if it&#39;s just a single record that I need you can   select the first row only but I don&#39;t need that I 
need the entire thing so we have a data set which   is where the data is going to be coming from but 
right now we not we don&#39;t have anything selected   for our table query so what we need to do serice 
a query that is going to be getting data for us I   think I copied that earlier but for some reasons 
I&#39;ve overwritten that so what we&#39;re going to do   is uh we can either go back into our DB or just 
take our chances and write it in here select St   from um I&#39;ll just get the database name uh maybe 
I&#39;ll just quickly go back into our SQL database   cuz I can recall the okay the exact query database 
name and all that okay just go in here and uh we   can select start from the database name I&#39;ll just 
copy this guy and I&#39;ll put it in here information   schema. tables okay and uh we just uh I&#39;ll reduce 
this a little bit so we the table schema equals to   sales LT and table type equals to base table okay 
so let&#39;s see if this works I think it does and we   can just select this I&#39;ll just copy that and go 
into our data Factory I&#39;ll paste this in here so   right now if you preview the data let&#39;s see what 
we have all right so we have all of these table   catalog table schema table name and table type 
okay so we can use this to get started and load   data into our data set so what we&#39;re going to be 
doing is once we have this the next thing is going   to be uh where the data is going to be going to 
so the data is going to be going into um our DB   table so okay so I I have it as uh tables query 
so this look up is fine now each of these records   that are being populated into our from our look 
up uh from the base uh table query in here what   we want to do is we want to Loop through each of 
them and for each of them we want to dump them   into our Bron layer so we come in here and we can 
search for for each okay so for each is going to   iterate through each of the items in there and I 
can select this in there and uh I&#39;ll select this   for each of the activities now the first thing 
we want to do is copy the data that is coming   in into the system okay so I can search in here 
and going to the move and transform for copy data   now you can add add many logic as well as poss as 
much as possible in here but right now just keep   it simple we have our copy data which is uh this 
we just need to configure The Source the source   is going to be the tables query that is coming in 
uh let&#39;s see our pipeline in here so if you click   on the for each here we have the settings now the 
items that will be coming into the look up before   we configure what is inside the forage we need to 
have a dynamic content in here so we have the uh   let&#39;s see what we have I&#39;ll just minimize this for 
now uh we have the look up I&#39;ll change the name to fch or tables okay so that will be the name 
and once it get into to the for each we have the   settings and the items are going to be fetch 
all tables uh return value so this is not the   pipeline value but the value itself okay and uh 
yeah we just use that now for each of the items   that is now coming in as the table output value 
we we want to pick each of them and dump them into   um uh bronze layer okay so right now we have this 
coming in and if you preview the data uh you can&#39;t   because we don&#39;t have anything outside to that yet 
for us to be able to work with this we can use the   table query as well but what we need to do is have 
a separate data set that is going to be feeding   data from this table query so I&#39;m going to just 
create a new data set here I&#39;ll call it U still   the same SQL and I&#39;ll select that and continue 
and inside here I&#39;ll be selecting the SQL table   instead of the data uh data table that we have 
earlier the table list so I just selecting this as   my SQL table and uh selecting that there&#39;s no need 
for tables as well I&#39;ll click on okay now this SQL   table is the one that we&#39;re going to be feeding 
from when we are copying okay so I&#39;ll select this   as my SQL table but in in here we just need to add 
a parameter to specify the schema as well as the   table name so I&#39;m going to have the schema name 
and uh the table table name okay now these two   guys are the ones that are going to be passing 
each of the records that are coming in from our   table query into our SQL so we can use this and 
uh dump them into our bronze layer okay so coming   back to this yeah I think we&#39;re good yeah we good 
so we just go back into the pipeline select this   and instead of this we just have uh what&#39;s the 
name SQL table so we have the schema name as well   as table name so we since we are not at coding the 
schema name and the table name we need to add them   as Dynamic content so I&#39;ll select this and it 
says for each of the item we will be selecting   the if you can recall in our DB we have the table 
schema and we have the table name okay so we have   the table uncore schema and I&#39;ll click on okay 
and for the table name I&#39;ll come in here for each   again and I&#39;ll select the table underscore name 
which is what is here I&#39;ll select okay now we have   access to the table that are coming in from our 
pipeline once we fetch all of them we pass them   into our for each and for each is going to Loop 
through each of these activities and and send them   into our source the source is going to transfer 
them to the sync so now let&#39;s configure the sync   yeah everything think everything is fine here we 
select the sync now the sync is where we are going   to be sending data to we don&#39;t have where that is 
going to be done right now so let&#39;s create a data   set for that so I&#39;ll be selecting here and uh I&#39;ll 
be selecting the gen two L Storage and uh that be   the pet file that we&#39;re going to be sending to 
that so this will be the pet file output okay   and then the link service I&#39;ll be selecting the 
data link storage I&#39;ll leave it as St but right   now we we need to pass in the directory as well 
as the the file name so I&#39;ll leave it as that for   now and then we can parameterize this as well so 
this is going to be taking the parameter of the   uh of the the folder name oh yeah folder 
name and the I&#39;ll call this the the file okay uh maybe the file name okay so these are the 
two things that this requires as parameters and uh   we have access to our data link from here now now 
in our file path we are going to be passing that   as Dynamic content so we have the folder name as 
our directory which is okay and for the I&#39;ll just   minimize this for the file name we&#39;ll be passing 
the file name that is coming into the system as   uh a parameter there so right now we have we 
leave it as a compression of Snappy but if you   want you can use gzip or the iso IO compression 
Snappy in there and then this is how we configure   the parameters okay now we can now use this pety 
output as our output section in the sync so I&#39;ll   be selecting P output and it&#39;s asking me for the 
folder name as well as the file name now for us   to configure the file name and the folder name we 
need to understand what the final output is going   to be looking like so based on what we have in 
each of these records if you look at the say for   customer for instance I&#39;ll just uh let&#39;s say we 
have select St from customer and I run this okay   you can see that we have uh we have some records 
in here but what we want to do basically is group   all of these as a single entity in the in our 
brones layer so the date that we&#39;re bringing the   data in is going to be where the data is going 
to be sitting in and the file name is going to   be separate okay so what I mean basically is I&#39;m 
going to be having a single date as a let&#39;s say   um format date time I&#39;ll be using the date time 
of current date which is the UTC now and I&#39;ll   be formatting that as the year month and day if 
you like you can use day month and year doesn&#39;t   really matter but you know for consistency sake 
you can use the year month or day or you can also   add something like this if if that&#39;s what you 
prefer okay so I&#39;ll select that and that that   will be my folder name now for the uh the file 
name we need to concatenate the schema name as   well as the uh table name to create a an output 
there so I&#39;m going to have the the table schema   which is this and then I&#39;m going to be having a 
join with uh because this is going to be a concat   function okay I&#39;m going to be joining this as 
well with uh the item. table table name okay then   finally I&#39;ll be joining this with the p because 
that&#39;s the final extension that we&#39;re going to be   storing so it&#39;s going to be schema name do table 
name. pet yeah so if we have multiple schemas or   this the same table name in multiple schemers 
the schema name is going to to differentiate   each of them as a a separate entity okay so I&#39;ll 
select that and once that is done I&#39;ll select I&#39;ll   leave the copy Behavior as as it is and U every 
other thing is going to be left as it is so in   here I&#39;ll just come into the uh yeah I&#39;ll leave 
it at that right now we can test this out and   see if everything is fine or we need to adjust 
anything before we continue so let&#39;s validate   this it says the copy data requires is table is 
required for copy activity okay let&#39;s see why is   table required for copy activity so let&#39;s see we 
have the source which is passed in here we have   it here but in our SQL table uh let&#39;s see what we 
have we have the schema as well as the parameters   sitting here so I think we need to fit in the 
table here so instead of selecting this uh we   need to pass in the parameters as our schema name 
I&#39;ll be selecting the enter manually and then we   can pass in the schema name as well as the table 
name manually okay as a dynamic content brother   okay I&#39;ll select this and this should be good so 
we should go back into our copy data and validate   again excellent so everything is fine right now 
and we can test let&#39;s see if everything is fine   and then we can debug okay let&#39;s debug 
this and see if there&#39;s any error before continuing okay so you recognize and you can see 
we have multiple copy data in here and uh once   the data comes in here you can see the output 
here you&#39;ll see the table catalog table schema   table name table type and each of the tables 
are here so we don&#39;t have to add code any of   the table so I have that replicated multiple 
times and uh beautifully you can see each of   the coupy data uh is completed successfully 
and you can see the data written is 35 kiloby   the number of rows for that and each of them 
has separate records so this is 847 record so   let&#39;s double check that in our storage account 
okay to see if everything is working as expected   I&#39;ll just duplicate this and um we can go into 
our base directory and search for the storage account uh okay I&#39;m in the I&#39;m in ADF so this 
is ADF studio so I&#39;ll just pure there okay I&#39;ll   just go to the the own page now I&#39;ll be selecting 
the storage account and uh yeah we can go into to   The Medallion storage account select a container 
and in our brones let&#39;s see if we have data in   there okay so no data is in here so something 
is wrong okay even though our copy data is uh   successful but it&#39;s not writing that to it&#39;s not 
writing it to a file so let&#39;s see why so we have   our copy data in here so if I select this and we 
have the activity so it&#39;s fetching all of this   output data and if you go into the copy data the 
sync we have the table schema and the table name   okay let&#39;s see why this is not working as expected 
so if I refresh this I&#39;m supposed to have data in   this container ah but it&#39;s actually created but 
it&#39;s inside a wrong Direct Tre if you look at   what we have right now we have this here and you 
can see all of them sitting pretty nicely here so   let&#39;s restructure the uh where the data is sitting 
in and then we can fix that so we created a new   storage account for that but we don&#39;t I mean he 
created a new folder for that but that&#39;s not what   we want we want to put it inside the the brones 
so let&#39;s go back in here and in our pile output   so the file system we need to put that as uh as 
our file uh our container name which is going   to be the brones the current one that we have 
right now that&#39;s what is missing I don&#39;t know   how I missed this but it&#39;s interesting okay so 
now going back in here we just have to R run this   I just uh run this I need to delete this one I&#39;ll 
just delete this and then delete it okay confirm   it takes a few seconds for the pipeline running 
to run to be done and we can continue once it is done now all the activities are done and you can 
see if we go into the bronze layer right now we   can see the folder of the date the current date 
and all of the tables sitting there pretty nicely   excellent so now at this point we&#39;ve been able to 
dump the data that we have in our SQL Server into   our bronze layer now the next thing we want to do 
is populate this into our data bricks notebook so   that involves setting up the notebooks connecting 
that to our ADF and populating the schema and this   uh the database in our um in our notebook with 
the dynamic content that we have in our ADF so   let&#39;s quickly do that okay so can come back in 
here and uh on our UI I just duplicate this and   I can search for just above here I can search 
for Azure data bricks Azure data bricks and   I&#39;ll click on this guy so I already have a 
workspace but what I&#39;m going to be doing is   creating a new one all right so I&#39;ll be selecting 
the resource Group the medalion spark and the the   workspace name is going to be the medallion spark 
data breaks okay still UK South we be selecting   the the standard and then select networking 
encryption limit dat compliance tags like   that okay and I&#39;ll just review and create takes a 
few seconds to be done and I&#39;ll create a new one okay the resource deployment is complete and 
we can click on this to go to the resource   now we are in the data brakes environment I just 
need to launch the workspace so it automatically   logs me into the datab bras environment or the 
datab studio if you like and we have access to   this now good now the first thing we want to 
do is uh create a workspace so right now we   don&#39;t have a workspace so I can create a new 
workspace or use this current one that we have   but I prefer to create a folder just in here 
so I can just create a folder and I&#39;ll call   this my workspace okay let&#39;s say medallion 
workspace Medallion workpace all right and   I click on create now if you like you can put 
this under a particular user or into the share   folder I just leave it into the root directory 
just for Simplicity sake now once this is done   any notebook that we are creating is going to 
be sitting here so I&#39;ll just uh right clicking   here create notebook and I&#39;ll call this uh 
I&#39;ll change the name to let&#39;s say base base   notebook now now with this base notebook we are 
going to be writing some scripts and queries   in here but before we do that we need to set 
up something called Secrets okay not so many   people know how to do this so but if if you 
want to have a secret you just need to add an   ash key in here and call this uh Scopes and 
uh create secret okay and this this should uh I think it&#39;s uh scope or did I miss it I think 
it&#39;s secret let me see uh Secrets create scope I   guess that&#39;s see that&#39;s secret slash create scope 
all right now this is where we need to connect our   data braks to our ADF okay so we need to get our 
the DNS name that we&#39;re using and the resource ID   that we&#39;re using for uh our data breaks into this 
guy so we can you know automatically access the   keys and we can use that to access our Medallion 
information in here so to do that we just need to   establish a relationship between our key Vault 
and our data breaks so we have a key Vault that   has been created already which is currently 
sitting here if you go to the key volt you   have our Medallion Spar key volt so let&#39;s put data 
into our key volt and let&#39;s see how that works so   I can come into the secrets in here and uh I&#39;ll 
create a new one generate a new one so the man   it&#39;s going to be manual anyways now the secret is 
going to be the name and the secret value so the   secret value that we&#39;re going to be putting inside 
here the key that we use to that will be used to   access our um our Bruns our storage account in 
short okay so basically I go into this guy and uh   in our storage account Medallion storage account 
here I just need to come into the access keys okay   and uh I&#39;ll click on show and I can copy this now 
don&#39;t worry I need I&#39;ll just rotate this key once   it is done but basically you come into the access 
key here and get the access key and that is what   you you use to create a secret as a secret value 
here and you can call the name let&#39;s say storage   account storage account key okay and the content 
type is going to be optional you you leave the   rest as it is there&#39;s no expiration date if 
you want you can set it but I&#39;ll just leave   it at that now you have a storage account key in 
here that you can use just in case you have some   challenges with uh the you know maybe listing 
the records the or whatever if you not able to   see the storage account ke that is created all 
you just have to do is come back to the access   configuration in here and uh you know activate the 
access policy and go to the access policy add your   user as a user to this particular access policy 
which is currently here okay if you do that you   should be able to access all of the records in 
there but as it is we just leave it at that with   the storage account key created okay now I have 
my storage account key my key VT so we need to   get this key Vault information into our data braks 
okay how do we do that we go into properties here   as our KEYT I&#39;ll copy The Vault URI and I&#39;ll go 
into my data braks and the URI is going to be   the data uh the DNS name okay and I&#39;ll come back 
again uh not this one yeah I&#39;ll copy the resource   ID which is currently here and I&#39;m going to be 
pasting it here all right then once that is done   I just need to put it to give it a particular 
scope name it doesn&#39;t matter what I give it but   you should Al also have the the name or whatever 
it is you give to it uh in mind so I&#39;ll call this   data bricks data bricks Bri scope if you like 
and um I&#39;ll create I say the premium tier is uh   disabled in this uh in this workspace secret scope 
can only be created with initial manage principle   user so all you just need to do that big grammar 
just changes to all users and you can create okay   now the secret Cod has been created and I click 
on okay now once this is done you can go back   into your base notebook to confirm if this has 
been correctly initialized and Order okay so   to do that you just go into the workspace you 
just go into the workspace in here and just do   the GBU Tools fs. Mount you need to mount all of 
our storage accounts into this for it to work so   we have our source it&#39;s going to be I&#39;ll just copy 
this so that I don&#39;t make mistakes uh just uh copy   that and change the name to brooms and change 
the storage account name to Medallion Medallion   essay I&#39;ll change this as well to that and then 
change the key which is what I have in here as my data bricks data bricks scope and uh I just 
need to get the key name which is the key inside   our Medallion key V I mean in the secret key 
here I just need to get this storage account key okay and we can confirm if this has successfully 
if we&#39;ve been able to successfully Mount this we   should be able to yeah we haven&#39;t created a 
compute so let&#39;s let&#39;s create a compute okay   to create a compute we can just come in here and 
go to compute to create a cluster so I&#39;m going to   have a single node okay the single node is going 
to have uh 14.0 as my data breaks one time and   I&#39;ll be using the standard DS3 version two as my 
node type and I&#39;ll leave the rest as it is uh yeah   I&#39;ll leave the rest as it is as uh click on create 
compute it takes a few seconds for the compute to   be created and we can continue once that is done 
while this is going on we can go into our notebook   and uh you know talk through what we have in our 
base notebook okay so what I&#39;m doing basically   uh we have two base notebooks right now so what 
I&#39;m doing basically is to mount whatever we have   in our brones uh on our Medallion storage account 
into this mounting point which is MNT brones okay   so once we&#39;ve been able to connect successfully 
to our storage account using this uh key that is   stored in our storage key in the in the key Vault 
once that is done successfully we can access that   in the MNT brunes and we should be able to see 
every content of our Medallion storage account   from this DBS okay so once this is uh I think is 
still creating uh we just wait a few more seconds all right so the compute creation 
is done we can go back into our   notebook and double check this so if 
you run this if there&#39;s any error we   should be able to see here but if 
there&#39;s none I guess everything is fine all right so you can see true that means it&#39;s 
able to connect successfully to our storage   account now if I want to see what we have in our 
storage account I just need to do FS list and we   can go into our mount for Bruns MNT Bruns and we 
should be able to see that we see you can see we   have 2023 1217 and the name of this directory 
so if you go into this directory let&#39;s say 20   I&#39;ll just copy this and I paste this here and 
I run this again you can see all of the records   in here address customer address uh product 
product category and the rest that means our   connection to our storage account is successful 
and is working as expected perfect so right now   we need to mount in our silver and gold as well 
so we just need to copy this a couple of times   change this to Silver and change this as well to 
Silver um yeah and this one to gold and this to   Gold I just need to run this again but because 
I already I&#39;ve already mounted our our brones   we need to comment that out and run this again 
so we should have access to Silver and um the gold okay perfect so we have our brooms and 
if you go into the silver you should be   able to see what we have in the silver 
nothing in the silver yet uh silver not   silver okay you probably saw that I just 
remount I&#39;ll comment this and remount the silver okay and run this again now it&#39;s okay and uh 
if you go into the gold we don&#39;t have anything   in there so we have everything fine so so far we 
can see an empty array that means the connection   is successful and we&#39;re good to go right now 
I just need to delete this uh okay I&#39;ll delete   this I&#39;ll leave this in a script section once we 
done so in case you want to use the same code and   you can just copy this and uh use it in your own 
project as well now once this is done uh we can   test something out as well before we continue 
because this is where we&#39;re going to be putting   all of this into play so I just need to don&#39;t show 
confirm I just wanted to delete that okay so what   I want to do basically is to double check if we 
can create a database I&#39;ll just say spark. SQL   and I&#39;ll say create database I&#39;ll call it testing 
and if we are able to create this database that   means we are ready to run our connections and 
script so let&#39;s double check this in our catalog   it says it&#39;s successful and you can see we have 
a testing database so that means our catalog is   fine by the time we run our queries we should be 
able to see all of them so I just need to drop it   again and we we&#39;re good okay perfect so at this 
point the connection between our data bricks and   the storage account has been established we&#39;ve 
also been able to run our queries to create a   new database on our catalog on data breaks now 
what we want to do is once we dump data into our   bronze layer we want to create use those files 
to create tables and you know the database in   our storage account well not storage account per 
say uh on our on our data breaks on our DB so we   have a customer paret file in our bronze layer 
we want to use that to create a customer table   on our database account if that makes sense 
so that&#39;s what we want to do basically so for   that we need to initially have a script that is 
going to automatically recognize the database if   it has not been created before create it but if 
it has been created skip skip the creation of the   database and as well as the table so if you run 
it multiple times you can only only create once   you know the simple query is you know creative not 
exist or create uh uh creative not exist for the   database and maybe create if not exist as well for 
the tables if you want so but that&#39;s basically the   script that is going to be helping us to do that 
so let&#39;s quickly go into that and continue okay   so I&#39;ll just remove all of this and we&#39;re going 
to be using the base notebook okay but before we   do that let&#39;s establish a connection between our 
data breakes and our data frame uh I mean data   Factory not data frame okay so in here I&#39;ll just 
create a plus and I&#39;ll select notebook here now   the notebook needs to be configured we need to 
be able to configure this I&#39;ll minimize this and   I&#39;ll minimize uh this guy as well okay now in our 
notebook I just need to come into the Azure data   braks here we don&#39;t have any linked service 
for our data break so let&#39;s create one so we   leave that as your data breaks if you like you can 
change the name now I&#39;ll come back in here for my   autor resolve integration I&#39;ll leave it as it is 
but instead of you know from the aure subscription   uh you can use this but I think I&#39;ll prefer to 
enter the connections manually so right now I for   my workspace URL I can come into my base notebook 
here and copy everything from net to the to the   starting point I don&#39;t need the storage account 
keys for now uh maybe I&#39;ll just leave it as it is now going back here where are we Yeah in our data 
Factory I&#39;ll paste this here as my workspace URL   now I&#39;ll be using access token to connect from 
our data Factory to data breaks to do that that   will take us to the user settings you go into the 
developer profile and select the access token to   generate a new token I&#39;ll call this adff in fact 
I&#39;ll call this medallion Medallion ADF data braks   okay and uh maybe you just use 30 days well this 
is going to be deleted once the video is released   anyway so it&#39;s fine so I&#39;ll just copy this token 
and uh go back into my data Factory and paste this   in here so that&#39;s the first thing know uh once 
that is done it automatically uh scan through   what we have but this is very important because 
I made this mistake earlier so so if you select   a new job cluster you have to be sure that you 
have enough compute maybe you have a pay as you   go account and you want it to be like each time 
the job is running it spin UPS a new cluster I   don&#39;t think that is the best approach but in some 
cases when you have you know very huge processing   to be done you might want to spin up individual 
processing and I will explain why I&#39;m using   this so if you have a single some something like 
myself if you have a a limited constraint on on   the number of computes that you can use it is best 
that you just select the existing compute that is   running on your system and run whatever job on 
that particular computer but in some cases you   know when you are in a production environment you 
want to spin up a new cluster but mind you once   that is done it automatically uh spins down but 
it costs you know to spin up every huge workload   in in in production so I&#39;ll be selecting the um 
existing cluster and then we use that so I&#39;ll be   selecting the existing interactive cluster and 
right now this should automatically populate um   what we have in there so let&#39;s see I&#39;ll just 
remove this and uh I remove the back slash   from there and in our access token existing 
interactive clusters we should be able to see uh so let&#39;s see what we have in here I mean uh let&#39;s see existing pool H this should be automatically 
populated I don&#39;t know why it is not so what I&#39;m   going to do is uh discard the changes in here 
and uh try to uh okay I&#39;ll leave it at that   and uh maybe try to set it up again and then I&#39;ll 
just click on this and go into the notebooks and   uh select new again I&#39;ll be selecting the autor 
resolve enter manually and I&#39;ll paste in our token   then the data bricks URL I what I&#39;m going to do 
is I&#39;ll just keep the token here and in our data   bricks I&#39;ll copy the URL from here and I&#39;ll paste 
it here and I believe this should automatically load yeah so this is the new job cluster but 
that&#39;s not what we want what we want is the   existing job cluster that is currently running 
which is this mujib Gan cluster so this is very   tricky I think it&#39;s probably something that needs 
to be improved on so the first time it wasn&#39;t   loading the cluster that is currently running 
I just need to close the current uh pop up and   then reopen it once that is done it&#39;s loaded the 
the cluster so you might want to double check if   you didn&#39;t see it at first that&#39;s how best to you 
know close the pop up and then create a new one   once you select the mujib G cluster and you test 
it out you can see connection successful I&#39;m going   to be selecting the create so that means once once 
this is done I just quickly talk through that once   this is done and you submit a job any job that 
is running on our ADF is going to be submitted   to this cluster using the the submit API okay so 
that&#39;s what is going to be happening there so but   what we want to do basically is uh we want to have 
a notebook path that is going to be triggered once   to submit a job to that so I&#39;ll be selecting 
this and in our root folder I have a medallion   work workspace and in my Medallion workspace I 
have a base notebook this is where we this is   The Notebook that will be triggered once we submit 
a job to this particular notebook cluster Okay so   so we need to pass in some parameters today so 
we know uh the parameters that are going to be   accessed uh by this notebook so I&#39;ll be adding 
three in here so the first one is going to be   the schema name which is the table schema and 
uh I had the file name uh I&#39;ll call this the   file name as well as okay I need to add the table 
name and I need to have the file name okay file   name now for the dynamic content since this is a 
parameter so I just need to we are still inside   it for each Loop don&#39;t forget so we can just pass 
in the table schema here okay and for the table   name I passing the table name and then the file 
name I just need to get the file name and this   is going to be automatic Al populated using the 
format uh query that we had earlier which is the   UTC now you can use any other name that you want 
if you if that&#39;s what you prefer are using the   year year month and day okay once that is done 
I just click on okay and we should be able to   trigger whatever it is in this uh Bas notebook so 
let&#39;s populate what will be triggered by the time   we submit a job to this particular notebook now 
this is a script that I got online I&#39;ll just uh   fine tune it for us to be able to work with that 
okay so I&#39;m going to be using the I&#39;ll be getting   the file name which is going to be get gotten 
from the DB utils uh the widgets and uh I&#39;ll   guess this from the file name okay which is what 
we have here uh this guy this guy the file name   we do the same for the table name as well as the 
table schema okay we come back in here I&#39;ll just   copy this and repl replicate it multiple times 
we have the table schema schema and we have the   table name table name and maybe I&#39;ll just stick 
to table schema table name if you like can use   the underscore doesn&#39;t really matter so I&#39;ll just 
uh table table schema and the table name because   it&#39;s a variable name if you I just want something 
that will be a little bit uh different see now we   just need to create a database if it doesn&#39;t exist 
so create database if it doesn&#39;t exist exist okay   what we just do is spark. SQL and this is going to 
be create database just like we did earlier if if   not exist and the the schema of that particular 
uh database is going to be the same as uh the   database name so I&#39;ll call this table schema okay 
table table schema all right now once the database   is created we want to create the table if it is 
not existing on the database let&#39;s create it so   if uh if the table is not existing on the database 
uh let&#39;s create it okay so to do that we just have   spark. SQL and we can have our query in here 
I&#39;ll just uh I&#39;ll use this and this is going to be yeah something like this we can put in our 
query here we can say create table if not exist   okay now the table schema that we&#39;re going to 
be passing to this is going to be I&#39;ll just uh   yeah terminate that and uh plus we have the table 
the table schema and then do table name so that   would be more like a a dot in here and then we 
had a table name in here and this should give   us the more like something like this create 
table if not exist so let&#39;s say we have The   Medallion as a schema name Medallion schema. 
table table name something like this is what   this is um going to be creating as well as uh 
we&#39;re going to be using the pet that we dumped   in our in our bronze directory uh we have the 
not pest is pet then the location of the pet is   inside the the month section which is the MNT 
we we we now have access to that don&#39;t forget   we have access to the bronze layer and we can 
pass in the uh the P which is going to be p p   okay that&#39;s our file name but we need to get in 
the schema name and the uh the schema name the   table name as well okay so I&#39;ll just uh this 
is supposed to be three I&#39;ll just copy this   all right in our brones layer here I&#39;ll just pass 
in the brones I&#39;ll pass in the file path which is the file name okay and this is 
instead of dots we have the file   name in fact I&#39;ll just leave this 
one as it is and I&#39;ll just add a new one yeah I&#39;ll just add a new one here 
and the new uh the new one here here is   going to be the file path file 
PA which is the file name okay um we can have forward slash so this is going 
to be Bronze in fact I&#39;ll just uh let me just   copy this cuz it looks like um I might make a 
mistake at that point yeah okay so this is uh   is mounting onto brones the file name the table 
schema as well as the table name okay so let&#39;s see   if we are missing anything I just need to maybe 
replace this Center section to avoid making errors okay just yeah so what is is doing is the table 
uh if it doesn&#39;t exist the table schema the table   name name and is mounting that into Bron the 
table schema table name as well as uh uh the   file name of course as the root direct which is 
the year month and day that we formatted earlier   and dumping that as a pet to use that it&#39;s going 
to use this to create a table okay so that&#39;s what   he&#39;s doing basically and that&#39;s all we need to do 
all we just have to do is uh trigger the pipeline   at this point to see if everything is working as 
expected okay I just save this uh and I can go   back into my data Factory and uh BL this again 
okay if everything is working uh we have the   base notebook we have the parameters that are 
going in there and we can try and run this if   there&#39;s any error we fix but for now everything 
seems to be fine so let&#39;s see if everything is okay okay so it&#39;s loading the data right now uh okay   yeah she&#39;s dumping them to our bronze 
layer it has not reached the notebook section okay so it&#39;s sending data into our notebook 
right now if you want to see how this is been   submitted all you just have to do is go into 
the base notebook here in our uh data braks   and you see in the data engineering section you 
see the job runs you will see the submitted data   uh submitted jobs and if you see what is going 
on with any of them it says the function has   no attributes SQL yeah it shouldn&#39;t be SQL it 
should be spark. SQL you probably saw that uh   why didn&#39;t you tell me okay let&#39;s fix that okay 
there&#39;s an error in here it should be spark.   SQL all right I don&#39;t know how I missed that 
it&#39;s just uh maybe I just need to rerun this again yeah we just that this looks like there&#39;s an other error in here 
so let&#39;s see what the error is looking like Jo   bronze yeah what&#39;s the error here again 
it says name table schema is not defined   I&#39;m no longer using the small letter yeah 
that&#39;s the error I just need to change this to I need to change this to table schema and table   name something like that table name and 
this table schema okay any other thing   please tell me now before I go on and run 
this again okay I just run this again to see okay so notebook is getting submitted and 
uh being we shouldn&#39;t have any error this time around yeah you can see all of them are Su 
successful and uh takes a few seconds to reflect here well all of them are successful regardless 
so if you go into the if you go into the catalog   you will see that we have sales LT in here as our 
uh database name and you can see address customer   address every other thing in here so if you check 
the customer information you can see the sample   data being populated using the pet information 
that we that was dumped into our bronze layer   as well as the customer address if you look at 
that the address itself look at date product and   yeah everything is fine and uh everything is good 
so what we just need to do now is go proceed into   our DBT section right now we are completed the ADF 
Azure data braks integration we&#39;ve been able to   move data into our from our SQL using ADF and then 
dump it into our data brakes what we want to do   right now is set up our DBT for the transformation 
as I explained earlier DBT is for the   transformation layer is really not concerned with 
data gets into the system or all data is being   uh man the only thing is that is concerned with 
is the manipulation of the the data so whatever   data that you have in your DB can be you know 
transformed and manipulated and saved into you   know other dimension and other firms that makes 
sense okay so let&#39;s switch over to the DBT section   and uh continue from there so I&#39;ll be creating 
a new project uh I&#39;ll be using py Cham for some   reasons py Cham is now my favorite ID for python 
okay so but you can use any other IDs that you   like if you if you prefer Visual Studio code or 
any other you knowbe whatever ID you like you can   just use okay so right now we&#39;ll be us we stick 
to The Medallion I&#39;ll call this uh DBT I&#39;ll call   this spark Azure Deb okay maybe we just leave it 
at that maybe there&#39;s no need for as but who okay   I just leave it at that yeah um I&#39;m using python 
3.9 anyway so if you want to use the same thing   you should use you should be using python 3.9 as 
your uh python uh environment so if you like you   can change it to something more complex maybe 
or the lat latest one 3.11 uh if you like but   I think uh to prevent any issues and errors that 
may occur from you know updation of dependencies   and all that I&#39;ll just stick to 3.9 okay but I 
think with time all of those dependencies will   be fully integrated there will be no errors and 
also so you might want to use the latest python   environment so we just wait a few seconds for 
this to be instantiated and once that is done   we can get started okay so the first thing that I 
need to do is uh let&#39;s let&#39;s get a new um python   file into this okay I&#39;ll just touch touch main. py 
and let&#39;s uh see if we can increase the size of uh   let&#39;s see if we can increase the size of this 
ID to 20 I hope this is okay 20 should be good   all right for clarity okay so I just need to do 
a pep install Deputy data braks now this is going   to help me bring in the dependency of data breaks 
and DBT all coined together and wired up into my   system so that&#39;s the first thing that I need to do 
so once this is done I need to do one other thing   before continuing okay I need to ensure that my 
system can connect successfully to DBT um to data   braks and now do I do that I use the package that 
has been released by the data breaks team so I&#39;ll   be using p install uh the name of the package is 
data bricks CLI okay so I&#39;ll just install that   as well and this is going to allow me to have 
access to data bricks from my system now once   this is done I&#39;m just waiting for the indexing to 
be done because it slows down once uh it is not is   still in progress okay just with a few seconds now 
the indexing is done the next that I want to do is   configure my data braks uh on my local system so I 
want to ensure that my system can effectively con   connect to data breaks uh that I have on the cloud 
okay to do that I just need to do data breaks data   bricks uh configure and I need to pass in the flag 
of token now this is where the host is going to   be so the host of my data bricks is this guy okay 
this this guy here this is my host my data bricks   host without the backwards the forward slash all 
right now the token was the one that I created   earlier that I pasted here I&#39;ll just use this 
token here and I&#39;ll paste okay um once I paste   it I press enter now we should want to test if I 
have access to data bricks itself or not so data   bricks and I&#39;ll use the secret list scope and if 
I&#39;m able to see this that means I&#39;ve been able   to establish a connection to my data brakes and 
this is the key even that I created on on data   bre now the next thing I want to do is check if 
I can list the data uh that I have in my catalog   okay data breaks and I can do FS list now if I can 
see this that means automatically my solution uh   my system is able to effectively connect to data 
breaks so if you have any challenge here you need   to fix whatever error that you have before you 
continue because at this this is a very critical   point for data DBT to work uh on your system as 
well as uh to connect to as your data breaks okay   now once this is done the next thing is to do 
DBT in it okay in it so once I I I press enter   it&#39;s going to ask me the name of my project and 
all that uh blah blah blah information so what I   want to do basically is for the Project&#39;s name 
I&#39;m going to have it as a medallion Medallion   DBT spark okay and uh I press enter it said it&#39;s 
not a valid project name letters digit underscore   so I just have Medallion DBT spark okay now which 
database would you like to use I&#39;ll be using data   bricks as my database so I&#39;ll select one now 
the next thing is going to ask me let&#39;s see now it says what&#39;s my data bricks host my data 
bricks host is uh let&#39;s see what we have as my   data breaks host which is this one isn&#39;t it datab 
brick.com yeah that should be it if there&#39;s any   error we fix this once we are not able to connect 
now the HTTP path if you want to uh fix this or   to get this information basically you just need 
to come into your compute and select your compute   select your compute here and in the advanced 
option you see the jdbc odbc here so my server   host name this is actually my server host name 
I think is the same one I have here but without   the htps I&#39;ll fix that shortly but the HTTP 
path is this one I&#39;ll copy this and pass I&#39;ll   paste it here now it says I&#39;m going to uh desire 
access token option I&#39;ll be using the use access   token now the access token that I copied earlier 
is here I&#39;ll just copy that and paste it here I   don&#39;t need to use Unity catalog it&#39;s okay and the 
schema that I&#39;ll be using the default schema that   I&#39;ll be using for this uh automation is this guy 
sales LT okay sales LT all right uh one is fine   I don&#39;t need to do that it&#39;s okay uh so final 
thing that I need to do is I need to bring in   my profiles yo and fix this HTTP that I added 
here because if I do DBT debug I&#39;m going to   have a challenge there&#39;s an error that is going to 
be generated at this point so I need to fix that   error it says it is not valid so let&#39;s fix that 
okay to fix that we just need to come back into my uh into my provid Y in here and it&#39;s 
usually here I&#39;ll just uh get it here   I&#39;ll just come in here to my DBT profiles. 
yo and I&#39;ll drag that into my Python and   you can see we have the code with you 
DBT and this is uh DBT B query C demo   crash course Medallion yeah this guy now this 
is saying we don&#39;t have a cat log but we do actually so we don&#39;t have a catalog here the host information is wrong I&#39;ll 
just remove that and uh what I need to do   is move this back up I&#39;ll just move 
this to the top because that&#39;s what   I want to work with at this point and uh 
yeah I&#39;ll just save this and do DBT debug again uh it says there are errors with 
this so what I&#39;m going to do is I&#39;m just   going to comment these people out 
they are causing issues for me DB debug I says profile should be known 
if loading profile is completed all   right um so what I&#39;m going to do as well I&#39;m 
just going to I&#39;ll just remove all of this   for now so I can establish connection 
and I can paste them back once this is done uh you can see that uh the connection 
test is fine but it says what&#39;s the issue it   says Medallion spark DBT project is no fun but 
because I&#39;m inside a wrong directory so I just   need to move into The Medallion spark directory 
and I can run DBT debug again and we should be   good all things been equal perfect so everything 
is fine I&#39;m able to connect to the uh data bricks   with this guy which is okay I&#39;ll just paste 
these guys back and uh I&#39;ll just leave them   as as they are okay in fact I&#39;ll just comment 
this guys out okay good now going back here in   my medalion DBT Park I have my DBT project toy 
and right now when I run the DB DBT debug I was   able to successfully connect and you can see the 
catalog has been resolved as I have met store now   this at this point is where we need to write 
some SQL queries for the transformation so I&#39;m   going to start with the snapshots to load in all 
of the current snapshot of the data that we have   right now if you look at our sales LT we only 
have these tables we don&#39;t have any snapshots   at this moment so we need to keep a snapshot of 
this current directory before for transformation   so we can replay all of them at any point in time 
we want okay so what I&#39;m going to do is I&#39;m going   to go back into my uh Medallion spark here and 
under this snapshot I&#39;m going to create a new   SQL file I call in fact I&#39;ll just say touch 
snapshot uh each of the each of these tables   are going to be handled as snapshot separately 
uh so I&#39;m going to start with the address so   I&#39;ll say address. SQL I&#39;ll have snapshot again 
I&#39;ll have customers. SQL customers. SQL and I&#39;m   going to have uh yeah just going to the snapshot 
directory and I can just have touch uh customer   address customer address SQL I have my products 
SQL I have product category product category. SQL   product description product description SQL uh 
product model product model. SQL we have product   model product description SQL we have sales 
order sales order detail tosql and we have   sales order Eder sales order ed. SQL those are 
the sqls that we are going to be creating and   if you look at the snapshot section you can see 
all of them so what I&#39;m going to do basically   I&#39;ll just copy paste all of these scripts I have 
written them and we can just talk through each   of them and what they do basically okay because 
it takes some time to to write all of them okay   so I&#39;ll just paste in the address I&#39;ll paste 
in the customer information the customer as   well as the customer address I&#39;ll copy that 
and paste that and we can talk through all of them we have the product uh 
product we have the product category I don&#39;t have that um I removed this so 
I&#39;ll just use the product model um the product   description information there and I&#39;ll just 
focus on this ones yeah so if you like you   can you know populate all of them but for 
now I&#39;ll just stick to the ones that have   uh that I&#39;ve been able to do I&#39;ll remove 
this category and description as well as   this guy as well okay so right now I have 
address customer customer address product   product model sales order detail sales other 
Ed so okay let&#39;s talk through each of them   so let&#39;s start with the address so what 
I&#39;m doing basically I thought I ped the address okay so what I&#39;m doing basically is I&#39;m 
creating a snapshot of the address snapshot as a   variable name that&#39;s the name of the snapshot 
now I&#39;m using the configuration of the file   format is going to be Delta like I said our 
silver and gold are going to be Delta but um but that&#39;s not the same for uh our brones 
brones is a simply pet okay right now we   have a Moun Mountain Point as MNT silver and 
the address so this is the location where the   data is going to be dumped into on our uh on our 
on our what&#39;s the name on our storage account it   just quickly escaped me okay so now we have 
the target scheme as the snapshot as this as   the target schema invalidate delete unique ID 
unique key for the address Key address ID the   strategies check and the columns is going to be 
checking all of them so what I&#39;m doing basically   is to select this colums from Salt sales LT from 
address so the address that we have in here this   guy this address I&#39;m selecting uh no not this one 
I&#39;m selecting all of this information I mean those   that I selected here I&#39;m selecting all of them and 
I&#39;m selecting select St from that that&#39;s basically   this is a CTE expression and also running this 
and that&#39;s all that&#39;s what I&#39;m doing and the   same thing is what I&#39;m doing for customers as 
well in our snapshot the customer address as well we have the product the same thing product model sales order as well as 
say all that details so let me just take water now I&#39;m selecting all of this 
information okay from the sales order   Ed as well sales order details is exactly the 
same thing it&#39;s just the column name that is   different as the as well as the name of the 
snapshot okay once this are done I think I&#39;ll   just this is a good point for us to commit 
whatever we have here and uh commit into our   repository on GitHub okay so if you look 
at what we have here we have we have the   everything inside our snapshot uh okay so 
I think I initialized the gate in the wrong directory yeah I&#39;ll just remove the dot gug here it should be in the root directory okay 
get in it and I just need to check again so I   have all of this good so I just need to add my 
my everything just I&#39;ll just add everything to   the DB at this point okay I don&#39;t need this main 
main py again so I just say GAA and uh commit see added snapshot uh just 
remove this added snapshot to DBT okay added snapshot to DBT okay now once that 
is done we can run this and see if everything   is fine working at expected so just just do 
that by running the DBT snapshot and running   DBT snapshot is going to run each of the queries 
that you have and see if everything is running as   expected it says snapshot depends on sales sales 
lt. customer which is not found okay uh okay so   we need to fix the models so let&#39;s fix the models 
and we should be good okay now for the sales LT   actually I&#39;m missing one other thing here that 
is important I only pasted all of this code I   haven&#39;t added the source which is the where we 
recognize all of the tables and the schema that   is on our Azure data bricks that we want to 
model on DBT yeah maybe it&#39;s because I needed   water then so let&#39;s fix that uh we just need to 
get in our yaml that is going to be recognizing   all of these tables as well well as the the 
schema okay to do that we just come into we   come into our model here uh we can just call 
this uh we create a folder we can call this a   staging directory uh or if you like you can call 
it brones whatever it is but basically this is   the yaml file we have so we can say maybe we can 
call this brones yo okay yes and so basically in   here we are going to be having version we have 
version two as yam file so we have our sources   the name is going to be let me just paste and 
copy paste that to save time okay so basically   what we have is uh the name of this schema is 
Sal LT and the sclt schema and this adventure   Works loaded into brones whatever description 
you want to put you want to give it so right now   we have address customer customer address product 
product category description model sa details and   say so Ed as our sources and once this is done you 
want to check if everything is fine debuted the BG okay so everything is fine let&#39;s check if we 
can do DBT snapshot again hopefully we should be good yeah you can see it&#39;s unable to pass unable 
to do partial passing because saved manifest not   found so it&#39;s doing a full pass and running 
everything from the beginning and you can see   each of the snapshots are being run at this point 
so we have the address snapshots being run first so all of the seven snapshots are going 
to be running and uh we in the three of seven okay perfect it&#39;s completed successfully 
so if you look at our sales LT right now and   you refresh you will see that we have uh we 
have our snapshots created and these are the   seven tables and if you look at each of 
them they have a snapshot with addition   of DBT updated DBT xci updated at Value from 
and value two right these are very these are   important keys for us to know the changes 
that are happening and what time each of   these record is invalidated so for instance 
if you change or remove or deleted whatever   record that is here the DBT value to will be 
populated and you will see the last record   valid date uh before it was changed okay and 
that&#39;s that&#39;s one of the good thing about DBT   it keeps track of the history of each of the 
data and this is our data is pulling data from P okay so that means uh at this point our snapshot 
is fine and if you go into our storage account uh   just go into the containers and you can check 
in the silver section you can see all of them   sitting there pretty nicely for our address 
address snapshot is here and uh if you look   at the customer we have the customer snapshot 
sitting here pretty nicely and uh each of these   can be used to replay you know if if anything 
changes or whether for some reasons which we   lost any data in our production environment we 
which is maybe the gold or the silver section   we can replay whatever we have in our silver 
in our bronze to the silver then to the gold   and everything will be sitting there pretty nicely 
okay so if you have any transformation to be done   I think this is where we are going to be doing 
the uh more like the major transformation okay   so I&#39;m going to be deleting this example file 
I don&#39;t need it so I just need to have in our   model here I need to have a match I&#39;ll just uh 
add our mat here so this match is going to be   more like a streamlined version that will 
be going into gold so this are going into   whatever snapshots we have here are going into 
the silver section uh and like I said you can   do any transformation in here you know change 
the column name merge whatever it is you want   to do but these are going into our silver section 
for the ones that are going into the gold which   is the final stage in our transformation layer 
we are putting them inside the mat section so   we going to have for mat we are going to have 
customer M customer and uh let me just uh you   know make directory we have a model mat and 
we have a customer we are going to have okay   it already exists we have uh products and we 
have sales okay sales now the three of them   are created so we just need to create a file 
so we just need to do models M sales. we do   for customer and we do customer customer. as well 
as uh the SQL version of each of them I just need   to do for product as well products. emo so we do 
products. SQL custom SQL and uh sales. SQL okay   so if you look at each of them you can see for we 
have the SQL in here so the transformation for the   SQL for the that is going into the gold will be 
inside here but the structure and the test which   is the yaml file just like we have in our if you 
look at our snapshot here okay we don&#39;t have the   structure here I mean if you look at the okay we 
we create one don&#39;t worry uh with already deleted   the examples but inside the yaml file is where 
we have our structure of the table and maybe   the test for each of the columns that are going 
into the table but inside our SQL that&#39;s where   we&#39;re going to be having the Transformations 
that are going to be happening okay so let&#39;s   start with the customer I already have this 
this written down and we can just quickly talk   through it um just space this here but I&#39;ll 
change the name from customer to dim Customer   because that&#39;s more like a if you like like 
you can put it like that I&#39;ll just change this um yeah do a refractor it&#39;s okay and do a 
refractor for this as well read customer   so if you like you can you can write your FX 
table here or you can write the DM I&#39;m using   the DM here but you can use whatever it is 
you like for some reasons I I just left it   at that but if you like you can I&#39;ll just uh 
rename this to leave the sales as it is so   we can you know work through each of them now 
so what we I&#39;m doing here basically in our dim customer uh so I just have this and if you look 
at what I have in here so this is going to be a   this is going to be materialized as a table 
that means it&#39;s being written as a table not   a view okay and um the file format is Delta 
and the location is going to be the mount go   customers okay that&#39;s where it&#39;s going to be 
written to and this is the address snapshot   uh this is this are the context expression the 
CT that I written so this is getting data from   address this is getting data from customer and 
this is getting customer address this is getting   from customer itself okay and at the end of the 
day this is the transformation that is happening   this is just basically a select statement that 
is happening between customer snapshot customer   address and address snapshot I just join all of 
them and then I join and you know select the uh   select the columns but you can you know adjust 
this and do multiple transformation I&#39;ll just   paste in the structure and the text that I have 
in here I&#39;ll just uh paste in here simple as that   and for the product as well I&#39;ll copy paste that 
let&#39;s just paste the SQL and paste the structure   as well for product and uh I&#39;ll paste the SQL for 
sales and uh for the sales structure as well I&#39;ll   just copy paste that in here good so to just talk 
through this we have the product uh this is also   going to be materializing as table the data format 
and is going to be to product so I&#39;m selecting the   product Snapshot from our product snapshot that 
we had earlier the model snapshot as well is going   to be uh selected as well from there and if you 
notice something here we have the DBT valid to   now dbut value two is uh showing us a new record 
okay so far it&#39;s n it has not changed so but if   you have multiple Transformations that have been 
done on a particular column maybe you&#39;ve updated   it you&#39;ve uh maybe removed it or deleted it or 
whatever it is or whatever any of the columns   have been modified you see the valid two and a new 
record is going to be added you know without the   know so we are selecting the latest one without 
the valid two information which is the time stamp   that is added at what point it was changed okay 
that&#39;s important okay so right now at the end of   the day we have our transformed section and 
we selecting that the same thing we have the   structure and this is the structure we have the 
D product and and uh everything here is as it is   uh we have the sales as well which is here uh the 
same thing happens here we selecting the snapshot   the product and I&#39;m joining the other as well to 
the to the dance and then the transformation layer   I just selected all of them and I write them I 
select them as uh transformed and that&#39;s basically   what is happening there all we just need to do 
right now is run this and we should be able to see   everything see sitting down in our Gold section 
and we can write a queries against that if that&#39;s   okay so let&#39;s do that and check so in our terminal 
we are okay so where are we so uh PWD so we are in   the medalion DBT spark I just need to do DBT run 
okay but before that I need to do a test to see   if all the test that I&#39;ve written in the schemas I 
mean in this yl files are working as a expected it   says configuration Parts exist in your DBT we do 
not apply to any resources which is this DBT spark   example so let&#39;s remove that if you look at this 
DBT uh project in here we have our example here   so we just need to change that to I&#39;ll just 
remove everything you can add each of these   folders if you want them to materialize as table 
of view you can add them in case you are wondering   how to you change that or you know modify that I 
have a crash course if you check the link in the   description or the card on this video you will 
see the video there that is been suggested to   you so you can watch the video and you&#39;ll see 
how I converted the video I mean you see how   I converted a table to view view to tables and 
how you can you know work individually with the   multiple nested folders inside each other so let&#39;s 
not go into that that&#39;s beyond the scope of this   video but let&#39;s leave that as it is we just remove 
that we do a test again DB test and we should be good okay it says uh it&#39;s trying to run yeah 
because we haven&#39;t run it so let&#39;s do a DBT   run so so you can have all of these registered 
in the system and we can test all of them okay   so it&#39;s trying to run all of these test 
on them but there&#39;s no deem sales in the   package itself so it&#39;s trying to run each of 
them as a table and once that is done we can   now do a test even though the test is being 
done uh in the background but you know it is   important to do that separately so right now 
we can now run our DBT test and we should be good okay so it says found 
three model 7 snapshot and   nothing to do try to check 
your config and model A so I think everything should be okay so 
let&#39;s check our UI and if you check   our catalog in here if I do a refresh and 
if go into the sales LT section you will   see that we have dim customer which is 
here and this is the final look of how   the customer looks like the same thing 
with product you see how it looks like and you can see something like the sales end 
date discontin date and something like that   and the last one is the sales which is 
here the sales is the sample data here   and it&#39;s crunching it here and it&#39;s done so 
if you look at what it looks like we have   it like this and uh that&#39;s the thumbnail 
photo and uh each of the records are here   and the final comment so this is how 
it looks like and uh what am I doing yeah this is how it looks like and these are the 
columns uh the column information the types and   all that and if you go into the gold section at 
this point I just need to go one step up into   the containers and if you go into the gold section 
you have the customer product and sales and if you   see the customer you have the DM customer here 
if you have any other Transformations they can   be done via the DBT product level here as well as 
uh the sales here okay all of them are there all   right now finally the next thing we want to do is 
just generate the documentation and we can view   it looks like how the structure of the system 
looks like so we just need to do DBT docs uh DBC I write it somewhere here uh just this 
docs generate uh DBT docks generate okay   this is going to generate the documentation of 
whatever transformation that has been done so   far and once this is done we just do DBT s to 
view it on the UI okay so it&#39;s written to this   catalog. Json in the Target directory and we can 
do DBT docs serve okay and it looks like this on   the UI I&#39;ll just uh drag it into this and you 
can see how it looks like okay okay so this is   how the the documentation looks like so we have 
the the sources here uh we have the The Medallion   DBT spark here and this is the model the mats the 
customer the DM the dim customer here and this is   the code uh if you look at the description here 
is not documented because I didn&#39;t do so much   documentation there and these are the columns 
the dependencies which are these three guys and   this is the full code all right the same thing 
happens to the product you will see the details   here and the code here as well the sales you see 
it here the snapshot and you can see of the data   and W is referencing what and now each of them 
are being referenced I think for the customer   if you look at this you see the D customer is 
referencing that and it has a node of customer   okay and if you look at the sales it here which 
are the source all right you can see each of the   data and the sample SQL that are select that are 
being selected so this is selecting data from our   hi met store and our Azure data bricks the same 
thing with the address here you can see it&#39;s the   uh the snapshot and this is the query that is 
that is used there if you look at the database   if you look at it on the database level it looks 
like this I mean sales it and the snapshot which   is exactly what we have in our c l in here okay 
the same thing we have here sales it sales LT and   snapshot the same thing just like it this the same 
way here with the exception of the additional uh   something like this uh product model product 
description table that wasn&#39;t added to that   so but basically that&#39;s how it looks like and I 
don&#39;t have any group models in here but that&#39;s how   you can see it on the pro project level as well 
as the database level so if you want to view uh   you can see it from there if you want to see the 
graph the lineage graph for um PR category looks   like this but the one that is a littleit more 
interesting is if you look at the sales let&#39;s   look at the dim customer yeah the dim customer 
has this uh graph which is the sales LT into the   address snapshot sales customer into the customer 
snapshot everything going to the the customer so   the same way if you want to view the sales for 
instance we have this and it&#39;s like this if you   view view screen we have the sales order details 
into the snapshots then we have the the product   and the order in here everything going into the 
sales all right so that&#39;s how the the graphs   look like and uh what else again I think there 
is another one we have the product do we have   an interesting graph I think it&#39;s just a simple 
graph really we don&#39;t have anything to cost uh to   complex but that&#39;s how it looks like anyways so 
that&#39;s basically that&#39;s basically it if you have   any questions or comment feedback just drop them 
in the comment section so that caps it all that&#39;s   where we are going to be stopping for our DBT 
but before we continue if you enjoy this video   and if you learn something new please give a like 
subscribe and turn on the notification Bell so you   don&#39;t miss out on latest information so if you new 
here don&#39;t forget to subscribe before going and   you can check videos that are going to be listed 
on this UI for more information and more contents   like this all right so thank you very much 
for watching I&#39;ll will see you in the next one

Transcript for:Notes: Spark, DBT, and Cloud Provider Project

Transcript for:
Notes: Spark, DBT, and Cloud Provider Project