Transcript for:
Notes: Spark, DBT, and Cloud Provider Project

hello and welcome back to the channel in today's  video we are going to be going through a project   that is based on spark DBT and a cloud provider  this is as decided by the community and you can   see how the community voted in the last week we  had 108 votes of which 43 of them voted that we   go for spark DBT and any other Cloud platform as a  runner Hub we have a data warehousing project with   best practices is following up with 38% snowflakes  16% and Apache Flink streaming project 4% I have   decided that we're going to be going with the  winner of this poll which is the spark DBT and   any Cloud platform project so let's go through the  system architecture as decided we will be adopting   The Medallion architecture for this project The  Medallion architecture allows us to have three   layers of transformation the bronze layer the  silver as well as the gold in the bronze layer   the data that are coming into the system are left  as is and uncut while in the silver layer a level   of transformation has taken place and some of the  data might have been changed transformed or even   removed from the system in the gold layer this is  where we have all the data that are ready to be   consumed by different business intelligence  unit different users as well as maybe for   Analytics purposes even for consumers like data  scientists that want to develop data science and   machine learning algorithms now according to this  project we will be using Azure data Factory Azure   data breaks DBT and a lot of gen two services  in the ADLs section so this project is majorly   on the Azure cloud and we'll try as much as  possible to keep it at that if you're new   here don't forget to like And subscribe to the  channel and also turn on the notification Bell   so you can be the first person to get notified  once there's a new video released so let's get   into the project and let's get started if you do  not have an account I would really recommend you   go on Google and search for aure free account if  you go for aure free account you have access to   a free account which gives you access to free $200  and as your credit and you can use 55 services for   free and it's usually for 12 months so if that is  something that you want to consider I will really   recommend and suggest that you go and start free  right now okay I'll be using my brother's account   it's a free account he open it a couple of days  ago and uh that's the one I'll be using okay so   now to get started I'm going to start by creating  a resource so the resource Group that we're going   to be creating start with the resource groups  so you can you can search from here or you can   select from the list of asual services that  is listed on the UI for you I'll be clicking   on the resource Group here and I have a couple of  resource groups currently here and I'll just click   on create to create a new one so the subscription  that is currently having is aure for student so if   you are using the free account you have aure  free uh subscription here so he's using aure   for students so but the resource Group that is  going to be creating I'll be using I'll call this Medallion Medallion spark DBT Resource Group okay  now it looks like this Resource Group is okay   and uh right now we're in the East us so I'll be  selecting something very close to me uh something   around UK South if that's okay to Europe so you  can choose anyone that is closer to you and if   you have any resource constraint and you cannot  create any of the services I really Rec commend   you you change the resource Group uh region or  whatever resource that you're going to use to be   using you can choose any Resource Group and region  for that so I'll be using UK Sal for that and in   next I click on the next and the tag section  so I don't have any tags it's just to review   and create and after the validation is passed if  you want to download the template for automation   with something like terraform you can download  it from here but I don't need that as for this   moment because it is out of scope of this video  so I'll just click on create okay so right now if   you can if you notice we have a medallion spark  DBT Resource Group that is available for us now   the next thing I want to do is to create a Gen 2  ADLs aure data link service for storing our data   so I'm going to comeon here I'll search for um  gen two okay and which is the storage account or   you search for storage accounts if that's okay for  you storage accounts so if you select this uh you   have access to the ones that have been previously  created so I'll just create a new one I'll call   this Medallion uh it's using the resource Group  that I previously created so I'll be selecting   The Medallion spark DBT Resource Group and the  storage account name is going to be called um so   let's call it Medallion I'm choosing The Medallion  architecture so that's the reason why I'm using   The Medallion name so I'll be using The Medallion  architecture which simply is the bronze silver and   gold layer so the the bronze layer like I said  is usually for raw data silver is for fine-tuned   data but is can be consumed by some people and  some analytics and data scientists uh if you   want but the good layer is where the refined data  and everything is more uh in a presentable format   it could be data Mass it could be multiple uh  services that are going to be consuming from there   but it's more like a very refined data at that  point so yeah so for the storage account I'm going   to say medallion storage uh so it can only contain  letters and numbers I just say medallion storage   account uh maybe Medallion Dash Medallion s say  okay as my solid account so the region is going   to be UK South and I'll be using the standard uh  okay the G redundancies by default then in here   I'll be just enabling the hierarchical name space  which is complement complimenting the D ADLs Gen 2   end points for enable enabling file and directed  semantics uh well that's English but usually you   have access to the files and directory some just  where you have it in your my documents on your   system where you have your folders and the files  uh structured out in in hierarchical order that's   the way it usually is so I'll leaving the rest uh  as it is then I click on network I should enable   access to public access from all Network the data  protection is going to be as it is I'll just click   on next no encryption as it is by default leave  it then we just wait for the validation to be done   and I'll click on create now the storage account  is going to be created and this is where we're   going to be having the multiple storage whether  it's bronze layer the silver layer or the gold   layer that's where we're going to be putting  them in our storage account so I just wait a   couple of seconds for the storage deployment  to be done once it is done we switch over to that okay so the deployment is done I can go to  the re resource now going to the resource I have   access to all of this in on the left and some of  the keys and shortcuts on the on the right here so   but on the left you can see the data storage I'll  be clicking on the containers in there so I don't   have any container right now so I'll be creating  one so I'll start with the brones bronze and then   the silver and finally the gold okay now these are  the three layers that we're going to be having so   whatever data that we're grating or that is coming  into the system will'll be landing into the bronze   layer from there we process it into the silver  layer and then we process finally uh into the   gold okay now once this is done the next thing is  to create a data Factory so I'm going to come in   here and search for data Factory okay and you can  click on data factories in here now I have one in   here but I'll just create a new one so I'll create  a new one I'll call it I'll be selecting the med   Medallion spark DBT Resource Group and I'll call  it Medallion Medallion ADF okay then next uh the   gate configuration is by default even though we're  not going to be using it we just leave it as we   configure it later the networking is going to be  a public end point we leave it at that Advance   no tax no review and create and that's it boom we  just wait a few seconds for the data Factory also   to be created as well before we continue okay the  deployment is succeed I click on go to Resource   and then we can we can view what we have in our  data Factory I'm not going to be launching the   data studio uh data Factory studio right now so  I need one other thing I it's called key Vault so   I can search for key volt in here key volt so the  key volt is where I'll be storing all my keys and   secrets so I don't want to be copying every now  and then so I can just easily paste data into my   secrets and I can reuse it in my data brakes by  the time we get to the data Brak section so we   don't have to you know recopy every time you  know expose our keys as it is going to be in   more like in a production environment okay so  I select the key vult now I have a key vult in   here so I'll just create a new one uh creating  a new one I'm going to be selecting the resource   Group medallion dbtg then the key volt name is  going to be Medallion Medallion spark uh call   it KV okay now the region is going to be UK  South all right the pricing here is going to   be standard I'll just leave it at that then next  access configuration um so I can use the rbac the   RO based Access Control control but I can leave  it as a vae access policy where you can select a   user so if you don't have it in here so I think  it is best to select something like this so you   can find grain and uh select uh the policy that  you want for every particular user that is going   to be assigned to so I already created a user  but if you want to create a new user you can   use this button and uh if you want you can use  a template in here just select the proper that   you're going to be attaching to a particular user  in here and then just the principal is going to be   okay let's say I select all of this for instance  the principal is going to be the uh applications   that are going to be accessing this in part this  particular key Vol so you can for instance if   you have data bricks you can add it in here data  bricks resource provider can be added to this list   and if you have any other ones maybe Gen 2 Storage  Lake is going to be accessing the key V for some   reason reasons you can add it in here and any  other service so you have any application as well   which is fine just review and create and that's  how I created this user and if you click on this   user and click on edit you can see the properties  and uh permissions that that were given to this   particular user I don't need any cryptographic  operations as well as privilege key operations   and secret key operations I don't need any of  them I only need the key Secrets the certificate   as well as the rotation policy operation those  are the the four things that I attached to that   particular user now if you click on the review  and create you can see all of them in here but I   will not be doing that right now once that is done  I'll just review uh I'll just next uh let's see am   I missing anything here so we have the resource  access virsion machine and know resource manager   for template deployment no this encryption no it's  fine you just click on next now allow access from   all networks I'll be leaving it at that the  tax is free review and create and uh once it   is validated we can click on create so it takes  a few seconds as well to finish the deployment   and once that is done we can continue okay now  the deployment is done so let's switch over to   our data Factory because that's where the major  work is so I I'll be selecting The Medallion ADF   in here and I'll be selecting the launch studio so  this is going to launch the Azure data Factory for me okay so right now we are in the Azure data  Factory page now we can start with the one on   the left so we have the own which is the landing  page of course author is where we're going to be   creating our pipeline data sets and the rest in  here monitor is where we we can do the monitoring   of the pipelines the triggers the chain data  capture and the rest here but right now we are   going to be focusing on two things the author  section and the manage section now we start   with the manage section because what we want  to do is have access to our Gen 2 Lake Service   as well as database which we'll be creating  shortly so we don't have a database right now   and we don't have any data as well so let's do  that before we continue to your data Factory and   you know get cracking okay now going into this  guy we are going to do one last thing in here   so you can search for SQL databases SQL database  this is aure SQL now mind you this is using SQL   server command so if you have background in  postgress or any other languages like uh any   other database scripting language like uh maybe  MySQL or uh or maybe Maria DB or something like   that some of the syntax might have to be fine-  tuned but right now we're using SQL Server uh   dialect as the language okay now I'll be creating  a new datab I click on this and then we can create   a new database okay so um creating a new database  uh so I'll be selecting the resource Group as The   Medallion spark uh DBT Resource Group okay  now the database name all right so this is   very expensive so what I'm going to do right  now is 396 so what I'm going to do basically   is to remove the one that I currently have on  the system so I'll just delete this uh database   okay I'll just delete it uh just delete okay  it takes a few seconds to be completed uh and   once it is done we can continue now as usual  a database resides on a particular server so   we can come in here and search for SQL Server  you have SQL servers in here so I I currently   have one server in here sitting pretty nicely  in here so if you want I can delete this and   we can create a new one or let me see if there's  anything that we can adjust in here so basically   uh I just created a Ser uh this server on the Fly  I think I'll just maybe delete this let's delete   it um yeah let's delete it uh skip feedback  we're going to create a new one eventually   so just delete this and uh delete okay I don't  have anything important on it so I just delete   that okay so I close this going to SQL Server  refresh so yeah our SQL Server is deleted as   well as the database itself so we can go back  into our SQL database in here so let's create   a new SQL database all right so right now I have  access to 32 gig free of data and 33 G free of   backup uh storage so I'll be applying this offer  so I don't pay so much on that so right now I'll   be selecting The Medallion spark DBT the database  name is going to be code with you and I'll call   the database name uh I'll say medallion DB okay  maybe I'll say Dev cuz this is semi- production   already really so we don't have a right now so  we create a new server so the server is going   to be code with you Medallion uh DB all right  so the location is going to be in the UK South uh where is it UK Sal Show Yourself uh I  can't find it okay UK yeah UK s okay now   the authentication is going to be like that but  I'll be using the usql authentication so I'll be   having the server admin as my in fact I'll be  using code with you as a admin login and the   password is going to be something that I create  and I'm not going to be sharing that with you   okay all right so you can use your own password as  well and don't share with me okay thank you very   much now the password matches and I can click  on okay now this creat a server for us on the   Fly and I know the username is code with you and  the password is something I'm going I'm not going   to be telling you okay now yeah behavior when the  free offer is reached I I want you to autop pause   the DB until the next month and uh it's fine I'll  just click on networking now connectivity method   now I don't need there's going to be a public  endpoint access so I need people to be able   to access access this not everybody it's just me  alone of course I'll be deleting this after this   video so nobody can access this DBA as well but  it's okay firewall rules there no need to that   no need for that as well as the connection policy  it's okay I'll just click on next as my security   I don't need the defender for SQ no Ledger Ser  identity no and the rest is going to be left   as default additional settings now this is very  important at the data source level I want I want   this to be populated with Adventure Works Data  that that's already more like a sample database   so you need to select the sample database for  adventure Works to be added to your system if   you don't select that that means you'll have to  maybe download adventure works by yourself and   then populate your DB uh with it if you want but  I think that's a stressful way of putting it so   if you want I think it's best to just use a sample  and it gives you the latest Adventure works in the   system you have all the data you know sitting  there pretty nicely once it is done so I just   click on the sample and it gives us adventure  adventure Works LT I'll just click on okay and   uh yeah the next TX is free and uh the review and  create I don't have any cost it's free of charge   and this is important if you select just like I  had it before if you have your first 32 gig you   have it for free but subsequently you can you have  to be paying right now I had $395 to be paying I I   can't be paying something like that for for just  a single video you know and all that so I think   it's okay if it is something that is more like  production ready then it's okay I just click on create all right so the deployment is in progress   we just wait a few more seconds for the  deployment to be done and we can get cracking okay the resource deployment is complete  and we can go to the resource now so let's double   check if the data is automatically populated and  we have access to it so if I click on the query   editor with the preview it's it recognizes my  username so I can just type in my password so   my password goes does and uh it says could not log  in with this IP address so I just need to allow it   and it's allowed uh for Access I'll just click on  okay and login again so right now I have access   to this and if you look at the tables uh I drop  it down you can see all of the tables in here so   if I select startu from address for instance and  I run this you can see that I have um customer   address sitting pretty nicely in here and for  the customer address as well customer address   I can okay I'll just remove this and I'll remove  the last one here I'll run it again and you can   see the customer address uh sitting there pretty  nicely we have the product as well and uh just   to show you that the data has been populated for  us we don't need to write any scripts to or some   fake uh data or synthetic data to be to it this is  Adventure works and it has some sales information   the customer information as well as the product  information so this works for us but what we want   to do basically is I want to automate the process  so I can automatically get the data into our   bronze layer from there to Silver then gold okay  how do we do this so we're going to start with the   ADF so the ADF is going to go into this database  on our server and fetch the table names in there   so each of these tables are going to be looped  through and then we are going to dump all of   this data into our broomes as our first level now  once data gets into our brones layer we are going   to have them as uh more like pack uh more like  normal files okay then in our in in the second   layer which is going to be silver and gold and the  second and the third they're going to be more like   a Delta uh Lether files there we're going to store  them as pet files at that point but initially   they're also going to be St as pet which is going  to be our initial files in the in the brones layer   so let let me let me show you what I'm what I mean  so it's a little bit clearer if I show you okay so   right now what we're doing basically is we have  all of these uh tables in here so if you want to   get all of the tables you can get them this guy  I'll just copy this and I'll put this as uh like that and that will be information _ schema okay yeah all right so these are the tables that  I have so this is the schema name and the table   name okay so you can see that we have uh let me  see if I can expand this a little bit okay maybe   not so I have the table catalog the table schema  table name and table type and you can see each of   the data that is here but what we want to focus  on are the ones that are in this sales LT schema   uh we don't need the dbo which is fine this is  more like a added to it automatically so we just   focus on this and we focus on the tables we don't  need the view as well okay uh so if you want you   can have the views but for now we just uh exclude  those on so we just say where the table schema is   equals to okay sales LT and uh table type table  type is equal to base table okay if we run this   again we have something more find tuned which is  uh this so we can use this to get all the table   names in there and based on this we can Feit  that uh with Dynamic query in our Azure data   Factory all right let's switch over to Azure  Factory and let's get cracking with that so in   our your factory in here we need to create a link  service that is going to be connecting to our SQL   server and we can based on that create a data set  okay so let's create a new one the way I got in   here by clicking on the manage uh on the left  on the Side Bar here and clicking on the linked   service so if you click on plus and you come  to the asual sport as your tabs you can type in SQL and uh you have access to SQL database here  if you click on continue uh it's going to be   a your SQL database whatever name you call it  I'll leave it at that and autor resolve runtime   integration from a zero subscription I'll select  the subscription okay the server name is going to   be The Medallion DB the database name is going  to be code with you Medallion DB Dev all right   then the username is going to be code with you  the password is once is going to be what I can   you know I'm the only one that haded that  okay once that is done now let's test the connection connection fail what's the error  it says user code with you client IP is not   allowed so I think we need to allow create a  fire world that allowed this okay so let's fix   that I'm just going to click on configure and  we have access in here it says public network   access is selected Network so what I'm going  to do yeah so I just need to had the just I   just scroll down and um allow your services and  resource access to this server and I just click   on that and I click on Save once that is done uh  we can go back into our Medallion here and test it   again test connection should have access now okay  connection is successful and I click on create all   right now we have a link access from our data  Factory to AO right now we also need access to   where we're going to be writing and dumping data  into on our Gen 2 Lake service so which is here   we just come in here and select the ADLs Gen 2  continue and I can leave it as as as that and uh   in here I'll just select that the subscription  the storage account name is going to to be The   Medallion storage account and then the file path  we going to be storing data into it's going to be I'll just call it brones okay and we can  test the connection it's successful all right   we create okay good now we can go back into the  author section and uh create our first pipeline   okay so let's create a new pipeline now in our  pipeline I'll just minimize this for now in our   pipeline we are going to have what is going to  be bringing data up for us for that we need a   data set so let's create a new data set now  this data set is going to be the one reading   the data uh from our Azure database into our  data uh data Factory okay so I'll just click   on this and uh I'll just say SQL and uh we have  Azure SQL continue now the link service is this   one and I'll call this uh table tables uh maybe  table squarey or something like that something to   fet all of the queries but right here we cannot  select just a single table because what we want   to do is get all of the tables so right now leave  everything in here as it is and I'll click on okay   now this seems funny but this is how we're going  to be doing because we want to dynamically assign   the table names into this guy okay okay so right  now we leave the table squarey as STS now let's   go into our Pipeline and bringing our look up  that is going to be fetching all of these data   for us so we can either come in here and search  for lookup okay but or you can just go into the   general section and uh and drag look up into  the plane okay so in our look up we are going   to have some settings in here so but for now  uh we click on the settings stab now I don't   need the first row only I need multiple records so  if it's just a single record that I need you can   select the first row only but I don't need that I  need the entire thing so we have a data set which   is where the data is going to be coming from but  right now we not we don't have anything selected   for our table query so what we need to do serice  a query that is going to be getting data for us I   think I copied that earlier but for some reasons  I've overwritten that so what we're going to do   is uh we can either go back into our DB or just  take our chances and write it in here select St   from um I'll just get the database name uh maybe  I'll just quickly go back into our SQL database   cuz I can recall the okay the exact query database  name and all that okay just go in here and uh we   can select start from the database name I'll just  copy this guy and I'll put it in here information   schema. tables okay and uh we just uh I'll reduce  this a little bit so we the table schema equals to   sales LT and table type equals to base table okay  so let's see if this works I think it does and we   can just select this I'll just copy that and go  into our data Factory I'll paste this in here so   right now if you preview the data let's see what  we have all right so we have all of these table   catalog table schema table name and table type  okay so we can use this to get started and load   data into our data set so what we're going to be  doing is once we have this the next thing is going   to be uh where the data is going to be going to  so the data is going to be going into um our DB   table so okay so I I have it as uh tables query  so this look up is fine now each of these records   that are being populated into our from our look  up uh from the base uh table query in here what   we want to do is we want to Loop through each of  them and for each of them we want to dump them   into our Bron layer so we come in here and we can  search for for each okay so for each is going to   iterate through each of the items in there and I  can select this in there and uh I'll select this   for each of the activities now the first thing  we want to do is copy the data that is coming   in into the system okay so I can search in here  and going to the move and transform for copy data   now you can add add many logic as well as poss as  much as possible in here but right now just keep   it simple we have our copy data which is uh this  we just need to configure The Source the source   is going to be the tables query that is coming in  uh let's see our pipeline in here so if you click   on the for each here we have the settings now the  items that will be coming into the look up before   we configure what is inside the forage we need to  have a dynamic content in here so we have the uh   let's see what we have I'll just minimize this for  now uh we have the look up I'll change the name to fch or tables okay so that will be the name  and once it get into to the for each we have the   settings and the items are going to be fetch  all tables uh return value so this is not the   pipeline value but the value itself okay and uh  yeah we just use that now for each of the items   that is now coming in as the table output value  we we want to pick each of them and dump them into   um uh bronze layer okay so right now we have this  coming in and if you preview the data uh you can't   because we don't have anything outside to that yet  for us to be able to work with this we can use the   table query as well but what we need to do is have  a separate data set that is going to be feeding   data from this table query so I'm going to just  create a new data set here I'll call it U still   the same SQL and I'll select that and continue  and inside here I'll be selecting the SQL table   instead of the data uh data table that we have  earlier the table list so I just selecting this as   my SQL table and uh selecting that there's no need  for tables as well I'll click on okay now this SQL   table is the one that we're going to be feeding  from when we are copying okay so I'll select this   as my SQL table but in in here we just need to add  a parameter to specify the schema as well as the   table name so I'm going to have the schema name  and uh the table table name okay now these two   guys are the ones that are going to be passing  each of the records that are coming in from our   table query into our SQL so we can use this and  uh dump them into our bronze layer okay so coming   back to this yeah I think we're good yeah we good  so we just go back into the pipeline select this   and instead of this we just have uh what's the  name SQL table so we have the schema name as well   as table name so we since we are not at coding the  schema name and the table name we need to add them   as Dynamic content so I'll select this and it  says for each of the item we will be selecting   the if you can recall in our DB we have the table  schema and we have the table name okay so we have   the table uncore schema and I'll click on okay  and for the table name I'll come in here for each   again and I'll select the table underscore name  which is what is here I'll select okay now we have   access to the table that are coming in from our  pipeline once we fetch all of them we pass them   into our for each and for each is going to Loop  through each of these activities and and send them   into our source the source is going to transfer  them to the sync so now let's configure the sync   yeah everything think everything is fine here we  select the sync now the sync is where we are going   to be sending data to we don't have where that is  going to be done right now so let's create a data   set for that so I'll be selecting here and uh I'll  be selecting the gen two L Storage and uh that be   the pet file that we're going to be sending to  that so this will be the pet file output okay   and then the link service I'll be selecting the  data link storage I'll leave it as St but right   now we we need to pass in the directory as well  as the the file name so I'll leave it as that for   now and then we can parameterize this as well so  this is going to be taking the parameter of the   uh of the the folder name oh yeah folder  name and the I'll call this the the file okay uh maybe the file name okay so these are the  two things that this requires as parameters and uh   we have access to our data link from here now now  in our file path we are going to be passing that   as Dynamic content so we have the folder name as  our directory which is okay and for the I'll just   minimize this for the file name we'll be passing  the file name that is coming into the system as   uh a parameter there so right now we have we  leave it as a compression of Snappy but if you   want you can use gzip or the iso IO compression  Snappy in there and then this is how we configure   the parameters okay now we can now use this pety  output as our output section in the sync so I'll   be selecting P output and it's asking me for the  folder name as well as the file name now for us   to configure the file name and the folder name we  need to understand what the final output is going   to be looking like so based on what we have in  each of these records if you look at the say for   customer for instance I'll just uh let's say we  have select St from customer and I run this okay   you can see that we have uh we have some records  in here but what we want to do basically is group   all of these as a single entity in the in our  brones layer so the date that we're bringing the   data in is going to be where the data is going  to be sitting in and the file name is going to   be separate okay so what I mean basically is I'm  going to be having a single date as a let's say   um format date time I'll be using the date time  of current date which is the UTC now and I'll   be formatting that as the year month and day if  you like you can use day month and year doesn't   really matter but you know for consistency sake  you can use the year month or day or you can also   add something like this if if that's what you  prefer okay so I'll select that and that that   will be my folder name now for the uh the file  name we need to concatenate the schema name as   well as the uh table name to create a an output  there so I'm going to have the the table schema   which is this and then I'm going to be having a  join with uh because this is going to be a concat   function okay I'm going to be joining this as  well with uh the item. table table name okay then   finally I'll be joining this with the p because  that's the final extension that we're going to be   storing so it's going to be schema name do table  name. pet yeah so if we have multiple schemas or   this the same table name in multiple schemers  the schema name is going to to differentiate   each of them as a a separate entity okay so I'll  select that and once that is done I'll select I'll   leave the copy Behavior as as it is and U every  other thing is going to be left as it is so in   here I'll just come into the uh yeah I'll leave  it at that right now we can test this out and   see if everything is fine or we need to adjust  anything before we continue so let's validate   this it says the copy data requires is table is  required for copy activity okay let's see why is   table required for copy activity so let's see we  have the source which is passed in here we have   it here but in our SQL table uh let's see what we  have we have the schema as well as the parameters   sitting here so I think we need to fit in the  table here so instead of selecting this uh we   need to pass in the parameters as our schema name  I'll be selecting the enter manually and then we   can pass in the schema name as well as the table  name manually okay as a dynamic content brother   okay I'll select this and this should be good so  we should go back into our copy data and validate   again excellent so everything is fine right now  and we can test let's see if everything is fine   and then we can debug okay let's debug  this and see if there's any error before continuing okay so you recognize and you can see  we have multiple copy data in here and uh once   the data comes in here you can see the output  here you'll see the table catalog table schema   table name table type and each of the tables  are here so we don't have to add code any of   the table so I have that replicated multiple  times and uh beautifully you can see each of   the coupy data uh is completed successfully  and you can see the data written is 35 kiloby   the number of rows for that and each of them  has separate records so this is 847 record so   let's double check that in our storage account  okay to see if everything is working as expected   I'll just duplicate this and um we can go into  our base directory and search for the storage account uh okay I'm in the I'm in ADF so this  is ADF studio so I'll just pure there okay I'll   just go to the the own page now I'll be selecting  the storage account and uh yeah we can go into to   The Medallion storage account select a container  and in our brones let's see if we have data in   there okay so no data is in here so something  is wrong okay even though our copy data is uh   successful but it's not writing that to it's not  writing it to a file so let's see why so we have   our copy data in here so if I select this and we  have the activity so it's fetching all of this   output data and if you go into the copy data the  sync we have the table schema and the table name   okay let's see why this is not working as expected  so if I refresh this I'm supposed to have data in   this container ah but it's actually created but  it's inside a wrong Direct Tre if you look at   what we have right now we have this here and you  can see all of them sitting pretty nicely here so   let's restructure the uh where the data is sitting  in and then we can fix that so we created a new   storage account for that but we don't I mean he  created a new folder for that but that's not what   we want we want to put it inside the the brones  so let's go back in here and in our pile output   so the file system we need to put that as uh as  our file uh our container name which is going   to be the brones the current one that we have  right now that's what is missing I don't know   how I missed this but it's interesting okay so  now going back in here we just have to R run this   I just uh run this I need to delete this one I'll  just delete this and then delete it okay confirm   it takes a few seconds for the pipeline running  to run to be done and we can continue once it is done now all the activities are done and you can  see if we go into the bronze layer right now we   can see the folder of the date the current date  and all of the tables sitting there pretty nicely   excellent so now at this point we've been able to  dump the data that we have in our SQL Server into   our bronze layer now the next thing we want to do  is populate this into our data bricks notebook so   that involves setting up the notebooks connecting  that to our ADF and populating the schema and this   uh the database in our um in our notebook with  the dynamic content that we have in our ADF so   let's quickly do that okay so can come back in  here and uh on our UI I just duplicate this and   I can search for just above here I can search  for Azure data bricks Azure data bricks and   I'll click on this guy so I already have a  workspace but what I'm going to be doing is   creating a new one all right so I'll be selecting  the resource Group the medalion spark and the the   workspace name is going to be the medallion spark  data breaks okay still UK South we be selecting   the the standard and then select networking  encryption limit dat compliance tags like   that okay and I'll just review and create takes a  few seconds to be done and I'll create a new one okay the resource deployment is complete and  we can click on this to go to the resource   now we are in the data brakes environment I just  need to launch the workspace so it automatically   logs me into the datab bras environment or the  datab studio if you like and we have access to   this now good now the first thing we want to  do is uh create a workspace so right now we   don't have a workspace so I can create a new  workspace or use this current one that we have   but I prefer to create a folder just in here  so I can just create a folder and I'll call   this my workspace okay let's say medallion  workspace Medallion workpace all right and   I click on create now if you like you can put  this under a particular user or into the share   folder I just leave it into the root directory  just for Simplicity sake now once this is done   any notebook that we are creating is going to  be sitting here so I'll just uh right clicking   here create notebook and I'll call this uh  I'll change the name to let's say base base   notebook now now with this base notebook we are  going to be writing some scripts and queries   in here but before we do that we need to set  up something called Secrets okay not so many   people know how to do this so but if if you  want to have a secret you just need to add an   ash key in here and call this uh Scopes and  uh create secret okay and this this should uh I think it's uh scope or did I miss it I think  it's secret let me see uh Secrets create scope I   guess that's see that's secret slash create scope  all right now this is where we need to connect our   data braks to our ADF okay so we need to get our  the DNS name that we're using and the resource ID   that we're using for uh our data breaks into this  guy so we can you know automatically access the   keys and we can use that to access our Medallion  information in here so to do that we just need to   establish a relationship between our key Vault  and our data breaks so we have a key Vault that   has been created already which is currently  sitting here if you go to the key volt you   have our Medallion Spar key volt so let's put data  into our key volt and let's see how that works so   I can come into the secrets in here and uh I'll  create a new one generate a new one so the man   it's going to be manual anyways now the secret is  going to be the name and the secret value so the   secret value that we're going to be putting inside  here the key that we use to that will be used to   access our um our Bruns our storage account in  short okay so basically I go into this guy and uh   in our storage account Medallion storage account  here I just need to come into the access keys okay   and uh I'll click on show and I can copy this now  don't worry I need I'll just rotate this key once   it is done but basically you come into the access  key here and get the access key and that is what   you you use to create a secret as a secret value  here and you can call the name let's say storage   account storage account key okay and the content  type is going to be optional you you leave the   rest as it is there's no expiration date if  you want you can set it but I'll just leave   it at that now you have a storage account key in  here that you can use just in case you have some   challenges with uh the you know maybe listing  the records the or whatever if you not able to   see the storage account ke that is created all  you just have to do is come back to the access   configuration in here and uh you know activate the  access policy and go to the access policy add your   user as a user to this particular access policy  which is currently here okay if you do that you   should be able to access all of the records in  there but as it is we just leave it at that with   the storage account key created okay now I have  my storage account key my key VT so we need to   get this key Vault information into our data braks  okay how do we do that we go into properties here   as our KEYT I'll copy The Vault URI and I'll go  into my data braks and the URI is going to be   the data uh the DNS name okay and I'll come back  again uh not this one yeah I'll copy the resource   ID which is currently here and I'm going to be  pasting it here all right then once that is done   I just need to put it to give it a particular  scope name it doesn't matter what I give it but   you should Al also have the the name or whatever  it is you give to it uh in mind so I'll call this   data bricks data bricks Bri scope if you like  and um I'll create I say the premium tier is uh   disabled in this uh in this workspace secret scope  can only be created with initial manage principle   user so all you just need to do that big grammar  just changes to all users and you can create okay   now the secret Cod has been created and I click  on okay now once this is done you can go back   into your base notebook to confirm if this has  been correctly initialized and Order okay so   to do that you just go into the workspace you  just go into the workspace in here and just do   the GBU Tools fs. Mount you need to mount all of  our storage accounts into this for it to work so   we have our source it's going to be I'll just copy  this so that I don't make mistakes uh just uh copy   that and change the name to brooms and change  the storage account name to Medallion Medallion   essay I'll change this as well to that and then  change the key which is what I have in here as my data bricks data bricks scope and uh I just  need to get the key name which is the key inside   our Medallion key V I mean in the secret key  here I just need to get this storage account key okay and we can confirm if this has successfully  if we've been able to successfully Mount this we   should be able to yeah we haven't created a  compute so let's let's create a compute okay   to create a compute we can just come in here and  go to compute to create a cluster so I'm going to   have a single node okay the single node is going  to have uh 14.0 as my data breaks one time and   I'll be using the standard DS3 version two as my  node type and I'll leave the rest as it is uh yeah   I'll leave the rest as it is as uh click on create  compute it takes a few seconds for the compute to   be created and we can continue once that is done  while this is going on we can go into our notebook   and uh you know talk through what we have in our  base notebook okay so what I'm doing basically   uh we have two base notebooks right now so what  I'm doing basically is to mount whatever we have   in our brones uh on our Medallion storage account  into this mounting point which is MNT brones okay   so once we've been able to connect successfully  to our storage account using this uh key that is   stored in our storage key in the in the key Vault  once that is done successfully we can access that   in the MNT brunes and we should be able to see  every content of our Medallion storage account   from this DBS okay so once this is uh I think is  still creating uh we just wait a few more seconds all right so the compute creation  is done we can go back into our   notebook and double check this so if  you run this if there's any error we   should be able to see here but if  there's none I guess everything is fine all right so you can see true that means it's  able to connect successfully to our storage   account now if I want to see what we have in our  storage account I just need to do FS list and we   can go into our mount for Bruns MNT Bruns and we  should be able to see that we see you can see we   have 2023 1217 and the name of this directory  so if you go into this directory let's say 20   I'll just copy this and I paste this here and  I run this again you can see all of the records   in here address customer address uh product  product category and the rest that means our   connection to our storage account is successful  and is working as expected perfect so right now   we need to mount in our silver and gold as well  so we just need to copy this a couple of times   change this to Silver and change this as well to  Silver um yeah and this one to gold and this to   Gold I just need to run this again but because  I already I've already mounted our our brones   we need to comment that out and run this again  so we should have access to Silver and um the gold okay perfect so we have our brooms and  if you go into the silver you should be   able to see what we have in the silver  nothing in the silver yet uh silver not   silver okay you probably saw that I just  remount I'll comment this and remount the silver okay and run this again now it's okay and uh  if you go into the gold we don't have anything   in there so we have everything fine so so far we  can see an empty array that means the connection   is successful and we're good to go right now  I just need to delete this uh okay I'll delete   this I'll leave this in a script section once we  done so in case you want to use the same code and   you can just copy this and uh use it in your own  project as well now once this is done uh we can   test something out as well before we continue  because this is where we're going to be putting   all of this into play so I just need to don't show  confirm I just wanted to delete that okay so what   I want to do basically is to double check if we  can create a database I'll just say spark. SQL   and I'll say create database I'll call it testing  and if we are able to create this database that   means we are ready to run our connections and  script so let's double check this in our catalog   it says it's successful and you can see we have  a testing database so that means our catalog is   fine by the time we run our queries we should be  able to see all of them so I just need to drop it   again and we we're good okay perfect so at this  point the connection between our data bricks and   the storage account has been established we've  also been able to run our queries to create a   new database on our catalog on data breaks now  what we want to do is once we dump data into our   bronze layer we want to create use those files  to create tables and you know the database in   our storage account well not storage account per  say uh on our on our data breaks on our DB so we   have a customer paret file in our bronze layer  we want to use that to create a customer table   on our database account if that makes sense  so that's what we want to do basically so for   that we need to initially have a script that is  going to automatically recognize the database if   it has not been created before create it but if  it has been created skip skip the creation of the   database and as well as the table so if you run  it multiple times you can only only create once   you know the simple query is you know creative not  exist or create uh uh creative not exist for the   database and maybe create if not exist as well for  the tables if you want so but that's basically the   script that is going to be helping us to do that  so let's quickly go into that and continue okay   so I'll just remove all of this and we're going  to be using the base notebook okay but before we   do that let's establish a connection between our  data breakes and our data frame uh I mean data   Factory not data frame okay so in here I'll just  create a plus and I'll select notebook here now   the notebook needs to be configured we need to  be able to configure this I'll minimize this and   I'll minimize uh this guy as well okay now in our  notebook I just need to come into the Azure data   braks here we don't have any linked service  for our data break so let's create one so we   leave that as your data breaks if you like you can  change the name now I'll come back in here for my   autor resolve integration I'll leave it as it is  but instead of you know from the aure subscription   uh you can use this but I think I'll prefer to  enter the connections manually so right now I for   my workspace URL I can come into my base notebook  here and copy everything from net to the to the   starting point I don't need the storage account  keys for now uh maybe I'll just leave it as it is now going back here where are we Yeah in our data  Factory I'll paste this here as my workspace URL   now I'll be using access token to connect from  our data Factory to data breaks to do that that   will take us to the user settings you go into the  developer profile and select the access token to   generate a new token I'll call this adff in fact  I'll call this medallion Medallion ADF data braks   okay and uh maybe you just use 30 days well this  is going to be deleted once the video is released   anyway so it's fine so I'll just copy this token  and uh go back into my data Factory and paste this   in here so that's the first thing know uh once  that is done it automatically uh scan through   what we have but this is very important because  I made this mistake earlier so so if you select   a new job cluster you have to be sure that you  have enough compute maybe you have a pay as you   go account and you want it to be like each time  the job is running it spin UPS a new cluster I   don't think that is the best approach but in some  cases when you have you know very huge processing   to be done you might want to spin up individual  processing and I will explain why I'm using   this so if you have a single some something like  myself if you have a a limited constraint on on   the number of computes that you can use it is best  that you just select the existing compute that is   running on your system and run whatever job on  that particular computer but in some cases you   know when you are in a production environment you  want to spin up a new cluster but mind you once   that is done it automatically uh spins down but  it costs you know to spin up every huge workload   in in in production so I'll be selecting the um  existing cluster and then we use that so I'll be   selecting the existing interactive cluster and  right now this should automatically populate um   what we have in there so let's see I'll just  remove this and uh I remove the back slash   from there and in our access token existing  interactive clusters we should be able to see uh so let's see what we have in here I mean uh let's see existing pool H this should be automatically  populated I don't know why it is not so what I'm   going to do is uh discard the changes in here  and uh try to uh okay I'll leave it at that   and uh maybe try to set it up again and then I'll  just click on this and go into the notebooks and   uh select new again I'll be selecting the autor  resolve enter manually and I'll paste in our token   then the data bricks URL I what I'm going to do  is I'll just keep the token here and in our data   bricks I'll copy the URL from here and I'll paste  it here and I believe this should automatically load yeah so this is the new job cluster but  that's not what we want what we want is the   existing job cluster that is currently running  which is this mujib Gan cluster so this is very   tricky I think it's probably something that needs  to be improved on so the first time it wasn't   loading the cluster that is currently running  I just need to close the current uh pop up and   then reopen it once that is done it's loaded the  the cluster so you might want to double check if   you didn't see it at first that's how best to you  know close the pop up and then create a new one   once you select the mujib G cluster and you test  it out you can see connection successful I'm going   to be selecting the create so that means once once  this is done I just quickly talk through that once   this is done and you submit a job any job that  is running on our ADF is going to be submitted   to this cluster using the the submit API okay so  that's what is going to be happening there so but   what we want to do basically is uh we want to have  a notebook path that is going to be triggered once   to submit a job to that so I'll be selecting  this and in our root folder I have a medallion   work workspace and in my Medallion workspace I  have a base notebook this is where we this is   The Notebook that will be triggered once we submit  a job to this particular notebook cluster Okay so   so we need to pass in some parameters today so  we know uh the parameters that are going to be   accessed uh by this notebook so I'll be adding  three in here so the first one is going to be   the schema name which is the table schema and  uh I had the file name uh I'll call this the   file name as well as okay I need to add the table  name and I need to have the file name okay file   name now for the dynamic content since this is a  parameter so I just need to we are still inside   it for each Loop don't forget so we can just pass  in the table schema here okay and for the table   name I passing the table name and then the file  name I just need to get the file name and this   is going to be automatic Al populated using the  format uh query that we had earlier which is the   UTC now you can use any other name that you want  if you if that's what you prefer are using the   year year month and day okay once that is done  I just click on okay and we should be able to   trigger whatever it is in this uh Bas notebook so  let's populate what will be triggered by the time   we submit a job to this particular notebook now  this is a script that I got online I'll just uh   fine tune it for us to be able to work with that  okay so I'm going to be using the I'll be getting   the file name which is going to be get gotten  from the DB utils uh the widgets and uh I'll   guess this from the file name okay which is what  we have here uh this guy this guy the file name   we do the same for the table name as well as the  table schema okay we come back in here I'll just   copy this and repl replicate it multiple times  we have the table schema schema and we have the   table name table name and maybe I'll just stick  to table schema table name if you like can use   the underscore doesn't really matter so I'll just  uh table table schema and the table name because   it's a variable name if you I just want something  that will be a little bit uh different see now we   just need to create a database if it doesn't exist  so create database if it doesn't exist exist okay   what we just do is spark. SQL and this is going to  be create database just like we did earlier if if   not exist and the the schema of that particular  uh database is going to be the same as uh the   database name so I'll call this table schema okay  table table schema all right now once the database   is created we want to create the table if it is  not existing on the database let's create it so   if uh if the table is not existing on the database  uh let's create it okay so to do that we just have   spark. SQL and we can have our query in here  I'll just uh I'll use this and this is going to be yeah something like this we can put in our  query here we can say create table if not exist   okay now the table schema that we're going to  be passing to this is going to be I'll just uh   yeah terminate that and uh plus we have the table  the table schema and then do table name so that   would be more like a a dot in here and then we  had a table name in here and this should give   us the more like something like this create  table if not exist so let's say we have The   Medallion as a schema name Medallion schema.  table table name something like this is what   this is um going to be creating as well as uh  we're going to be using the pet that we dumped   in our in our bronze directory uh we have the  not pest is pet then the location of the pet is   inside the the month section which is the MNT  we we we now have access to that don't forget   we have access to the bronze layer and we can  pass in the uh the P which is going to be p p   okay that's our file name but we need to get in  the schema name and the uh the schema name the   table name as well okay so I'll just uh this  is supposed to be three I'll just copy this   all right in our brones layer here I'll just pass  in the brones I'll pass in the file path which is the file name okay and this is  instead of dots we have the file   name in fact I'll just leave this  one as it is and I'll just add a new one yeah I'll just add a new one here  and the new uh the new one here here is   going to be the file path file  PA which is the file name okay um we can have forward slash so this is going  to be Bronze in fact I'll just uh let me just   copy this cuz it looks like um I might make a  mistake at that point yeah okay so this is uh   is mounting onto brones the file name the table  schema as well as the table name okay so let's see   if we are missing anything I just need to maybe  replace this Center section to avoid making errors okay just yeah so what is is doing is the table  uh if it doesn't exist the table schema the table   name name and is mounting that into Bron the  table schema table name as well as uh uh the   file name of course as the root direct which is  the year month and day that we formatted earlier   and dumping that as a pet to use that it's going  to use this to create a table okay so that's what   he's doing basically and that's all we need to do  all we just have to do is uh trigger the pipeline   at this point to see if everything is working as  expected okay I just save this uh and I can go   back into my data Factory and uh BL this again  okay if everything is working uh we have the   base notebook we have the parameters that are  going in there and we can try and run this if   there's any error we fix but for now everything  seems to be fine so let's see if everything is okay okay so it's loading the data right now uh okay   yeah she's dumping them to our bronze  layer it has not reached the notebook section okay so it's sending data into our notebook  right now if you want to see how this is been   submitted all you just have to do is go into  the base notebook here in our uh data braks   and you see in the data engineering section you  see the job runs you will see the submitted data   uh submitted jobs and if you see what is going  on with any of them it says the function has   no attributes SQL yeah it shouldn't be SQL it  should be spark. SQL you probably saw that uh   why didn't you tell me okay let's fix that okay  there's an error in here it should be spark.   SQL all right I don't know how I missed that  it's just uh maybe I just need to rerun this again yeah we just that this looks like there's an other error in here  so let's see what the error is looking like Jo   bronze yeah what's the error here again  it says name table schema is not defined   I'm no longer using the small letter yeah  that's the error I just need to change this to I need to change this to table schema and table   name something like that table name and  this table schema okay any other thing   please tell me now before I go on and run  this again okay I just run this again to see okay so notebook is getting submitted and  uh being we shouldn't have any error this time around yeah you can see all of them are Su  successful and uh takes a few seconds to reflect here well all of them are successful regardless  so if you go into the if you go into the catalog   you will see that we have sales LT in here as our  uh database name and you can see address customer   address every other thing in here so if you check  the customer information you can see the sample   data being populated using the pet information  that we that was dumped into our bronze layer   as well as the customer address if you look at  that the address itself look at date product and   yeah everything is fine and uh everything is good  so what we just need to do now is go proceed into   our DBT section right now we are completed the ADF  Azure data braks integration we've been able to   move data into our from our SQL using ADF and then  dump it into our data brakes what we want to do   right now is set up our DBT for the transformation  as I explained earlier DBT is for the   transformation layer is really not concerned with  data gets into the system or all data is being   uh man the only thing is that is concerned with  is the manipulation of the the data so whatever   data that you have in your DB can be you know  transformed and manipulated and saved into you   know other dimension and other firms that makes  sense okay so let's switch over to the DBT section   and uh continue from there so I'll be creating  a new project uh I'll be using py Cham for some   reasons py Cham is now my favorite ID for python  okay so but you can use any other IDs that you   like if you if you prefer Visual Studio code or  any other you knowbe whatever ID you like you can   just use okay so right now we'll be us we stick  to The Medallion I'll call this uh DBT I'll call   this spark Azure Deb okay maybe we just leave it  at that maybe there's no need for as but who okay   I just leave it at that yeah um I'm using python  3.9 anyway so if you want to use the same thing   you should use you should be using python 3.9 as  your uh python uh environment so if you like you   can change it to something more complex maybe  or the lat latest one 3.11 uh if you like but   I think uh to prevent any issues and errors that  may occur from you know updation of dependencies   and all that I'll just stick to 3.9 okay but I  think with time all of those dependencies will   be fully integrated there will be no errors and  also so you might want to use the latest python   environment so we just wait a few seconds for  this to be instantiated and once that is done   we can get started okay so the first thing that I  need to do is uh let's let's get a new um python   file into this okay I'll just touch touch main. py  and let's uh see if we can increase the size of uh   let's see if we can increase the size of this  ID to 20 I hope this is okay 20 should be good   all right for clarity okay so I just need to do  a pep install Deputy data braks now this is going   to help me bring in the dependency of data breaks  and DBT all coined together and wired up into my   system so that's the first thing that I need to do  so once this is done I need to do one other thing   before continuing okay I need to ensure that my  system can connect successfully to DBT um to data   braks and now do I do that I use the package that  has been released by the data breaks team so I'll   be using p install uh the name of the package is  data bricks CLI okay so I'll just install that   as well and this is going to allow me to have  access to data bricks from my system now once   this is done I'm just waiting for the indexing to  be done because it slows down once uh it is not is   still in progress okay just with a few seconds now  the indexing is done the next that I want to do is   configure my data braks uh on my local system so I  want to ensure that my system can effectively con   connect to data breaks uh that I have on the cloud  okay to do that I just need to do data breaks data   bricks uh configure and I need to pass in the flag  of token now this is where the host is going to   be so the host of my data bricks is this guy okay  this this guy here this is my host my data bricks   host without the backwards the forward slash all  right now the token was the one that I created   earlier that I pasted here I'll just use this  token here and I'll paste okay um once I paste   it I press enter now we should want to test if I  have access to data bricks itself or not so data   bricks and I'll use the secret list scope and if  I'm able to see this that means I've been able   to establish a connection to my data brakes and  this is the key even that I created on on data   bre now the next thing I want to do is check if  I can list the data uh that I have in my catalog   okay data breaks and I can do FS list now if I can  see this that means automatically my solution uh   my system is able to effectively connect to data  breaks so if you have any challenge here you need   to fix whatever error that you have before you  continue because at this this is a very critical   point for data DBT to work uh on your system as  well as uh to connect to as your data breaks okay   now once this is done the next thing is to do  DBT in it okay in it so once I I I press enter   it's going to ask me the name of my project and  all that uh blah blah blah information so what I   want to do basically is for the Project's name  I'm going to have it as a medallion Medallion   DBT spark okay and uh I press enter it said it's  not a valid project name letters digit underscore   so I just have Medallion DBT spark okay now which  database would you like to use I'll be using data   bricks as my database so I'll select one now  the next thing is going to ask me let's see now it says what's my data bricks host my data  bricks host is uh let's see what we have as my   data breaks host which is this one isn't it datab  brick.com yeah that should be it if there's any   error we fix this once we are not able to connect  now the HTTP path if you want to uh fix this or   to get this information basically you just need  to come into your compute and select your compute   select your compute here and in the advanced  option you see the jdbc odbc here so my server   host name this is actually my server host name  I think is the same one I have here but without   the htps I'll fix that shortly but the HTTP  path is this one I'll copy this and pass I'll   paste it here now it says I'm going to uh desire  access token option I'll be using the use access   token now the access token that I copied earlier  is here I'll just copy that and paste it here I   don't need to use Unity catalog it's okay and the  schema that I'll be using the default schema that   I'll be using for this uh automation is this guy  sales LT okay sales LT all right uh one is fine   I don't need to do that it's okay uh so final  thing that I need to do is I need to bring in   my profiles yo and fix this HTTP that I added  here because if I do DBT debug I'm going to   have a challenge there's an error that is going to  be generated at this point so I need to fix that   error it says it is not valid so let's fix that  okay to fix that we just need to come back into my uh into my provid Y in here and it's  usually here I'll just uh get it here   I'll just come in here to my DBT profiles.  yo and I'll drag that into my Python and   you can see we have the code with you  DBT and this is uh DBT B query C demo   crash course Medallion yeah this guy now this  is saying we don't have a cat log but we do actually so we don't have a catalog here the host information is wrong I'll  just remove that and uh what I need to do   is move this back up I'll just move  this to the top because that's what   I want to work with at this point and uh  yeah I'll just save this and do DBT debug again uh it says there are errors with  this so what I'm going to do is I'm just   going to comment these people out  they are causing issues for me DB debug I says profile should be known  if loading profile is completed all   right um so what I'm going to do as well I'm  just going to I'll just remove all of this   for now so I can establish connection  and I can paste them back once this is done uh you can see that uh the connection  test is fine but it says what's the issue it   says Medallion spark DBT project is no fun but  because I'm inside a wrong directory so I just   need to move into The Medallion spark directory  and I can run DBT debug again and we should be   good all things been equal perfect so everything  is fine I'm able to connect to the uh data bricks   with this guy which is okay I'll just paste  these guys back and uh I'll just leave them   as as they are okay in fact I'll just comment  this guys out okay good now going back here in   my medalion DBT Park I have my DBT project toy  and right now when I run the DB DBT debug I was   able to successfully connect and you can see the  catalog has been resolved as I have met store now   this at this point is where we need to write  some SQL queries for the transformation so I'm   going to start with the snapshots to load in all  of the current snapshot of the data that we have   right now if you look at our sales LT we only  have these tables we don't have any snapshots   at this moment so we need to keep a snapshot of  this current directory before for transformation   so we can replay all of them at any point in time  we want okay so what I'm going to do is I'm going   to go back into my uh Medallion spark here and  under this snapshot I'm going to create a new   SQL file I call in fact I'll just say touch  snapshot uh each of the each of these tables   are going to be handled as snapshot separately  uh so I'm going to start with the address so   I'll say address. SQL I'll have snapshot again  I'll have customers. SQL customers. SQL and I'm   going to have uh yeah just going to the snapshot  directory and I can just have touch uh customer   address customer address SQL I have my products  SQL I have product category product category. SQL   product description product description SQL uh  product model product model. SQL we have product   model product description SQL we have sales  order sales order detail tosql and we have   sales order Eder sales order ed. SQL those are  the sqls that we are going to be creating and   if you look at the snapshot section you can see  all of them so what I'm going to do basically   I'll just copy paste all of these scripts I have  written them and we can just talk through each   of them and what they do basically okay because  it takes some time to to write all of them okay   so I'll just paste in the address I'll paste  in the customer information the customer as   well as the customer address I'll copy that  and paste that and we can talk through all of them we have the product uh  product we have the product category I don't have that um I removed this so  I'll just use the product model um the product   description information there and I'll just  focus on this ones yeah so if you like you   can you know populate all of them but for  now I'll just stick to the ones that have   uh that I've been able to do I'll remove  this category and description as well as   this guy as well okay so right now I have  address customer customer address product   product model sales order detail sales other  Ed so okay let's talk through each of them   so let's start with the address so what  I'm doing basically I thought I ped the address okay so what I'm doing basically is I'm  creating a snapshot of the address snapshot as a   variable name that's the name of the snapshot  now I'm using the configuration of the file   format is going to be Delta like I said our  silver and gold are going to be Delta but um but that's not the same for uh our brones  brones is a simply pet okay right now we   have a Moun Mountain Point as MNT silver and  the address so this is the location where the   data is going to be dumped into on our uh on our  on our what's the name on our storage account it   just quickly escaped me okay so now we have  the target scheme as the snapshot as this as   the target schema invalidate delete unique ID  unique key for the address Key address ID the   strategies check and the columns is going to be  checking all of them so what I'm doing basically   is to select this colums from Salt sales LT from  address so the address that we have in here this   guy this address I'm selecting uh no not this one  I'm selecting all of this information I mean those   that I selected here I'm selecting all of them and  I'm selecting select St from that that's basically   this is a CTE expression and also running this  and that's all that's what I'm doing and the   same thing is what I'm doing for customers as  well in our snapshot the customer address as well we have the product the same thing product model sales order as well as  say all that details so let me just take water now I'm selecting all of this  information okay from the sales order   Ed as well sales order details is exactly the  same thing it's just the column name that is   different as the as well as the name of the  snapshot okay once this are done I think I'll   just this is a good point for us to commit  whatever we have here and uh commit into our   repository on GitHub okay so if you look  at what we have here we have we have the   everything inside our snapshot uh okay so  I think I initialized the gate in the wrong directory yeah I'll just remove the dot gug here it should be in the root directory okay  get in it and I just need to check again so I   have all of this good so I just need to add my  my everything just I'll just add everything to   the DB at this point okay I don't need this main  main py again so I just say GAA and uh commit see added snapshot uh just  remove this added snapshot to DBT okay added snapshot to DBT okay now once that  is done we can run this and see if everything   is fine working at expected so just just do  that by running the DBT snapshot and running   DBT snapshot is going to run each of the queries  that you have and see if everything is running as   expected it says snapshot depends on sales sales  lt. customer which is not found okay uh okay so   we need to fix the models so let's fix the models  and we should be good okay now for the sales LT   actually I'm missing one other thing here that  is important I only pasted all of this code I   haven't added the source which is the where we  recognize all of the tables and the schema that   is on our Azure data bricks that we want to  model on DBT yeah maybe it's because I needed   water then so let's fix that uh we just need to  get in our yaml that is going to be recognizing   all of these tables as well well as the the  schema okay to do that we just come into we   come into our model here uh we can just call  this uh we create a folder we can call this a   staging directory uh or if you like you can call  it brones whatever it is but basically this is   the yaml file we have so we can say maybe we can  call this brones yo okay yes and so basically in   here we are going to be having version we have  version two as yam file so we have our sources   the name is going to be let me just paste and  copy paste that to save time okay so basically   what we have is uh the name of this schema is  Sal LT and the sclt schema and this adventure   Works loaded into brones whatever description  you want to put you want to give it so right now   we have address customer customer address product  product category description model sa details and   say so Ed as our sources and once this is done you  want to check if everything is fine debuted the BG okay so everything is fine let's check if we  can do DBT snapshot again hopefully we should be good yeah you can see it's unable to pass unable  to do partial passing because saved manifest not   found so it's doing a full pass and running  everything from the beginning and you can see   each of the snapshots are being run at this point  so we have the address snapshots being run first so all of the seven snapshots are going  to be running and uh we in the three of seven okay perfect it's completed successfully  so if you look at our sales LT right now and   you refresh you will see that we have uh we  have our snapshots created and these are the   seven tables and if you look at each of  them they have a snapshot with addition   of DBT updated DBT xci updated at Value from  and value two right these are very these are   important keys for us to know the changes  that are happening and what time each of   these record is invalidated so for instance  if you change or remove or deleted whatever   record that is here the DBT value to will be  populated and you will see the last record   valid date uh before it was changed okay and  that's that's one of the good thing about DBT   it keeps track of the history of each of the  data and this is our data is pulling data from P okay so that means uh at this point our snapshot  is fine and if you go into our storage account uh   just go into the containers and you can check  in the silver section you can see all of them   sitting there pretty nicely for our address  address snapshot is here and uh if you look   at the customer we have the customer snapshot  sitting here pretty nicely and uh each of these   can be used to replay you know if if anything  changes or whether for some reasons which we   lost any data in our production environment we  which is maybe the gold or the silver section   we can replay whatever we have in our silver  in our bronze to the silver then to the gold   and everything will be sitting there pretty nicely  okay so if you have any transformation to be done   I think this is where we are going to be doing  the uh more like the major transformation okay   so I'm going to be deleting this example file  I don't need it so I just need to have in our   model here I need to have a match I'll just uh  add our mat here so this match is going to be   more like a streamlined version that will  be going into gold so this are going into   whatever snapshots we have here are going into  the silver section uh and like I said you can   do any transformation in here you know change  the column name merge whatever it is you want   to do but these are going into our silver section  for the ones that are going into the gold which   is the final stage in our transformation layer  we are putting them inside the mat section so   we going to have for mat we are going to have  customer M customer and uh let me just uh you   know make directory we have a model mat and  we have a customer we are going to have okay   it already exists we have uh products and we  have sales okay sales now the three of them   are created so we just need to create a file  so we just need to do models M sales. we do   for customer and we do customer customer. as well  as uh the SQL version of each of them I just need   to do for product as well products. emo so we do  products. SQL custom SQL and uh sales. SQL okay   so if you look at each of them you can see for we  have the SQL in here so the transformation for the   SQL for the that is going into the gold will be  inside here but the structure and the test which   is the yaml file just like we have in our if you  look at our snapshot here okay we don't have the   structure here I mean if you look at the okay we  we create one don't worry uh with already deleted   the examples but inside the yaml file is where  we have our structure of the table and maybe   the test for each of the columns that are going  into the table but inside our SQL that's where   we're going to be having the Transformations  that are going to be happening okay so let's   start with the customer I already have this  this written down and we can just quickly talk   through it um just space this here but I'll  change the name from customer to dim Customer   because that's more like a if you like like  you can put it like that I'll just change this um yeah do a refractor it's okay and do a  refractor for this as well read customer   so if you like you can you can write your FX  table here or you can write the DM I'm using   the DM here but you can use whatever it is  you like for some reasons I I just left it   at that but if you like you can I'll just uh  rename this to leave the sales as it is so   we can you know work through each of them now  so what we I'm doing here basically in our dim customer uh so I just have this and if you look  at what I have in here so this is going to be a   this is going to be materialized as a table  that means it's being written as a table not   a view okay and um the file format is Delta  and the location is going to be the mount go   customers okay that's where it's going to be  written to and this is the address snapshot   uh this is this are the context expression the  CT that I written so this is getting data from   address this is getting data from customer and  this is getting customer address this is getting   from customer itself okay and at the end of the  day this is the transformation that is happening   this is just basically a select statement that  is happening between customer snapshot customer   address and address snapshot I just join all of  them and then I join and you know select the uh   select the columns but you can you know adjust  this and do multiple transformation I'll just   paste in the structure and the text that I have  in here I'll just uh paste in here simple as that   and for the product as well I'll copy paste that  let's just paste the SQL and paste the structure   as well for product and uh I'll paste the SQL for  sales and uh for the sales structure as well I'll   just copy paste that in here good so to just talk  through this we have the product uh this is also   going to be materializing as table the data format  and is going to be to product so I'm selecting the   product Snapshot from our product snapshot that  we had earlier the model snapshot as well is going   to be uh selected as well from there and if you  notice something here we have the DBT valid to   now dbut value two is uh showing us a new record  okay so far it's n it has not changed so but if   you have multiple Transformations that have been  done on a particular column maybe you've updated   it you've uh maybe removed it or deleted it or  whatever it is or whatever any of the columns   have been modified you see the valid two and a new  record is going to be added you know without the   know so we are selecting the latest one without  the valid two information which is the time stamp   that is added at what point it was changed okay  that's important okay so right now at the end of   the day we have our transformed section and  we selecting that the same thing we have the   structure and this is the structure we have the  D product and and uh everything here is as it is   uh we have the sales as well which is here uh the  same thing happens here we selecting the snapshot   the product and I'm joining the other as well to  the to the dance and then the transformation layer   I just selected all of them and I write them I  select them as uh transformed and that's basically   what is happening there all we just need to do  right now is run this and we should be able to see   everything see sitting down in our Gold section  and we can write a queries against that if that's   okay so let's do that and check so in our terminal  we are okay so where are we so uh PWD so we are in   the medalion DBT spark I just need to do DBT run  okay but before that I need to do a test to see   if all the test that I've written in the schemas I  mean in this yl files are working as a expected it   says configuration Parts exist in your DBT we do  not apply to any resources which is this DBT spark   example so let's remove that if you look at this  DBT uh project in here we have our example here   so we just need to change that to I'll just  remove everything you can add each of these   folders if you want them to materialize as table  of view you can add them in case you are wondering   how to you change that or you know modify that I  have a crash course if you check the link in the   description or the card on this video you will  see the video there that is been suggested to   you so you can watch the video and you'll see  how I converted the video I mean you see how   I converted a table to view view to tables and  how you can you know work individually with the   multiple nested folders inside each other so let's  not go into that that's beyond the scope of this   video but let's leave that as it is we just remove  that we do a test again DB test and we should be good okay it says uh it's trying to run yeah  because we haven't run it so let's do a DBT   run so so you can have all of these registered  in the system and we can test all of them okay   so it's trying to run all of these test  on them but there's no deem sales in the   package itself so it's trying to run each of  them as a table and once that is done we can   now do a test even though the test is being  done uh in the background but you know it is   important to do that separately so right now  we can now run our DBT test and we should be good okay so it says found  three model 7 snapshot and   nothing to do try to check  your config and model A so I think everything should be okay so  let's check our UI and if you check   our catalog in here if I do a refresh and  if go into the sales LT section you will   see that we have dim customer which is  here and this is the final look of how   the customer looks like the same thing  with product you see how it looks like and you can see something like the sales end  date discontin date and something like that   and the last one is the sales which is  here the sales is the sample data here   and it's crunching it here and it's done so  if you look at what it looks like we have   it like this and uh that's the thumbnail  photo and uh each of the records are here   and the final comment so this is how  it looks like and uh what am I doing yeah this is how it looks like and these are the  columns uh the column information the types and   all that and if you go into the gold section at  this point I just need to go one step up into   the containers and if you go into the gold section  you have the customer product and sales and if you   see the customer you have the DM customer here  if you have any other Transformations they can   be done via the DBT product level here as well as  uh the sales here okay all of them are there all   right now finally the next thing we want to do is  just generate the documentation and we can view   it looks like how the structure of the system  looks like so we just need to do DBT docs uh DBC I write it somewhere here uh just this  docs generate uh DBT docks generate okay   this is going to generate the documentation of  whatever transformation that has been done so   far and once this is done we just do DBT s to  view it on the UI okay so it's written to this   catalog. Json in the Target directory and we can  do DBT docs serve okay and it looks like this on   the UI I'll just uh drag it into this and you  can see how it looks like okay okay so this is   how the the documentation looks like so we have  the the sources here uh we have the The Medallion   DBT spark here and this is the model the mats the  customer the DM the dim customer here and this is   the code uh if you look at the description here  is not documented because I didn't do so much   documentation there and these are the columns  the dependencies which are these three guys and   this is the full code all right the same thing  happens to the product you will see the details   here and the code here as well the sales you see  it here the snapshot and you can see of the data   and W is referencing what and now each of them  are being referenced I think for the customer   if you look at this you see the D customer is  referencing that and it has a node of customer   okay and if you look at the sales it here which  are the source all right you can see each of the   data and the sample SQL that are select that are  being selected so this is selecting data from our   hi met store and our Azure data bricks the same  thing with the address here you can see it's the   uh the snapshot and this is the query that is  that is used there if you look at the database   if you look at it on the database level it looks  like this I mean sales it and the snapshot which   is exactly what we have in our c l in here okay  the same thing we have here sales it sales LT and   snapshot the same thing just like it this the same  way here with the exception of the additional uh   something like this uh product model product  description table that wasn't added to that   so but basically that's how it looks like and I  don't have any group models in here but that's how   you can see it on the pro project level as well  as the database level so if you want to view uh   you can see it from there if you want to see the  graph the lineage graph for um PR category looks   like this but the one that is a littleit more  interesting is if you look at the sales let's   look at the dim customer yeah the dim customer  has this uh graph which is the sales LT into the   address snapshot sales customer into the customer  snapshot everything going to the the customer so   the same way if you want to view the sales for  instance we have this and it's like this if you   view view screen we have the sales order details  into the snapshots then we have the the product   and the order in here everything going into the  sales all right so that's how the the graphs   look like and uh what else again I think there  is another one we have the product do we have   an interesting graph I think it's just a simple  graph really we don't have anything to cost uh to   complex but that's how it looks like anyways so  that's basically that's basically it if you have   any questions or comment feedback just drop them  in the comment section so that caps it all that's   where we are going to be stopping for our DBT  but before we continue if you enjoy this video   and if you learn something new please give a like  subscribe and turn on the notification Bell so you   don't miss out on latest information so if you new  here don't forget to subscribe before going and   you can check videos that are going to be listed  on this UI for more information and more contents   like this all right so thank you very much  for watching I'll will see you in the next one